### Dataset description

To demonstrate the performance and robustness of the Medical Diffusion model, we trained the model on four different publicly available datasets.^{twenty five} The dataset contains 1250 knee MRI studies from n = 1199 patients, each containing scans along the axial, sagittal, and coronal directions. To keep the research focus, we trained the model only with sagittal T2-weighted sequences with fat saturation. Alzheimer’s Disease Neuroimaging Initiative (ADNI)^{26} The dataset contains brain MRI studies from n = 2733 patients. ADNI was launched in 2003 as a public-private partnership led by his principal investigator, Michael W. Weiner, MD. The primary goal of ADNI is whether serial MRI, positron emission tomography, other biological markers, and clinical and neuropsychological assessments can be combined to measure the progression of mild cognitive impairment and early Alzheimer’s disease. was to test We trained the model on fast acquisitions prepared with his 3D magnetization of 998 using gradient-echo (MP RAGE) sequences labeled as cognitively normal.Additionally, we evaluated the model on the breast cancer MRI dataset^{27} (referred to as the DUKE dataset) were obtained from 922 breast cancer patients. We used non-fat-saturated T1-weighted sequences for each patient. To demonstrate the generalizability of the medical diffusion model, we also trained it to synthesize CT images. For this purpose, we used 1010 low-dose lung CT studies from his 1010 patients from the Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI).^{28}We also used an internal dataset of 200 fat-free saturated axial T1-weighted sequences of breasts from 200 patients and corresponding manual ground truth segmentation outlines of breast contours. This dataset was included to evaluate the use of synthetic breast MR images in a self-supervised pre-training approach, as detailed below.

### Data preprocessing

Knee MRI studies from the MRNet dataset were preprocessed by scaling the high-resolution image plane to 256 × 256 pixels and applying histogram-based intensity normalization.^{29} to the image.Dataset provider performed this step^{twenty five}Additionally, each image was center-cropped to uniform dimensions of 256 × 256 × 32 voxels (height × width × depth). Brain MRI sequences from the ADNI dataset were preprocessed by removing non-brain regions in the MR images. This step was performed by the dataset provider. To allow comparison of diffusion models and GANs, we followed the approach of Kwon et al.^{Four} We resized the brain MRI study to 64 × 64 × 64 voxels before feeding the data to the neural network. MR images from the breast cancer dataset were preprocessed by resampling all images to standard voxel spacings (0.66 mm, 0.66 mm, 3 mm) and using corresponding segmentation masks delineating the breast. rice field. The image was split in half. As a result, the left and right breasts were in separate images. Finally, the images were resized to uniform dimensions of 256 × 256 × 32 voxels. Similarly, chest CT studies were resampled to a standard voxel spacing of 1 mm in all dimensions. Pixel values were then converted to Hounsfield units and the images were center-cropped to dimensions of 320 × 320 × 320 voxels and then resized to 128 × 128 × 128 voxels. Images for all datasets were min and max normalized to a range of -1 to 1. Additionally, we augmented all datasets by vertically flipping images 50% of the time during training.

### architecture

Given its considerable size, 3D medical images pose the challenge of finding a model architecture that can generate synthetic images while avoiding computational overload. We approach this challenge using his two-step approach of the Medical Diffusion Architecture. We first encode the image into a low-dimensional latent space and then train a diffusion probability model on the latent representation of the data. Addressing the low-dimensional latent space alleviates the problem of limited computational resources. In the following, we provide background information on vector quantization autoencoders, focusing on VQ-GAN.^{16} Architecture used to compress images.After that, we introduce the concept of denoising in the diffusion stochastic model^{30}.

#### VQ-GAN

To encode images into meaningful latent representations, vector quantization autoencoders are a viable option as they mitigate the blurry output commonly encountered in variational autoencoders.^{17,18}They map latent feature vectors at the bottleneck of the autoencoder to quantized representations derived from the trained codebook. VQ-GAN architecture proposed by Esser et al.^{16} A class for vector quantized autoencoders. Here, the image reconstruction quality is further improved by imposing a discriminator loss on its output. More precisely, the image is sent to an encoder to build the latent code. \(z_{e} \in {\mathbb{R}}^{(H/s) \times (W/s) \times (k)}\)where *H.* indicate height, *W.* indicate the width, *k* indicates the number of latent feature maps, *s* indicates the compression ratio. In the vector quantization step, potential feature vectors are quantized by replacing each with the closest corresponding codebook vector. *e*_{n} Included in learned codebook *Z.*The image is reconstructed by feeding the quantized feature vectors to the decoder *G.*The learning goal is defined as minimizing the reconstruction loss *L.*_{recording}the codebook loss *L.*_{code book}and commitment loss *L.*_{devoted}As suggested by Esser et al.in their original publication^{16}, using the perceptual loss as the reconstruction loss and a straight-through estimator to overcome the non-differentiable quantization steps. Commitment loss is the mean squared error between the unquantized latent feature vector and the corresponding codebook vector. Note that gradients are only computed for continuous latent feature vectors to enforce higher proximity to the quantized codebook vectors. A learnable codebook vector is optimized by maintaining an exponential moving average over all latent vectors mapped to it. Additionally, a patch-based discriminator is used in the output to improve the reconstruction quality. To extend this architecture to accommodate his 3D input, we adopted the methodology proposed by Ge et al.^{31} Replaced 2D convolution with 3D convolution. Furthermore, we replaced the discriminator in the original VQ-GAN model with a slice-wise discriminator that takes random slices of the image volume as input and a 3D discriminator that takes the entire reconstructed volume as input. We also follow their approach of adding feature matching loss to stabilize GAN training.

#### Diffusion model

Diffusion models are a class of generative models defined through Markov chains for latent variables. \(x_{1} \cdots x_{T}\)^{30}The main idea is to start with an image *X*_{0}the image is continuously perturbed by adding Gaussian noise with increasing variance *T.* time step.Neural network tuned on noise version of image at timestep *t* The timestep itself is trained to learn the noise distribution used to perturb the image.As a result, the data distribution \(p(x_{t – 1} |x_{t} )\) at the time step *t* − 1 can be assumed.when *T.*will be large enough that we can approximate *p*(*X*_{T.}) by prior distribution \({\mathcal{N}}({\mathbf{0}},{\mathbf{I}})\), samples from this distribution and traverses the Markov chain in reverse.This allows you to sample new images from the learned distribution \(p_{\theta} (x_{0}): = \int {p_{\theta} (x_{0:T})dx_{1:T}}\)A neural network that models noise is usually chosen as U-Net.^{32}This is because the noisy input and the denoised output must be the same size. I replaced his 2D convolution in the original U-Net architecture with a 3D convolution to support 3D input. Each block in the encoder part of U-Net consists of convolutional layers. These layers downsampled the image using kernels of size 3 × 3 × 1 and worked only on high-resolution planes of 3D volumes.Later, as proposed by Ho et al., spatial and depthwise attentional layers^{33}, was implemented. The spatial attention layer leverages the global attention mechanism by computing key, query, and value vectors for every element on the high-resolution image plane, thereby treating the depth dimension as an extension of the batch size. The resulting vectors were combined utilizing the attentional mechanism proposed by Vaswani et al.^{34}We have introduced a depth attention layer following the spatial attention layer. In this stage, we treated the axes of the high-res image plane as batch axes, so that each feature vector on the high-res plane could correspond to a feature vector on a different depth slice. The decoder for U-Net is constructed similarly, with a convolutional layer applied to each block, followed by spatial and depth attention blocks.Additionally, the image was upsampled by a transposed convolution^{35}a skip connection was added to the output of each block.

#### put it all together

In the first step, we trained a VQ-GAN model on the entire dataset to learn meaningful low-dimensional latent representations of the data. Since the input to the diffusion model had to be normalized (ie range – 1 to 1), we had to ensure that the latent representation of the image was also within this range.^{30}Assuming that the vector quantization step of the VQ-GAN model forced the vector of the learned codebook to approach the latent feature vector, it was unquantized by the maximum value of the learned codebook. We approximated the maximum value of the feature representation. Similarly, we approximated the minimum of the unquantized feature representation by the minimum of the trained codebook. Therefore, we obtained latent representations with values from -1 to close to 1 by performing a simple min-max normalization on the unquantized feature vector. These latent representations can later be used to train a 3D diffusion model. By starting with noise sampled from a standard Gaussian distribution and performing a reverse diffusion process, we can create a potential representation corresponding to a new image. The output of this process is quantized using the learned codebook of the previously trained VQ-GAN model and then fed to the decoder to produce the respective images. All models were trained on an NVIDIA Quadro RTX6000 with 24 GB GPU RAM and took about 7 days per model. See Supplementary Table S1 for details on training settings for each model.

#### Pig UNETR self-supervised learning

We pre-trained Swin UNETR to demonstrate the usefulness of the synthetic data generated by the model.^{36} Models using self-supervised learning^{twenty three}This method relies only on generated synthetic data and does not require ground truth labels. A self-monitoring task was set up as an inpainting problem, masking random patches in the composite image and then randomly flipping it horizontally or vertically. The neural network was then tasked with reconstructing the missing pixels. Utilizing a combination of L1 distance and multi-scale structural similarity indices, we trained the model until convergence^{twenty two} (MS-SSIM) as the loss function, with AdamW^{37} An optimizer with a learning rate of 1e−3 and weight decay of 0.01. After training the network, we fine-tuned the model in a supervised fashion using another dataset of real images containing manually annotated ground truth labels. A combination of cross-entropy loss and Dice loss served as the loss function, and the AdamW optimizer was used to train the model until convergence. In this supervised setup, the input image was augmented by applying flipping, affine transformation, ghosting, Gaussian noise, blurring, bias field, and gamma enhancement using the TorchIO framework.^{38}.