skip to main content
research-article
Open access

Voxel-Wise Medical Image Generalization for Eliminating Distribution Shift

Published: 19 June 2024 Publication History

Abstract

Currently, the medical field is witnessing an increase in the use of machine learning techniques. Supervised learning methods adopted in classification, prediction, and segmentation tasks for medical images always experience decreased performance when the training and testing datasets do not follow the independent and identically distributed assumption. These distribution shift situations seriously influence machine learning applications’ robustness, fairness, and trustworthiness in the medical domain. Hence, in this article, we adopt the CycleGAN (generative adversarial network) method to cycle train the computed tomography data from different scanners/manufacturers. It aims to eliminate the distribution shift from diverse data terminals based on our previous work [14]. However, due to the model collapse problem and generative mechanisms of the GAN-based model, the images we generated contained serious artifacts. To remove the boundary marks and artifacts, we adopt score-based diffusion generative models to refine the images voxel-wisely. This innovative combination of two generative models enhances the quality of data providers while maintaining significant features. Meanwhile, we use five paired patients’ medical images to deal with the evaluation experiments with structural similarity index measure metrics and the segmentation model’s performance comparison. We conclude that CycleGAN can be utilized as an efficient data augmentation technique rather than a distribution-shift-eliminating method. In contrast, the denoising diffusion the denoising diffusion model is more suitable for dealing with the distribution shift problem aroused by the different terminal modules. The limitation of generative methods applied in medical images is the difficulty in obtaining large and diverse datasets that accurately capture the complexity of biological structure and variability. In our following research, we plan to assess the initial and generated datasets to explore more possibilities to overcome the above limitation. We will also incorporate the generative methods into the federated learning architecture, which can maintain their advantages and resolve the distribution shift issue on a larger scale.

1 Introduction

Companies and institutions use big data and Artificial Intelligence (AI) to optimize processes and performance in this digital age, and rich data offers many opportunities for AI applications. In the ever-evolving healthcare landscape, data-driven decision making is pivotal in improving patient outcomes, streamlining operations, and advancing medical research. However, medical data is highly sensitive and often collected and hosted in different healthcare facilities, existing in isolation. In addition, a persistent challenge in healthcare is the dataset distribution shift problem, which can hinder machine learning models’ trustworthiness, fairness, and effectiveness. This problem arises when the data used to train a model significantly differs from the data it encounters in real-world applications, leading to decreased performance and potential bias. To combat this issue, generative methods have emerged as a promising solution, offering innovative ways to address distribution shifts and enhance the reliability of healthcare AI systems. This exploration delves into generative methods’ trustworthiness, fairness, and effectiveness in eliminating dataset distribution shifts within the radiology domain.
During cancer treatment, radiation oncologists require Magnetic Resonance Imaging (MRI) and/or Computed Tomography (CT) scans to observe a tumor’s changes, which is an essential aspect of cancer treatment. Many machine learning methods, such as segmentation and classification models, are now applied to these scans to segment tumors or predict their changes. As the treatment takes a long time, the patients will be examined in different locations or scanners, or scans might be reconstructed from raw data using different approaches. The potential emergence of a distribution shift during machine learning methods execution on different machines may impede their performance. All of these are because supervised machine learning methods must follow a fundamental assumption: that the training and testing datasets are independent and identically distributed (i.i.d.). When AI-based methods perform not as well as they are trained in the previous process, it will cause clinicians and patients to not trust the algorithm’s decision.
Machine learning methods’ performance depends on distribution shifts and health information exchange, especially in medical imaging. In a study of MRI manufacturer shift and adaptation, Yan et al. [33] used a manufacturer-adaption strategy, which is based on CycleGAN, to reverse the distribution shift from different data manufacturers, and they successfully improved the cross-manufacturer performance of the segmentation tool. However, in the CycleGAN-generated scans, if oncologists checked the details, they would find that the scans suffer serious patch boundary marks, which can be classified in radiology artifacts [14]. To eliminate these effects, we found another generative method—the score-based diffusion model. UT Austin’s research team [12] has demonstrated that it can be applied to solving the inverse problem in medical imaging and accelerating and improving the MRI. The author of the score-based generative model also currently applied this method in the MRI research field and achieved better performance [24].
The significant challenge in generative methods applied in medical imaging is the requirement of large and diverse datasets to capture the complexity of biological structures and disease variability accurately. The limitation is the difficulty in ensuring the generated images are clinically accurate and safe for patient use. In this study, our key contributions are as follows:
Cycle-transformed lung cancer CT data from two reconstructions/scanners and enriched and augmented training datasets to eliminate the non-i.i.d. problem.
Examination of the synthesized data and finding the flaws of CycleGAN methods applied in the medical imaging domain.
Discovery that the score-based generative model can successfully solve the artifact problem and eliminate the distribution problem aroused by the terminal modules.
Application of diffusion models on the GAN-generated imperfect images and enhanced scans with more fidelity and preserving significant features.
Evaluation of the performance of generative methods on the paired patient CT data to improve the safety of algorithm decisions.

2 Related Work

2.1 Fairness Transfer across Distribution Shift

The distribution shift problem seriously affects the fairness of machine learning methods. Many works propose novel methods to measure and eliminate the out-of-the-distribution problem. Their methods can be divided into three categories: transfer learning based, representation learning based, and generative.
For the transfer learning methods, McGill University’s work [6] points out that “Deconfounding” does not correct dataset shift for predictive models, and the importance weighting of transfer learning is a simple approach to dataset shift that applies to many situations and can be easy to implement. Wen’s team shows that domain shift may still exist via label distribution shift at the classifier, thus deteriorating model performances. To alleviate this issue, Wen et al. [30] propose an approximate joint distribution matching scheme by exploiting prediction uncertainty. Liu et al. [17] propose a deep transfer learning framework called CDAR [17], namely conditional distribution deep adaptation in regression.
For the representation learning methods, Tanwani [26] presents the DIRL (domain-invariant representation learning) algorithm to adapt deep models to the physical environment with a small amount of real data. Shi et al. [21] propose an inter-domain gradient matching objective that targets domain generalization by maximizing the inner product between gradients from different domains. Liu et al. [16] propose a domain generalization approach to learn on several labeled source domains and transfer knowledge to a target domain inaccessible in training. The work of Zhang et al. [3436] related to reachable distance can be a solid work basis for representation learning.
The generative methods are the significant methods we adopt in our work and will be introduced in detail in the next section.
These domain generalization and transfer learning methods can deal with the distribution shift problem on a certain level. Meanwhile, diagnosing and mitigating changes in model fairness under distribution shift is also essential to safely deploying machine learning in healthcare settings. The following works emphasize fine-tuning and causal relations for improving the robustness of models when confronting distribution shift problems. Schrouff et al. [20] adopt a causal framing to motivate conditional independence tests as a key tool for characterizing distribution shifts, and they show that this knowledge can help diagnose failures of fairness transfer, including cases where real-world shifts are more complex than is often assumed in the literature. Although existing fine-tuning methods substantially improve accuracy on a given target distribution, they often reduce robustness to distribution shifts. Wortsman et al. [31] addresss this tension by introducing a simple and effective method for improving robustness while fine-tuning: ensembling the weights of the zero-shot and fine-tuned models (WiSE-FT). Djolonga et al. [5] investigate the impact of the pre-training data size, the model scale, and the data preprocessing pipeline. They study the interplay between out-of-distribution and transfer performance of modern image classification CNNs for the first time and investigate the impact of the pre-training data size, the model scale, and the data preprocessing pipeline. Coston et al. [4] add on an additional challenge beyond fairness: unsupervised domain adaptation to covariate shift between a source and target distribution.

2.2 Generative Methods

Unlike discriminative models, generative models are trained to learn the underlying distribution of the training data instead of learning the mapping function between observation and labels. Therefore, generative methods can use the learned distribution to generate new data similar to the original data. It can be considered an effective method to eliminate the distribution shift problem. Unlike transfer learning and domain adaptation methods, generative methods can rely on known invariances to implement data transformations, resampling, and data augmentation [1].
There are several types of generative models, including Generative Adversarial Networks (GANs) [8], variational autoencoders, flow-based models, and diffusion generative models. GANs are a type of neural network that consists of two parts: a generator and a discriminator. The generator learns to create data resembling the original, whereas the discriminator distinguishes real data from the generated data in a min-max adversarial training setup. Variational autoencoders [13] generate new data by sampling from a learned latent space, often following a normal distribution. They excel in image and video synthesis. Flow-based models [19] transform primary distribution into a more complex distribution similar to the source data, mostly used in natural language processing for text generation. Diffusion generative models [9], also known as diffusion probabilistic models, are generative models that have gained popularity in recent years for their ability to generate high-quality images and videos.
Since our data are medical images, our study focuses on GANs and diffusion generative methods. CycleGAN (e.g., [37]) is an effective image-to-image translation model based on the GAN framework. It is also applied in the medical imaging process and gains competitive performance in the transformation between different data terminals (e.g., [7, 33]). Image reconstruction techniques have been developed to overcome the challenges of low signal-to-noise and contrast-to-noise ratios. They also improve the quality of images for better visual interpretation, understanding, and analysis. The state-of-the-art medical imaging technique is a denoising diffusion generative model. In this research direction, several works [3, 2325, 32] all proposed practical improvements to Denoising Diffusion Probabilistic Models (DDPMs) for medical image reconstruction. However, more real-world radiology patient data must be applied for more generative capacity. The DDPM is a special form of autoencoder. To be more concrete, an autocoder is what captures perceptual compression. The encoder in an autoencoder projects the high-dimensional data to a latent space, whereas in the DDPM, the encoder projects an image by adding minor noise in each step to a total noise distribution. Then, the decoder recovers the image from the noise—the latent space.

3 Methods

The methods adopted in our work are divided into two steps, as illustrated in Figure 1. The first step is adopting cycle training to generate coarse scans with patch boundaries, then utilizing an effective score-based diffusion model to eliminate the artifacts and enhance the scans to a higher fidelity level. Then, we evaluate our models’ generative performances with metrics and segmentation model tests.
Fig. 1.
Fig. 1. Generative methods under the federated learning framework.

3.1 CycleGAN

CycleGAN (a cycle-consistent adversarial network) [37] is a type of unsupervised learning model that leverages deep convolutional neural networks to perform image-to-image translation without paired data. It is built based on the architectures of GANs. For the purpose of cycle translation, it implements two generators and discriminators in its framework. It employs a cycle-consistency loss to ensure the translated image maintains fidelity to the original content and style. By employing two generators and two discriminators, CycleGAN facilitates the transformation between two domains, like turning images from summer to winter landscapes without explicit matching pairs.
GANs operate on a principle of adversarial training where two neural networks—the generator and discriminator—engage in a competitive process. The generator fabricates data to deceive the discriminator while the discriminator learns to distinguish between real and generated data, continually improving each other’s performance in a zero-sum game. Except for the generator’s loss and discriminator’s loss, CycleGAN also involves cycle-consistency loss.
During the CycleGAN training process [37], the objective is to acquire knowledge of the mapping function between two distinct domains: Terminal X and Terminal Y. Even though it can produce subpar synthesized results, cycle-consistency loss is an indirect structural similarity constraint of input and synthesized images. The key to GANs is an adversarial loss combination of discriminator and generator. Two discriminator networks, denoted as \(D_X\) and \(D_Y\), are employed in the model training phase. Two mapping functions are \(G: X \rightarrow Y\) and \(F: Y \rightarrow X\). For each image x from domain X, the purpose of image translation is to transfer x by G, then through F to bring \(G(x)\) back to the original one, as the process \(x \rightarrow G(x) \rightarrow F(G(x)) \approx x\). Similar to the preceding forward process, each image y from domain Y has a backward process, such as \(y \rightarrow F(y) \rightarrow G(F(y)) \approx y\). The loss function combines forward and backward cycle-consistency loss, as in Equation (1), based on the L1 norm distance between the generated image and the original image.
\begin{equation} {\it L}_{Cycle}(G,F)=\mathbf {E}_{x\sim p_{data}(x)}\left[\left\Vert F(G(x))-x \right\Vert _{1} \right] +\mathbf {E}_{y\sim p_{data}(y)}\left[\left\Vert G(F(y))-y \right\Vert _{1} \right] \end{equation}
(1)
In addition, the model’s objective includes adversarial loss, as illustrated in Equation (2).
\begin{equation} {\it L}_{GAN}(G,D_Y,X,Y)=\mathbf {E}_{y\sim p_{data}(y) }\left[ \log D_Y(y)\right] + \mathbf {E}_{x\sim p_{data}(x) }\left[ \log (1-D_Y(G(x)\right] \end{equation}
(2)
Generator G aims to generate images \(G(x)\) and makes it look like an image from domain Y; however, the purpose of \(D_Y\) is to distinguish between \(G(x)\) and y. Major insight into the GANs reveals that G tries to minimize Equation (2), whereas \(D_Y\) aims to maximize the loss. The same situation happens in the reverse cycle. Hence, the total objective function is shown in Equation (3). The \(\lambda\) denotes a factor hyperparameter, which represents the relative importance of the two objectives. In the default implementation, \(\lambda = 10\).
\begin{equation} {\bf L} = \min _G\max _{D_Y}{\it L}_{GAN}(G,D_Y,X,Y) + \min _F\max _{D_X}{\it L}_{GAN}(F,D_X,X,Y) + \lambda {\it L}_{Cycle}(G,F) \end{equation}
(3)

3.2 Score-Based Denoising Diffusion Model

Diffusion models [22] represent another successful generative modeling method without the disadvantages of GANs. Score-based generative modeling [10, 23, 25] uses score functions to generate the model, which is actually also the diffusion model method. In the work of Chung and Ye [3], a framework based on the score-based diffusion method is proposed for the task of MRI reconstruction.
The score function \(\nabla _{x}\log p(x)\) [23, 25] refers to the gradient of the log density, where \(p(x)\) is the probability density function. It is intractable to calculate this score function directly. However, it can train a network \(s_{\theta }({x})\) to estimate the \(\nabla _{x} \log p(x)\) even without knowing the function of \(p(x)\) [23, 25].
The whole process can be divided into two parts. The forward diffusion process can be modelled as the solution to the following Stochastic Differential Equation (SDE) [25]:
\begin{equation} dx = f(x,t)dt + g(t)dw, \end{equation}
(4)
where \(\left\lbrace {x}(t) \right\rbrace _{t=0}^{T}\) denotes the diffusion process with \(t \in [0, T]\), w is an n-dimensional Wiener process, f: \({R}^{n}\) \(\rightarrow {R}^{n}\) represents the drift coefficient of \(x(t)\), and g: \({R}\rightarrow {R}\) is the diffusion coefficient of \(x(t)\). The reverse process of Equation (4) can be constructed with a reverse-time SDE [25]:
\begin{equation} d{x} = [{f}({x},t) - g(t)^2 \nabla _{{x}} \log p({x})]dt + g(t)d\bar{w,} \end{equation}
(5)
where \(\bar{w}\) is an n-dimensional Wiener process running from T to 0, and dt is the infinitesimal negative step of time. Many methods [10, 23] can be used to train the score network \(s_{\theta }({x})\) to approximate the score function \(\nabla _{{x}}\)log\(p({x})\). The framework of Chung and Ye [3] adopts the training method proposed in the work of Song et al. [25] for the task of MRI reconstruction. Based on the preceding methods, what we utilized in this work will be shortened in DDPM.

4 Experiments

In our experiments of the CycleGAN and diffusion model, we have one training CT dataset from different reconstruction processes. There are 137 voxel-wise images reconstructed with Iterative Model Reconstruction (IMR) and 76 images from YA-iDose (Intelligent Dose reconstruction with YA kernel). All images were acquired on the same scanner (Philips IQon Spectral CT) but with two different image reconstructions (IMR and YA). These data are captured from unpaired patients and anonymized. In addition, we have five paired CT data from the same patients, reconstructed with the two different approaches IMR and YA-iDose, to evaluate all experiments. The GPU for computation is NVIDIA RTX A4000.
The size of the generated image is [219,219,187], with one slice captured and displayed every seventh scan interval. Figure 2(a) displays the original scans from the YA terminal; Figure 2(b) displays the generated scans from the generator (YA-IMR); and Figure 2(c) represents the generated scans from the generator (IMR-YA), with Figure 2(b) serving as the input.
Fig. 2.
Fig. 2. Generated scans from CycleGAN.
For the evaluation phase, we first use the t-sne (t-distributed stochastic neighbor embedding) method to distinguish the difference between all of these voxel-wise data. However, because the voxel-wise medical image’s homogeneity is relatively high, the t-sne results are too sparse to prove there is a distribution shift. What is more, since our experiments are unsupervised learning based, we do not have enough ground truth to evaluate the generative performance. Hence, we adopt two other methods to evaluate the similarity of the original data and generative data of all experiments.
First, the evaluation metrics are the Structural Similarity Index Measure (SSIM) [27] and the Multi-Scale Structural Similarity Index Measure (MS-SSIM) [28]. SSIM [27] measures the similarity between two images by evaluating their luminance, contrast, and structure, aiming to mimic human perception of image quality. MS-SSIM [28] extends SSIM by considering image structures across multiple scales, providing a more comprehensive assessment of perceptual image similarity by accounting for details at various levels. The resultant SSIM is a decimal value between –1 and 1, where 1 indicates perfect similarity, 0 indicates no similarity, and –1 indicates perfect anti-correlation. For an image, it is typically calculated using a sliding Gaussian window of size 11 × 11 or a block window of size 8 × 8.The window can be displaced pixel by pixel on the image to create a SSIM-quality map of the image. There is a significant similarity in the SSIM measurement between the original medical images captured by the two terminals.
However, one research work from NVIDIA concludes that assessing image quality based on SSIM can lead to incorrect conclusions and using SSIM as a loss function for deep learning can guide network training in the wrong direction [18].
Hence, TotalSegmentator [29], as an indirect evaluation method, has also been adopted. We use TotalSegmentator as a segmentation tool to generate every anatomy’s volume and intensity statistics results. The difference percentage was calculated using Equation (6).
\begin{equation} Diff_p = \left|\frac{Value_1-Value_2}{Value_1+Value_2}\right| \end{equation}
(6)

4.1 Distribution Shift Evaluation

Distribution shift influences machine learning methods in the following aspects—performance degradation, generalization challenges, bias and fairness concerns, and model drift—and might raise other risks and uncertainties. We hypothesize that there is a distribution shift between datasets from the two different CT terminals. However, there are no effective methods to evaluate the 3D medical images’ distribution shift quantitatively. Hence, we use the two methods mentioned previously to prove that a distribution shift exists in our datasets.
First, SSIM and MS-SSIM evaluation results are shown in Table 1. In this table, SSIM-ori and MS-SSIM-ori denote the metric values between the original paired patient images. SSIM-IMR and MS-SSIM-IMR are calculated by comparing the original IMR and synthesized IMR (from YA to IMR with the same patient) images. SSIM-YA and MS-SSIM-YA are calculated by comparing the original YA and synthesized YA (from IMR to YA with the same patient) images visualization of the original and generated scans from the the same patient are illustrated as in Figure 3. From the SSIM-ori column in the table, there is only around a 0.5 SSIM score after calculating the same patient’s CT images from the IMR and YA reconstructions. The MS-SSIM results are, in general, better than the SSIM results because of the effects on multiple scales that it considers. But for the other synthesized metric results, we can see the value is around 0, which means there are barely any structural similarity exits in the aspect of SSIM comparison.
Table 1.
Patient IndexSSIM-oriMS-SSIM-oriSSIM-IMRMS-SSIM-IMRSSIM-YAMS-SSIM-YA
p10.50440.8068–0.04020.0101–0.00840.0007
p20.47770.7986–0.03430.0101–0.00840.007
p30.46710.8097–0.02740.02690.00550.0181
p40.53950.84140.04230.03150.00070.0179
p50.49130.8331–0.02170.025–0.00320.0159
Table 1. Similarity Metrics Evaluation Results of CycleGAN
In addition, Figure 4 shows difference percentage, calculated based on Equation (6), and values come from the segmentation model’s volume results of each organ. The organ index listed on the x-axis is detailed and mappedlater in Table 3 of the appendix. From this figure, there is a significant difference (\(p \lt 0.05\)), and outliners exist in indexes 7, 8, 9, 10, 11, 15, 17, 50, 61, and 63. These indexes represent the pancreas, adrenal gland, part of the lungs, esophagus, vertebrae C1, atrial appendage left, and inferior vena cava. It can illustrate a distribution shift between CT output from two different reconstructions, which can affect the segmentation results on a certain level.
Table 2.
Patient IndexSSIM-oriSSIM-IMR-beSSIM-YA-beSSIM-ori-diSSIM-IMR-diSSIM-YA-di
p10.81860.12510.12590.82260.17880.2308
p20.75650.16610.16570.76380.25350.3019
p30.68410.16220.12050.69960.25010.2545
p40.85740.20880.17660.8650.32140.3172
p50.73380.17090.13110.73570.26550.2676
Table 2. Similarity Metrics Evaluation Results after Denoising Diffusion
Fig. 3.
Fig. 3. Detailed generated scans from CycleGAN from the same layer and the same patient after the same preprocessing.
Fig. 4.
Fig. 4. Box plot of the volume difference percentage (\(Diff_p\)) between paired data from two terminals.

4.2 CycleGAN Experiments

We implemented CycleGAN to transform the 3D CT image from one reconstruction to the other, and the visualization results are illustrated in Figure 2. It is difficult to check the differences between the synthesized and original data from a human’s perspective.
When using SSIM and MS-SSIM metrics to check the synthesized data and original data quantitatively, we can see the differences. Table 1, from the third to the last column, shows the scores between the original IMR image and the synthesized IMR image and the same comparison for the YA reconstruction. Although the research of NVIDIA stated that it could not exactly be utilized to compare image quality similarity, in our experiment we can make the certain point that there is no structural similarity between original images and synthesized data. Even after revising the loss function of the generator by adding SSIM loss, the results are not improved at all.
Additionally, when we use the segmentation tool to deal with these data, it is obvious to observe the differences, such as shown in Figure 5 and Figure 6. For some organ segmentation, there is significant diversity between the compared data.
Fig. 5.
Fig. 5. Box plot of the volume difference percentage (\(Diff_p\)) between IMR and synthesized IMR.
Fig. 6.
Fig. 6. Box plot of the volume difference percentage (\(Diff_p\)) between YA and synthesized YA.
Then, based on the results of Table 1, we also calculate the SSIM between two synthesis images, and the results are \([0.7392,0.7687,0,7365,0.7514,0.783]\) for five patients. This illustrates that the synthesis image’s module affects the SSIM values. When the boundary of patches is obvious in the results, we try to eliminate it by decreasing the patch size to \([32,32,32]\). Training results can be seen in Figure 7. In the ideal well-functional CycleGAN training process, the loss function of the two discriminators and two generators should be converged, meaning that a dynamic balance exists in the adversarial game, which is illustrated in Figure 7(a). In Figure 7(b), all four losses do not converge at all. It suffers the model collapse, a common training problem of GAN-based methods. When patch size decreased to [32,32,32], the synthesized results can also not be visualized as shown in Figure 2.
Fig. 7.
Fig. 7. Parameter tuning results. Left: Training losses of each model when dealing with patch size [64, 64, 32]. Right: Training losses of each model when dealing with patch size [32, 32, 32]. In the key, \(D_X\) denotes the loss function of the discriminator of X and \(D_Y\) denotes the loss function of the discriminator of Y, whereas \(G_X\) means the generator from X to Y and \(G_Y\) means the generator from Y to X.

4.3 Diffusion Experiments

From the evaluation results from CycleGAN, we can see that a distribution shift exists between two different modules, but the segmentation results do not seem very diverse. To address this issue, we applied the DDPM to refine the generated images. The refining process and corresponding experimental outcomes, as illustrated in Figure 8, indicate that the method can yield superior synthesis results when applied with unpaired scan images from two CT reconstruction methods. To better eliminate the distribution shift, we use the denoising diffusion model to enhance the original and generated voxel-wise image and check the SSIM metrics. The evaluation results of five paired patients are listed in Table 2.
Fig. 8.
Fig. 8. Reverse SDE of each refining slice from GAN-generated data.
The original data of evaluation patients have different image sizes, respectively: [512, 512, 254], [512, 512, 281], [512, 512, 302], [512, 512, 341], and [512, 512, 313]. Due to our computation limit when applying diffusion models, we only evaluate the results via SSIM with the middle layer. In Table 2, the first column, “SSIM-ori,” is the comparison between the same layer of voxel-wise CT results from IMR and YA modules. The second column, “SSIM-IMR-be,” is the comparison between the original IMR and the synthesis IMR image before applying the diffusion model. The third column, “SSIM-YA-be,” is the comparison between the original YA and the synthesis YA image before the diffusion model application. There are higher scores compared to the results in Table 1 because it only compares one layer, not considering the whole voxel range. The following three columns, “SSIM-ori-di,” “SSIM-IMR-di,” and “SSIM-YA-di,” are the comparison results after all layers following the denoising diffusion model.

5 Results

Visualization results of synthesized medical images from CycleGAN and diffusion models are illustrated in Figure 2 and Figure 8. From a human observation perspective, there are only boundary artifacts and image resolution differences in the preceding results. Without calibration methods, it is difficult to measure the organs’ or the objective’s dimensions. Based on Table 1 and Table 2 of SSIM index results from the CycleGAN and diffusion model, it is beneficial to have a precise representation of the similarity between the generative images and original images.
The results demonstrate that even though CycleGAN is good at generating fake images, which just seem real, it is still not as real/precise as the capture methods as physical terminals. It can be utilized as an effective method of data augmentation to train the machine learning model to deal with a wider range of data and increase the model’s robustness. However, it should not be implemented to eliminate the distribution shift and generate reliable diagnoses for the patients. Although the diffusion model has more potential application opportunities to deal with the distribution shift problem aroused by the module diversity, it can preserve the useful structure information and eliminate the other factor influence.

6 Discussion

The majority of the present approaches, which deal with image-to-image translation/ transformation, need huge amounts of paired data from two separate modalities. This requirement severely restricts their use because paired data may not be possible in some circumstances. Some initiatives to loosen this restriction have been put forth, such as CycleGANs. Specifically, CycleGANs were successful in transferring data across different terminals in our experiments. However, even though it can produce subpar synthesized results, cycle-consistency loss is an indirect structural similarity constraint of input and synthesized images. Hence, for smaller patches and rich details, the GAN-based models suffered from model collapse due to an imbalance in the adversarial training process. In addition, the generative purpose of GAN-based methods is only to fool the discriminator, as the discriminator can only distinguish roughly if it looks real or fake, not trained to make the model preserve diagnosis features for the patients. Hence, it can be an efficient data augmentation method rather than a direct distribution-shift-eliminating method for the medical service.
However, the DDPM is a trending generative modality capable of reversing the imaging process. Due to the training process being step by step on the real medical images, it can enhance the fidelity and resolution and not wrongly generate unrelated structure information if there are no other inputs to affect the results. Although computationally expensive and requiring more time to run the reverse SDE process, the results are more reliable for medical services.
Our research aims to combine the reverse SDE process with the cycle training concept to synthesize images with greater fidelity and eliminate the distribution shift among different terminals. Due to the evaluation results of the five paired patients, we successfully evaluated the generated images quantitatively. Based on our experiments, we have observed that both CycleGANs and DDPMs have shown exemplary performance in synthesizing medical images and eliminating distribution shifts under certain conditions.

7 Conclusion and Outlook

7.1 Conclusion

Accurate and efficient medical data analysis requires high-quality imaging processes to facilitate precise diagnosis. Generative models are vital to ensure the stable performance of machine learning methods in medical image analysis, irrespective of the distribution of the input images. In this study, we employed two distinct generative models—CycleGAN and score-based DDPM—to address this challenge. The former focuses on cycle training to generate similar distribution between different terminals, whereas the latter aims to reverse the imaging process and generate high-fidelity images by eliminating artifacts and the distribution shift between two modules. The two methods complement each other’s shortcomings, although their combined implementation results in some redundancy. As such, this work represents an initial attempt to synthesize voxel-size medical images using generative models and visualize the generated results.
In addition, this work evaluated the generative results from two perspectives. One is SSIM evaluation to check the structural similarity of the synthesized data and original data. The second is using the segmentation model to check the segmentation results’ differences. The evaluation results demonstrated that CycleGAN can only be used for data augmentation and is beneficial to train the machine learning models for better robustness. Its mechanism is not suitable for increasing the trustworthiness and reliability of synthesized medical data. However, with suitable utilization of GAN-based generative methods, it can raise the fairness and generalization ability of the machine learning methods. However, the diffusion model has more potential to be adopted as a distribution-shift-eliminating method, especially since the distribution shift is aroused by the module differences, not the fundamental differences. Hence, thanks to the paired patients’ evaluation experiments, we developed a deeper understanding of how and when to apply these generative methods to foster medical domain applications and increase their fairness and trustworthiness.

7.2 Outlook

In this study, we demonstrated that distribution shifts from two different CT reconstruction methods affect the segmentation results on a certain level. These understandings and evaluation results will be beneficial for our later research related to federated learning. Federated learning is a privacy-preserving distributed machine learning paradigm that supports collaborative model training on datasets distributed among multiple parties while preventing data leakage. Li et al. [15] propose a federated learning approach to collaboratively train functional MRI classification models for neurological diseases or disorders. In addition, the PHT (Personal Health Train) [2] is a novel approach aiming to establish a distributed data analytic infrastructure enabling the (re)use of distributed healthcare data. Hence, federated learning and PHT can provide security concerns for data owners because they control their own data.
Generative methods and federated learning hold immense promises for enhancing the trustworthiness of AI applications in the medical domain. The cooperation of these two approaches will aim to improve the reliability, privacy, and fairness of AI systems. Generative methods can efficiently and effectively eliminate the distribution shift problem and augment the data sources. Federated learning has emerged as a powerful approach for training AI across decentralized data sources while maintaining data privacy and security.
Hence, in our next step, a reply on the NFDI4Health project [11], we will utilize their infrastructures to implement the model sharing between different terminals. Further research about the generated images from physician and algorithm perspectives will be carried out to increase the fairness and trustworthiness of generative methods’ application in the medical domain.

Appendix

Table 3 shows a list of all classes, mapping the first 24 indexes and anatomy names. More index information can be found in TotalSegmentator GitHub [29].
Table 3.
Index1234
Anatomy namespleenkidney rightkidney leftgallbladder
Index5678
Anatomy nameliverstomachpancreasadrenal gland right
Index9101112
Anatomy nameadrenal gland leftlung upper lobe leftlung lower lobe leftlung upper lobe right
Index13141516
Anatomy namelung middle lobe rightlung lower lobe rightoesophagustrachea
Index17181920
Anatomy namethyroid glandsmall bowelduodenumcolon
Index21222324
Anatomy nameurinary bladderprostatekidney cyst leftkidney cyst right
Table 3. Anatomy Index Table

References

[1]
Antreas Antoniou, Amos Storkey, and Harrison Edwards. 2018. Data augmentation generative adversarial networks. arXiv:1711.04340 [stat.ML] (2018).
[2]
Oya Beyan, Ananya Choudhury, Johan Soest, Oliver Kohlbacher, Lukas Zimmermann, Holger Stenzhorn, Md. Karim, Michel Dumontier, Stefan Decker, Luiz Olavo Bonino da Silva Santos, and André Dekker. 2019. Distributed analytics on sensitive medical data: The personal health train. Data Intelligence 2 (2019), 96–107.
[3]
Hyungjin Chung and Jong Chul Ye. 2021. Score-based diffusion models for accelerated MRI. arXiv:2110.05243 (2021).
[4]
Amanda Coston, Karthikeyan Natesan Ramamurthy, Dennis Wei, Kush R. Varshney, Skyler Speakman, Zairah Mustahsan, and Supriyo Chakraborty. 2019. Fair transfer learning with missing protected attributes. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society(AIE ’19). ACM, 91–98.
[5]
Josip Djolonga, Jessica Yung, Michael Tschannen, Rob Romijnders, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Matthias Minderer, Alexander D’Amour, Dan Moldovan, Sylvain Gelly, Neil Houlsby, Xiaohua Zhai, and Mario Lucic. 2021. On robustness and transferability of convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’21). IEEE, 16458–16468.
[6]
Jérôme Dockès, Gaël Varoquaux, and Jean-Baptiste Poline. 2021. Preventing dataset shift from breaking machine-learning biomarkers. GigaScience 10, 9 (2021), giab055.
[7]
Yunhao Ge, Dongming Wei, Zhong Xue, Qian Wang, Xiang Zhou, Yiqiang Zhan, and Shu Liao. 2019. Unpaired Mr to CT synthesis with explicit structural constrained adversarial learning. In Proceedings of the 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI ’19). IEEE, 1096–1099.
[8]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial networks. arXiv:1406.2661 [stat.ML] (2014).
[9]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. CoRR abs/2006.11239 (2020). https://arxiv.org/abs/2006.11239
[10]
Aapo Hyvärinen and Peter Dayan. 2005. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research 6, 4 (2005), 695–709.
[11]
Mehrshad Jaberansary, Macedo Maia, Yeliz Ucer Yediel, Oya Beyan, and Toralf Kirsten. 2023. Analyzing distributed medical data in FAIR data spaces. In Companion Proceedings of the ACM Web Conference 2023(WWW ’23 Companion). ACM, 1480–1484.
[12]
Ajil Jalal, Marius Arvinte, Giannis Daras, Eric Price, Alexandros G. Dimakis, and Jonathan I. Tamir. 2021. Robust compressed sensing MRI with deep generative priors. CoRR abs/2108.01368 (2021). https://arxiv.org/abs/2108.01368
[13]
Diederik P. Kingma and Max Welling. 2022. Auto-Encoding Variational Bayes. arXiv:1312.6114 [stat.ML] (2022).
[14]
Feifei Li, Mirjam Schöneck, Oya Beyan, and Liliana Lourenco Caldeira. 2023. Voxel-wise medical imaging transformation and adaption based on CycleGAN and score-based diffusion. Studies in Health Technology and Informatics 302 (May2023), 1027–1028.
[15]
Xiaoxiao Li, Yufeng Gu, Nicha Dvornek, Lawrence H. Staib, Pamela Ventola, and James S. Duncan. 2020. Multi-site fMRI analysis using privacy-preserving federated learning and domain adaptation: ABIDE results. Medical Image Analysis 65 (2020), 101765.
[16]
Xiaofeng Liu, Bo Hu, Linghao Jin, Xu Han, Fangxu Xing, Jinsong Ouyang, Jun Lu, Georges El Fakhri, and Jonghye Woo. 2021. Domain generalization under conditional and label shifts via variational Bayesian inference. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, Zhi-Hua Zhou (Ed.). International Joint Conferences on Artificial Intelligence Organization, 881–887.
[17]
Xu Liu, Yingguang Li, Qinglu Meng, and Gengxiang Chen. 2021. Deep transfer learning for conditional shift in regression. Knowledge-Based Systems 227 (2021), 107216.
[18]
Jim Nilsson and Tomas Akenine-Möller. 2020. Understanding SSIM. arXiv:abs/2006.13846 (2020). https://api.semanticscholar.org/CorpusID:220041631
[19]
George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. 2021. Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research 22, 1 (Jan. 2021), Article 57, 64 pages.
[20]
Jessica Schrouff, Natalie Harris, Oluwasanmi Koyejo, Ibrahim Alabdulmohsin, Eva Schnider, Krista Opsahl-Ong, Alexander Brown, Subhrajit Roy, Diana Mincu, Christina Chen, Awa Dieng, Yuan Liu, Vivek Natarajan, Alan Karthikesalingam, Katherine A. Heller, Silvia Chiappa, and Alexander D’Amour. 2022. Maintaining fairness across distribution shift: Do we have viable solutions for real-world applications? CoRR abs/2202.01034 (2022). https://arxiv.org/abs/2202.01034
[21]
Yuge Shi, Jeffrey Seely, Philip H. S. Torr, N. Siddharth, Awni Y. Hannun, Nicolas Usunier, and Gabriel Synnaeve. 2021. Gradient matching for domain generalization. CoRR abs/2104.09937 (2021). https://arxiv.org/abs/2104.09937
[22]
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning. 2256–2265.
[23]
Yang Song and Stefano Ermon. 2019. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems 32 (2019), 1–13.
[24]
Yang Song, Liyue Shen, Lei Xing, and Stefano Ermon. 2022. Solving inverse problems in medical imaging with score-based generative models. In Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=vaRCHVj0uGI
[25]
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2020. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020).
[26]
Ajay Kumar Tanwani. 2020. DIRL: Domain-invariant representation learning for SIM-to-real transfer. CoRR abs/2011.07589 (2020). https://arxiv.org/abs/2011.07589
[27]
Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. 2004. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 13, 4 (2004), 600–612.
[28]
Z. Wang, E. P. Simoncelli, and A. C. Bovik. 2003. Multiscale structural similarity for image quality assessment. In Proceedings of the 2003 37th Asilomar Conference on Signals, Systems, and Computers, Vol. 2. 1398–1402.
[29]
Jakob Wasserthal, Hanns-Christian Breit, Manfred T. Meyer, Maurice Pradella, Daniel Hinck, Alexander W. Sauter, Tobias Heye, Daniel T. Boll, Joshy Cyriac, Shan Yang, Michael Bach, and Martin Segeroth. 2023. TotalSegmentator: Robust segmentation of 104 anatomic structures in CT images. Radiology: Artificial Intelligence 5, 5 (2023), e230024.
[30]
Jun Wen, Nenggan Zheng, Junsong Yuan, Zhefeng Gong, and Changyou Chen. 2019. Bayesian uncertainty matching for unsupervised domain adaptation. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI ’19). 3849–3855.
[31]
Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt. 2022. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’22). IEEE, 7959–7971.
[32]
Yutong Xie and Quanzheng Li. 2022. Measurement-conditioned denoising diffusion probabilistic model for under-sampled medical image reconstruction. arXiv:2203.04623 (2022).
[33]
Wenjun Yan, Lu Huang, Liming Xia, Shengjia Gu, Fuhua Yan, Yuanyuan Wang, and Qian Tao. 2020. MRI manufacturer shift and adaptation: Increasing the generalizability of deep learning segmentation for MR images acquired with different scanners. Radiology: Artificial Intelligence 2, 4 (2020), e190195.
[34]
Shichao Zhang and Jiaye Li. 2023. KNN classification with one-step computation. IEEE Transactions on Knowledge and Data Engineering 35, 3 (2023), 2711–2723.
[35]
S. Zhang, J. Li, and Y. Li. 2023. Reachable distance function for KNN classification. IEEE Transactions on Knowledge and Data Engineering 35, 7 (July 2023), 7382–7396.
[36]
Shichao Zhang, Jiaye Li, Wenzhen Zhang, and Yongsong Qin. 2022. Hyper-class representation of data. Neurocomputing 503 (Sept.2022), 200–218.
[37]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. CoRR abs/1703.10593 (2017). http://arxiv.org/abs/1703.10593

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 18, Issue 7
August 2024
505 pages
EISSN:1556-472X
DOI:10.1145/3613689
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2024
Online AM: 25 January 2024
Accepted: 16 January 2024
Revised: 09 January 2024
Received: 15 September 2023
Published in TKDD Volume 18, Issue 7

Check for updates

Author Tags

  1. Data fairness
  2. machine learning
  3. CycleGAN
  4. score-based generative model
  5. model robustness

Qualifiers

  • Research-article

Funding Sources

  • Deutsche Forschungsgemeinschaft (DFG, German Re search Foundation)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)896
  • Downloads (Last 6 weeks)85
Reflects downloads up to 31 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media