research-article

Open access

Voxel-Wise Medical Image Generalization for Eliminating Distribution Shift

Authors:

Liliana Lourenco CaldeiraAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data, Volume 18, Issue 7

Article No.: 167, Pages 1 - 16

https://doi.org/10.1145/3643034

Published: 19 June 2024 Publication History

PDF eReader

Abstract

Currently, the medical field is witnessing an increase in the use of machine learning techniques. Supervised learning methods adopted in classification, prediction, and segmentation tasks for medical images always experience decreased performance when the training and testing datasets do not follow the independent and identically distributed assumption. These distribution shift situations seriously influence machine learning applications’ robustness, fairness, and trustworthiness in the medical domain. Hence, in this article, we adopt the CycleGAN (generative adversarial network) method to cycle train the computed tomography data from different scanners/manufacturers. It aims to eliminate the distribution shift from diverse data terminals based on our previous work [14]. However, due to the model collapse problem and generative mechanisms of the GAN-based model, the images we generated contained serious artifacts. To remove the boundary marks and artifacts, we adopt score-based diffusion generative models to refine the images voxel-wisely. This innovative combination of two generative models enhances the quality of data providers while maintaining significant features. Meanwhile, we use five paired patients’ medical images to deal with the evaluation experiments with structural similarity index measure metrics and the segmentation model’s performance comparison. We conclude that CycleGAN can be utilized as an efficient data augmentation technique rather than a distribution-shift-eliminating method. In contrast, the denoising diffusion the denoising diffusion model is more suitable for dealing with the distribution shift problem aroused by the different terminal modules. The limitation of generative methods applied in medical images is the difficulty in obtaining large and diverse datasets that accurately capture the complexity of biological structure and variability. In our following research, we plan to assess the initial and generated datasets to explore more possibilities to overcome the above limitation. We will also incorporate the generative methods into the federated learning architecture, which can maintain their advantages and resolve the distribution shift issue on a larger scale.

1 Introduction

Companies and institutions use big data and Artificial Intelligence (AI) to optimize processes and performance in this digital age, and rich data offers many opportunities for AI applications. In the ever-evolving healthcare landscape, data-driven decision making is pivotal in improving patient outcomes, streamlining operations, and advancing medical research. However, medical data is highly sensitive and often collected and hosted in different healthcare facilities, existing in isolation. In addition, a persistent challenge in healthcare is the dataset distribution shift problem, which can hinder machine learning models’ trustworthiness, fairness, and effectiveness. This problem arises when the data used to train a model significantly differs from the data it encounters in real-world applications, leading to decreased performance and potential bias. To combat this issue, generative methods have emerged as a promising solution, offering innovative ways to address distribution shifts and enhance the reliability of healthcare AI systems. This exploration delves into generative methods’ trustworthiness, fairness, and effectiveness in eliminating dataset distribution shifts within the radiology domain.

During cancer treatment, radiation oncologists require Magnetic Resonance Imaging (MRI) and/or Computed Tomography (CT) scans to observe a tumor’s changes, which is an essential aspect of cancer treatment. Many machine learning methods, such as segmentation and classification models, are now applied to these scans to segment tumors or predict their changes. As the treatment takes a long time, the patients will be examined in different locations or scanners, or scans might be reconstructed from raw data using different approaches. The potential emergence of a distribution shift during machine learning methods execution on different machines may impede their performance. All of these are because supervised machine learning methods must follow a fundamental assumption: that the training and testing datasets are independent and identically distributed (i.i.d.). When AI-based methods perform not as well as they are trained in the previous process, it will cause clinicians and patients to not trust the algorithm’s decision.

Machine learning methods’ performance depends on distribution shifts and health information exchange, especially in medical imaging. In a study of MRI manufacturer shift and adaptation, Yan et al. [33] used a manufacturer-adaption strategy, which is based on CycleGAN, to reverse the distribution shift from different data manufacturers, and they successfully improved the cross-manufacturer performance of the segmentation tool. However, in the CycleGAN-generated scans, if oncologists checked the details, they would find that the scans suffer serious patch boundary marks, which can be classified in radiology artifacts [14]. To eliminate these effects, we found another generative method—the score-based diffusion model. UT Austin’s research team [12] has demonstrated that it can be applied to solving the inverse problem in medical imaging and accelerating and improving the MRI. The author of the score-based generative model also currently applied this method in the MRI research field and achieved better performance [24].

The significant challenge in generative methods applied in medical imaging is the requirement of large and diverse datasets to capture the complexity of biological structures and disease variability accurately. The limitation is the difficulty in ensuring the generated images are clinically accurate and safe for patient use. In this study, our key contributions are as follows:

—

Cycle-transformed lung cancer CT data from two reconstructions/scanners and enriched and augmented training datasets to eliminate the non-i.i.d. problem.

—

Examination of the synthesized data and finding the flaws of CycleGAN methods applied in the medical imaging domain.

—

Discovery that the score-based generative model can successfully solve the artifact problem and eliminate the distribution problem aroused by the terminal modules.

—

Application of diffusion models on the GAN-generated imperfect images and enhanced scans with more fidelity and preserving significant features.

—

Evaluation of the performance of generative methods on the paired patient CT data to improve the safety of algorithm decisions.

2 Related Work

2.1 Fairness Transfer across Distribution Shift

The distribution shift problem seriously affects the fairness of machine learning methods. Many works propose novel methods to measure and eliminate the out-of-the-distribution problem. Their methods can be divided into three categories: transfer learning based, representation learning based, and generative.

For the transfer learning methods, McGill University’s work [6] points out that “Deconfounding” does not correct dataset shift for predictive models, and the importance weighting of transfer learning is a simple approach to dataset shift that applies to many situations and can be easy to implement. Wen’s team shows that domain shift may still exist via label distribution shift at the classifier, thus deteriorating model performances. To alleviate this issue, Wen et al. [30] propose an approximate joint distribution matching scheme by exploiting prediction uncertainty. Liu et al. [17] propose a deep transfer learning framework called CDAR [17], namely conditional distribution deep adaptation in regression.

For the representation learning methods, Tanwani [26] presents the DIRL (domain-invariant representation learning) algorithm to adapt deep models to the physical environment with a small amount of real data. Shi et al. [21] propose an inter-domain gradient matching objective that targets domain generalization by maximizing the inner product between gradients from different domains. Liu et al. [16] propose a domain generalization approach to learn on several labeled source domains and transfer knowledge to a target domain inaccessible in training. The work of Zhang et al. [34–36] related to reachable distance can be a solid work basis for representation learning.

The generative methods are the significant methods we adopt in our work and will be introduced in detail in the next section.

These domain generalization and transfer learning methods can deal with the distribution shift problem on a certain level. Meanwhile, diagnosing and mitigating changes in model fairness under distribution shift is also essential to safely deploying machine learning in healthcare settings. The following works emphasize fine-tuning and causal relations for improving the robustness of models when confronting distribution shift problems. Schrouff et al. [20] adopt a causal framing to motivate conditional independence tests as a key tool for characterizing distribution shifts, and they show that this knowledge can help diagnose failures of fairness transfer, including cases where real-world shifts are more complex than is often assumed in the literature. Although existing fine-tuning methods substantially improve accuracy on a given target distribution, they often reduce robustness to distribution shifts. Wortsman et al. [31] addresss this tension by introducing a simple and effective method for improving robustness while fine-tuning: ensembling the weights of the zero-shot and fine-tuned models (WiSE-FT). Djolonga et al. [5] investigate the impact of the pre-training data size, the model scale, and the data preprocessing pipeline. They study the interplay between out-of-distribution and transfer performance of modern image classification CNNs for the first time and investigate the impact of the pre-training data size, the model scale, and the data preprocessing pipeline. Coston et al. [4] add on an additional challenge beyond fairness: unsupervised domain adaptation to covariate shift between a source and target distribution.

2.2 Generative Methods

Unlike discriminative models, generative models are trained to learn the underlying distribution of the training data instead of learning the mapping function between observation and labels. Therefore, generative methods can use the learned distribution to generate new data similar to the original data. It can be considered an effective method to eliminate the distribution shift problem. Unlike transfer learning and domain adaptation methods, generative methods can rely on known invariances to implement data transformations, resampling, and data augmentation [1].

There are several types of generative models, including Generative Adversarial Networks (GANs) [8], variational autoencoders, flow-based models, and diffusion generative models. GANs are a type of neural network that consists of two parts: a generator and a discriminator. The generator learns to create data resembling the original, whereas the discriminator distinguishes real data from the generated data in a min-max adversarial training setup. Variational autoencoders [13] generate new data by sampling from a learned latent space, often following a normal distribution. They excel in image and video synthesis. Flow-based models [19] transform primary distribution into a more complex distribution similar to the source data, mostly used in natural language processing for text generation. Diffusion generative models [9], also known as diffusion probabilistic models, are generative models that have gained popularity in recent years for their ability to generate high-quality images and videos.

Since our data are medical images, our study focuses on GANs and diffusion generative methods. CycleGAN (e.g., [37]) is an effective image-to-image translation model based on the GAN framework. It is also applied in the medical imaging process and gains competitive performance in the transformation between different data terminals (e.g., [7, 33]). Image reconstruction techniques have been developed to overcome the challenges of low signal-to-noise and contrast-to-noise ratios. They also improve the quality of images for better visual interpretation, understanding, and analysis. The state-of-the-art medical imaging technique is a denoising diffusion generative model. In this research direction, several works [3, 23–25, 32] all proposed practical improvements to Denoising Diffusion Probabilistic Models (DDPMs) for medical image reconstruction. However, more real-world radiology patient data must be applied for more generative capacity. The DDPM is a special form of autoencoder. To be more concrete, an autocoder is what captures perceptual compression. The encoder in an autoencoder projects the high-dimensional data to a latent space, whereas in the DDPM, the encoder projects an image by adding minor noise in each step to a total noise distribution. Then, the decoder recovers the image from the noise—the latent space.

3 Methods

The methods adopted in our work are divided into two steps, as illustrated in Figure 1. The first step is adopting cycle training to generate coarse scans with patch boundaries, then utilizing an effective score-based diffusion model to eliminate the artifacts and enhance the scans to a higher fidelity level. Then, we evaluate our models’ generative performances with metrics and segmentation model tests.

Fig. 1.

3.1 CycleGAN

CycleGAN (a cycle-consistent adversarial network) [37] is a type of unsupervised learning model that leverages deep convolutional neural networks to perform image-to-image translation without paired data. It is built based on the architectures of GANs. For the purpose of cycle translation, it implements two generators and discriminators in its framework. It employs a cycle-consistency loss to ensure the translated image maintains fidelity to the original content and style. By employing two generators and two discriminators, CycleGAN facilitates the transformation between two domains, like turning images from summer to winter landscapes without explicit matching pairs.

GANs operate on a principle of adversarial training where two neural networks—the generator and discriminator—engage in a competitive process. The generator fabricates data to deceive the discriminator while the discriminator learns to distinguish between real and generated data, continually improving each other’s performance in a zero-sum game. Except for the generator’s loss and discriminator’s loss, CycleGAN also involves cycle-consistency loss.

During the CycleGAN training process [37], the objective is to acquire knowledge of the mapping function between two distinct domains: Terminal X and Terminal Y. Even though it can produce subpar synthesized results, cycle-consistency loss is an indirect structural similarity constraint of input and synthesized images. The key to GANs is an adversarial loss combination of discriminator and generator. Two discriminator networks, denoted as \(D_X\) and \(D_Y\), are employed in the model training phase. Two mapping functions are \(G: X \rightarrow Y\) and \(F: Y \rightarrow X\). For each image x from domain X, the purpose of image translation is to transfer x by G, then through F to bring \(G(x)\) back to the original one, as the process \(x \rightarrow G(x) \rightarrow F(G(x)) \approx x\). Similar to the preceding forward process, each image y from domain Y has a backward process, such as \(y \rightarrow F(y) \rightarrow G(F(y)) \approx y\). The loss function combines forward and backward cycle-consistency loss, as in Equation (1), based on the L1 norm distance between the generated image and the original image.

\begin{equation} {\it L}_{Cycle}(G,F)=\mathbf {E}_{x\sim p_{data}(x)}\left[\left\Vert F(G(x))-x \right\Vert _{1} \right] +\mathbf {E}_{y\sim p_{data}(y)}\left[\left\Vert G(F(y))-y \right\Vert _{1} \right] \end{equation}

(1)

In addition, the model’s objective includes adversarial loss, as illustrated in Equation (2).

\begin{equation} {\it L}_{GAN}(G,D_Y,X,Y)=\mathbf {E}_{y\sim p_{data}(y) }\left[ \log D_Y(y)\right] + \mathbf {E}_{x\sim p_{data}(x) }\left[ \log (1-D_Y(G(x)\right] \end{equation}

(2)

Generator G aims to generate images \(G(x)\) and makes it look like an image from domain Y; however, the purpose of \(D_Y\) is to distinguish between \(G(x)\) and y. Major insight into the GANs reveals that G tries to minimize Equation (2), whereas \(D_Y\) aims to maximize the loss. The same situation happens in the reverse cycle. Hence, the total objective function is shown in Equation (3). The \(\lambda\) denotes a factor hyperparameter, which represents the relative importance of the two objectives. In the default implementation, \(\lambda = 10\).

\begin{equation} {\bf L} = \min _G\max _{D_Y}{\it L}_{GAN}(G,D_Y,X,Y) + \min _F\max _{D_X}{\it L}_{GAN}(F,D_X,X,Y) + \lambda {\it L}_{Cycle}(G,F) \end{equation}

(3)

3.2 Score-Based Denoising Diffusion Model

Diffusion models [22] represent another successful generative modeling method without the disadvantages of GANs. Score-based generative modeling [10, 23, 25] uses score functions to generate the model, which is actually also the diffusion model method. In the work of Chung and Ye [3], a framework based on the score-based diffusion method is proposed for the task of MRI reconstruction.

The score function \(\nabla _{x}\log p(x)\) [23, 25] refers to the gradient of the log density, where \(p(x)\) is the probability density function. It is intractable to calculate this score function directly. However, it can train a network \(s_{\theta }({x})\) to estimate the \(\nabla _{x} \log p(x)\) even without knowing the function of \(p(x)\) [23, 25].

The whole process can be divided into two parts. The forward diffusion process can be modelled as the solution to the following Stochastic Differential Equation (SDE) [25]:

\begin{equation} dx = f(x,t)dt + g(t)dw, \end{equation}

(4)

where \(\left\lbrace {x}(t) \right\rbrace _{t=0}^{T}\) denotes the diffusion process with \(t \in [0, T]\), w is an n-dimensional Wiener process, f: \({R}^{n}\) \(\rightarrow {R}^{n}\) represents the drift coefficient of \(x(t)\), and g: \({R}\rightarrow {R}\) is the diffusion coefficient of \(x(t)\). The reverse process of Equation (4) can be constructed with a reverse-time SDE [25]:

\begin{equation} d{x} = [{f}({x},t) - g(t)^2 \nabla _{{x}} \log p({x})]dt + g(t)d\bar{w,} \end{equation}

(5)

where \(\bar{w}\) is an n-dimensional Wiener process running from T to 0, and dt is the infinitesimal negative step of time. Many methods [10, 23] can be used to train the score network \(s_{\theta }({x})\) to approximate the score function \(\nabla _{{x}}\)log\(p({x})\). The framework of Chung and Ye [3] adopts the training method proposed in the work of Song et al. [25] for the task of MRI reconstruction. Based on the preceding methods, what we utilized in this work will be shortened in DDPM.

4 Experiments

In our experiments of the CycleGAN and diffusion model, we have one training CT dataset from different reconstruction processes. There are 137 voxel-wise images reconstructed with Iterative Model Reconstruction (IMR) and 76 images from YA-iDose (Intelligent Dose reconstruction with YA kernel). All images were acquired on the same scanner (Philips IQon Spectral CT) but with two different image reconstructions (IMR and YA). These data are captured from unpaired patients and anonymized. In addition, we have five paired CT data from the same patients, reconstructed with the two different approaches IMR and YA-iDose, to evaluate all experiments. The GPU for computation is NVIDIA RTX A4000.

The size of the generated image is [219,219,187], with one slice captured and displayed every seventh scan interval. Figure 2(a) displays the original scans from the YA terminal; Figure 2(b) displays the generated scans from the generator (YA-IMR); and Figure 2(c) represents the generated scans from the generator (IMR-YA), with Figure 2(b) serving as the input.

Fig. 2.

For the evaluation phase, we first use the t-sne (t-distributed stochastic neighbor embedding) method to distinguish the difference between all of these voxel-wise data. However, because the voxel-wise medical image’s homogeneity is relatively high, the t-sne results are too sparse to prove there is a distribution shift. What is more, since our experiments are unsupervised learning based, we do not have enough ground truth to evaluate the generative performance. Hence, we adopt two other methods to evaluate the similarity of the original data and generative data of all experiments.

First, the evaluation metrics are the Structural Similarity Index Measure (SSIM) [27] and the Multi-Scale Structural Similarity Index Measure (MS-SSIM) [28]. SSIM [27] measures the similarity between two images by evaluating their luminance, contrast, and structure, aiming to mimic human perception of image quality. MS-SSIM [28] extends SSIM by considering image structures across multiple scales, providing a more comprehensive assessment of perceptual image similarity by accounting for details at various levels. The resultant SSIM is a decimal value between –1 and 1, where 1 indicates perfect similarity, 0 indicates no similarity, and –1 indicates perfect anti-correlation. For an image, it is typically calculated using a sliding Gaussian window of size 11 × 11 or a block window of size 8 × 8.The window can be displaced pixel by pixel on the image to create a SSIM-quality map of the image. There is a significant similarity in the SSIM measurement between the original medical images captured by the two terminals.

However, one research work from NVIDIA concludes that assessing image quality based on SSIM can lead to incorrect conclusions and using SSIM as a loss function for deep learning can guide network training in the wrong direction [18].

Hence, TotalSegmentator [29], as an indirect evaluation method, has also been adopted. We use TotalSegmentator as a segmentation tool to generate every anatomy’s volume and intensity statistics results. The difference percentage was calculated using Equation (6).

\begin{equation} Diff_p = \left|\frac{Value_1-Value_2}{Value_1+Value_2}\right| \end{equation}

(6)

4.1 Distribution Shift Evaluation

Distribution shift influences machine learning methods in the following aspects—performance degradation, generalization challenges, bias and fairness concerns, and model drift—and might raise other risks and uncertainties. We hypothesize that there is a distribution shift between datasets from the two different CT terminals. However, there are no effective methods to evaluate the 3D medical images’ distribution shift quantitatively. Hence, we use the two methods mentioned previously to prove that a distribution shift exists in our datasets.

First, SSIM and MS-SSIM evaluation results are shown in Table 1. In this table, SSIM-ori and MS-SSIM-ori denote the metric values between the original paired patient images. SSIM-IMR and MS-SSIM-IMR are calculated by comparing the original IMR and synthesized IMR (from YA to IMR with the same patient) images. SSIM-YA and MS-SSIM-YA are calculated by comparing the original YA and synthesized YA (from IMR to YA with the same patient) images visualization of the original and generated scans from the the same patient are illustrated as in Figure 3. From the SSIM-ori column in the table, there is only around a 0.5 SSIM score after calculating the same patient’s CT images from the IMR and YA reconstructions. The MS-SSIM results are, in general, better than the SSIM results because of the effects on multiple scales that it considers. But for the other synthesized metric results, we can see the value is around 0, which means there are barely any structural similarity exits in the aspect of SSIM comparison.

Table 1.

Patient Index	SSIM-ori	MS-SSIM-ori	SSIM-IMR	MS-SSIM-IMR	SSIM-YA	MS-SSIM-YA
p1	0.5044	0.8068	–0.0402	0.0101	–0.0084	0.0007
p2	0.4777	0.7986	–0.0343	0.0101	–0.0084	0.007
p3	0.4671	0.8097	–0.0274	0.0269	0.0055	0.0181
p4	0.5395	0.8414	0.0423	0.0315	0.0007	0.0179
p5	0.4913	0.8331	–0.0217	0.025	–0.0032	0.0159

Table 1. Similarity Metrics Evaluation Results of CycleGAN

In addition, Figure 4 shows difference percentage, calculated based on Equation (6), and values come from the segmentation model’s volume results of each organ. The organ index listed on the x-axis is detailed and mappedlater in Table 3 of the appendix. From this figure, there is a significant difference (\(p \lt 0.05\)), and outliners exist in indexes 7, 8, 9, 10, 11, 15, 17, 50, 61, and 63. These indexes represent the pancreas, adrenal gland, part of the lungs, esophagus, vertebrae C1, atrial appendage left, and inferior vena cava. It can illustrate a distribution shift between CT output from two different reconstructions, which can affect the segmentation results on a certain level.

Table 2.

Patient Index	SSIM-ori	SSIM-IMR-be	SSIM-YA-be	SSIM-ori-di	SSIM-IMR-di	SSIM-YA-di
p1	0.8186	0.1251	0.1259	0.8226	0.1788	0.2308
p2	0.7565	0.1661	0.1657	0.7638	0.2535	0.3019
p3	0.6841	0.1622	0.1205	0.6996	0.2501	0.2545
p4	0.8574	0.2088	0.1766	0.865	0.3214	0.3172
p5	0.7338	0.1709	0.1311	0.7357	0.2655	0.2676

Table 2. Similarity Metrics Evaluation Results after Denoising Diffusion

Fig. 3.

Fig. 4.

4.2 CycleGAN Experiments

We implemented CycleGAN to transform the 3D CT image from one reconstruction to the other, and the visualization results are illustrated in Figure 2. It is difficult to check the differences between the synthesized and original data from a human’s perspective.

When using SSIM and MS-SSIM metrics to check the synthesized data and original data quantitatively, we can see the differences. Table 1, from the third to the last column, shows the scores between the original IMR image and the synthesized IMR image and the same comparison for the YA reconstruction. Although the research of NVIDIA stated that it could not exactly be utilized to compare image quality similarity, in our experiment we can make the certain point that there is no structural similarity between original images and synthesized data. Even after revising the loss function of the generator by adding SSIM loss, the results are not improved at all.

Additionally, when we use the segmentation tool to deal with these data, it is obvious to observe the differences, such as shown in Figure 5 and Figure 6. For some organ segmentation, there is significant diversity between the compared data.

Fig. 5.

Fig. 6.

Then, based on the results of Table 1, we also calculate the SSIM between two synthesis images, and the results are \([0.7392,0.7687,0,7365,0.7514,0.783]\) for five patients. This illustrates that the synthesis image’s module affects the SSIM values. When the boundary of patches is obvious in the results, we try to eliminate it by decreasing the patch size to \([32,32,32]\). Training results can be seen in Figure 7. In the ideal well-functional CycleGAN training process, the loss function of the two discriminators and two generators should be converged, meaning that a dynamic balance exists in the adversarial game, which is illustrated in Figure 7(a). In Figure 7(b), all four losses do not converge at all. It suffers the model collapse, a common training problem of GAN-based methods. When patch size decreased to [32,32,32], the synthesized results can also not be visualized as shown in Figure 2.

Fig. 7.

4.3 Diffusion Experiments

From the evaluation results from CycleGAN, we can see that a distribution shift exists between two different modules, but the segmentation results do not seem very diverse. To address this issue, we applied the DDPM to refine the generated images. The refining process and corresponding experimental outcomes, as illustrated in Figure 8, indicate that the method can yield superior synthesis results when applied with unpaired scan images from two CT reconstruction methods. To better eliminate the distribution shift, we use the denoising diffusion model to enhance the original and generated voxel-wise image and check the SSIM metrics. The evaluation results of five paired patients are listed in Table 2.

Fig. 8.

The original data of evaluation patients have different image sizes, respectively: [512, 512, 254], [512, 512, 281], [512, 512, 302], [512, 512, 341], and [512, 512, 313]. Due to our computation limit when applying diffusion models, we only evaluate the results via SSIM with the middle layer. In Table 2, the first column, “SSIM-ori,” is the comparison between the same layer of voxel-wise CT results from IMR and YA modules. The second column, “SSIM-IMR-be,” is the comparison between the original IMR and the synthesis IMR image before applying the diffusion model. The third column, “SSIM-YA-be,” is the comparison between the original YA and the synthesis YA image before the diffusion model application. There are higher scores compared to the results in Table 1 because it only compares one layer, not considering the whole voxel range. The following three columns, “SSIM-ori-di,” “SSIM-IMR-di,” and “SSIM-YA-di,” are the comparison results after all layers following the denoising diffusion model.

5 Results

Visualization results of synthesized medical images from CycleGAN and diffusion models are illustrated in Figure 2 and Figure 8. From a human observation perspective, there are only boundary artifacts and image resolution differences in the preceding results. Without calibration methods, it is difficult to measure the organs’ or the objective’s dimensions. Based on Table 1 and Table 2 of SSIM index results from the CycleGAN and diffusion model, it is beneficial to have a precise representation of the similarity between the generative images and original images.

The results demonstrate that even though CycleGAN is good at generating fake images, which just seem real, it is still not as real/precise as the capture methods as physical terminals. It can be utilized as an effective method of data augmentation to train the machine learning model to deal with a wider range of data and increase the model’s robustness. However, it should not be implemented to eliminate the distribution shift and generate reliable diagnoses for the patients. Although the diffusion model has more potential application opportunities to deal with the distribution shift problem aroused by the module diversity, it can preserve the useful structure information and eliminate the other factor influence.

6 Discussion

The majority of the present approaches, which deal with image-to-image translation/ transformation, need huge amounts of paired data from two separate modalities. This requirement severely restricts their use because paired data may not be possible in some circumstances. Some initiatives to loosen this restriction have been put forth, such as CycleGANs. Specifically, CycleGANs were successful in transferring data across different terminals in our experiments. However, even though it can produce subpar synthesized results, cycle-consistency loss is an indirect structural similarity constraint of input and synthesized images. Hence, for smaller patches and rich details, the GAN-based models suffered from model collapse due to an imbalance in the adversarial training process. In addition, the generative purpose of GAN-based methods is only to fool the discriminator, as the discriminator can only distinguish roughly if it looks real or fake, not trained to make the model preserve diagnosis features for the patients. Hence, it can be an efficient data augmentation method rather than a direct distribution-shift-eliminating method for the medical service.

However, the DDPM is a trending generative modality capable of reversing the imaging process. Due to the training process being step by step on the real medical images, it can enhance the fidelity and resolution and not wrongly generate unrelated structure information if there are no other inputs to affect the results. Although computationally expensive and requiring more time to run the reverse SDE process, the results are more reliable for medical services.

Our research aims to combine the reverse SDE process with the cycle training concept to synthesize images with greater fidelity and eliminate the distribution shift among different terminals. Due to the evaluation results of the five paired patients, we successfully evaluated the generated images quantitatively. Based on our experiments, we have observed that both CycleGANs and DDPMs have shown exemplary performance in synthesizing medical images and eliminating distribution shifts under certain conditions.

7 Conclusion and Outlook

7.1 Conclusion

Accurate and efficient medical data analysis requires high-quality imaging processes to facilitate precise diagnosis. Generative models are vital to ensure the stable performance of machine learning methods in medical image analysis, irrespective of the distribution of the input images. In this study, we employed two distinct generative models—CycleGAN and score-based DDPM—to address this challenge. The former focuses on cycle training to generate similar distribution between different terminals, whereas the latter aims to reverse the imaging process and generate high-fidelity images by eliminating artifacts and the distribution shift between two modules. The two methods complement each other’s shortcomings, although their combined implementation results in some redundancy. As such, this work represents an initial attempt to synthesize voxel-size medical images using generative models and visualize the generated results.

In addition, this work evaluated the generative results from two perspectives. One is SSIM evaluation to check the structural similarity of the synthesized data and original data. The second is using the segmentation model to check the segmentation results’ differences. The evaluation results demonstrated that CycleGAN can only be used for data augmentation and is beneficial to train the machine learning models for better robustness. Its mechanism is not suitable for increasing the trustworthiness and reliability of synthesized medical data. However, with suitable utilization of GAN-based generative methods, it can raise the fairness and generalization ability of the machine learning methods. However, the diffusion model has more potential to be adopted as a distribution-shift-eliminating method, especially since the distribution shift is aroused by the module differences, not the fundamental differences. Hence, thanks to the paired patients’ evaluation experiments, we developed a deeper understanding of how and when to apply these generative methods to foster medical domain applications and increase their fairness and trustworthiness.

7.2 Outlook

In this study, we demonstrated that distribution shifts from two different CT reconstruction methods affect the segmentation results on a certain level. These understandings and evaluation results will be beneficial for our later research related to federated learning. Federated learning is a privacy-preserving distributed machine learning paradigm that supports collaborative model training on datasets distributed among multiple parties while preventing data leakage. Li et al. [15] propose a federated learning approach to collaboratively train functional MRI classification models for neurological diseases or disorders. In addition, the PHT (Personal Health Train) [2] is a novel approach aiming to establish a distributed data analytic infrastructure enabling the (re)use of distributed healthcare data. Hence, federated learning and PHT can provide security concerns for data owners because they control their own data.

Generative methods and federated learning hold immense promises for enhancing the trustworthiness of AI applications in the medical domain. The cooperation of these two approaches will aim to improve the reliability, privacy, and fairness of AI systems. Generative methods can efficiently and effectively eliminate the distribution shift problem and augment the data sources. Federated learning has emerged as a powerful approach for training AI across decentralized data sources while maintaining data privacy and security.

Hence, in our next step, a reply on the NFDI4Health project [11], we will utilize their infrastructures to implement the model sharing between different terminals. Further research about the generated images from physician and algorithm perspectives will be carried out to increase the fairness and trustworthiness of generative methods’ application in the medical domain.

Appendix

Table 3 shows a list of all classes, mapping the first 24 indexes and anatomy names. More index information can be found in TotalSegmentator GitHub [29].

Table 3.

Index	1	2	3	4
Anatomy name	spleen	kidney right	kidney left	gallbladder
Index	5	6	7	8
Anatomy name	liver	stomach	pancreas	adrenal gland right
Index	9	10	11	12
Anatomy name	adrenal gland left	lung upper lobe left	lung lower lobe left	lung upper lobe right
Index	13	14	15	16
Anatomy name	lung middle lobe right	lung lower lobe right	oesophagus	trachea
Index	17	18	19	20
Anatomy name	thyroid gland	small bowel	duodenum	colon
Index	21	22	23	24
Anatomy name	urinary bladder	prostate	kidney cyst left	kidney cyst right

Table 3. Anatomy Index Table

References

[1]

Antreas Antoniou, Amos Storkey, and Harrison Edwards. 2018. Data augmentation generative adversarial networks. arXiv:1711.04340 [stat.ML] (2018).

Abstract

1 Introduction

2 Related Work

2.1 Fairness Transfer across Distribution Shift

2.2 Generative Methods

3 Methods

3.1 CycleGAN

3.2 Score-Based Denoising Diffusion Model

4 Experiments

4.1 Distribution Shift Evaluation

4.2 CycleGAN Experiments

4.3 Diffusion Experiments

5 Results

6 Discussion

7 Conclusion and Outlook

7.1 Conclusion

7.2 Outlook

Appendix

References

Cited By

Index Terms

Recommendations

Voxel-wise adversarial semi-supervised learning for medical image segmentation

Cone-Beam Computed Tomography (CBCT) Segmentation by Adversarial Learning Domain Adaptation

Variational inference for medical image segmentation

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations