DODA: Diffusion for Object-detection Domain Adaptation in Agriculture

Shuai Xiang Pieter M. Blok James Burridge Haozhou Wang Wei Guo
Graduate School of Agricultural and Life Sciences
The University of Tokyo
1-1-1 Midori-cho, Nishitokyo City
{xiang-shuai, pieterblok, burridge-j, haozhou-wang, guowei}@g.ecc.u-tokyo.ac.jp

https://github.com/UTokyo-FieldPhenomics-Lab/DODA.git
Corresponding Author

Abstract

The diverse and high-quality content generated by recent generative models demonstrates the great potential of using synthetic data to train downstream models. However, in vision, especially in objection detection, related areas are not fully explored, the synthetic images are merely used to balance the long tails of existing datasets, and the accuracy of the generated labels is low, the full potential of generative models has not been exploited. In this paper, we propose DODA, a data synthesizer that can generate high-quality object detection data for new domains in agriculture. Specifically, we improve the controllability of layout-to-image through encoding layout as an image, thereby improving the quality of labels, and use a visual encoder to provide visual clues for the diffusion model to decouple visual features from the diffusion model, and empowering the model the ability to generate data in new domains. On the Global Wheat Head Detection (GWHD) Dataset, which is the largest dataset in agriculture and contains diverse domains, using the data synthesized by DODA improves the performance of the object detector by 12.74-17.76 AP₅₀ in the domain that was significantly shifted from the training data.

1 Introduction

Real-world scenes are often more complex and diverse than training data, and applying models to the real world has been a challenge because of the gap between training data and the real world. In agriculture, object detection is of great significance for yield estimation, breeding, etc. Agricultural scenes are inherently variable and the problem of domain shift between training data and real scenes is particularly important. It’s often necessary for a breeder to process hundreds of varieties to obtain a new variety, and these varieties are all different. In addition, factors such as weather, flowering time, equipment, crop development period, etc. will promote domain shift, which affects the reliability and usability of the object detection model. For object detection, a common solution is semi-supervised learning because of the expense of artificial labeling, where an existing model is used to recognize new scenes and generate pseudo-labels [1, 2, 3, 4], and then the pseudo-labels are used to train another model, thus extending the model’s ability to recognize different scenes. However, this approach amplifies the existing drawbacks and strengths of the model [2], and it cannot work if the domain shift is too strong and the model is completely unable to recognize certain scenes. Global Wheat Head Detection (GWHD) Dataset [5] is the largest object detection dataset in agriculture, but we observed that simply training on the training set of GWHD, the models cannot recognize certain domains reasonably in the test set because of their differences with the training set, the use of object detection model in practice still relies on expensive and laborious artificial annotation of the target scenes [6, 7, 8].

Refer to caption — Figure 1: Overview. (a) DODA can generate high-quality object detection data for new domains by encoding the layout image and the reference image from target domain. (b)Fine-tuning with data generated by DODA can significantly improve the recognition of new domain of various detectors.

A growing number of studies have delved into harnessing generative AI as a data reservoir for addressing data-related challenges. The availability of GPT4 [9] marks a new era in text generation, where high-quality content from large language models is extensively employed in crafting datasets for an array of novel and challenging tasks. Wang et al. [10] leveraged Large Language Models (LLMs) for generating data to facilitate fine-grained hallucination detection. Bernsohn et al. [11] utilized GPT4 [9] for generating data aimed at identifying legal violations and associating victims. Choi et al. [12] employed synthetic tweets to train FACT-GPT, achieving effective identification of false information in social media. Furthermore, in computer vision, the synthesized high-quality images of diffusion [13, 14] are widely used in tasks such as visual representation learning [15, 16, 17], image classification [18, 19], and semantic segmentation [20, 21, 22].

However, due to the challenge of acquiring labels, there are few studies on utilizing diffusion to synthesize data for object detection. A straightforward approach is to synthesize the foreground and background separately (or just the foreground) and then paste the foreground onto the background [23, 24]. However, this method often results in composite images with inadequate consistency. Alternatively, generating images directly using a text-to-image model and then deriving labels through some method [25] may yield more consistent images, yet it encounters the same predicament as pseudo-labeling. Letting diffusion to generate images directly based on the layout [26] is a more promising approach. Nevertheless, due to the inadequate alignment between the layout’s features and those of the image, the quality of the synthesized data is not high. Moreover, unlike the preceding approaches, because of the necessity of using the label-image pairs from the object detection dataset to train the diffusion model, the distribution of the generated images will conform to the training set, thereby restricting the utility of synthetic images solely to augmenting the training set.

To enable the model to generate data for different scenes, especially those that are not included in the training set, the approach introduced in this study, referred to as DODA (Diffusion for Object-detection Domain Adaptation), follows the idea of Domain Adaptation, which separates the learning of the differences between different domains from the core components of the model. Specifically, we employ an additional pretrained visual encoder to provide image features for diffusion, ensuring that diffusion does not learn image features from the data when training with label-image pairs. Thus, when generating datasets, DODA can produce high-quality dataset for the target domain by combining the randomly generated layout with the image features specific to the target domain that are extracted through the visual encoder. Furthermore, unlike the previous approach where layouts were encoded through a text encoder for layout-to-image(L2I) generation, we encode layouts in the form of images. This improves the alignment between layouts and images, thereby enhancing the quality of labels.

Below are the main contributions made in this work:

- We proposed an image-based layout encoding method in layout-to-image generation. Results on the COCO dataset [27] demonstrate that this approach achieves higher accuracy in generating labels compared to previous layout-to-image generation (L2I) methods, achieving a new state-of-the-art result.

- We follow the idea of Domain Adaptation by adopting a design that decouples image features from the core components of the model. This allows the generative model to generate images in a completely new domain without the need for addi-tional training. Through training on datasets synthesized by DODA, the performance of object detectors has notably improved. The improvement remains consistent across object detectors of varying sizes and architectures.

- The decoupled design allows us to train different parts of the model with asymmetric data. We found that pretraining the model with more unlabeled data enables the model to acquire better feature composition ability, thereby enhancing performance in downstream tasks.

2 Related Work

2.1 Image Generation Model

Image generation has garnered significant attention in the last decade, leading to the emergence of various approaches. Among them, the autoregressive model [28, 29] is a classic method that focuses on learning the conditional distribution among pixels to generate images pixel by pixel. Generative Adversarial Networks (GANs) [30, 31, 32] have achieved remarkable success by employing adversarial training between a generator and a discriminator to produce realistic images. Additionally, the autoencoder [33, 34] learns the representation of images in the latent space and thus generates new samples, bringing new ideas and methods to image generation. More recently, methods based on diffusion models [13] have received increasing attention. The diffusion model generates images through a multistep denoising process. By combining the strengths of previous methods, such as using classifier guidance [35], encoding images into latent space [14], etc., it can produce more diverse and realistic images than prior methods. This has sparked wide interest and exploration in the field of image generation [36, 37, 38, 39].

2.2 Generative Models for Different Tasks

There have been early attempts to explore the use of image generators such as GANs to generate images and then train models [40, 41, 42, 43]. With the development of the diffusion model, its ability of learning image representations across datasets and generating diversified images has attracted attention. Consequently, an increasing number of studies are exploring the application of diffusion model to different tasks. For example, some studies have tried to directly use diffusion models as representation learners [44, 45], which is still not as effective as contrastive learning [46] or masked-image-modeling [47], but still a meaningful attempt. More research involves generating images using diffusion models [42, 19, 20, 21, 22] and then training downstream models with the synthesized images. Because of the design of the diffusion model, various conditions, such as text [37, 14, 38], can be easily incorporated into the generation process along with other control, making the process of synthesizing images and training downstream models similar to knowledge distillation [17], which helps downstream models learn the implicit knowledge of large model. For instance, in visual representations learning, the utilization of synthetic data for training [17] can compete with approaches like CLIP [46], DINOV2 [48].

2.3 Layout-to-image Generation

Integrating layout guidance into the image generation process to produce images that correspond to the input layout in terms of categories and positions is termed as layout-to-image. Although there were early attempts in GANs [49, 50], the quality of generated images and the fidelity to the layout were constrained by GANs. Inspired by text-to-image generation tasks, diffusion-based L2I methods [26, 14, 51] represent the layout in text form, and then leverage a text encoder (such as CLIP [46] or Bert [52]) to encode the layout, and integrate layout into the diffusion model through cross-attention. Compared with GAN-based methods, this approach achieves significant improvements in image quality and controllability. However, the text encoder is usually trained on image-text pairs or plain text data, so its ability to encode and align layout with image features is relatively weak, and when encoding the category and position of objects together, not only does it heavily consume tokens, but the category and position also compete with each other, resulting in weaker layout control or poorer image quality. Therefore, we propose representing the layout in images form to reduce the length of token to limit the number of generated objects, and to further improve the controllability of generated images and the quality of labels.

2.4 Domain Adaptation

In practical applications, due to various factors, realworld data may have distributions different from the training data, leading to decreased performance of the model. To address this, the goal of domain adaptation is to enable the knowledge learned by the model in the source domain to be used in the target domain [53]. The basic idea of adversarial-based domain adaptation is to use some method (such as additional loss function [54, 55]) to separate the differences between different domains and reduce the model’s discrimination between different domains, and compelling the feature extraction network to extract a domain-invariant feature for downstream tasks. Following this idea, we use an image encoder to separate image features and ensure that the diffusion model learns only how to generate images based on the layout and the image features provided by the image encoder. In this way, when generating new images, through the image features of the target domain provided by the image encoder, the model can generate data for the target domain without training.

3 Method

Given the random layout and the reference images from the target domain, DODA is designed to generate images that match the input layout and domain features, thus providing object detection data for the target domain. Fig. 2 shows the general structure of DODA.

3.1 Problem Formulation

In this study, our objective is to generate data tailored to a specific domain to enhance the performance of the target detector within that domain:

I_{i}=\phi_{\text{DODA}}\left(E_{\text{domain}}\left(I_{\text{domain}}^{j}% \right),E_{\text{layout}}\left(I_{\text{layout}}^{i}\right),\varepsilon\right)

(1)

Here, $\phi_{\text{DODA}}(\cdot)$ takes a reference image $I_{\text{domain}}^{j}$ from the target domain, a randomly generated layout image $I_{\text{layout}}^{i}$ , and random noise ( $\varepsilon\sim N(0,1)$ ) as input, to generate images ( $I_{i}\in\mathbb{R}^{H\times W\times 3}$ ) adhering to the layout. Sampling occurs in the latent space, where ( $\varepsilon\in\mathbb{R}^{h\times w\times c}$ ), with $h=H/f$ , $w=W/f$ , and $f$ representing the downsampling factor relative to the original image. In the GWHD dataset, images contain a lot of textured details. To preserve these texture details as much as possible and reduce the loss of details caused by downsampling, $f$ is set to 4 instead of the more common 8 [14].

3.2 Domain Encoding

In our approach, the generative model does not learn the representation between different domains during training, so that it can use randomly generated labels when generating images and obtain zero-shot image generation capability. To avoid the diffusion model from learning domain bias, we employ a pre-trained visual coder to provide features:

F_{\text{domain}}^{i}=E_{\text{domain}}\left(I_{\text{domain}}^{i}\right)

(2)

Where the $E_{\text{domain}}(\cdot)$ denotes a VIT-B [56] pre-trained with MAE [47], the MAE trains VIT to learn visual representations of images via masked-image-modeling (MIM). Specifically, during training, 75% of image patches are randomly masked, tasking the model with reconstructing the masked patches. Compared to contrastive learning, models trained with MIM will pay more attention to the texture features of the image rather than the structural features, thereby retaining more high-frequency features [57, 58]. These high-frequency texture features effectively encapsulate domain-specific information. $(F_{\text{domain}}^{i}\in\mathbb{R}^{768\times 1})$ denotes the encoded image features of the target domain $(I_{\text{domain}}^{i}\in\mathbb{R}^{224\times 224\times 3})$ . To prevent the model from obtaining layout information from the features provided by MAE, we utilize the [CLS] token as the image feature representation. Moreover, reference images use a different data augmentation method from target images. Further details on data augmentation can be found in the appendix.

3.3 Layout Encoding

The previous L2I method [26, 14, 50] used text encoding to represent the layout of the image, and incorporates it into the diffusion model through cross-attention:

L_{i}=\{(b_{i,k},c_{i,k})\}_{k=1}^{K}

(3)

F_{\text{layout}}^{i}=E_{\text{text}}(L_{i})

(4)

F_{i}=\text{Attention}(Q_{i},K_{L}^{i},V_{L}^{i})

(5)

Where $b_{(i,k)}=(x_{\text{min}}^{(i,k)},y_{\text{min}}^{(i,k)},x_{\text{max}}^{(i,k)% },y_{\text{max}}^{(i,k)})\in[0,1]^{4}$ represents the four vertices of a bounding box, $c_{(i,k)}\in[0,C]$ denotes the class of the object within the corresponding bounding box. $F_{\text{layout}}^{i}$ is the layout feature encoded by the text encoder $E_{\text{text}}(\cdot)$ , and $F_{i}$ is the feature map in diffusion. $Q$ , $K_{L}^{i}$ , and $V_{L}^{i}$ represent the query, key, and values of the attention operation for cross-attention with text. Specifically, $Q_{i}=W_{q}F_{i}$ , $K_{L}^{i}=W_{k}F_{\text{layout}}^{i}$ , $V_{L}^{i}=W_{v}F_{\text{layout}}^{i}$ .

Cross-attention is suitable for conveying conceptual information, such as the category of objects, the style of the image. However, layout represents the spatial arrangement of objects in the image, so the above method cannot well align layout features with the image. Additionally, this approach heavily consumes tokens, leading to a significant reduction in the number of encoded objects.

In terms of geometric structure control, enhancing the consistency of geometric structures through the ControlNet [59] is a common practice [60, 61, 62, 63]. Here, we represent the layout in the form of images and then incorporate the layout features into the diffusion model refer to ControlNet:

F_{\text{layout}}^{i}=E_{\text{layout}}(I_{\text{layout}}^{i})

(6)

F_{i}=F_{i}+F_{\text{layout}}^{i}

(7)

Where $F_{\text{layout}}^{i}$ represents the layout feature encoded by the image encoder $E_{\text{layout}}(\cdot)$ .

ControlNet initializes the conditioning encoding module with the weights of the diffusion encoder, which effectively preserves the diversity of the generated image. Additionally, the pointwise addition of condition features with the original features better retains spatial information compared to cross-attention.

Note that the mask-to-image approach used in ControlNet, where carefully labeled masks accurately represent the shape and location of an object, and there is no overlap between masks. In our L2I task, bounding boxes of different objects only reflect their positions and sizes, with potential overlap between them. Thus, we specifically designed the method for drawing layout images.

For a single-category data set like GWHD, we calculate the overlapping relationship between all objects, and use the greedy algorithm to arrange the bounding boxes of overlapped objects into three different channels of the RGB image. For a multi-category data set like COCO, the bounding box of each object are drawn in descending order of area, ensuring that smaller objects are less likely to be occluded. Objects of the same category within an image are depicted with the same hue but slightly weaker brightness for differentiation.

3.4 Dataset Preparation and Model Training

GWHD dataset is composed of nadir view photographs of wheat. In the GWHD dataset, the images have high resolution but are relatively few in number. Therefore, we divided the original $1024\times 1024$ images into 9 images of size $512\times 512$ with a step size of 256. Thus, following the original dataset’s division of images, the training set contains 32,913 images, and the test set contains 25,722 images. In GWHD, all object detectors have particularly poor recognition of the two domains ’Terraref_1’ and ’Terraref_1’. We merge these two domains into a single entity called ’Terraref’ and utilize it to test the effect of DODA for object detection domain adaptation. The merged ’Terraref’ dataset comprises 2250 images.

For the COCO 2017 dataset, we train with the official training set and test the proposed L2I method on the validation set.

The training of DODA for GWHD can be divided into two stages:

1. The diffusion model training stage, also referred to as the pre-training stage. During this stage, training the upper part of the model in Fig. 1, the model is trained to generate images based on the features provided by MAE.

2. The L2I training stage, where the model is trained to generate images based on layouts, which is the lower part of Fig. 2, and the weights of upper part of the model are frozen.

We use all images from the GWHD training and test sets for pre-training, and image-label pairs from the training set for L2I training. Since bounding box annotation is not required during pre-training, and image generation is decoupled from image feature learning, we can use this asymmetric data for two-stage training. Detailed experiments have been conducted regarding the impact of pre-training dataset size on the quality of generated data.

3.5 Evaluation Metrics

Fréchet Inception Distance (FID). [64] reflecting the quality of the generated image. FID measures feature similarity between generated and real images, the features calculated by the pre-trained Inception-V3 [65].

Inception Score (IS) [66] uses a pre-trained Inception-V3 [65] to classify the generated images, reflecting the diversity and quality of the images.

COCO Metrics refers to fine-tuning detectors using synthetic data, and then calculating AP according to the official COCO [27].

YOLO Score uses a pre-trained YOLOX L [67] to detect the generated image, and calculates the AP between the detection result and the input label, which reflects the ability of the generated model to control the layout.

YOLO Distinction. We use DODA to generate images for the target domain, then use the GWHD-pretrained YOLOX L for detection, and calculate the AP50 between the detection result and the input label. Because YOLOX is not fine-tuned on the target domain, it can recognize domains in the GWHD training set but cannot recognize the target domains. It reflects the difference between the generated images and the GWHD training set for the detector.

4 Experiment

4.1 Comparison with T2I-based L2I model

In this section, we compare our proposed L2I method with other text-to-image based L2I methods.

4.1.1 Quantitative Comparison

Table 1: Quantitative results on COCO-stuff-val2017. SD1.5 refers to stable diffusion v1.5. Max obj refers to the maximum number of objects that can be generated in a picture, limited by the length of the token.

Method	Base model	Max obj	YOLO Score $\uparrow$						FID $\downarrow$	IS $\uparrow$
Method	Base model	Max obj	mAP	AP₅₀	AP₇₅	AP^s	AP^m	AP^l	FID $\downarrow$	IS $\uparrow$
$256\times 256$
LayoutDiffusion[51]	-	8	6.0	14.9	3.8	0.2	4.7	19.8	20.5	21.8 $\pm$ 1.1
GeoDiffusion[26]	SD1.5	18	27.3	8.5	29.3	2.8	40.3	63.2	34.3	24.7 $\pm$ 1.1
DODA(ours)	SD1.5	-	28.4	45.5	28.5	10.0	38.2	54.8	30.2	31.2 $\pm$ 1.0
$512\times 512$
GeoDiffusion[26]	SD1.5	18	27.7	40.7	29.6	0	13.0	57.8	28.8	26.4 $\pm$ 1.1
DODA(ours)	SD1.5	-	39.7	59.0	41.9	14.7	39.0	56.3	24.0	32.4 $\pm$ 1.0

To provide a fair comparison of the L2I method, we followed the setup of GeoDiffusion [26] and trained the model based on Stable Diffusion [14] v1.5¹¹1https://huggingface.co/runwayml/stable-diffusion-v1-5. Since Stable Diffusion is a T2I model, we constructed a simple text prompt for our method. Assuming that an image contains objects of i different categories, the prompt is formulated as:

prompt=^{\prime}\text{a photograph with }(N_{\text{cls}}^{1})({Cls}^{1}),(N_{{% cls}}^{2})({Cls}^{2}),\ldots,(N_{{cls}}^{i})({Cls}^{i})^{\prime}

Where $Cls^{i}$ represents a specific category, and $N_{cls}^{i}$ denotes the number of objects belonging to that category.

The results are shown in Table 1. It can be seen that using the layout-image-to-image method can achieve better overall control of the image layout, and the number of generated objects is not limited by the token length. Compared to previous methods, there is an improvement of 11.3 in mAP. When the object area is small, because of the competition between category and position within the prompt, the positions of small objects may be directly ignored, or result in loss of image diversity (As shown in Fig. 3), while DODA achieved the highest APS and IS, which means it can better generate small objects and preserves more details, and. Additionally, after increasing the resolution, DODA’s image control capability rapidly improves, which is crucial for constructing high-quality datasets.

However, DODA’s ability to handle large objects is not as strong. This is because in drawing layout images, we simply arrange objects in descending order of area without considering their true depth relationships in the image, leading to the possibility of large objects being incorrectly occluded. And in terms of FID, due to the utilization of Stable Diffusion as the base model with frozen weights, although more diversity is retained, the FID score is higher compared to LayoutDiffusion, the model that was fully trained on COCO.

Qualitative Comparison Fig. 3 illustrates a comparison of $256\times 256$ images generated on the COCO validation set using our method and previous SOTA L2I approaches.

Table 2: Performance of different models before and after fine-tuning using synthetic domain-specific data.

Method	Parameters	AP₅₀ $\uparrow$	AP₇₅ $\uparrow$	AP^s $\uparrow$	AP^m $\uparrow$	AP^l $\uparrow$
Deformable DETR[68]	41M	10.2	0.5	0.3	3.1	10.5
+ synthetic dataset		24.8	3.	0.54	9.9	30.2
YOLOX L[67]	54M	30.5	5.0	0.7	12.9	42.0
+ synthetic dataset		48.3	11.6	3.0	22.6	42.3
YOLOV7 X[69]	71M	27.2	7.5	0.2	13.9	40.4
+ synthetic dataset		42.9	11.7	2.2	21.2	40.4
FCOS X101[70]	90M	20.1	1.4	0.4	6.1	17.5
+ synthetic dataset		33.8	5.4	1.9	14.3	21.1

4.2 Synthetic data for Object Detection Domain Adaptation

In this section, we employ DODA to generate data for the domain ‘Terraref’ and evaluate the effectiveness of the generated data on a range of object detection models with varying sizes and structures. For each detector, we initialize it with weights of a model pre-trained on COCO, train on GWHD’s training set and test their performance on ‘Terraref’ as the baseline. We randomly sample images from ‘Terraref’ as references, combine them with randomly generated layouts, and leverage DODA to create a synthetic dataset comprising 800 images. We then repeatedly fine-tune the object detectors using this synthetic dataset for 5 times, and average the result. The results are illustrated in Fig. 1 (b) and summarized in Table 2.

After fine-tuning with the data synthesized by DODA, all models show significant improvement in recognizing ‘Terraref’, indicating DODA’s capability to extract representation of a specific domain and transform them into knowledge that object detection models can utilize, assisting detectors in recognizing new and different domains.

As shown in Fig. 1 (b), different detectors have different starting points, but improvement rates are relatively close. DODA cannot help reduce the gap between detectors, suggesting that synthetic data only helps the detectors to transfer knowledge, while the final recognition performance still depends on the detectors’ inherent recognition abilities.

4.3 Ablation Study

To further understand the significance of each design in the proposed method for domain adaptation, we conducted an ablation study. We explored and analyzed in detail the impact of domain encoder and pre-training dataset size on the quality of data generation, as well as the specific role of generated data.

Domain Encoder. After removing the domain encoder module from the model, we tested the model’s ability to generate images. As illustrated in Fig. 4 (a), without the domain encoder to separate image features, the diffusion model learns the average representation of the entire dataset, the generated images are highly similar to each other, and can’t be used for training downstream models. Separating features through the domain encoder not only allows the model to have the ability to generate images for specific domains, but more importantly, it can avoid learning the average representation of the data set during the training process.

Number of generated images. Fine-tuning YOLOX L with more images, as shown in Fig. 4 (b). 800 images are sufficient to convey information about the target domain. Further increasing the number of images only results in oscillation in the performance of the detector without further improvement. The primary role of DODA is to assist the detector in knowledge transfer rather than data augmentation.

Table 3: Performance of different models before and after fine-tuning using synthetic domain-specific data.

Additional unlabeled data	FID $\downarrow$	YOLO Distinc-tion $\downarrow$	AP₅₀ $\uparrow$	AP₇₅ $\uparrow$
Baseline	-	-	30.5	5.0
0%	177.4	81.1	28.3	5.8
25%	158.4	80.3	30.1	5.4
50%	164.3	76.6	31.1	4.9
75%	144.1	76.5	32.2	5.5
100%	158.8	74.2	36.2	7.6
+Terraref	110.3	57.2	40.6	6.2

Scaling dataset. The separation of image features by the domain encoder allows us to pre-train the diffusion model using a larger dataset, without worrying about overfitting during the training of the layout encoder due to differences in data distribution. We explored the variation in data quality generated for ‘Terraref’ when training with different amounts of unlabeled data. Here, the unlabeled data refers to the GWHD test set, which will not be used when training the layout encoder.

Specifically, we specify data quality as image quality and label quality. Image quality is evaluated by FID and YOLO Distinction, and label quality is evaluated based on the performance of the detector that fine-tuned with the generated images. The quantitative results are shown in Table 3. When no additional unlabeled data was used, the FID was relatively high, indicating poor image quality, and the generated data even impaired the performance of the detector. As the size of pre-training dataset increases, FID cannot capture changes in images very well, but YOLO Distinction is steadily decreasing, while data quality is steadily increasing. DODA can better distinguish and extract features of specific domains, and these features are helpful for the detector.

After additionally using images in ‘Terraref’ for pre-training, DODA could generate images with styles almost identical to the original images (as seen in the last column of Fig. 5). The performance of the downstream model has been further improved, but the improvement is not as good as Image quality improvement. Although image similarity increased significantly, control over the image layout weakened, leading to a decline in label quality. This may be because when the data from the target domain is not used for pre-training, the model simply combines the learned knowledge about the foreground and background from other data with the embedding of target domain. After pre-training with data from the target domain, the model will use less typical features from other domains and tend to identify the foreground and background of the target domain by itself. However, due to the difficulty of the data, the recognition is not good, so the layout control is weakened.

5 Limitation

After training with images generated by DODA, the detector shows a significant improvement in recognizing ‘Terraref’. However, compared with some simple domains (AP>=80), the recognition performance is still relatively limited. On GWHD, pretraining with larger unlabeled data can enhance the quality of generated data, and no performance saturation is observed. Collecting a more diverse range of images may be a potential approach to address this issue.

The proposed layout-image-to-image method exhibits a significant enhancement in controlling the layout compared to previous state-of-the-art layout-to-image methods, which means higher label quality. However, there are still erroneous labels present. Due to resource limitations, we performed simple fine-tuning using the generated data without focusing on improving scores, such as by incorporating certain semi-supervised learning methods to further enhance label quality.

6 Conclusion

In this paper, we propose an objection detection data generation model called DODA, which synthesizes data for the target domain zero-shot by utilizing unlabeled reference images from the target domain and the randomly generated layouts. Through comprehensive experiments, we demonstrate its effectiveness in alleviating the performance degradation of object detectors caused by significant domain shifts in agricultural scenes.

Acknowledgments. This material is based upon work supported by the Google Cloud Research Credits program. This study was partially supported by the Japan Science and Technology Agency (JST) AIP Acceleration Research (JPMJCR21U3)

Disclosure of Interests. The authors have no competing interests to declare that are relevant to the content of this article.

Appendix

A. Visualization of image representationsitle

We encode the images with MAE and take the [CLS] token as the representations of the images.and employ the UMAP [71] to reduce feature dimensions to 2 for visualization. As shown in Fig. 6, the GWHD training set is remote from the ’Terraref’ domain. Using images from ’Terraref’ as the reference, DODA generates images closer to the target domain, and synthetic data can help downstream models adapt to new domain.

B. Hyperparameters

Table 4: Hyperparameters for pre-training DODA. DODA leverages latent diffusion (LDM) [14] as the base diffusion model, which uses variational autoencoder (VAE) [33] to encode the image into the latent space and thus reduces the computation, so the pre-training of DODA is divided into two stages: the VAE and LDM.

		VAE	LDM
Dataset		All images in GWHD	All images in GWHD
Target Image Shape		$256\times 256\times 3$	$256\times 256\times 3$
Domain Reference Image Shape		-	$224\times 224\times 3$
Data Augmentation	Target Image	Random Rotation	Random Rotation
		Random Crop	Random Crop
		Random Flip	Random Flip
	Reference Image	-	Random Crop
f		4	4
Channels		128	224
Channel Multiplier		1,2,4	1,2,4
Attention Resolutions		-	2,4
Number of Heads		-	8
Learning Rate		2e-6	2.5e-5
Iterations		170k	315k
Batch Size		8	16

Table 5: Hyperparameters for layout-to-image.

Dataset		COCO 2017 training	COCO 2017 training	GWHD training
Target/Layout Image Shape		$256\times 256\times 3$	$512\times 512\times 3$	$256\times 256\times 3$
Domain Reference Image Shape		-	-	$224\times 224\times 3$
Data Augmentation	Target Image	Random Flip	Random Flip	Random Rotation
				Random Crop
				Random Flip
	Reference Image	-	-	Random Crop
Base Model		SD1.5	SD1.5	LDM in Table 4
f		8	8	4
Channels		320	320	224
Channel Multiplier		1,2,4,4	1,2,4,4	1,2,4
Attention Resolutions		1,2,4	1,2,4	2,4
Number of Heads		8	8	8
Learning Rate		2.5e-5	2.5e-5	1e-5
Iterations		100K	30K	80K
Batch Size		16	8	16

C. More qualitative comparisons with previous L2I methods on COCO

References

Chen et al. [2022] B. Chen, P. Li, X. Chen, B. Wang, L. Zhang, and X.-S. Hua, “Dense learning based semi-supervised object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4815–4824.
Chen et al. [2023a] Z. Chen, W. Zhang, X. Wang, K. Chen, and Z. Wang, “Mixed pseudo labels for semi-supervised object detection,” arXiv preprint arXiv:2312.07006, 2023.
Liu et al. [2021] Y.-C. Liu, C.-Y. Ma, Z. He, C.-W. Kuo, K. Chen, P. Zhang, B. Wu, Z. Kira, and P. Vajda, “Unbiased teacher for semi-supervised object detection,” arXiv preprint arXiv:2102.09480, 2021.
Xu et al. [2021] M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, and Z. Liu, “End-to-end semi-supervised object detection with soft teacher,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3060–3069.
David et al. [2021] E. David, M. Serouart, D. Smith, S. Madec, K. Velumani, S. Liu, X. Wang, F. Pinto, S. Shafiee, I. S. Tahir et al., “Global wheat head detection 2021: An improved dataset for benchmarking wheat head detection methods,” Plant Phenomics, 2021.
Meng et al. [2023] X. Meng, C. Li, J. Li, X. Li, F. Guo, and Z. Xiao, “Yolov7-ma: Improved yolov7-based wheat head detection and counting,” Remote Sensing, vol. 15, no. 15, p. 3770, 2023.
Wu et al. [2023] T. Wu, S. Zhong, H. Chen, and X. Geng, “Research on the method of counting wheat ears via video based on improved yolov7 and deepsort,” Sensors, vol. 23, no. 10, p. 4880, 2023.
Zhaosheng et al. [2022] Y. Zhaosheng, L. Tao, Y. Tianle, J. Chengxin, and S. Chengming, “Rapid detection of wheat ears in orthophotos from unmanned aerial vehicles in fields based on yolox,” Frontiers in Plant Science, vol. 13, p. 851245, 2022.
Achiam et al. [2023] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
Mishra et al. [2024] A. Mishra, A. Asai, V. Balachandran, Y. Wang, G. Neubig, Y. Tsvetkov, and H. Hajishirzi, “Fine-grained hallucination detection and editing for language models,” arXiv preprint arXiv:2401.06855, 2024.
Bernsohn et al. [2024] D. Bernsohn, G. Semo, Y. Vazana, G. Hayat, B. Hagag, J. Niklaus, R. Saha, and K. Truskovskyi, “Legallens: Leveraging llms for legal violation identification in unstructured text,” arXiv preprint arXiv:2402.04335, 2024.
Choi and Ferrara [2024] E. C. Choi and E. Ferrara, “Fact-gpt: Fact-checking augmentation via claim matching with llms,” arXiv preprint arXiv:2402.05904, 2024.
Ho et al. [2020] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
Rombach et al. [2022] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
Jahanian et al. [2021] A. Jahanian, X. Puig, Y. Tian, and P. Isola, “Generative models as a data source for multiview representation learning,” arXiv preprint arXiv:2106.05258, 2021.
Tian et al. [2024] Y. Tian, L. Fan, P. Isola, H. Chang, and D. Krishnan, “Stablerep: Synthetic images from text-to-image models make strong visual representation learners,” Advances in Neural Information Processing Systems, vol. 36, 2024.
Tian et al. [2023] Y. Tian, L. Fan, K. Chen, D. Katabi, D. Krishnan, and P. Isola, “Learning vision from models rivals learning vision from data,” arXiv preprint arXiv:2312.17742, 2023.
Azizi et al. [2023] S. Azizi, S. Kornblith, C. Saharia, M. Norouzi, and D. J. Fleet, “Synthetic data from diffusion models improves imagenet classification,” arXiv preprint arXiv:2304.08466, 2023.
Sarıyıldız et al. [2023] M. B. Sarıyıldız, K. Alahari, D. Larlus, and Y. Kalantidis, “Fake it till you make it: Learning transferable representations from synthetic imagenet clones,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8011–8021.
Schnell et al. [2023] J. Schnell, J. Wang, L. Qi, V. T. Hu, and M. Tang, “Generative data augmentation improves scribble-supervised semantic segmentation,” arXiv preprint arXiv:2311.17121, 2023.
Tan et al. [2023] W. Tan, S. Chen, and B. Yan, “Diffss: Diffusion model for few-shot semantic segmentation,” arXiv preprint arXiv:2307.00773, 2023.
Xie et al. [2023] J. Xie, W. Li, X. Li, Z. Liu, Y. S. Ong, and C. C. Loy, “Mosaicfusion: Diffusion models as data augmenters for large vocabulary instance segmentation,” arXiv preprint arXiv:2309.13042, 2023.
Ge et al. [2022] Y. Ge, J. Xu, B. N. Zhao, N. Joshi, L. Itti, and V. Vineet, “Dall-e for detection: Language-driven compositional image synthesis for object detection,” arXiv preprint arXiv:2206.09592, 2022.
Lin et al. [2023] S. Lin, K. Wang, X. Zeng, and R. Zhao, “Explore the power of synthetic data on few-shot object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 638–647.
Zhang et al. [2023a] M. Zhang, J. Wu, Y. Ren, M. Li, J. Qin, X. Xiao, W. Liu, R. Wang, M. Zheng, and A. J. Ma, “Diffusionengine: Diffusion model is scalable data engine for object detection,” arXiv preprint arXiv:2309.03893, 2023.
Chen et al. [2023b] K. Chen, E. Xie, Z. Chen, L. Hong, Z. Li, and D.-Y. Yeung, “Integrating geometric control into text-to-image diffusion models for high-quality detection data generation via text prompt,” arXiv preprint arXiv:2306.04607, 2023.
Lin et al. [2014] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755.
Salimans et al. [2017] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications,” arXiv preprint arXiv:1701.05517, 2017.
Van den Oord et al. [2016] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves et al., “Conditional image generation with pixelcnn decoders,” Advances in neural information processing systems, vol. 29, 2016.
Goodfellow et al. [2014] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
Karras et al. [2019] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401–4410.
Zhu et al. [2017] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
Kingma and Welling [2013] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
Van Den Oord et al. [2017] A. Van Den Oord, O. Vinyals et al., “Neural discrete representation learning,” Advances in neural information processing systems, vol. 30, 2017.
Dhariwal and Nichol [2021] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021.
Lu et al. [2022] C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,” Advances in Neural Information Processing Systems, vol. 35, pp. 5775–5787, 2022.
Ramesh et al. [2021] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in International conference on machine learning. Pmlr, 2021, pp. 8821–8831.
Saharia et al. [2022] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in neural information processing systems, vol. 35, pp. 36 479–36 494, 2022.
Song et al. [2020] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
Besnier et al. [2020] V. Besnier, H. Jain, A. Bursuc, M. Cord, and P. Pérez, “This dataset does not exist: training models from generated images,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 1–5.
Frid-Adar et al. [2018] M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, and H. Greenspan, “Synthetic data augmentation using gan for improved liver lesion classification,” in 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018). IEEE, 2018, pp. 289–293.
Li et al. [2021] D. Li, J. Yang, K. Kreis, A. Torralba, and S. Fidler, “Semantic segmentation with generative models: Semi-supervised learning and strong out-of-domain generalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8300–8311.
Zhang et al. [2021] W. Zhang, K. Chen, J. Wang, Y. Shi, and W. Guo, “Easy domain adaptation method for filling the species gap in deep learning-based fruit detection,” Horticulture Research, vol. 8, 2021.
Chen et al. [2024] X. Chen, Z. Liu, S. Xie, and K. He, “Deconstructing denoising diffusion models for self-supervised learning,” arXiv preprint arXiv:2401.14404, 2024.
Xiang et al. [2023] W. Xiang, H. Yang, D. Huang, and Y. Wang, “Denoising diffusion autoencoders are unified self-supervised learners,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 802–15 812.
Radford et al. [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
He et al. [2022] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009.
Oquab et al. [2023] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
Sun and Wu [2019] W. Sun and T. Wu, “Image synthesis from reconfigurable layout and style,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 10 531–10 540.
Wang et al. [2022] B. Wang, T. Wu, M. Zhu, and P. Du, “Interactive image synthesis with panoptic layout generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7783–7792.
Zheng et al. [2023] G. Zheng, X. Zhou, X. Li, Z. Qi, Y. Shan, and X. Li, “Layoutdiffusion: Controllable diffusion model for layout-to-image generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 490–22 499.
Devlin et al. [2018] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
Wang and Deng [2018] M. Wang and W. Deng, “Deep visual domain adaptation: A survey,” Neurocomputing, vol. 312, pp. 135–153, 2018.
Motiian et al. [2017] S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto, “Unified deep supervised domain adaptation and generalization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5715–5725.
Tzeng et al. [2015] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, “Simultaneous deep transfer across domains and tasks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4068–4076.
Dosovitskiy et al. [2020] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
Park et al. [2023] N. Park, W. Kim, B. Heo, T. Kim, and S. Yun, “What do self-supervised vision transformers learn?” arXiv preprint arXiv:2305.00729, 2023.
Vanyan et al. [2023] A. Vanyan, A. Barseghyan, H. Tamazyan, V. Huroyan, H. Khachatrian, and M. Danelljan, “Analyzing local representations of self-supervised vision transformers,” arXiv preprint arXiv:2401.00463, 2023.
Zhang et al. [2023b] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3836–3847.
Chen et al. [2023c] W. Chen, J. Wu, P. Xie, H. Wu, J. Li, X. Xia, X. Xiao, and L. Lin, “Control-a-video: Controllable text-to-video generation with diffusion models,” arXiv preprint arXiv:2305.13840, 2023.
Lu et al. [2023] W. Lu, Y. Xu, J. Zhang, C. Wang, and D. Tao, “Handrefiner: Refining malformed hands in generated images by diffusion-based conditional inpainting,” arXiv preprint arXiv:2311.17957, 2023.
Zhang et al. [2024] D. J. Zhang, D. Li, H. Le, M. Z. Shou, C. Xiong, and D. Sahoo, “Moonshot: Towards controllable video generation and editing with multimodal conditions,” arXiv preprint arXiv:2401.01827, 2024.
Zhang et al. [2023c] Y. Zhang, Y. Wei, D. Jiang, X. Zhang, W. Zuo, and Q. Tian, “Controlvideo: Training-free controllable text-to-video generation,” arXiv preprint arXiv:2305.13077, 2023.
Heusel et al. [2017] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.
Szegedy et al. [2016] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
Salimans et al. [2016] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” Advances in neural information processing systems, vol. 29, 2016.
Ge et al. [2021] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: Exceeding yolo series in 2021,” arXiv preprint arXiv:2107.08430, 2021.
Zhu et al. [2020] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” arXiv preprint arXiv:2010.04159, 2020.
Wang et al. [2023] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 7464–7475.
Tian et al. [2019] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully Convolutional One-Stage Object Detection,” arXiv e-prints, p. arXiv:1904.01355, Apr. 2019.
McInnes et al. [2018] L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426, 2018.