\externaldocument

weyler2024pami_suppl

PhenoBench: A Large Dataset and Benchmarks for Semantic Image Interpretation
in the Agricultural Domain

Jan Weyler, Federico Magistri, Elias Marks, Yue Linn Chong, Matteo Sodano,
Gianmarco Roggiolani, Nived Chebrolu, Cyrill Stachniss, and Jens Behley J. Weyler, F. Magistri, E. Marks, Y.L. Chong, M. Sodano, G. Roggiolani, and J. Behley are with the Center for Robotics, University of Bonn, Germany. E-mails: {firstname.lastname}@igg.uni-bonn.de N. Chebrolu is with the University of Oxford, UK.
E-mail: [email protected] C. Stachniss is with the Center for Robotics, University of Bonn, Germany, the University of Oxford, UK, and the Lamarr Institute for Machine Learning and Artificial Intelligence, Germany.
E-Mail: [email protected]

Abstract

The production of food, feed, fiber, and fuel is a key task of agriculture, which has to cope with many challenges in the upcoming decades, e.g., a higher demand, climate change, lack of workers, and the availability of arable land. Vision systems can support making better and more sustainable field management decisions, but also support the breeding of new crop varieties by allowing temporally dense and reproducible measurements. Recently, agricultural robotics got an increasing interest in the vision and robotics communities since it is a promising avenue for coping with the aforementioned lack of workers and enabling more sustainable production. While large datasets and benchmarks in other domains are readily available and enable significant progress, agricultural datasets and benchmarks are comparably rare. We present an annotated dataset and benchmarks for the semantic interpretation of real agricultural fields. Our dataset recorded with a UAV provides high-quality, pixel-wise annotations of crops and weeds, but also crop leaf instances at the same time. Furthermore, we provide benchmarks for various tasks on a hidden test set comprised of different fields: known fields covered by the training data and a completely unseen field. Our dataset, benchmarks, and code are available at https://www.phenobench.org.

Refer to caption — Figure 1: Our dataset, called *PhenoBench*, provides dense semantic plant-level instance annotations (shown by different colors) of sugar beet crops and weeds (green and red in the semantics) and leaf-level instance annotations of crops (different colors correspond to different instances) for high-resolution images recorded with a UAV. The dataset consists of images collected at different times during a growing season, which captures various growth stages of plants.

1 Introduction

The agricultural production of food, feed, fiber, and fuel has to cope with several challenges in the upcoming decades. The world population is increasing, yet the availability of arable land is limited or even decreasing, climate change increased uncertainties in crop yield, and we observe substantial losses in biodiversity [18]. At the same time, agricultural practices need to be more sustainable and have to reduce the use of agrochemical inputs, i.e., herbicides and fertilizers that potentially negatively impact yield [32] and the environment.

Robots and drones using vision-based perception systems could help with these challenges by offering tools to make better, more sustainable field management decisions and providing supporting tools for breeding new varieties of crops by estimating plant traits in a reproducible manner [72]. Such visual perception systems enable the development of agricultural robots that can support the monitoring of fields and replace labor-intensive tasks such as manual weeding [89]. Additionally, they potentially enable more targeted crop management, where agrochemicals are applied precisely and only where needed, thereby reducing the negative effects on the environment [53, 82].

With the advent of deep learning for visual perception [49, 41], the field of computer vision has made tremendous progress in image interpretation, achieving remarkable results in several domains. Datasets and associated benchmarks [14, 52, 66] were essential for achieving this progress as they provide a testbed for developing novel algorithms but also provided the necessary data to tackle novel tasks. Progress can be tracked quantitatively with metrics that measure the performance of developed approaches against benchmarks using hidden test sets. Novel tasks with increasing complexity drive the progress of the field by posing novel challenges for the community.

In this paper, we aim to provide a large dataset together with benchmarks for semantic interpretation under real-field conditions enabling similar progress in the agricultural domain. We target multiple tasks: semantic segmentation, panoptic segmentation, plant detection, leaf detection, and the novel task of hierarchical panoptic segmentation that provides a coarse-to-fine interpretation of plants.

For this purpose, we recorded high-resolution images with unmanned aerial vehicles (UAV) of sugar beet fields under natural lighting conditions over multiple days, capturing a large range of growth stages. We annotated these images with dense, pixel-wise annotations to identify sugar beet crops and weeds at an instance level, as needed for semantic segmentation and plant-level instance segmentation tasks. Additionally, we labeled leaf instances of crops to enable the investigation of leaf instance segmentation (see Fig. 1). Furthermore, we provide temporal association of plant instances over the different dates, which allows to identify individual plants at different growth stages.

The combination of plant-level and leaf-level annotations enable the investigation of novel tasks needed for a holistic semantic interpretation in the agricultural domain. One such task is the hierarchical panoptic segmentation that targets to segment individual leaves and assign them to their associated plant instance to predict the total number of leaves per plant. Plant scientists and breeders commonly assess this information to describe the growth stage of individual plants, which is also linked to yield potential and plant performance [45]. However, this in-field assessment is nowadays done manually outside greenhouses, which is laborious and time-consuming [62]. Thus, developing vision systems to assess these properties per plant automatically is essential for large-scale, sustainable crop production.

Our provided data shows distinct challenges in terms of plant variation and overlap between different plant and leaf instances that are distinct in the agricultural domain. Such challenges are seldomly addressed by general segmentation approaches prevalent in man-made environments, as shown by our experimental results, where we challenged several state-of-the-art approaches but also provide results for more domain-specific approaches for the agricultural domain.

In summary, our main contributions are:

•

We present a large dataset for plant segmentation providing accurate instance annotations at the level of plants and leaves.
•

We provide benchmark tasks on a hidden test set for evaluating semantic, instance, and panoptic segmentation, and detection approaches targeted at plants enabling reproducible and unbiased evaluation of novel plant perception approaches.
•

We provide baseline results for general and domain-specific models for plant and leaf detection, but also semantic, instance, and panoptic segmentation.

We believe that the effort in generating high-quality annotations and establishing reliable benchmarks for multiple tasks with a hidden test set will accelerate progress in semantic perception of agricultural fields and potentially lead to novel avenues of research in this important domain. We make our dataset and benchmarks¹¹1https://www.phenobench.org, code for visualizing predictions and computing metrics²²2https://github.com/PRBonn/phenobench, and baselines³³3https://github.com/PRBonn/phenobench-baselines with code, checkpoints, and predictions publicly available.

2 Related Work

In recent years, dense, pixel-wise semantic interpretation of images, i.e., semantic, instance, and panoptic segmentation [38], made rapid progress due to advances in deep learning [49], but also thanks to the availability of large-scale datasets for object detection [52, 21, 20], semantic segmentation [14, 66], instance segmentation [52], and lately panoptic segmentation [52, 14, 66].

Dataset	#Images	Image Size	Crop			Weed		Field?	Hidden Test Set?
Dataset	#Images	Image Size	Sem.	Inst.	Leaves	Sem.	Inst.		Hidden Test Set?
CWFID [30]	60	1291 $\times$ 966	✓			✓		✓
CVPPP [1, 61]	1,311	2048 $\times$ 2448¹			✓				✓
Carrot-Weed [44]	39	3264 $\times$ 2448	✓			✓		✓
Sugar beets [7]	280	1296 $\times$ 966	✓			✓		✓
WeedMap [79]	1,670	480 $\times$ 360	✓			✓		✓
Carrots-Onion [3]	40	2464 $\times$ 2056	✓			✓		✓
Oil Radish [65]	129	1600 $\times$ 1600	✓			✓		✓
Sunflower [22]	500	1296 $\times$ 966	✓			✓		✓
GrowliFlowers [36]	2,198	448 $\times$ 368	✓	✓	✓			✓
CropAndWeed [81]	8,034	1920 $\times$ 1088	✓			✓		✓
PhenoBench (Ours)	2,872	1024 $\times$ 1024	✓	✓	✓	✓	✓	✓	✓

TABLE I: Comparison of datasets in the agricultural domain providing dense pixel-wise annotations. For the crop and weed, we indicate if semantic segmentation (Sem.), plant instances (Inst.), and leaf instances (Leaves) are densely annotated. We also record if the dataset was recorded under field conditions, as opposed to under lab conditions (Field?). Furthermore, we note if there is a hidden test set, such that approaches do not have access to test set labels (Hidden Test Set?). ¹We report maximum image size, as it ranges from

441\text{\,}\frac{\mathrm{px}}{}$\times$441\text{\,}\frac{\mathrm{px}}{}

2048\text{\,}\frac{\mathrm{px}}{}$\times$2448\text{\,}\frac{\mathrm{px}}{}

Despite the availability of large datasets in man-made environments, the agricultural domain faces different challenges, such as large intra-class variability due to plant growth. Thus, there has been interest in large datasets to enable studying perception in the agricultural domain [57]. However, accurately dense annotated and large agricultural datasets in combination with reproducible benchmarks on a hidden test set are still missing today, see Tab. I.

In particular, the crop/weed field image dataset (CWFID) by Haug et al. [30] is one of the first semantic segmentation datasets that provides pixel-level annotations of semantics for plants, i.e., sugar beets and weeds using a multispectral camera. Lameski et al. [44] also provides a dataset for crops, i.e., carrots and weed segmentation. CVPPP [61, 1] is one of the first datasets providing annotations for leaves in images of individual tobacco and arabidopsis plants recorded in a lab environment, which is also the basis for a series of workshops and competitions hosted at CVPR and ICCV. The dataset by Chebrolu et al. [7] provides images of sugar beets and weeds recorded by a ground robot under real field conditions with a ground sampling distance (GSD) of $0.3\text{\,}\frac{\mathrm{mm}}{\mathrm{px}}$ and provides annotations for semantic segmentation. Similar to our dataset, the WeedMap dataset [79] provides imagery of UAVs covering a large field with sugar beets and weeds. In contrast to our dataset, where we provide the original camera data, WeedMap first generated orthophotos via bundle adjustment. While we considered this option, we noticed that the lack of a detailed elevation model usually leads to artifacts on the boundaries of the plants. Additionally, the images of WeedMap have a coarse GSD between $8.2\text{\,}\frac{\mathrm{cm}}{\mathrm{px}}$ and $13\text{\,}\frac{\mathrm{cm}}{\mathrm{px}}$ while our images have a GSD of $1\text{\,}\frac{\mathrm{mm}}{\mathrm{px}}$ to assess detailed information for individual plants. The Sunflowers dataset [22] provides images collected with a multi-spectral sensor providing RGB and near-infrared images from a ground robot. The Agriculture-Vision dataset [13] contains aerial images with a coarse GSD between $10\text{\,}\frac{\mathrm{cm}}{\mathrm{px}}$ and $20\text{\,}\frac{\mathrm{cm}}{\mathrm{px}}$ with corresponding annotations that covers rather large areas but not individual plants, e.g., regions with nutrient deficiencies and weed clusters. More recently, the GrowliFlowers dataset [36] provides images recorded with a UAV showing multiple growth stages of cauliflowers. While we recorded images on three dates roughly a week apart, this dataset contains images captured on four different dates, also roughly a week apart. Therefore, it captures an extended period of one month.

Lately, the CropAndWeed dataset [81] provides RGB images taken close to the field canopy showing a large variety of crops and weeds. While the number of annotated images is large, the pixel-wise annotations have been semi-automatically annotated exploiting a pre-segmentation via a deep neural network to lower the annotation effort. However, this sometimes leads to incomplete annotations and notable annotation artifacts. Also in our experience, we noted that correcting annotations is quite tedious and can counter-intuitively lead to even larger annotation effort as boundary regions generated using contemporary segmentation approaches almost always need to be corrected, which is also the part that takes most of the annotation time.

The recently published RumexLeaves dataset [27] provides fine-grained annotations of leaves of the Rumex obtusifolius L., which is a problematic weed in grasslands. Besides the leaf annotations of this particular plant, the dataset also provides more fine-grained vein and stem annotations that allows to get insights into the plant physiology corresponding to traits relevant for plant phenotyping.

Besides the aforementioned closely related datasets that also provide dense pixel-wise annotations, there have been recently also several datasets in the agricultural domain released for wheat detection [16], localization and mapping [69, 34], image classification of weed species [67], detection for phenotyping [58], crop row detection [96], or fruit detection [78]. Additionally, there are a small number of available datasets for semantic interpretation of 3D agricultural data [80, 35, 19].

While recent interactive labeling approaches, like SegmentAnything [39], can certainly speed-up labeling of instance masks with only weak annotations delivering compelling results, we target to generate a reliable and high-quality dataset and corresponding benchmark. Therefore, we resort to manual annotations from scratch, which entailed a rigorous correction and verification procedure to ensure accurate and consistent segmentation masks.

In contrast to the aforementioned datasets, which are great starting points for research, our dataset shows an unique level of annotations, including semantic and instance masks for crops and weeds of an overall larger number individual plants (see Tab. I). Furthermore, we provide temporally consistent instance ids of crops that allow to identify individual plants over multiple dates. Note that our dataset provides large images with multiple completely visible plants, which is not always the case for other pixel-wise annotated datasets [36, 61]. Lastly, we enable comparable and reproducible results with the provided benchmarks on a hidden test set, i.e., labels are not released and the predictions are evaluated on a server via CodaLab [68].

3 Our Dataset

In this section, we present our setup for data collection, explain the labeling process, and provide statistics to show the variability of the data.

3.1 Data Collection

Our dataset provides RGB images in real field conditions recorded by an UAV equipped with a high-resolution camera that captures imagery of the field. For recording the data, we employed a DJI M600 and used the PhaseOne iXM-100 camera with a $80\text{\,}\frac{\mathrm{mm}}{}$ RSM prime lens mounted on a gimbal to obtain motion-stabilized RGB images at a resolution of ${$11\,664\text{\,}\frac{\mathrm{px}}{}$\times$8750\text{\,}\frac{\mathrm{px}}{% }$}$ . The UAV was flying at a height of approx. $21\text{\,}\frac{\mathrm{m}}{}$ , resulting in a GSD of $1\text{\,}\frac{\mathrm{mm}}{\mathrm{px}}$ . For covering the entire field, we use the DJI Ground Station Pro app to plan a flight that covers the field row-wise. We set the forward overlap between consecutive images by motion vector at $75\text{\,}\frac{\mathrm{\char 37\relax}}{}$ and the side overlap between images placed in neighboring rows at $50\text{\,}\frac{\mathrm{\char 37\relax}}{}$ . Each image is geo-referenced by using the on-board GNSS.

We performed three missions roughly a week apart to capture different growth stages of the plants. More specifically, we performed the flights on May 15, May 26, and June 6 in 2020. Additionally, we used the same sensor setup to record images at four different points in time in 2021 on a different field: May 20, May 28, June 1, and June 10. As the data was captured in the open field, we have a variety of different lighting conditions with sunny and also overcast weather, as shown in Fig. 2, which significantly changes the visual appearance of the plants.

From the approximately $1300\text{\,}\frac{{\mathrm{m}}^{2}}{}$ sugar beets field located at the Campus Klein-Altendorf farm between Meckenheim and Rheinbach, Germany (50^∘37’.51N, 6^∘59’.32E), we selected eight crop rows that were covered by the recording mission. To have a clear spatial separation between the train and test set, we selected four crop rows for extracting training images, two crop rows for validation, and two crop rows for testing purposes as shown in Fig. 3. Additional data recorded in 2021 is only included in the test set to evaluate also the performance in a setting of an unseen field with the same crop but potentially different weeds.

Specifically, the sugar beet field contains a mixture of two different crop varieties, i.e., BTS 440 and Celesta KWS that are both from distinct agro-seed companies and differ in their properties regarding a beet’s mass and sugar yield. Furthermore, we observe six weed varieties that are most prominent in the field, i.e., Chenopodium album, Polygonum aviculare, Thlaspi arvense, Persicaria lapathifolia, Bilderdykia convolvulus, and Polygonum hydropiper.

The field belongs to a farm of the University of Bonn located at the Campus Klein-Altendorf. This allows us to conduct field studies and to study perception systems under varying conditions with respect to application of herbicides, which leads to different scenarios with fully (conventional), partial ( $80\%$ herbicides), and non-herbicided field conditions, as shown in Fig. 4. In conventional farming and field management operations, such conditions with less or no herbicides are usually not observable. While keeping most of the other field parameters constant, this makes our field setup distinct to other larger datasets, such as GrowliFlowers [36] that recorded data only under conventional field management conditions with only a very few weeds.

3.2 Labeling Process

The full-sized images, which we denote as global images, $I_{g}$ , are challenging to annotate due to their large size of ${$11\,664\text{\,}\frac{\mathrm{px}}{}$\times$8750\text{\,}\frac{\mathrm{px}}{% }$}$ . To parallelize the labeling process and ensure no plant is missed, we extracted from $I_{g}$ overlapping patches, $I_{p}$ , of size ${$2000\text{\,}\frac{\mathrm{px}}{}$\times$2000\text{\,}\frac{\mathrm{px}}{}$}$ . We extracted multiple iterations of overlapping patches such that we always have in one of the resulting four tilings complete plants visible, c.f. Fig. 5. As we ensure that each plant is fully visible in at least one of the patches, we instructed our annotators to label only completely visible plants in $I_{p}$ .

For labeling the plants and leaves at the same time, we developed a novel tool to enable a hierarchical annotation of the images. Please see the supplement for a more detailed description of the labeling tool and the provided features.

We first labeled the plant instances of sugar beet crops and weeds, which was completed by 9 annotators investing a total of 800 h. Each iteration was validated and corrected before we transferred the annotations to the global images $I_{g}$ . Then, the next iteration is started with the transferred labels copied to the respective patches $I_{p}$ , and these steps were repeated till the final fourth iteration.

Annotation of a single patch $I_{p}$ ranged from approx. 1 h for earlier growth stages to 3.5 h for later growth stages where plants had significant overlap. In sum, we annotated 705 patches over all dates and crop rows.

Split	#imgs	#crops	#weeds	#leaves
Train	$1,407$	$11,875$	$8,141$	$71,264$
Validation	$772$	$6,482$	$3,926$	$35,503$
Test	$693$	$6,201$	$4,291$	$33,935$
Unlabeled	$129,000$	–	–	–

TABLE II: Dataset statistics of the provided splits. Note that we have a hidden test set, i.e., we have a server-sided evaluation [68]. We additional provide unlabeled data of the fields to enable studying of self-supervised pre-training.

After the plant instances were labeled, we had 5 annotators labeling leaf instances. Annotators were tasked with identifying crop leaves and annotation of a patch $I_{p}$ took approx. 1 h to 2 h depending on the number of visible crops. With the masking of plant instances provided by our annotation tool, we ensure that we have consistent leaf labels that are inside the crop instance. Thus, it is possible to associate each leaf instance with its corresponding crop based on the plant instance annotations.

To ensure high-quality, accurate annotations of plants and leaves, we furthermore had an additional round of corrections performed by four additional annotators that revised the annotations. More details on our quality assurance process is provided in the supplementary material.

In total, we had 14 annotators who invested 1,400 h of annotation work and roughly 600 h invested into additional validation and refinement, leading to an overall labeling effort of approximately 2,000 h.

3.3 Temporal Alignment

As we recorded images in the same geographical location, we can furthermore provide temporally aligned plant instances, which enables the study of individual plant growth. By matching the occurrences of the same plants in different recordings we ensure that each crop plant has a unique instance id throughout our whole dataset.

To this end, we exploit the positions delivered by the RTK GNSS of the drone as initial guesses for a bundle adjustment procedure to determine the pose of the camera for each captured image in a global reference frame. This allows us to project the crop center locations, computed as the centroid of the plant pixels, of plants appearing in all images of a mission into a common plane.

As the estimated poses of the camera are not completely free of noise, we use Hungarian matching [42] based on the distances of crop centers to robustly associate instances of the same plant appearing in different images. To account for new crop instances but also missing crop instances, we only associate crop centers, when their distance is below a threshold of $15\,$ cm, which was determined empirically. We experimented with using GNSS poses to associate crop instances between different missions collected at different points in time, but found the inaccuracies of the localization to be too high for our purpose. We, therefore, manually associated around 10 plants between the different missions and used these datapoints to compute a transformation between each mission using a least squares approach. Given those transformations we then associated the crop ids again by projecting them onto a common plane and matching them by the Hungarian algorithm. Finally, we validated the temporal alignment by visualizing the matches between missions at different points in time.

3.4 Dataset Statistics

We finally extracted from the global images $I_{g}$ smaller images of size ${$1024\text{\,}\frac{\mathrm{px}}{}$\times$1024\text{\,}\frac{\mathrm{px}}{}$}$ to ensure that we have images containing complete crops at later growth stages, but also provide context such as the crop row structure. More specifically, we use the an overlap of $50$ % between extracted patches to ensure that plants in later growth stages are at least 50% visible in the extracted patches.

Tab. II shows an overview of the number of extracted images for the different splits from the earlier described train/validation/test rows, the number of crop instances, the number of crop leaves, and the number of weed instances annotated. Note that only the test data includes data from 2020 and 2021. As we ensured that we have completely annotated plants, we are able to generate a visibility map and differentiate between mostly visible plants with at least 50 % visible pixels and partially visible plants. Note that we provide a rather large validation set to allow researchers to conduct conclusive ablations studies.

In addition to the labeled data, we also provide unlabeled data from all fields, which can be exploited for pre-training, semi-supervised, or unsupervised domain adaptation, which we see as promising future avenue of research.

As motivated earlier, we recorded images under real-world conditions of real agricultural fields leading to a diverse range of plant appearances due to varying growth stages. The crops are affected by different soil conditions leading to a variety of growth stages even on images of the same date. This intra-class variability of the crops poses an interesting challenge for learning approaches that have to correctly segment or detect small but also large crops at the same time. The extra data from a different field captured in 2021 leads to even greater diversity of recording conditions, which is a common challenge in the agricultural domain.

Additionally, we observe a large variability in terms of overlap between plants. They are clearly separated at the beginning of the recording campaign but show a considerable overlap at the last recording date. Fig. 2 shows the same area of the field over the course of three weeks showing the variation in terms of growth stage but also the overlap between crops.

In Fig. 6, we provide an overview of the plant sizes per data collection day in terms of the area covered by the plant instances that shows the diversity in terms of growth stages. While on May 20 ${}^{\text{th}}$ plants with a small coverage are predominately present, the plant area of plants naturally increased in the following weeks. On May 26 ${}^{\text{th}}$ , the amount of larger plants increases. At the latest date, June 5 ${}^{\text{th}}$ , the amount of larger plants further increases and the distribution gets more long-tailed as now all plants directly compete for space, which is also visually visible from the larger overlap between neighboring plants. Thus, only few plants are able to develop a larger canopy cover.

Finally, we present in Fig. 7 the distribution of leaves per plant per data collection day of completely visible plants in the training and validation split. Similar to the trends for the canopy cover, we can also observe an increase in terms of the number of leaves over time. On May 20 ${}^{\text{th}}$ , most of the plants are still in the two-leaves stage with only a few plants in the later development with more than 10 leaves. Note that some leaves are also so-called germ leaves that are later replaced by the real leaves. The peak in the leaf count shifts to the right on May 26 ${}^{\text{th}}$ as the sugar beet plants develop more leaves in later growth stages. On the last data collection date, June 5 ${}^{\text{th}}$ , the distribution of leaves gets more long-tailed as now larger plants are competing for space. At this stage, however, it’s also more likely that leaves are covered by other leaves, since we observe the field from a UAV. Thus, the true number of leaves is not observable.

Overall, we annotated 583 unique crop plants at potentially different growth stages growing under real-world conditions in the open field. Thus, the individual plant growth is affected by the weather conditions and the soil quality that changes over the whole field. As noted before, the visual appearance changes between different plants but also can have substantial differences due to the natural plant growth. More specifically, 496 plants appear in all three dates, 15 plants in only two of the dates, and 72 plants only at a single point in time, which is caused by the conventional field management operations or natural growing conditions.

4 Benchmarks

In this section, we present the benchmark tasks that we provide together with the dataset. These tasks cover different aspects of a perception system for the crop production domain in agriculture. While we cover classical, well-established tasks, we also want to provide a novel task of hierarchical panoptic segmentation that provides a complete picture of the plant structure.

We provide metrics on the test set of our dataset including data from known and unknown fields for all investigated baseline approaches. Note that we provide more details on the training setup, including hyperparameters and qualitative results, in the supplement. We furthermore will provide code for the baselines in our code release. In the supplement, we furthermore provide qualitative results together with more fine-grained quantitative results differentiating between the different fields of the test set.

Approach	mIoU	IoU
	mIoU	Crop	Weed	Soil
ERFNet [76]	$85.98$	$94.30$	$64.37$	$99.28$
DeepLabV3+ [9]	$85.97$	$94.07$	$64.59$	$99.25$

TABLE III: Baseline results for semantic segmentation on the test set.

4.1 Semantic Segmentation

Task description. Semantic segmentation in images aims to train models capable of predicting each pixel’s class. Thus, we provide annotated ground truth data that assigns each pixel to the class soil, crop, or weed. Consequently, an approach for this task needs to provide dense predictions assigning each pixel to one of the before-mentioned classes.

State of the Art. Semantic segmentation is a classical task that was first mainly tackled using conditional random fields [43, 40] to exploit the neighboring structure of images. With the advent of deep learning and the success in image classification [41], dense prediction tasks are nowadays mainly tackled by encoder/decoder architectures [54, 76, 77]. Recently, refined architectures add larger context [8, 9] and multi-resolution processing [84] or rely on Transformers [87] for the encoder [97, 12]. We refer to surveys [85, 46] for an overview of recent developments.

In the agricultural domain, most approaches [55, 56, 60] follow the development and adopt the pipelines to account for the row structure [55] or leverage additional background knowledge to cope with less labeled data [60].

Baselines. As baselines, we select DeepLabV3+ [9] ( ${39.8\text{ M}}$ params) and ERFNet [76] ( ${2.1\text{ M}}$ params) at different ends of model capacity.

Metrics. To evaluate the performance of semantic segmentation models, we report the common intersection-over-union (IoU) for each class individually, where higher values indicate a better performance [14]. Additionally, we compute the mean intersection over union (mIoU) across all classes as the main metric.

Results and Discussion. In Tab. III, we show quantitative results of the selected baselines. The investigated off-the-shelf semantic segmentation methods already show an overall good performance in terms of mIoU. However, we observe a relatively low IoU for weeds which are often wrongly assigned to pixels of crops. We support these results qualitatively in Fig. 10 and Fig. 11 of the supplement,depicting the predictions of each approach as well as highlighting correct and false predictions. In terms of model capacity, the different investigated methods perform very similarly, indicating that the models’ capacity cannot resolve the aforementioned issues. Surprisingly, the smaller, simpler, and faster architecture ERFNet performs on par with the more complex DeepLabV3+ model that commonly shows better performance in the context of autonomous driving. Furthermore, we refer to Tab. 9 of the supplement for more detailed quantitative results distinguishing between each data collection date.

4.2 Panoptic Segmentation

Task description. Panoptic segmentation [38] tackles the task of jointly estimating a pixel-wise semantic label and distinguishing instances. This task differentiates between so-called “stuff” and “thing” classes. The former corresponds to instance-less classes, i.e., soil, and the latter refers to classes with clearly separable objects, i.e., crops and weeds. Consequently, an approach for this task needs to produce semantic masks assigning each pixel to crop, weed, or soil and an instance segmentation for crops and weeds.

State of the Art. Most approaches for panoptic segmentation [37] extend classical semantic segmentation approaches with an instance branch or head to separate “thing” classes. Generally, there are two main paradigms for generating instances prevalent: top-down and bottom-up approaches. Top-down approaches [37, 70, 51] use detection-based bounding box predictions to locate instances and mask predictions in bounding boxes to segment the located instances pioneered by Mask R-CNN [31]. Bottom-up approaches [10, 91] use a separate decoder to estimate embedding vectors and offsets to find clusters corresponding to instances of “thing” classes guided by the semantic segmentation branch. The main focus of research in this field concentrates on improving the architecture to achieve better separation between instances [63, 51, 71]. However, recent approaches [11, 98, 83] based on Vision Transformer [17] show substantial improvements.

In the agricultural domain, most methods adopt panoptic segmentation pipelines for crop and weed detection [6, 28] to contribute towards sustainable crop production and targeted weed management in real field conditions.

Baselines. We use Panoptic DeepLab [10] ( ${7.7\text{ M}}$ params) and Mask R-CNN [31] ( ${44.4\text{ M}}$ params). Further, we show Mask2Former [11] ( ${44\text{ M}}$ params) performance of a Transformer-based approach.

Approach	PQ^†	$\text{PQ}_{\text{crop}}$	$\text{PQ}_{\text{weed}}$	$\text{IoU}_{\text{soil}}$
Panoptic DeepLab [10]	$57.97$	$52.02$	$22.61$	$99.27$
Mask R-CNN [31]	$65.79$	$67.61$	$31.30$	$98.47$
Mask2Former [11]	$69.99$	$71.21$	$40.39$	$98.38$

TABLE IV: Baseline results for panoptic segmentation on the test set.

Metrics. We separately compute the panoptic quality [38] for the predicted instance masks of crops ( $\text{PQ}_{\text{crop}}$ ) and weeds ( $\text{PQ}_{\text{weeds}}$ ). During evaluation, we treat predicted instances associated with a partially visible instance, i.e., a plant where less than $50$ % of its pixels are inside the image, as “do not care” regions not affecting the score. Additionally, we report the IoU for the semantic segmentation of soil ( $\text{IoU}_{\text{soil}}$ ) to consider predictions related to “stuff”. In our final metric, we compute the average over all three values and denote it as PQ^† as proposed by Porzi et al. [70].

Results and Discussion. In Tab. IV we show that Mask2Former [11] achieves the best overall performance. A more detailed quantitative evaluation provided in Tab. 11 of the supplement, distinguishing between different data collection days characterized by specific plant growth stages, reveals that the instance segmentation of plants is challenging in cases of barely visible small plants and large plants with high mutual overlap. We support these results qualitatively in Fig. 13 and Fig. 14 of the supplement. This suggests that domain-specific models could potentially exploit the plant growth stage.

4.3 Detection

Task description. While pixel-wise segmentation of instances allows for extracting fine-grained information, often detecting instances is sufficient. Therefore, we also propose using our data for studying plant or leaf detection in separate tasks. For plant detection, we distinguish between the classes of crop and weed. Similar to COCO [52], we extract bounding box annotations from the instance-level plant and leaf annotations to allow training of object detection approaches. An approach for either plant or leaf detection needs to provide bounding boxes and confidence scores for each detected instance.

State of the Art. Early approaches for object detection relies on sliding window-based classification methods [88] and research before 2014 mainly concentrates on better feature representations [15, 24], part-based representations [23, 50], or better proposal generation [86].

Since 2013, CNN-based approaches have been prevalent as pioneered by R-CNN [26] and follow-up work [74, 25, 31]. Generally, one can distinguish between single-stage and two-stage approaches. Nowadays, single-stage approaches are mainly employed and YOLO [73]-based approaches are popular choices. Recently, also keypoint-based approaches [47, 99] were proposed that divert from the anchor-based methods. Similarly to other tasks, the field recently shifted towards Transformer-based approaches [5].

In the agricultural domain, most methods use detectors to identify crops or weeds [28, 29] or suggest domain-specific adaptations, e.g., for fruit detection [59].

Baselines. We select established approaches for object detection, such as Faster RCNN [74] ( ${41.7\text{ M}}$ params), Mask R-CNN [31] ( ${44.4\text{ M}}$ params) and YOLOv7 [90] ( ${37.2\text{ M}}$ params), which are commonly used approaches. Since this task refers to either plant or leaf detection, we train models for each task separately. Although Mask R-CNN also provides an instance segmentation, we do not consider these here but rely on its predicted bounding boxes.

Approach	mAP	mAP₅₀	mAP₇₅	AP
	mAP	mAP₅₀	mAP₇₅	Crop	Weed
Faster R-CNN [74]	$40.43$	$65.07$	$40.19$	$63.23$	$17.62$
Mask R-CNN [31]	$38.68$	$63.72$	$38.07$	$60.32$	$17.05$
YOLOv7 [90]	$60.48$	$82.47$	$62.30$	$83.06$	$37.91$

TABLE V: Baseline results for plant detection on the test set.

Approach	mAP	mAP₅₀	mAP₇₅
Faster R-CNN [74]	$33.91$	$64.61$	$31.30$
Mask R-CNN [31]	$34.41$	$66.02$	$32.15$
YOLOv7 [90]	$57.90$	$86.85$	$62.92$

TABLE VI: Baseline results for leaf detection on the test set.

Metrics. In line with established benchmarks [21, 20, 52], we report the average precision (AP) for each class and mean average precision (mAP) across all classes, which uses multiple IoUs for matching between $0.5$ and $0.95$ with a step size of $0.05$ . Furthermore, we report the mean average precision at $0.5$ IoU (mAP₅₀) and $0.75$ IoU (mAP₇₅). As previously, we treat each predicted bounding box associated with a partially visible instance as “do not care” regions. Thus, these predictions do not affect the scores.

Results and Discussion. In Tab. V, we show results for plant detection, where we see that modern approaches have a clear edge over the other approaches. Apparently, weed detection is more difficult than crop detection, which could result from smaller plant sizes, as also suggested qualitatively in Fig. 16 and Fig. 17 of the supplement.

In Tab. VI, we summarize the results for leaf detection, which shows lower performance across all methods compared with aforementioned plant detection, indicating the need for domain-specific approaches. In Tab. 15 of the supplement, we provide more detailed results for each data collection day and additionally show qualitative results in Fig. 18 and Fig. 19 of the supplement.

4.4 Leaf Instance Segmentation

Task description. Leaf instance segmentation is relevant for estimating the growth stage of a plant [45] and also the basis for leaf disease detection [64]. Such approaches are involved in phenotyping activities to investigate new varieties of crops [62]. An automatic, vision-based assessment of such traits has the potential to have reproducible and objective measurements at a high temporal frequency. Consequently, an approach for this task needs to predict an instance mask for each visible crop leaf.

Approach	$\text{PQ}_{\text{leaf}}$
Mask R-CNN [31]	$59.74$
Mask2Former [11]	$57.50$

TABLE VII: Baseline results for leaf instance segmentation on test set.

State of the Art. Instance segmentation is closely related to object detection. Therefore earlier approaches rely on object detection approaches [74, 73] to perform top-down instance segmentation by predicting segmentation masks for bounding boxes [31, 2]. A different line of research [4] investigated the usage of bottom-up processing, where first pixel-wise embedding vectors are estimated such that pixels belonging to the same instance are near in embedding space, while embedding vectors of different instances are separated. The estimated embedding vectors can then be clustered, resulting in instances. Recently, several methods [92, 93] were proposed that directly estimate masks for each object instance. Most recently, also Transformer-based approaches [48, 11] for instance segmentation gained interest. Popularized by CVPPP [61], several approaches tackle the task of leaf instance segmentation [33] or leaf counting [95].

Baselines. As baselines for our experiments, we employ Mask R-CNN [31] ( ${44.4\text{ M}}$ params) and Mask2Former [11] ( ${44\text{ M}}$ params). While the former method represents a traditional top-down approach, the latter belongs to more recent methods relying on a Transformer decoder and masked attention.

Metrics. We compute the panoptic quality [38] for the predicted instance masks of crop leaves, denoted as $\text{PQ}_{\text{leaf}}$ . As previously, any instance prediction associated with a partially visible instance does not affect the score.

Results and Discussion. Tab. VII shows the results of the investigated baselines. In this setting, the approaches generally struggle to separate leaves, as they are naturally overlapping, even for smaller plants. In Fig. 20 and Fig. 21 of the supplement, we support these results qualitatively and provide more detailed metrics differentiating between each data collection day in Tab. 17 of the supplement. Again, we suspect that more domain-specific approaches could induce prior knowledge to achieve a better separation.

4.5 Hierarchical Panoptic Segmentation

Task description. Models for hierarchical panoptic segmentation target objects, which can be represented as an aggregation of individual parts, e.g., plants can be represented as the union of their leaves [94]. Consequently, these methods provide a simultaneous instance segmentation of the whole object and each part. Thus, they are capable of providing more detailed information about each object, e.g., the association of individual leaves to a specific plant allows obtaining the total number of leaves per plant, which correlates to its growth stage [45]. We provide the annotated instance masks of all crops and their associated leaves. Since there are no leaf annotations for weeds, we do not consider them under the guise of a hierarchical structure. Thus, we also relate to weeds as “stuff” for this task.

Approach	PQ^†	PQ	$\text{PQ}_{\text{crop}}$	$\text{PQ}_{\text{leaf}}$	IoU
Approach	PQ^†	PQ	$\text{PQ}_{\text{crop}}$	$\text{PQ}_{\text{leaf}}$	Weed	Soil
HAPT [75]	$65.27$	$50.73$	$54.61$	$46.84$	$61.11$	$98.50$
Weyler et al. [94]	-	$40.49$	$38.37$	$42.60$	-	-

TABLE VIII: Baseline results for hierarchical panoptic segmentation on the test set.

State of the Art. Several recent works exploit the underlying hierarchical structure of objects to obtain a panoptic segmentation [75, 94]. In the agricultural domain, recent methods [75, 94] operating in real field conditions exploit the hierarchical structure of plants to predict the instance segmentation of individual crops and their leaves.

Baselines. We select the methods by Weyler et al. [94] ( ${2.2\text{ M}}$ params) and Roggiolani et al. [75] ( ${2.4\text{ M}}$ params) as baselines that both perform a simultaneous instance segmentation of crops and their associated leaves, where the latter method is denoted as HAPT. The first method is a bottom-up approach that first predicts leaves, which are then associated to a plant. In contrast, HAPT uses a hierarchical feature aggregation starting at the plants providing plant-level features to then predict leaves.

Metrics. To evaluate the performance of this task, we compute the panoptic quality [38] for the predicted instance masks of all crops ( $\text{PQ}_{\text{crop}}$ ) and leaves ( $\text{PQ}_{\text{leaf}}$ ) separately. We report the average panoptic quality over both values, denoted as PQ. As previously, any instance prediction assigned to a partially visible instance does not affect the metrics. To account for methods that filter pixels related to weeds or soil with an additional semantic segmentation, we also report the IoU for both classes. Finally, we compute PQ^† as the average over $\text{PQ}_{\text{crop}}$ , $\text{PQ}_{\text{leaf}}$ , and both IoU values.

Results and Discussion. In Tab. VIII, we show the results of the hierarchical approaches. Here, we can see that both methods do not obtain consistent predictions for plants at a large growth stage, where individual plants and their leaves overlap. In particular, instance separation of leaves seems most challenging in line with the plant instance segmentation. Thus, methods targeting these scenarios could improve the performance. We support these findings in Fig. 19 of the supplement, where we perform the evaluation for each data collection day separately. Ultimately, we show quantitative results in Fig. 22, which we separate into true positives, false positives, and false negatives in Fig. 23 in the supplement.

5 Challenge in Conjunction with CVPPA Workshop at IEEE/CVF ICCV 2023

In conjunction with the workshop on Computer Vision in Plant Phenotyping and Agriculture held at the IEEE/CVF International Conference on Computer Vision (ICCV) in 2023, we invited the community to tackle the most challenging task of hierarchical panoptic segmentation using our dataset. We received overall 148 submissions from 107 registered participants on the competition hosted on CodaLab⁴⁴4The concluded and now closed competition is still available at https://codalab.lisn.upsaclay.fr/competitions/13904., where one could upload predictions until a fixed deadline.

For the top-performing entries of the leaderboard, we invited authors to provide a technical report of their approach⁵⁵5Non-archival, non-peer reviewed technical reports are available at https://cvppa2023.github.io/challenges/. The technical solutions surpassed the baselines by a large margin and often employed the Segment Anything Model [39] either in conjunction with a detection approach or initial segmentation that is refined. But also a Mask2Former-based [11] approach using a mask refinement on small plants and a second stage for leaf instance segmentation on plant masks showed promising results surpassing our off-the-shelf baselines presented in Sec. 4.5.

6 Potential Impact on Other Topics

Besides the already covered supervised tasks in agricultural perception, our dataset providing labeled and unlabeled images has the potential to impact also other fields of research and applications in the agricultural domain, such as research in self-supervised representation learning, domain generalization, and unsupervised domain adaptation that is currently getting increasing interest in the computer vision and robotics community. Exploiting developments in semi-supervised, but also unsupervised learning of vision models seems like a indispensable step to reduce the burden of annotating data and unlocking the scalable deployment of vision models in the agricultural domain.

Furthermore, the combination with other agricultural datasets providing pixel-wise annotations, e.g., GrowliFlowers [36], opens the door for studying cross-domain transfer between different plant species towards the goal of developing more generalizable visual perception systems in the agricultural domain.

7 Conclusion

In this paper, we present a novel dataset for studying visual perception in the agricultural domain of crop production using real-world field images captured by an UAV. Together with dense pixel-wise annotations of crops and weeds that distinguish instances of plants, we also provide leaf-level pixel-wise annotations of crop leaves.

In line with the dataset, we presented our benchmark tasks that will be evaluated on a hidden test set to allow an unbiased and controlled evaluation of developed approaches. The server-side evaluation also ensures that metrics are consistent and reliable allowing to compare approaches based on published results.

For each task, we also provide baseline results that show the performance of off-the-shelf approaches for the different tasks. These results show that certain tasks need further research to tackle the specific challenges of the agricultural domain. We believe that more domain-specific approaches exploiting domain knowledge could boost performance.

Acknowledgments

We thank all students annotating the data. The work has been funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy, EXC-2070 – 390732324 (PhenoRob).

References

[1] J. Bell and H. M. Dee, “Aberystwyth Leaf Evaluation Dataset,” https://doi.org/10.5281/zenodo.168158, 2016.
[2] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT++ Better Real-Time Instance Segmentation,” IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), vol. 44, no. 2, pp. 1108–1121, 2022.
[3] P. Bosilj, E. Aptoula, T. Duckett, and G. Cielniak, “Transfer learning between crop types for semantic segmentation of crops versus weeds in precision agriculture,” Journal of Field Robotics (JFR), vol. 37, no. 1, pp. 7–19, 2019.
[4] B. D. Brabandere, D. Neven, and L. V. Gool, “Semantic Instance Segmentation with a Discriminative Loss Function,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) Workshops, 2017.
[5] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-End Object Detection with Transformers,” in Proc. of the Europ. Conf. on Computer Vision (ECCV), 2020.
[6] J. Champ, A. Mora-Fallas, H. Goëau, E. Mata-Montero, P. Bonnet, and A. Joly, “Instance segmentation for the fine detection of crop and weed plants by precision agricultural robots,” Applications in Plant Sciences, vol. 8, no. 7, p. e11373, 2020.
[7] N. Chebrolu, P. Lottes, A. Schaefer, W. Winterhalter, W. Burgard, and C. Stachniss, “Agricultural Robot Dataset for Plant Classification, Localization and Mapping on Sugar Beet Fields,” Intl. Journal of Robotics Research (IJRR), vol. 36, pp. 1045–1052, 2017.
[8] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic Image Segmentation withDeep Convolutional Nets, Atrous Convolution,and Fully Connected CRFs,” IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), vol. 40, no. 4, pp. 834–848, 2018.
[9] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking Atrous Convolution for Semantic Image Segmentation,” arXiv preprint:1706.05587, 2017.
[10] B. Cheng, M. D. Collins, Y. Zhu, T. Liu, T. S. Huang, H. Adam, and L.-C. Chen, “Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2020.
[11] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention Mask Transformer for Universal Image Segmentation,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2022.
[12] B. Cheng, A. G. Schwing, and A. Kirillov, “Per-Pixel Classification is Not All You Need for Semantic Segmentation,” in Proc. of the Conf. on Neural Information Processing Systems (NeurIPS), 2021.
[13] M. T. Chiu, X. Xu, Y. Wei, Z. Huang, A. G. Schwing, R. Brunner, H. Khachatrian, H. Karapetyan, I. Dozier, and G. Rose, “Agriculture-vision: A large aerial image database for agricultural pattern analysis,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2020.
[14] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The Cityscapes Dataset for Semantic Urban Scene Understanding,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016.
[15] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2005, pp. 886–893.
[16] E. David, M. Serouart, D. Smith, S. Madec, K. Velumani, S. Liu, X. Wang, F. P. Espinosa, S. Shafiee, I. S. A. Tahir, H. Tsujimoto, S. Nasuda, B. Zheng, N. Kichgessner, H. Aasen, A. Hund, P. Sadhegi-Tehran, K. Nagasawa, G. Ishikawa, S. Dandrifosse, A. Carlier, B. Mercatoris, K. Kuroki, H. Wang, M. Ishii, M. A. Badhon, C. Pozniak, D. S. LeBauer, M. Lilimo, J. Poland, S. Chapman, B. de Solan, F. Baret, I. Stavness, and W. Guo, “Global Wheat Head Dataset 2021: more diversity to improve the benchmarking of wheat head localization methods,” arXiv preprint:2105.07660, 2021.
[17] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in Proc. of the Intl. Conf. on Learning Representations (ICLR), 2021.
[18] T. Duckett, S. Pearson, S. Blackmore, B. Grieve, W.-H. Chen, G. Cielniak, J. Cleaversmith, J. Dai, S. Davis, C. Fox, P. From, I. Georgilas, R. Gill, I. Gould, M. Hanheide, A. Hunter, F. Iida, L. Mihalyova, S. Nefti-Meziani, G. Neumann, P. Paoletti, T. Pridmore, D. Ross, M. Smith, M. Stoelen, M. Swainson, S. Wane, P. Wilson, I. Wright, and G.-Z. Yang, “Agricultural Robotics: The Future of Robotic Agriculture,” arXiv preprint: 1806.06762, 2018.
[19] H. Dutagaci, P. Rasti, G. Galopin, and D. Rousseau, “Rose-x: an annotated data set for evaluation of 3d plant organ segmentation methods,” Plant Methods, vol. 16, no. 1, pp. 1–14, 2020.
[20] M. Everingham, S. A. Eslami, L. van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The Pascal Visual Object Classes Challenge – a Retrospective,” Intl. Journal of Computer Vision (IJCV), vol. 111, no. 1, pp. 98–136, 2015.
[21] M. Everingham, L. van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The Pascal Visual Object Classes (VOC) Challenge,” Intl. Journal of Computer Vision (IJCV), vol. 88, no. 2, pp. 303–338, 2010.
[22] M. Fawakherji, C. Potena, A. Pretto, D. D. Bloisi, and D. Nardi, “Multi-Spectral Image Synthesis for Crop/Weed Segmentation in Precision Farming,” Journal on Robotics and Autonomous Systems (RAS), vol. 146, p. 103861, 2021.
[23] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object Detection with Discriminatively Trained Part-Based Models,” IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), vol. 32, no. 9, pp. 1627–1645, 2010.
[24] J. Gall, A. Yao, N. Razavi, L. V. Gool, and V. Lempitsky, “Hough Forests for Object Detection, Tracking, and Action Recognition,” IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), vol. 33, no. 11, 2011.
[25] R. Girshick, “Fast R-CNN,” in Proc. of the IEEE Intl. Conf. on Computer Vision (ICCV), 2015.
[26] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
[27] R. Güldenring, R. E. Adersen, and L. Nalpantidis, “Zoom in on the Plant: Fine-grained Analysis of Leaf, Stem and Vein Instances,” IEEE Robotics and Automation Letters (RA-L), vol. 9, no. 2, pp. 1588–1595, 2024.
[28] M. Halstead, A. Ahmadi, C. Smitt, O. Schmittmann, and C. McCool, “Crop Agnostic Monitoring Driven by Deep Learning,” Frontiers in Plant Science, vol. 12, 2021.
[29] M. S. Hammad, K. K. Velayudhan, J. Potgieter, and K. M. Arif, “Weed identification by single-stage and two-stage neural networks: A study on the impact of image resizers and weights optimization algorithms,” Frontiers in Plant Science, vol. 13, 2022.
[30] S. Haug and J. Ostermann, “A crop/weed field image dataset for the evaluation of computer vision based precision agriculture tasks,” in Proc. of the European Conference on Computer Vision (ECCV) Workshops, 2015, pp. 105–116.
[31] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. of the IEEE Intl. Conf. on Computer Vision (ICCV), 2017.
[32] L. Horrigan, R. S. Lawrence, and P. Walker, “How sustainable agriculture can address the environmental and human health harms of industrial agriculture,” Environ. Health Perspect., vol. 110, no. 5, pp. 445–456, 2002.
[33] W. Huang, S. Deng, C. Chen, X. Fu, and Z. Xiong, “Learning to Model Pixel-Embedded Affinity for Homogeneous Instance Segmentation,” in Proc. of the Conf. on Advancements of Artificial Intelligence (AAAI), 2022.
[34] M. Imperoli, C. Potena, D. Nardi, G. Grisetti, and A. Pretto, “An Effective Multi-Cue Positioning System for Agricultural Robotics,” IEEE Robotics and Automation Letters (RA-L), vol. 3, no. 4, pp. 3685–3692, 2018.
[35] R. Khanna, L. Schmid, A. Walter, J. Nieto, R. Siegwart, and F. Liebisch, “A spatio temporal spectral framework for plant stress phenotyping,” Plant Methods, vol. 15, no. 1, pp. 1–18, 2019.
[36] J. Kierdorf, L. V. Junker-Frohn, M. Delaney, M. D. Olave, A. Burkart, H. Jaenicke, O. Muller, U. Rascher, and R. Roscher, “GrowliFlower: An image time series dataset for GROWth analysis of cauLIFLOWER,” Journal of Field Robotics (JFR), vol. 40, no. 2, pp. 173–192, 2022.
[37] A. Kirillov, R. Girshick, K. He, and P. Dollar, “Panoptic Feature Pyramid Networks,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2019.
[38] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic Segmentation,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2019.
[39] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollar, and R. Girshick, “Segment anything,” in Proc. of the IEEE/CVF Intl. Conf. on Computer Vision (ICCV), 2023.
[40] P. Krähenbühl and V. Koltun, “Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials,” in Proc. of the Conf. on Neural Information Processing Systems (NIPS), 2011.
[41] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.
[42] H. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
[43] L. Ladicky, C. Russell, and P. Kohli, “Associative Hierarchical CRFs for Object Class Image Segmentation,” in Proc. of the IEEE Intl. Conf. on Computer Vision (ICCV), 2009.
[44] P. Lameski, E. Zdraveski, V. Trajkovik, and A. Kulkov, “Weed Detection Dataset with RGB Images Taken Under Variable Light Conditions,” in Proc. of the Intl. Conf. on ICT Innovations, 2017.
[45] P. D. Lancashire, H. Bleiholder, T. Boom, P. Langelüddeke, R. Stauss, E. Weber, and A. Witzenberger, “A Uniform Decimal Code for Growth Stages of Crops and Weeds,” Annals of Applied Biology, vol. 119, no. 3, pp. 561–601, 1991.
[46] F. Lateef and Y. Ruichek, “Survey on semantic segmentation using deep learning techniques,” Neurocomputing, vol. 338, pp. 321–348, 2019.
[47] H. Law and J. Deng, “CornerNet: Detecting Objects as Paired Keypoints,” in Proc. of the Europ. Conf. on Computer Vision (ECCV), 2018.
[48] J. Lazarow, W. Xu, and Z. Tu, “Instance Segmentation with Mask-supervised Polygonal Boundary Transformers,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2022.
[49] Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep Learning,” Nature, vol. 521, pp. 436–444, 2015.
[50] B. Leibe, A. Leonardis, and B. Schiele, “Combined Object Categorization and Segmentation with an Implicit Shape Model,” in Proc. of Workshop on Statistical Learning in Computer Vision at ECCV, 2004.
[51] Y. Li, H. Zhao, X. Qi, L. Wang, Z. Li, J. Sun, and J. Jia, “Fully Convolutional Networks for Panoptic Segmentation,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.
[52] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common Objects in Context,” in Proc. of the Europ. Conf. on Computer Vision (ECCV), 2014, pp. 740–755.
[53] M. T. Linaza, J. Posada, J. Bund, P. Eisert, M. Quartulli, J. Döllner, A. Pagani, I. G. Olaizola, A. Barriguinha, T. Moysiadis, and L. Lucat, “Data-Driven Artificial Intelligence Applications for Sustainable Precision Agriculture,” Agronomy, vol. 11, no. 6, p. 1227, 2021.
[54] J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networks for Semantic Segmentation,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.
[55] P. Lottes, J. Behley, A. Milioto, and C. Stachniss, “Fully convolutional networks with sequential information for robust crop and weed detection in precision farming,” IEEE Robotics and Automation Letters (RA-L), vol. 3, pp. 3097–3104, 2018.
[56] P. Lottes, M. Höferlin, S. Sander, and C. Stachniss, “Effective Vision-based Classification for Separating Sugar Beets and Weeds for Precision Farming,” Journal of Field Robotics (JFR), vol. 34, pp. 1160–1178, 2017.
[57] Y. Lu and S. Young, “A survey of public datasets for computer vision tasks in precision agriculture,” Computers and Electronics in Agriculture, vol. 178, p. 105760, 2020.
[58] S. L. Madsen, S. K. Mathiassen, M. Dyrmann, M. S. Laursen, L.-C. Paz, and R. N. Jørgensen, “Open Plant Phenotype Database of Common Weeds in Denmark,” Remote Sensing, vol. 12, no. 8, p. 1246, 2020.
[59] X. Mai, H. Zhang, and M. Q. Meng, “Faster R-CNN with Classifier Fusion for Small Fruit Detection,” in Proc. of the IEEE Intl. Conf. on Robotics & Automation (ICRA), 2018.
[60] A. Milioto, P. Lottes, and C. Stachniss, “Real-time Semantic Segmentation of Crop and Weed for Precision Agriculture Robots Leveraging Background Knowledge in CNNs,” in Proc. of the IEEE Intl. Conf. on Robotics & Automation (ICRA), 2018.
[61] M. Minervini, A. Fischbach, H. Scharr, and S. A. Tsaftaris, “Finely-grained annotated datasets for image-based plant phenotyping,” Pattern Recognition Letters, vol. 81, pp. 80–89, 2016.
[62] M. Minervini, H. Scharr, and S. A. Tsaftaris, “Image Analysis: The New Bottleneck in Plant Phenotyping,” IEEE Signal Processing Magazine, vol. 32, no. 4, pp. 126–131, 2015.
[63] R. Mohan and A. Valada, “EfficientPS: Efficient Panoptic Segmentation,” Intl. Journal of Computer Vision (IJCV), vol. 129, pp. 1551–1579, 2021.
[64] S. P. Mohanty, D. P. Hughes, and M. Salathé, “Using deep learning for image-based plant disease detection,” Frontiers in Plant Science, vol. 7, p. 1419, 2016.
[65] A. K. Mortensen, S. Skovsen, H. Karstoft, and R. Gislum, “The Oil Radish Growth Dataset for Semantic Segmentation and Yield Estimation,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR) Workshops, 2019.
[66] G. Neuhold, T. Ollmann, S. R. Bulo, and P. Kontschieder, “The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes,” in Proc. of the IEEE Intl. Conf. on Computer Vision (ICCV), 2017.
[67] A. Olsen, D. A. Konovalov, B. Philippa, P. Ridd, J. C. Wood, J. Johns, W. Banks, B. Girgenti, O. Kenny, J. Whinney, B. Calvert, M. R. Azghadi, and R. D. White, “DeepWeeds: A Multiclass Weed Species Image Dataset for Deep Learning,” Scientifc Reports, vol. 9, no. 1, p. 2058, 2019.
[68] A. Pavao, I. Guyon, A.-C. Letournel, D.-T. Tran, X. Baro, H. J. Escalante, S. Escalera, T. Thomas, and Z. Xu, “CodaLab Competitions: An Open Source Platform to Organize Scientific Challenges,” Journal on Machine Learning Research (JMLR), vol. 24, no. 198, pp. 1–6, 2023.
[69] T. Pire, M. Mujica, J. Civera, and E. Kofman, “The Rosario dataset: Multisensor data for localization and mapping in agricultural environments,” Intl. Journal of Robotics Research (IJRR), vol. 38, no. 6, 2019.
[70] L. Porzi, S. R. Bulo, A. Colovic, and P. Kontschieder, “Seamless Scene Segmentation,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2019.
[71] L. Porzi, S. R. Bulo, and P. Kontschieder, “Improving Panoptic Segmentation at All Scales ,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.
[72] M. P. Pound, J. A. Atkinson, A. J. Townsend, M. H. Wilson, M. Griffiths, A. S. Jackson, A. Bulat, G. Tzimiropoulos, D. M. Wells, E. H. Murchie, T. P. Pridmore, and A. P. French, “Deep machine learning provides state-of-the-art performance in image-based plant phenotyping,” Gigascience, vol. 6, no. 10, p. gix083, 2017.
[73] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016.
[74] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” in Proc. of the Conf. on Neural Information Processing Systems (NIPS), 2015.
[75] G. Roggiolani, M. Sodano, T. Guadagnino, F. Magistri, J. Behley, and C. Stachniss, “Hierarchical Approach for Joint Semantic, Plant Instance, and Leaf Instance Segmentation in the Agricultural Domain,” in Proc. of the IEEE Intl. Conf. on Robotics & Automation (ICRA), 2023.
[76] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “Erfnet: Efficient residual factorized convnet for real-time semantic segmentation,” IEEE Trans. on Intelligent Transportation Systems (ITS), vol. 19, no. 1, pp. 263–272, 2017.
[77] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” in Proc. of the Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015.
[78] I. Sa, Z. Ge, F. Dayoub, B. Upcroft, T. Perez, and C. McCool, “DeepFruits: A Fruit Detection System Using Deep Neural Networks,” Sensors, vol. 16, no. 8, p. 1222, 2016.
[79] I. Sa, M. Popovic, R. Khanna, Z. Chen, P. Lottes, F. Liebisch, J. Nieto, C. Stachniss, A. Walter, and R. Siegwart, “WeedMap: A Large-Scale Semantic Weed Mapping Framework Using Aerial Multispectral Imaging and Deep Neural Network for Precision Farming,” Remote Sensing, vol. 10, no. 9, p. 1423, 2018.
[80] D. Schunck, F. Magistri, R. A. Rosu, A. Cornelißen, N. Chebrolu, S. Paulus, J. Léon, S. Behnke, C. Stachniss, H. Kuhlmann, and L. Klingbeil, “Pheno4D: A spatio-temporal dataset of maize and tomato plant point clouds for phenotyping and advanced plant analysis ,” PLOS ONE, vol. 16, no. 8, pp. 1–18, 2021.
[81] D. Steininger, A. Trondl, G. Croonen, J. Simon, and V. Widhalm, “The cropandweed dataset: A multi-modal learning approach for efficient crop and weed manipulation,” in Proc. of the IEEE Winter Conf. on Applications of Computer Vision (WACV), 2023.
[82] H. Storm, S. Seidel, L. Klingbeil, F. Ewert, H. Vereecken, W. Amelung, S. Behnke, M. Bennewitz, J. Börner, T. Döring, J. Gall, A.-K. Mahlein, C. McCool, U. Rascher, S. Wrobel, A. Schnepf, C. Stachniss, and H. Kuhlmann, “Research Priorities to Leverage Smart Digital Technologies for Sustainable Crop Production,” European Journal of Agronomy, vol. 156, p. 127178, 2024.
[83] R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Transformer for Semantic Segmentation,” in Proc. of the IEEE/CVF Intl. Conf. on Computer Vision (ICCV), 2020.
[84] K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang, W. Liu, and J. Wang, “High-Resolution Representations for Labeling Pixels and Regions,” arXiv preprint:1904.04514, 2019.
[85] S. A. Taghanaki, K. Abhishek, J. P. Cohen, J. Cohen-Adad, and G. Hamarneh, “Deep semantic segmentation of natural and medical images: a review,” Artificial Intelligence Review, vol. 54, no. 1, pp. 137–178, 2021.
[86] K. E. A. van de Sande, J. R. R. Uijlings, T. Gevers, and A. W. M. Smeulders, “Segmentation As Selective Search for Object Recognition,” in Proc. of the IEEE Intl. Conf. on Computer Vision (ICCV), 2011.
[87] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. of the Conf. on Neural Information Processing Systems (NeurIPS), 2017.
[88] P. Viola and M. J. Jones, “Robust Real-time Object Detection,” Intl. Journal of Computer Vision (IJCV), vol. 57, pp. 137–154, 2001.
[89] A. Walter, R. Khanna, P. Lottes, C. Stachniss, R. Siegwart, J. Nieto, and F. Liebisch, “Flourish - a robotic approach for automation in crop management,” in Proc. of the Intl. Conf. on Precision Agriculture, 2018.
[90] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” arXiv preprint: 2207.02696, 2022.
[91] H. Wang, R. Luo, M. Maire, and G. Shakhnarovich, “Pixel Consensus Voting for Panoptic Segmentation,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2020.
[92] X. Wang, T. Kong, C. Shen, Y. Jiang, and L. Li, “SOLO: Segmenting Objects by Locations,” in Proc. of the Europ. Conf. on Computer Vision (ECCV), 2020.
[93] X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen, “SOLOv2: Dynamic and Fast Instance Segmentation,” in Proc. of the Conf. on Neural Information Processing Systems (NeurIPS), 2020.
[94] J. Weyler, F. Magistri, P. Seitz, J. Behley, and C. Stachniss, “In-Field Phenotyping Based on Crop Leaf and Plant Instance Segmentation,” in Proc. of the IEEE Winter Conf. on Applications of Computer Vision (WACV), 2022.
[95] J. Weyler, A. Milioto, T. Falck, J. Behley, and C. Stachniss, “Joint Plant Instance Detection and Leaf Count Estimation for In-Field Plant Phenotyping,” IEEE Robotics and Automation Letters (RA-L), vol. 6, no. 2, pp. 3599–3606, 2021.
[96] W. Winterhalter, F. V. Fleckenstein, C. Dornhege, and W. Burgard, “Crop Row Detection on Tiny Plants With the Pattern Hough Transform,” IEEE Robotics and Automation Letters (RA-L), vol. 3, no. 4, pp. 3394–3401, 2018.
[97] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers,” in Proc. of the Conf. on Neural Information Processing Systems (NeurIPS), 2021.
[98] Q. Yu, H. Wang, D. Kim, S. Qiao, M. Collins, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, “CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation ,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2022.
[99] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as Points,” arXiv preprint:1904.07850v2, 2019.

Jan Weyler is a PhD student in Engineering at the Photogrammetry & Robotics Lab at the University of Bonn, Germany. He obtained his B.Sc. in 2015 and his M.Sc. degree in Geodesy and Geoinformation in 2019 from the University of Bonn, Germany. His research focuses on vision-based semantic scene understanding for agricultural robots.

Federico Magistri is a Ph.D. student at the Photogrammetry & Robotics Lab at the University of Bonn, Germany, since November 2019. He received his M.Sc. in Artificial Intelligence and Robotics from ”La Sapienza” University of Rome, Italy, with a thesis on Swarm Robotics for Precision Agriculture in collaboration with the National Research Council of Italy and the Wageningen University and Research, Netherlands.

Elias Marks is a PhD student in Engineering at the Photogrammetry & Robotics Lab at the University of Bonn, Germany. He obtained his B.Sc. degree in Robotics and Automation from the Hochschule Heilbronn, Germany, in 2018 and received his M.Sc. degree in Artificial Intelligence and Robotics at University La Sapienza in Rome, Italy, in 2021. His research focuses on plant modeling for phenotyping based on image data.

Yue Linn Chong is a Ph.D. student in Engineering at Photogrammetry & Robotics Lab at the University of Bonn, Germany. She completed her B.Eng in Mechanical Engineering from the National University of Singapore in 2017. In 2020, she completed her M.Sc. in Mechanical Engineering from the National University of Singapore. Her research focuses on unsupervised learning using generative models.

Matteo Sodano is a PhD student in Engineering at the Photogrammetry & Robotics Lab at the University of Bonn since January 2021. He obtained his MSc degree in Control Engineering in 2020. His research centers around perception and segmentation, with a focus on novel object discovery.

Gianmarco Roggiolani is a Ph.D. candidate in the Photogrammetry & Robotics Lab at the University of Bonn, Germany. He obtained his B.Sc. degree in Computer and Automatic Engineering in 2018 and received his MSc degree in Artificial Intelligence and Robotics in 2021, both from the Sapienza University of Rome, Italy. His research focuses on self-supervised techniques to improve the performance of vision-based learning systems in agricultural robotics.

Nived Chebrolu is a postdoctoral research associate at the Oxford Robotics Institute, University of Oxford, UK. His research interests are in developing robust localization and mapping techniques for field robotics applications. He obtained his Ph.D. from the University of Bonn in 2021, where he developed registration techniques for agricultural robotic applications. Before that, Nived received his M.Sc. in Robotics from Ecole Centrale de Nantes (ECN), France, and the University of Genoa, Italy in 2015.

Cyrill Stachniss is a full professor at the University of Bonn, Germany, with the University of Oxford, UK, as well as with the Lamarr Institute for Machine Learning and AI, Germany. He is the Spokesperson of the DFG Cluster of Excellence PhenoRob at the University of Bonn. His research focuses on probabilistic techniques and learning approaches for mobile robotics, perception, and navigation. Main application areas of his research are agricultural and service robotics and self-driving cars.

Jens Behley received his Dipl.-Inform. in computer science in 2009 and his Ph.D. in computer science in 2014, both from the Dept. of Computer Science at the University of Bonn, Germany. Since 2016, he is a postdoctoral researcher at the Photogrammetry & Robotics Lab at the University of Bonn, Germany. He finished his habilitation at the University of Bonn in 2023. His area of interest lies in the area of perception for autonomous vehicles, deep learning for semantic interpretation, and LiDAR-based SLAM.

PhenoBench: A Large Dataset and Benchmarks for Semantic Image Interpretation in the Agricultural Domain