1. Introduction
With the increasing dependency on high-definition (HD) maps for the development of intelligent vehicle (IV) technology [
1,
2,
3], it is now more important than ever to achieve an up-to-date, reliable, and accurate map [
4,
5]. The information provided by the HD map can lead to more-reliable decisions for planning and control since the optimal routes can be determined in an instant. In perception tasks, the drivable area and lane information can be utilized to detect the vehicle position by exploiting visual cues such as traffic signs, road markers, and lanes [
6,
7]. In recent years, researchers have built HD maps from aerial images [
8]. However, the state-of-the-art method to create HD maps relies on specialized vehicles equipped with sophisticated sensors such as light detection and ranging (LiDAR) and the real-time kinematic positioning-global navigation satellite system (RTK-GNSS) [
9]. These vehicles are expensive, which limits the number of mapping vehicles dispatched on the road [
10].
In our everyday lives, dynamic changes occur in the road environment. For example, lane lines can be removed when the road is coated with new asphalt; arrow markers on the road can differ due to local regulation changes; centerlines can be moved for the expansion of the lane width. These add to the number and complexity of tasks that need to be performed by specialized mapping vehicles. Previously, whenever a map needed to be updated, the specialized mapping vehicle will update the map as a whole. As a result of the high workload during an update, the frequency of updates in the necessary areas are relatively low [
11]. To avoid the ineffective allocation of resources and reduce the burden of performing these map updates, specialized mapping vehicles should only be dispatched to areas that require updates on the map [
12]. In this paper, we aimed to identify these specific out-of-date areas by outsourcing the detection of changes to the road environment to mass-produced vehicles equipped with low-cost sensors such as monocular cameras and GNSS. The visualization of the monitoring system can be seen in
Figure 1.
Conventional map-change-detection methods will typically project the map data, such as traffic poles, road markings, and lane lines into the 2D image and perform a matching algorithm to determine whether the change has occurred [
13]. Pannen et al. performed change detection using a boosted particle filter trained using datasets to update lane lines with crowdsourced data [
14]. This approach utilized multiple sessions or multiple transversals, which outperformed single-session change detection. The phrase ‘multi-session’ denotes multiple visits by one or more vehicles to a specific area, exceeding one visit, while ‘single session’ describes a one-time visit by a single vehicle to a particular area in order to determine the change.
In another study, a simultaneous localization and map change update (SLAMCU) was proposed to perform the simultaneous localization and mapping (SLAM) algorithm and HD map update at the same time [
15]. This method relies on expensive LiDAR sensors to determine the change, which means the solution might not be viable in the near future given the high cost of LiDAR on the market. In 2020, Pannen et al. introduced a new method of performing map updates by utilizing density-based clustering of applications with noise (DBSCAN) [
16]. In this method, only the vectorized map data are considered, and the system does not differentiate good and bad information. Ref. [
17] proposed a novel method to detect map change based on deep metric learning, which projects the existing map data into the image domain, directly compares it to the detected map element, and calculates a similarity score. Although this approach is a great breakthrough since it only uses a single frame in the change-monitoring algorithm, the system lacks robustness toward occlusion and false detection cases.
In this study, we combined the advantages of single-session map monitoring, which is rich in semantic information, with multi-session map monitoring, where more information can be used to determine the map change. A confidence level is assigned to each map element to determine how confident the system is about the existence of the map element [
18]. For map elements with a confidence level above a predetermined threshold, map matching is performed on the vehicle side, and the results are sent to the server to monitor the HD map as a whole.
The foremost contributions of this paper are:
Formulating a map element confidence level model based on map element detection accuracy observed in a series of occupancy objects.
Proposing a systematic approach to create a multi-session HD map-monitoring system from the aggregation of multiple single-session HD map-monitoring results.
Conducting experiments based on real-world data. The results demonstrated the effectiveness and efficiency of the proposed method.
This paper is organized as follows:
Section 2 of this paper discusses the related work in the extant literature. In
Section 3, the single-session map monitoring approach is introduced. The aggregation of single-session map monitoring results to obtain a multi-session map monitoring system is described in
Section 4. The model is applied to a real case scenario in
Section 5, which also discusses the technical challenges encountered in the implementation. Finally, the conclusion of the study is drawn in
Section 6.
3. Single-Session Monitoring
In this section, we describe the system model and the workflow for a single-session monitoring framework. This model is based on a reporting system on the vehicle side that will be transferred to the server side. However, the absolute decision as to whether the map has been changed or not should be based on the multi-session framework since it is richer in information. The whole framework can be seen in
Figure 2.
3.1. Map Object Detection and Tracking
For map object detection, we used the classic region-based convolutional neural network (R-CNN) algorithm called MaskRCNN because of its robustness to create an instance and bounding box with confidence values [
27] for pole-like objects, as well as traffic signs. Note that the network only helps to create segmentation results. Thus, further post-processing steps are required. For pole-like objects, the center pixels for the top and bottom boundary of the mask are defined as the control points. As for signs, the Hough transform was used to determine the control points from the outer contours. For lane line detection, we utilized the algorithm proposed in [
28]. Similar to the previous objects, a further post-processing step is required to vectorize the instance segmentation results into a set of control points, which are separated according to the length between the points. It is important to note that the network was also further optimized with the labeled dataset collected in the same test area to enhance detection accuracy.
After we were able to detect the map element, the main challenge was to perform multi-object tracking for different map elements. The time at which a map element is first detected is denoted as
, and the time at which the same map element is last detected is denoted as
. Considering that we are tracking many map elements, which are either vertically positioned, such as pole-like objects and signs, or horizontally positioned, such as lane markers, it is better to track the object in the image space. Using the traditional method can lead to a very big error caused by distortions in perspective. Furthermore, the wrong association between tracked map elements can lead to confidence error, which might be significant for the monitoring result. Here, we used a state-of-the-art dense optical flow algorithm called recurrent all-pairs field transform (RAFT) [
29], which gives the corresponding distance between tracked pixels of two consecutive frames. Given that the detection
and
, the distance between
and the pixel prediction of the previous frame
can be calculated by warping the optical flow estimation of
. The warping process uses the motion vector obtained from optical flow estimation for every pixel to locate the position of each pixel in the next frame. The formula of the warping process can be seen in [
30].
The map element association between two time frames is performed by finding the minimum Manhattan distance between the previous prediction and the detected map object in the next time frame. When the predicted pixel locations of the vertical objects (poles and signs) are outside of the image border
=
and the ground objects (lane lines) are outside of the image horizon
, the tracking of that element is finished. The visualization of this limit can be seen in
Figure 3.
When the detected map element is registered as tracked, we put the confidence value inside an occupancy object table in a time series manner
as
. It is important to note that, for the case of missing tracking, the occupancy table is also filled with
. The observation of the map element in the image plane is modeled as a series of occupancy objects, as shown in
Figure 4. This occupancy object works as an occupancy grid, but instead of using the grid as a reference, we used the map element itself as the reference. These occupancy objects will be used in the next step when we want to calculate the existence confidence of the map element.
3.2. Bayesian Recursion Confidence
After we have collected the occupancy object of the map element, the next step is to estimate the state of map elements using multiple frame observations. The posterior probability of the map element is estimated by a series of observations collected on the occupancy object. It is given that
For the observation of each frame, it is natural to have cases where missed detections and false detections occur. These two perceptual errors are also the main source of noise in
. Here, we introduced the missed detection probability
and false detection probability
of image perception. The former gives rise to:
and the latter gives rise to:
when false detection occurs and missed tracking happens. Then,
represents the probability that there is a map element observed in the map from
to
.
where
is the probability density function of the map element and
is the accumulation of the
distribution function.
According to the Bayesian inference formula, we have:
The probability density function of
represents the statistical characteristics of the map elements in the macro sense, i.e., the frequency of map element changes. The time probability density of a typical event’s occurrence can be modeled by an exponential distribution function, which describes the probability distribution of the time interval
from the current observation to the moment when the map elements change. In this paper, we simplified the modeling of the probability density function by assuming that the current change time of a map element is 0 or does not change (
). That is,
where
represents the prior probability of the existence of a map feature.
However, since the ending of the time series for each tracked element varies and is unknown in practice, a threshold value for the number of times a map element is observed has to be satisfied in order to have reliable confidence. In this paper, we call this variable , and only when the elements have more than observations can the obtained posterior probability be used as the confidence to judge whether the map element exists or not in reality. We set this threshold value to be 8 to satisfy the posterior probability and also set the threshold of probability to 0.8 to ensure the existence is true.
3.3. Map Reconstruction
After the confidence of the map element has been calculated, the next step is to reconstruct the map element that we believe to exist in a 3D space. The vehicle pose used in this paper was derived from GNSS and the IMU. It is important to note that the reconstruction problems for lane-like elements and pole-like elements are different. Here, we briefly describe the approach we chose to take for this step:
Lane reconstruction: The lane was reconstructed from the projection of the bird’s-eye view (BEV) by inverse projection mapping (IPM). Given the camera and installation poses
for the vehicle, the 3D lane marking can be reconstructed in the local coordinate with the assumption of a flat road as:
where
denotes the 3D lane markings of a fixed road surface height in the vehicle local coordinate and
denotes the 2D lane markings in the normalized 2D image plane obtained from the detection result of
. This process is performed to transform the information in the pixel coordinate into the vehicle coordinate. When the vehicle pose
is defined as a 6-DoF pose
, we can map
into a global coordinate
by:
The lane ID separation is obtained from the tracking. When the tracking is finished, the lane ID is considered finished as well, and we can add 1 to the lane ID index.
Poles’ reconstruction: Given the camera projection matrix
and the camera pose
, the projection of the Plücker coordinate of 3D line
in the world coordinate
to the 2D image plane can be formulated as:
Here, we adopted the approach used in [
31] to ensure a more-accurate reconstruction of pole-like elements. The reconstruction result of this approach can be seen in
Figure 5 where the poles along the linklet area of a vehicle are reconstructed.
Sign reconstruction: Here, we adopted the approach used in [
31] to reconstruct the traffic sign by first estimating the plane model of the sign
. Then, the 3D position from the 2D contour points can be generated given that the depth
of the control point
is solved with the condition of
, and its world coordinate can be obtained as:
where
is the vehicle pose. We also used the rectangular and circular template points as our sign shape template as in [
31]. The information for each reconstructed sign is saved as a set of points and the corresponding class of the sign,
. This information will be used in the next step to determine changes in the map.
3.4. Map Matching
We propose an expected observation area that a vehicle will pass through on its trajectory. As shown in
Figure 6, we define this area as an isosceles triangle with angle
, length
L, and width
W. Then, the points
and
in the world coordinate can be obtained. It is important to note that the matching in this process is performed in the world coordinates. As the vehicle drives through its trajectory, the points
and
will create the expected observation area. This area is important for the vehicle to determine the limit of its observation since it is unrealistic to expect a vehicle to be able to have a full observation of all map elements that it passes through. The limit can be defined by two sets of points
and
.
Through this model, we can determine the unmatched map elements to be out of limit or the map element to no longer exist. Here, we denote the map elements that belong to the expected observation inside the linklet area
e as
. The map elements that belong to
satisfy the constraint as shown below:
After we are able to find the map element sets that belong to the expected observation area, we can then match them with the list of detected map elements
. The matching process is performed via a point-based approach to simplify the calculation on the vehicle side. Given that all the lane line point sets from the map database are
, the poles sets from the map database are
, the sign sets from the map database are
, the lane line detection points in the 3D world coordinate are
, the pole points are
, and the sign points are
, then the map database set can be defined as:
and the map element detection set can be defined as
. The distance matching is performed by considering only the detection result.
Thus, we can obtain the three important parameters, which are: map elements from the map database on the expected observation area , map elements observed , and map elements matched . All of this information will be sent to the server as a result of single-session monitoring.
5. Experimental Result
5.1. Experimental Setup
We simulated multi-session monitoring by traversing across Beijing Yizhuang District multiple times. Each time, the distance traveled was between 10 and 20 km. Therefore, several linklet areas were passed by the vehicle. Our experimental vehicle was equipped with a monocular camera with 1080p resolution recording at 10 fps. We also equipped the vehicle with an IMU sensor, gyroscope, and wheel encoder at 100 Hz. Although the vehicle was equipped with a Hesai 40P LiDAR sensor, the main localization tool used was GNSS-RTK to obtain high-precision positioning. The vehicle’s sensor details can be seen in
Table 1, and the vehicle itself can be seen in
Figure 8.
The whole Yizhuang District area was divided into 106 linklet areas, and we collected 20 sequences of data in this area, which will henceforth be referred to as the Yizhuang dataset. The data consisted of raw images and GNSS data complete with the timestamp information, and they will be used to evaluate the performance of our proposed method. To find the change on the map, we first removed some of the map data randomly to check whether our system is able to effectively and accurately detect the linklet areas that have this error. Furthermore, we checked the reconstruction results to further validate the map’s changes.
5.2. Evaluation Metric
In order to measure the accuracy of the multi-session monitoring system, we used the sensitivity (true positive rate (
TPR) ) and specificity (true negative rate (
TNR)) metrics from the confusion matrix that we generated.
where
TPR is the true positive rate,
TP is the true positive value,
FN is the false negative value,
TNR is the true negative rate,
TN is the true negative value, and
FP is the false positive value.
The TP and TN values give rise to the correct prediction as to whether or not there has been a change in the map, while the FP and FN values result in the wrong prediction. We determined the TP when our reconstructed data were different from the modified map data and matched the changes we previously made to the original map. This indicates that our monitoring algorithm has successfully detected the change. Conversely, we identified the FN when our reconstructed data matched the modified map data, despite there being a real-world change. For example, if there is a pole in reality and we have removed it from our data, but our reconstructed data also do not show the pole, this aligns with the modified map. This suggests that our algorithm incorrectly perceives no change, hence a false negative.
5.3. Monitoring Accuracy
Out of 106 linklet areas, we changed 40 linklet areas by either removing or shifting the map element in the world coordinate. It is important to note that, in these 40 linklet areas, we randomly changed the lane lines’, poles’, and traffic signs’ data. This process induced map error into the dataset. Here, we validate the result of our monitoring system through the detection of these changes in the map data, and the visualization of this map can be seen in
Figure 9. We provide two types of monitoring accuracy: map-element-specific monitoring accuracy and linklet monitoring accuracy. The first accuracy determines the elementwise specificity of our change detection. The second accuracy provides us with a bigger picture of the overall result of our system.
5.3.1. Map-Element-Specific Monitoring Accuracy
In this subsection, we want to analyze the specificity of our monitoring algorithm.
Figure 10 shows the amount of observations for each constructed map element in each of the linklet areas. It can be seen that there are some linklets that have no observations whatsoever, which is because not all the linklet areas are being passed by the vehicle. Here, we can see that, in most cases, the value of the ground truth and that of the reconstructed map element do not differ much. The association value is also close to the value of the changed map, which indicates that our method was able to find the change in the linklet. Given that we can observe most of the information the map provides to the vehicle, yet the vehicle is able to reconstruct additional details, this implies that the map lacks sufficient information, hinting at a need for updates or changes.
Additionally, we discovered that the ground truth map data were occasionally incomplete for certain objects like signs and poles. Take, for instance, the data for signs; the map only accounted for rectangular signs, which suggests a need for map updates. This issue is illustrated in
Figure 11, where our system’s ability to identify such map inaccuracies is evident. However, to maintain the accuracy of our experiment, we excluded these instances from our analysis, as including erroneous ground truth data would significantly skew the accuracy of our method.
5.3.2. Linklet Monitoring Accuracy
After determining the map-element-specific changes, we can calculate the accuracy of our overall system by finding the changes we have created. As we can see in
Table 2, our system managed to find 26 of the 27 changes that were made. For the remaining 35 linklet areas that had not experienced changes, our system gave the correct decision for 31 of them. Overall, we were able to achieve above 90% accuracy in the
TPR and TNR.
The cases where our system detected false changes were mostly caused by the topology problem of the lane lines. The ID assignment of the reconstructed lane lines did not match the ID from the map data, thus causing the system to think that there was a change on the map because of a lack of the amount of matched IDs. However, after manually determining the starting and ending of the lane lines, the impact of the problem could be minimized.
5.4. Time Efficiency
To evaluate the runtime performance of the proposed algorithm, the single-session system was tested on an i7-6700 CPU and NVIDIA Quadro V100 GPU. The GPU was mainly used in the map element detection and object tracking part, and then, the CPU was used for the rest of the system. The multi-session system was tested on an i7-9850H CPU for all the modules. However, it is important to note that we did not consider the cost of information transmission since it is heavily dependent on the signal coverage of the area. That being said, considering the type of information and the size of it, which was about 8 to 14 KB/image, we can safely assume that the runtime will be very quick in normal circumstances. The average processing time can be seen in
Table 3.
From the timing data, we can learn that the processing bottlenecks are in the machine learning estimation because of the neural network computation. The rest of the algorithms are considered lightweight compared to these two modules. This result demonstrated that our system is efficient in terms of computational requirements.
5.5. Challenging Cases
As shown in
Figure 12, we found that the most-challenging cases within this experiment were mostly the division of the lane line IDs of the map at road intersection areas, where a higher number of lane lines can be found. Here, the different lane line colors represent different lane line IDs. This problem directly correlates with how the topology of the lanes is defined, and it might differ from the topology definition of the reconstruction result. This problem increases the value of the change even though there is little to no change in the map. Therefore, we performed manual adjustments of the lane topology in the local map. It is important to optimize this problem in the future to automate the whole process.