CN111797688A - Visual SLAM method based on optical flow and semantic segmentation - Google Patents
Visual SLAM method based on optical flow and semantic segmentation Download PDFInfo
- Publication number
- CN111797688A CN111797688A CN202010488128.2A CN202010488128A CN111797688A CN 111797688 A CN111797688 A CN 111797688A CN 202010488128 A CN202010488128 A CN 202010488128A CN 111797688 A CN111797688 A CN 111797688A
- Authority
- CN
- China
- Prior art keywords
- dynamic
- area
- semantic segmentation
- optical flow
- point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 62
- 230000011218 segmentation Effects 0.000 title claims abstract description 35
- 230000003287 optical effect Effects 0.000 title claims abstract description 32
- 230000000007 visual effect Effects 0.000 title claims abstract description 24
- 230000003068 static effect Effects 0.000 claims abstract description 48
- 238000012545 processing Methods 0.000 claims abstract description 10
- 239000013598 vector Substances 0.000 claims description 47
- 238000004364 calculation method Methods 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 8
- 239000004576 sand Substances 0.000 claims description 8
- 230000009466 transformation Effects 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 5
- 238000013519 translation Methods 0.000 claims description 5
- 238000003709 image segmentation Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 5
- 230000006870 function Effects 0.000 description 7
- 238000005457 optimization Methods 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 241000271566 Aves Species 0.000 description 1
- 241000282472 Canis lupus familiaris Species 0.000 description 1
- 241000282326 Felis catus Species 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
- G06V10/757—Matching configurations of points or features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
- G06V20/42—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Molecular Biology (AREA)
- Evolutionary Biology (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the technical field of visual space positioning, and discloses a visual SLAM method based on optical flow and semantic segmentation, which comprises the following steps: segmenting input image information by adopting a semantic segmentation network to obtain a static region and a predicted dynamic region; carrying out feature tracking on the static area and the predicted dynamic area by adopting a sparse optical flow method; judging the types of the feature points in the input image information, and removing the dynamic feature points; and (4) taking the set of the removed motion characteristic points as tracking data, inputting the tracking data into an ORB-SLAM for processing, and outputting a pose result. The method solves the problem of poor SLAM tracking and positioning effects in the dynamic environment, and can obtain track information with high pose precision in the dynamic environment.
Description
Technical Field
The invention relates to the technical field of visual space positioning, in particular to a visual SLAM method based on optical flow and semantic segmentation.
Background
SLAM is a key technology in the field of smart mobile robots, wherein visual SLAM uses a camera as a main sensor, and the camera can provide more information than other types of sensors, so that the visual SLAM has been widely researched in recent years. However, achieving accurate tracking and localization in dynamic scenarios has always been a significant challenge for SLAM systems.
In an actual scene, a dynamic object may cause erroneous data in the camera motion calculation, resulting in a tracking failure or a wrong tracking situation. Several methods have been proposed to solve this problem, one of which is the traditional robustness estimation method-RANSAC. The method judges that the dynamic information and the like are removed as outliers, retains the static information to ensure the success of tracking and motion calculation, but when the moving object in the environment is taken as a main body, the method fails because the extracted available data is too little. Another approach integrates additional sensors. The method can utilize data information of a plurality of sensors to complement and realize a compensation strategy for tracking and motion calculation. However, this method is not economical in terms of equipment cost, calculation cost, and the like, and it is often realistic to increase the number of cameras.
The prior method is not excellent in SLAM application, and with the application of deep learning in semantic segmentation, target detection and the like in recent years, a new solution is provided for solving the influence of moving objects in dynamic scenes.
Visual SLAM can be divided into two types, one is a feature-based approach and one is based on a direct approach. The feature method realizes tracking and positioning by comparing descriptors of features to match point pairs and minimizing reprojection errors, and can keep better robustness to geometric noise. However, the time cost of the process of extracting the feature points is large; the direct method optimizes the pose to track by calculating the reprojection error based on the gray scale invariant theory, has better performance in a low texture environment than a method based on the characteristic points, is lower in time cost, and is lower in robustness of the whole algorithm. Neither the feature point method nor the direct method can solve the problems caused by common dynamic objects, and the dynamic objects can generate wrong data association to reduce the calculation pose precision.
Disclosure of Invention
The embodiment of the application solves the problem of poor SLAM tracking and positioning effect in a dynamic environment by providing a visual SLAM method based on optical flow and semantic segmentation.
The embodiment of the application provides a visual SLAM method based on optical flow and semantic segmentation, which comprises the following steps:
step 1, segmenting input image information by adopting a semantic segmentation network to obtain a static region and a predicted dynamic region;
step 2, carrying out feature tracking on the static area and the prediction dynamic area by adopting a sparse optical flow method;
step 3, judging the types of the feature points in the input image information, and removing the dynamic feature points;
and 4, inputting the set of the removed motion characteristic points as tracking data into an ORB-SLAM for processing, and outputting a pose result.
Preferably, the input image information in step 1 is one of input data corresponding to a monocular camera, input data corresponding to a binocular camera, and input data corresponding to a depth camera;
in the step 3, the types of the feature points in the input data corresponding to the monocular camera are judged through epipolar constraint; and judging the types of the characteristic points in the input data corresponding to the binocular camera or the input data corresponding to the depth camera through the reprojection error.
Preferably, the step 1 comprises the following substeps:
step 1.1, selecting a data set to train a Mask R-CNN network to obtain a trained semantic segmentation network; the data set includes multiple types of data as potential moving objects;
step 1.2, inputting image information into trainedSemantic segmentation network, finishing image segmentation to obtain static area AsPredicted motion area Am。
Preferably, the step 2 comprises the following substeps:
step 2.1, utilizing a sparse optical flow method to perform alignment on the static area AsAnd the predicted motion region AmPerforming feature extraction and matching to obtain a static matching point pair set and a predicted motion matching point pair set;
and 2.2, solving the pose based on the SLAM running model.
Preferably, the SLAM-based running model pose solving in the step 2.2 includes:
at time k, using the jth signpost yjProjected to the current frame to obtain a projection position h (xi)k,yj) And obtaining a corresponding observation model:
zk,j=h(ξk,yj)+vk,j
where h () represents the nonlinear model of the landmark at a known pose transformation, zk,jIndicating signpost yjAt the pixel coordinates of the current frame, vk,j~N(0,Qk,j) Representing a mean of 0 and a covariance of Qk,jGaussian noise of (2);
according to the observation model, establishing an error model according to a reprojection error formed by the projection position and the corresponding pixel coordinate:
ek,j=zk,j-h(ξk,yj)
wherein e isk,jIndicating signpost yjDifference, ξ, between the position of the current frame and the projected positionkA lie algebraic form representing pose transformation between two frames at the moment k;
converting the error model into a nonlinear least square problem, setting all camera poses xi and signposts y as x to be optimized, setting the tracking time as m and the total number of the signposts as n, and establishing a loss function:
wherein J () represents a loss function;
and obtaining an optimized pose by resolving the loss function.
Preferably, in the step 3, if the input image information is input data corresponding to a binocular camera or input data corresponding to a depth camera, the determining the type of the feature point in the input image information and removing the dynamic feature point includes the following substeps:
obtaining a first offset vector set corresponding to the static matching point pair set by adopting reprojection calculation; using a weighted average method according to the static area AsAnd the first offset vector set is calculated to obtain a first offset vector weight TiAnd the mean value of the weights of the first offset vectors;
Calculating by adopting reprojection to obtain a second offset vector set corresponding to the predicted motion matching point pair set; using a weighted average method, based on said predicted motion area AmAnd the second offset vector set is calculated to obtain a second offset vector weight Tj;
According to the second offset vector weight TjAnd the mean value of the first offset vector weightsFor the predicted motion region AmJudging the type of each feature point in the image;
judging whether the predicted motion area is a dynamic area; if the characteristic points with the number exceeding the first threshold value in the predicted motion area are judged as dynamic characteristic points, the predicted motion area is marked as a dynamic area, and all the characteristic points marked as the dynamic area are removed.
Preferably, the specific implementation manner of obtaining the first offset vector set corresponding to the static matching point pair set is as follows:
the rotation and translation of the optimized pose obtained in the step 2 in the matrix form are R, t respectively, the camera internal parameter is K, and the matching point pair p of the previous frame and the current frame is setiAnd q isi(xi,yi) Corresponding to a three-dimensional space point PiA 1 is to PiProjecting to the current frame to obtain a projectionCoordinates of the object
Wherein,represents a spatial point PiPosition of pixel point projected to current frame, xi、yiDenotes qiThe coordinates of the pixels of (a) and (b),to representThe pixel coordinates of (a);
the static area AsThe position offset vector corresponding to a certain matching point pair is expressed asThe first set of position offset vectors corresponding to the n pairs of matching point pairs is represented as
Preferably, the obtaining of the first offset vector weight TiMean value phi of the first offset vector weightsThe specific implementation mode is as follows:
wherein, TiRepresenting a first offset vectorWeight on angle and mode length, phisMeans representing a weight of the first offset vector;
the pair of the predicted motion areas AmThe specific implementation manner of judging the type of each feature point in the method is as follows:
will TjPhi and phisFor comparison, if TjGreater than phisJudging as a dynamic feature point; otherwise, the static feature point is determined.
Preferably, in the step 3, if the input image information is input data corresponding to a monocular camera, the determining the type of the feature point in the input image information and removing the dynamic feature point includes the following substeps:
obtaining a basis matrix F according to the rotation R and the translation t corresponding to the optimized pose obtained in the step 2:
F=K-Tt^RK-1
obtaining polar line Fp according to the basic matrix F, the static matching point pair set obtained in the step 2 and the predicted motion matching point pair setk=[x,y,z]T;
Obtaining a characteristic point D value according to the polar line;
presetting a second threshold eta, and judging the type of the feature point according to the D value of the feature point and the second threshold eta;
judging whether the predicted motion area is a dynamic area; if the characteristic points with the number exceeding the first threshold value in the predicted motion area are judged as dynamic characteristic points, the predicted motion area is marked as a dynamic area, and all the characteristic points marked as the dynamic area are removed.
Preferably, the calculation method of the feature point D value is:
wherein p isk,qkUsing homogeneous forms, Fp, in the formulakRepresenting epipolar lines in epipolar geometry, x, y, z being vector parametric representations of the epipolar lines, D representing point pkTo polar line FpkThe distance of (d);
if the value of the characteristic point D is larger than a second threshold eta, a judgment point p is determinedkIs a dynamic characteristic point; otherwise, the static feature point is determined.
One or more technical solutions provided in the embodiments of the present application have at least the following technical effects or advantages:
in the embodiment of the application, firstly, a semantic segmentation network is adopted to segment input image information to obtain a static area and a predicted dynamic area, then a sparse optical flow method is adopted to track the characteristics of the static area and the predicted dynamic area, then the types of characteristic points in the input image information are judged, the dynamic characteristic points are removed, finally, a set with the motion characteristic points removed is used as tracking data and is input into an ORB-SLAM for processing, and a pose result is output. The method is based on the dynamic object extraction, judgment and removal of dynamic influence of the combination of semantic segmentation and an optical flow method, and the static feature points without the dynamic influence are applied to a subsequent SLAM system, so that the track information with high pose precision in the dynamic environment can be obtained finally. Compared with the traditional method for processing the dynamic environment, the method can well judge and eliminate the problems of characteristic influence and low pose precision of the dynamic object.
Drawings
In order to more clearly illustrate the technical solution in the present embodiment, the drawings needed to be used in the description of the embodiment will be briefly introduced below, and it is obvious that the drawings in the following description are one embodiment of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is an overall flowchart of a visual SLAM method based on optical flow and semantic segmentation according to an embodiment of the present invention.
Detailed Description
The embodiment provides a visual SLAM method based on optical flow and semantic segmentation, which mainly comprises the following steps:
step 1, segmenting input image information by adopting a semantic segmentation network to obtain a static region and a predicted dynamic region.
And 2, performing feature tracking on the static area and the predicted dynamic area by adopting a sparse optical flow method.
And 3, judging the types of the feature points in the input image information, and removing the dynamic feature points.
And 4, inputting the set of the removed motion characteristic points as tracking data into an ORB-SLAM for processing, and outputting a pose result.
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
The embodiment provides a visual SLAM method based on optical flow and semantic segmentation, as shown in fig. 1, including the following steps:
step 1, segmenting input image information (data) by using a Mask R-CNN network, distinguishing static and dynamic objects, and obtaining a static area and a predicted dynamic area.
Step 1.1, selecting a data set to train a Mask R-CNN network, and selecting 20 types of data in a COCO data set as potential moving objects, such as: people, bicycles, buses, boats, birds, cats, dogs, etc.
And 1.2, reading data corresponding to a monocular camera, a binocular camera or a depth camera RGB-D and inputting the data into a network. In the trained semantic segmentation network, the format of an input image is mxnx3, the format of an output result is mxnxl, wherein mxn represents the size of the image, 3 represents an image channel (RGB), and l represents the number of training categories (i.e. 20) selected in step 1.1, and the semantic segmentation is completed by combining the selected 20 types as a possible moving objectMarking the region, completing image segmentation to obtain a static region AsAnd predicting a motion region Am. If the segmentation does not result in AmAt this time, it is considered that no dynamic area exists in the data, and the processing of the dynamic area is not necessary, and the processing may be performed according to the conventional ORB-SLAM, and the flow may directly proceed to step 4.
And 2, performing feature tracking on the static area and the predicted dynamic area by adopting a sparse optical flow method.
After the preprocessing of the data is completed, a lightweight algorithm is used for tracking, the essence of the algorithm is that the feature extraction and tracking functions are reserved on the basis of ORB-SLAM tracking, wherein an optical flow method is adopted to replace a feature point method, functional modules of local optimization and key frame decision are eliminated, and feature extraction, matching and pose solving are completed.
Step 2.1, for the data which is divided completely, the Lucas-Kanade optical flow method is used for the static area AsAnd predicting a motion region AmExtracting and matching features, obtaining all feature point sets P for the current frame, and obtaining a static matching point pair set P by matchingmatchs={(pi,qi) I ═ 1,2,3, …, n } and the set of predicted motion matching point pairs Pmatchm={(pj,qj),j=1,2,3,…,m}。piAnd q isiI-th statically matched pixel point pair, p, representing the previous and current frame, respectivelyjAnd q isjRespectively indicate the jth predicted motion matching pixel point pair of the previous frame and the current frame, and n and m respectively indicate the number of static region matching point pairs and the number of predicted motion region matching point pairs.
Step 2.2, starting to solve the pose based on the SLAM running model, and utilizing the jth landmark y at the moment kjProjected to the current frame to obtain a projection position h (xi)k,yj) At this time, the corresponding observation model can be obtained:
zk,j=h(ξk,yj)+vk,j
where h () represents the nonlinear model of the landmark at a known pose transformation, zk,jIndicating signpost yjAt the current frame pixelCoordinates, vk,j~N(0,Qk,j) Representing a mean of 0 and a covariance of Qk,jGaussian noise.
An error model can be established according to the re-projection error formed by the projection position and the corresponding pixel coordinate according to the observation model:
ek,j=zk,j-h(ξk,yj)
wherein e isk,jIndicating signpost yjDifference, ξ, between the position of the current frame and the projected positionkAnd a lie algebraic form representing pose transformation between two frames at the moment k.
And 2.3, converting the error model into a nonlinear least square problem, setting all camera poses xi and signposts y as the quantity x to be optimized, setting the tracking time as m and the total number of the signposts as n, and establishing the following loss functions:
wherein J () represents a loss function, k represents the tracked k time, J represents the jth signpost, ek,jError in step 2.2, Qk,jRepresenting the covariance of the gaussian noise.
And obtaining the optimized pose of the camera by resolving the loss function.
And 3, judging the types of the feature points in the input image information, and removing the dynamic feature points.
The input image information is one of input data corresponding to a monocular camera, input data corresponding to a binocular camera, and input data corresponding to a depth camera. And designing different dynamic characteristic point judgment and processing methods according to different sensor types corresponding to different input data types.
Wherein, the depth camera RGB-D and the binocular camera type are processed by the same method, and the step 3.1 is carried out; the monocular camera type then jumps directly to step 3.4.
Step 3.1, the link aims at RGB-D and binocular systems, and the initial pose lie algebraic form obtained after optimization of the camera in step 2 is xi, corresponding toIs rotated and translated into R, t, and sets the pairs of matched pixel points p of the previous and current frames, knowing the camera internal reference KiAnd q isi(xi,yi) Corresponding to a three-dimensional space point PiA 1 is to PiProjecting to the current frame to obtain projection coordinatesThe relationship is as follows:
wherein,represents a spatial point PiPosition of pixel point projected to current frame, xi、yiDenotes qiThe pixel coordinates of (a);to representThe pixel coordinates of (a).
If there is no error effect, there should beHowever, the positional shift that may exist between the static feature point and the predicted moving point is caused by the influence of noise or the like
Wherein, the position offset vector corresponding to a certain static matching point pair is expressed asAt this point, there is a set of offset vectors for n pairs of matching point pairsI.e. static matchingAnd the point pair set corresponds to a first offset vector set.
Step 3.2, describing the static area A by using a weighted average method according to the static matching point pairs obtained in the step 2.1sFor a static area AsAnd a first set of offset vectors VstateCalculating an offset errorAngle theta ofiLength of dieiAnd a weight Ti:
Then, the mean value phi of the offset vector weights is calculateds:
Step 3.3, obtaining a second offset vector set corresponding to the predicted motion matching point pair set by adopting reprojection calculation; using a weighted average method based on the predicted motion region AmAnd a second offset vector set, and calculating to obtain a second offset vector weight Tj。
I.e. the corresponding predicted motion area a is obtained with reference to steps 3.1 and 3.2mSecond set of offset vectorsAnd the angle, the modulo length and the corresponding second offset vector weight T of the offset errorjWill TjPhi and phisAnd (3) comparison:
using VotherEach vector in (1) completes the dynamic or static judgment of the corresponding predicted feature point. After this step is completed, the process jumps to step 3.6.
Step 3.4, aiming at the monocular system, the initial pose li algebraic form after the optimization of the camera is obtained in step 2 is xi, the rotation and translation corresponding to xi is R, t, and then the basic matrix F of the current motion is obtained:
F=K-Tt^RK-1
meanwhile, a set P of all point pairs (including the static region matching point pair and the predicted motion region matching point pair) with two frame image feature matching is obtained in step 2.1match={(pk,qk) And k is 1,2,3,.. n }, n represents the total number of all matched point pairs, and the polar line Fp is obtained by combining the basic matrix Fk=[x,y,z]T。
Wherein p isk,qkUsing homogeneous forms, Fp, in the formulakRepresenting epipolar lines in epipolar geometry, x, y, z being vector parametric representations of the epipolar lines, D representing point pkTo polar line FpkThe distance of (c).
Step 3.5, point pairs in the set P sequentially calculate a characteristic point D value, set a threshold eta, and judge the type of the characteristic point according to the characteristic point D value and the threshold eta:
among them, it is found that the effect of setting η to 5 is stable in the test, so it is preferable to use η of 5 for the solution.
And 3.6, judging the predicted motion area extracted in the step 1.2 according to the dynamic feature point set, when most of feature points (for example, more than 80 percent) in the predicted motion area are determined as dynamic feature points, determining that the area is a dynamic area, and then removing all feature points calibrated to be on the dynamic area.
Step 4, removing a set P of motion characteristic points from all characteristic points (including characteristic points corresponding to a static area and a predicted dynamic area)eAs trace data, this time set PeThe influence of the dynamic characteristics is removed, and then the pose result is input into a traditional ORB-SLAM framework for processing and output.
Step 4.1, set PeAnd finishing local map creation and pose optimization in tracking.
And 4.2, carrying out closed-loop detection.
And 4.3, outputting a pose result.
In summary, the method for SLAM based on the combination of semantic segmentation and an optical flow method for extracting, judging and removing dynamic influence of dynamic objects provided by the invention adopts a semantic segmentation network to effectively segment potential dynamic objects, then uses a sparse optical flow method to complete stable feature tracking, then judges and removes dynamic feature points through epipolar constraint of matching point pairs and reprojection error distribution difference, and applies the static feature points without dynamic influence to a subsequent SLAM system, thereby finally obtaining track information with higher pose precision in a dynamic environment. Compared with the traditional method for processing the dynamic environment, the method can well judge and eliminate the problems of characteristic influence and low pose precision of the dynamic object.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to examples, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.
Claims (10)
1. A visual SLAM method based on optical flow and semantic segmentation is characterized by comprising the following steps:
step 1, segmenting input image information by adopting a semantic segmentation network to obtain a static region and a predicted dynamic region;
step 2, carrying out feature tracking on the static area and the prediction dynamic area by adopting a sparse optical flow method;
step 3, judging the types of the feature points in the input image information, and removing the dynamic feature points;
and 4, inputting the set of the removed motion characteristic points as tracking data into an ORB-SLAM for processing, and outputting a pose result.
2. The visual SLAM method based on optical flow and semantic segmentation as claimed in claim 1 wherein the input image information in step 1 is one of input data corresponding to a monocular camera, input data corresponding to a binocular camera, input data corresponding to a depth camera;
in the step 3, the types of the feature points in the input data corresponding to the monocular camera are judged through epipolar constraint; and judging the types of the characteristic points in the input data corresponding to the binocular camera or the input data corresponding to the depth camera through the reprojection error.
3. The optical flow and semantic segmentation based visual SLAM method as defined in claim 1 wherein step 1 comprises the sub-steps of:
step 1.1, selecting a data set to train a Mask R-CNN network to obtain a trained semantic segmentation network; the data set includes multiple types of data as potential moving objects;
step 1.2, inputting the image information into the trained semantic segmentation network to complete image segmentation and obtain a static area AsPredicted motion area Am。
4. The optical flow and semantic segmentation based visual SLAM method of claim 3 wherein said step 2 comprises the sub-steps of:
step 2.1, utilizing a sparse optical flow method to perform alignment on the static area AsAnd the predicted motion region AmPerforming feature extraction and matching to obtain a static matching point pair set and a predicted motion matching point pair set;
and 2.2, solving the pose based on the SLAM running model.
5. The visual SLAM method based on optical flow and semantic segmentation as claimed in claim 4 wherein the SLAM based run model pose solution in step 2.2 comprises:
at time k, using the jth signpost yjProjected to the current frame to obtain a projection position h (xi)k,yj) And obtaining a corresponding observation model:
zk,j=h(ξk,yj)+vk,j
where h () represents the nonlinear model of the landmark at a known pose transformation, zk,jIndicating signpost yjAt the pixel coordinates of the current frame, vk,j~N(0,Qk,j) Representing a mean of 0 and a covariance of Qk,jGaussian noise of (2);
according to the observation model, establishing an error model according to a reprojection error formed by the projection position and the corresponding pixel coordinate:
ek,j=zk,j-h(ξk,yj)
wherein e isk,jIndicating signpost yjDifference, ξ, between the position of the current frame and the projected positionkA lie algebraic form representing pose transformation between two frames at the moment k;
converting the error model into a nonlinear least square problem, setting all camera poses xi and signposts y as x to be optimized, setting the tracking time as m and the total number of the signposts as n, and establishing a loss function:
wherein J () represents a loss function;
and obtaining an optimized pose by resolving the loss function.
6. The visual SLAM method based on optical flow and semantic segmentation as set forth in claim 5 wherein, in step 3, if the input image information is input data corresponding to a binocular camera or input data corresponding to a depth camera, the determining the type of the feature points in the input image information and removing the dynamic feature points comprises the sub-steps of:
obtaining a first offset vector set corresponding to the static matching point pair set by adopting reprojection calculation; using a weighted average method according to the static area AsAnd the first offset vector set is calculated to obtain a first offset vector weight TiAnd the mean value of the weights of the first offset vectors;
Calculating by adopting reprojection to obtain a second offset vector set corresponding to the predicted motion matching point pair set; using a weighted average method, based on said predicted motion area AmAnd the second offset vector set is calculated to obtain a second offset vector weight Tj;
According to the second offset vector weight TjAnd the mean value of the first offset vector weightsFor the predicted motion region AmJudging the type of each feature point in the image;
judging whether the predicted motion area is a dynamic area; if the characteristic points with the number exceeding the first threshold value in the predicted motion area are judged as dynamic characteristic points, the predicted motion area is marked as a dynamic area, and all the characteristic points marked as the dynamic area are removed.
7. The visual SLAM method based on optical flow and semantic segmentation as claimed in claim 6 wherein the specific implementation of the first set of offset vectors corresponding to the set of statically matched point pairs is:
the rotation and translation of the optimized pose obtained in the step 2 in the matrix form are R, t respectively, the camera internal parameter is K, and the matching point pair p of the previous frame and the current frame is setiAnd q isi(xi,yi) Corresponding to a three-dimensional space point PiA 1 is to PiProjecting to the current frame to obtain projection coordinates
Wherein,represents a spatial point PiPosition of pixel point projected to current frame, xi、yiDenotes qiThe coordinates of the pixels of (a) and (b),to representThe pixel coordinates of (a);
8. The visual SLAM method based on optical flow and semantic segmentation as claimed in claim 7 wherein said deriving a first offset vector weight TiFirst, aMean of offset vector weights phisThe specific implementation mode is as follows:
wherein, TiRepresenting a first offset vectorWeight on angle and mode length, phisMeans representing a weight of the first offset vector;
the pair of the predicted motion areas AmThe specific implementation manner of judging the type of each feature point in the method is as follows:
will TjPhi and phisFor comparison, if TjGreater than phisJudging as a dynamic feature point; otherwise, the static feature point is determined.
9. The visual SLAM method based on optical flow and semantic segmentation as set forth in claim 5, wherein the step 3, if the input image information is the input data corresponding to the monocular camera, the steps of determining the type of the feature points in the input image information and removing the dynamic feature points comprise the following substeps:
obtaining a basis matrix F according to the rotation R and the translation t corresponding to the optimized pose obtained in the step 2:
F=K-Tt^RK-1
according to the baseObtaining a base matrix F, the static matching point pair set and the predicted motion matching point pair set obtained in the step 2, and obtaining a polar line Fpk=[x,y,z]T;
Obtaining a characteristic point D value according to the polar line;
presetting a second threshold eta, and judging the type of the feature point according to the D value of the feature point and the second threshold eta;
judging whether the predicted motion area is a dynamic area; if the characteristic points with the number exceeding the first threshold value in the predicted motion area are judged as dynamic characteristic points, the predicted motion area is marked as a dynamic area, and all the characteristic points marked as the dynamic area are removed.
10. The visual SLAM method based on optical flow and semantic segmentation as claimed in claim 9 wherein the feature point D values are calculated by:
wherein p isk,qkUsing homogeneous forms, Fp, in the formulakRepresenting epipolar lines in epipolar geometry, x, y, z being vector parametric representations of the epipolar lines, D representing point pkTo polar line FpkThe distance of (d);
if the value of the characteristic point D is larger than a second threshold eta, a judgment point p is determinedkIs a dynamic characteristic point; otherwise, the static feature point is determined.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010488128.2A CN111797688A (en) | 2020-06-02 | 2020-06-02 | Visual SLAM method based on optical flow and semantic segmentation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010488128.2A CN111797688A (en) | 2020-06-02 | 2020-06-02 | Visual SLAM method based on optical flow and semantic segmentation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111797688A true CN111797688A (en) | 2020-10-20 |
Family
ID=72806020
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010488128.2A Pending CN111797688A (en) | 2020-06-02 | 2020-06-02 | Visual SLAM method based on optical flow and semantic segmentation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111797688A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112308921A (en) * | 2020-11-09 | 2021-02-02 | 重庆大学 | Semantic and geometric based joint optimization dynamic SLAM method |
CN112381841A (en) * | 2020-11-27 | 2021-02-19 | 广东电网有限责任公司肇庆供电局 | Semantic SLAM method based on GMS feature matching in dynamic scene |
CN112418288A (en) * | 2020-11-17 | 2021-02-26 | 武汉大学 | GMS and motion detection-based dynamic vision SLAM method |
CN112446885A (en) * | 2020-11-27 | 2021-03-05 | 广东电网有限责任公司肇庆供电局 | SLAM method based on improved semantic optical flow method in dynamic environment |
CN112734836A (en) * | 2020-12-28 | 2021-04-30 | 香港中文大学深圳研究院 | Mobile robot positioning method and system in dynamic environment |
CN113920163A (en) * | 2021-10-09 | 2022-01-11 | 成都信息工程大学 | Moving target detection method based on combination of tradition and deep learning |
CN114332163A (en) * | 2021-12-29 | 2022-04-12 | 武汉大学 | High-altitude parabolic detection method and system based on semantic segmentation |
CN115061770A (en) * | 2022-08-10 | 2022-09-16 | 荣耀终端有限公司 | Method and electronic device for displaying dynamic wallpaper |
CN115660944A (en) * | 2022-10-27 | 2023-01-31 | 深圳市大头兄弟科技有限公司 | Dynamic method, device and equipment for static picture and storage medium |
WO2023178951A1 (en) * | 2022-03-25 | 2023-09-28 | 上海商汤智能科技有限公司 | Image analysis method and apparatus, model training method and apparatus, and device, medium and program |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190114777A1 (en) * | 2017-10-18 | 2019-04-18 | Tata Consultancy Services Limited | Systems and methods for edge points based monocular visual slam |
CN110125928A (en) * | 2019-03-27 | 2019-08-16 | 浙江工业大学 | A kind of binocular inertial navigation SLAM system carrying out characteristic matching based on before and after frames |
CN110706279A (en) * | 2019-09-27 | 2020-01-17 | 清华大学 | Global position and pose estimation method based on information fusion of global map and multiple sensors |
-
2020
- 2020-06-02 CN CN202010488128.2A patent/CN111797688A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190114777A1 (en) * | 2017-10-18 | 2019-04-18 | Tata Consultancy Services Limited | Systems and methods for edge points based monocular visual slam |
CN110125928A (en) * | 2019-03-27 | 2019-08-16 | 浙江工业大学 | A kind of binocular inertial navigation SLAM system carrying out characteristic matching based on before and after frames |
CN110706279A (en) * | 2019-09-27 | 2020-01-17 | 清华大学 | Global position and pose estimation method based on information fusion of global map and multiple sensors |
Non-Patent Citations (2)
Title |
---|
JUNHAO CH.等: "DM-SLAM:A Feature-Based SLAM System for Rigid Dynamic Scenes", 《INTERNATIONAL JOURNAL OF GEO-INFORMATION》 * |
席志红 等: "基于语义分割的室内动态场景同步定位与语义建图", 《计算机应用》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112308921B (en) * | 2020-11-09 | 2024-01-12 | 重庆大学 | Combined optimization dynamic SLAM method based on semantics and geometry |
CN112308921A (en) * | 2020-11-09 | 2021-02-02 | 重庆大学 | Semantic and geometric based joint optimization dynamic SLAM method |
CN112418288B (en) * | 2020-11-17 | 2023-02-03 | 武汉大学 | GMS and motion detection-based dynamic vision SLAM method |
CN112418288A (en) * | 2020-11-17 | 2021-02-26 | 武汉大学 | GMS and motion detection-based dynamic vision SLAM method |
CN112446885A (en) * | 2020-11-27 | 2021-03-05 | 广东电网有限责任公司肇庆供电局 | SLAM method based on improved semantic optical flow method in dynamic environment |
CN112381841A (en) * | 2020-11-27 | 2021-02-19 | 广东电网有限责任公司肇庆供电局 | Semantic SLAM method based on GMS feature matching in dynamic scene |
CN112734836A (en) * | 2020-12-28 | 2021-04-30 | 香港中文大学深圳研究院 | Mobile robot positioning method and system in dynamic environment |
CN113920163A (en) * | 2021-10-09 | 2022-01-11 | 成都信息工程大学 | Moving target detection method based on combination of tradition and deep learning |
CN113920163B (en) * | 2021-10-09 | 2024-06-11 | 成都信息工程大学 | Moving target detection method based on combination of traditional and deep learning |
CN114332163A (en) * | 2021-12-29 | 2022-04-12 | 武汉大学 | High-altitude parabolic detection method and system based on semantic segmentation |
WO2023178951A1 (en) * | 2022-03-25 | 2023-09-28 | 上海商汤智能科技有限公司 | Image analysis method and apparatus, model training method and apparatus, and device, medium and program |
CN115061770A (en) * | 2022-08-10 | 2022-09-16 | 荣耀终端有限公司 | Method and electronic device for displaying dynamic wallpaper |
CN115660944A (en) * | 2022-10-27 | 2023-01-31 | 深圳市大头兄弟科技有限公司 | Dynamic method, device and equipment for static picture and storage medium |
CN115660944B (en) * | 2022-10-27 | 2023-06-30 | 深圳市闪剪智能科技有限公司 | Method, device, equipment and storage medium for dynamic state of static picture |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11763485B1 (en) | Deep learning based robot target recognition and motion detection method, storage medium and apparatus | |
CN111797688A (en) | Visual SLAM method based on optical flow and semantic segmentation | |
CN111563442B (en) | Slam method and system for fusing point cloud and camera image data based on laser radar | |
Melekhov et al. | Dgc-net: Dense geometric correspondence network | |
CN108665496B (en) | End-to-end semantic instant positioning and mapping method based on deep learning | |
CN110853100B (en) | Structured scene vision SLAM method based on improved point-line characteristics | |
CN113221647B (en) | 6D pose estimation method fusing point cloud local features | |
Zhao et al. | Deep direct visual odometry | |
CN108776989B (en) | Low-texture planar scene reconstruction method based on sparse SLAM framework | |
CN113139470B (en) | Glass identification method based on Transformer | |
CN114937083B (en) | Laser SLAM system and method applied to dynamic environment | |
CN112001859A (en) | Method and system for repairing face image | |
CN112489083A (en) | Image feature point tracking matching method based on ORB-SLAM algorithm | |
Zhu et al. | A review of 6d object pose estimation | |
CN114708293A (en) | Robot motion estimation method based on deep learning point-line feature and IMU tight coupling | |
Yu et al. | Accurate and robust visual localization system in large-scale appearance-changing environments | |
Wang et al. | Stream query denoising for vectorized hd map construction | |
Min et al. | Coeb-slam: A robust vslam in dynamic environments combined object detection, epipolar geometry constraint, and blur filtering | |
Zhu et al. | Fusing panoptic segmentation and geometry information for robust visual slam in dynamic environments | |
CN112686952A (en) | Image optical flow computing system, method and application | |
CN114283265A (en) | Unsupervised face correcting method based on 3D rotation modeling | |
CN113888603A (en) | Loop detection and visual SLAM method based on optical flow tracking and feature matching | |
Geiger | Monocular road mosaicing for urban environments | |
CN114119999B (en) | Iterative 6D pose estimation method and device based on deep learning | |
CN108534797A (en) | A kind of real-time high-precision visual odometry method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20201020 |
|
RJ01 | Rejection of invention patent application after publication |