Relative pose of three calibrated and partially calibrated cameras from four points using virtual correspondences
Abstract
We study challenging problems of estimating the relative pose of three cameras and propose novel efficient solutions to the configurations (1) of four points in three calibrated cameras (the 4p3v problem), and (2) of four points in three cameras with unknown shared focal length (the 4p3vf problem). Our solutions are based on the simple idea of generating one or two additional virtual point correspondences in two views by using the information from the locations of the input correspondences. We generate such correspondences using a very simple and efficient strategy, where the new points are the mean points of three corresponding input points. The new solvers are efficient and easy to implement, since they are based on existing efficient minimal solvers, i.e., the well-known 5-point and 6-point relative pose solvers and the P3P solver. Extensive experiments on real data show that our solvers achieve state-of-the-art results. We also present a simple network that can improve the precision of the mean-point correspondences, showing the potential to learn better virtual point correspondences.
1 Introduction
Camera geometry estimation is crucial in many computer vision applications, e.g., visual navigation [48], Structure-from-Motion [52], augmented reality [5], self-driving cars [17], and visual localization [47]. Due to noise and outliers in the input correspondences, the predominant way for camera geometry estimation is to use a hypothesis-and-test framework, e.g., RANSAC [15, 8, 45, 2]. For RANSAC-like methods, using as few (ideally the minimal number of) correspondences as possible for estimation is important since the number of RANSAC iterations (and, thus, its run-time) grows exponentially with the number of correspondences required for the model estimation.
Minimal camera geometry problems often result in complex systems of polynomial equations. Efficient algebraic methods helped to solve many previously unsolved problems [55, 4, 28, 25, 24, 54]. Still, they fail to generate efficient and/or numerically stable solutions for some problems. In this paper, we study such challenging problems of estimating the relative pose of three cameras. These problems have received attention for a long time [18, 44, 30, 33, 1]. However, due to their complexity, they are still not considered fully solved. There are no efficient and practical solutions for most of the minimal configurations of point and/or line correspondences [22]. One such configuration that is particularly interesting is the notoriously difficult configuration of four points in three views [44, 34]. This configuration is minimal for cameras with an unknown shared focal length, i.e., the 4p3vf problem, and provides one more constraint than minimal for calibrated cameras, i.e., the 4p3v problem.
State-of-the-art algebraic and numerical methods are known to fail in generating efficient and numerically stable solutions to the 4p3v and 4p3vf problems. The existing methods for solving these problems are either too slow for practical applications [7, 9] or are only approximate [19, 35]. By solving only for one (a few) solutions from the 272 solutions of the 4p3v problem [19], and by discretely sampling the space of potential solutions [35], the existing 4p3v methods can often fail, i.e., the returned solution can be, in general, arbitrarily far from the geometrically correct solution. To decrease the failure rate, both methods [19, 35] require a lot of tuning and are not easy to re-implement.111For the solver proposed in [35], there is no publicly available implementation. The publicly available implementation of the solver from [19] is quite complex and requires a non-negligible effort to run.
In this paper, we propose a novel approach for solving the 4p3v and 4p3vf problems. Our solutions are based on the simple idea of generating new approximate point correspondence(s) between two of the three views.222Note that similarly to the state-of-the-art solutions [35, 19], our solutions are only approximate. However, as we show in the experiments, they provide good initialization for local optimization [8] and outperform [19]. Such approximate correspondences are generated using only the locations of the original input point correspondences, without any information about the image itself (e.g., appearance or local features). Consequently, the new correspondences do not need to correspond to any physical 3D points in the scene. Thus, we call them virtual correspondences. Using virtual correspondences, we can efficiently solve the 4p3v and 4p3vf problems by first estimating the relative pose of two cameras from five/six correspondences using efficient 5pt/6pt solvers [34, 56], and then registering the third camera using a P3P solver [39]. We call these combinations the 5pt+P3P and 6pt+P3P solvers.
Based on this idea, we propose a group of novel solvers for the 4p3v and 4p3vf problems. These solvers, called M-based solvers (4p3v(M),4p3vf(M)), use the mean points of three corresponding points detected in two views and, potentially, points in their vicinity ((M)-solvers) as new virtual point correspondences. To compensate for noise in these virtual correspondences, our solvers refine the solutions on the original four points in three views using just a few iterations of Levenberg-Marquardt refinement. While conceptually very simple and efficient, the novel solvers achieve state-of-the-art results on real data.
The contributions of the paper are as follows:
-
•
For the well-known challenging 4p3v problem for calibrated cameras, we propose novel M-based solvers. These solvers generate an additional virtual point correspondence(s) in two views as the mean points of three corresponding points and refine the approximate solution on the original four points in three views. The new solvers achieve state-of-the-art results in terms of accuracy on real data. Compared to state-of-the-art 4p3v solvers [19, 35], which are non-trivial and difficult to re-implement such that they are numerically stable and fast, our new solutions can be easily implemented using existing efficient implementations of the 5pt solver [34] and the P3P solver [39]. The source code of our solvers will be publicly available.
-
•
We provide efficient solutions to the 4p3vf problem for cameras with an unknown shared focal length. Our novel solvers generate for each instance two virtual correspondences to solve the problem via the efficient 6pt [34] and the P3P [39] solvers. Our solutions are significantly faster than the existing homotopy-continuation solutions [7, 9]. Our solvers show the potential of virtual correspondences to be applied to other camera geometry problems.
-
•
We present preliminary results for a simple network that can improve the precision of the mean-point correspondences. While our current versions of the learning-based (L-based) solvers do not provide sufficient improvement of virtual correspondences to produce a visible improvement after the refinement on all four correspondences, the proposed network shows the potential to learn better virtual point correspondences.
-
•
To the best of our knowledge, we are the first to extensively evaluate solutions to the 4p3v and 4p3vf problems on a large variety of real-world scenes and within state-of-the-art RANSAC frameworks, and to compare them to the baseline minimal 5pt+P3P and 6pt+P3P solvers on such data.
2 Related work
Estimating the relative pose of three cameras from a minimal number of point and line correspondences is known as an extremely challenging problem.
For three uncalibrated cameras, 6 point correspondences are necessary to estimate the trifocal tensor, with a solution known for a long time [43, 57]. Solutions to three minimal combinations of points and lines are presented in [36]. The minimal configuration using 9 lines is much more challenging and was solved only recently [27]. However, the final solver is far from practical since it runs for 17.8s.
For calibrated cameras, the configuration that attracts most of the attention is the configuration of four points in three views (the 4p3v problem). Note that this is not a minimal configuration since it generates 12 constraints for 11 degrees-of-freedom (DoF) (see also Section 3). The 4p3v problem is known to be extremely difficult to solve. Several papers present mostly theoretical results [30, 33, 1]. For four triplets of exact points without noise, it is shown that the 4p3v problem has, in general, a unique solution [18, 44].
To the best of our knowledge, there are only two reasonably efficient solutions to the 4p3v problem reported in the literature. The first solver [35] is based on a one-dimensional exhaustive search. It performs a sweep of a -degree curve of possible epipoles. For each potential epipole, it computes the relative pose of two cameras, registers the third camera using three triangulated points, and finally extracts the solution minimizing the reprojection error of the fourth point in the third view. Evaluation of the solver on one potential epipole is fast. Yet, to obtain reasonable precise and stable results, usually, 1,000 candidates need to be evaluated and even then, refinement at multiple local minima is required to improve the precision. The runtimes reported for this solver were depending on the number of points searched. There is no publicly available implementation for this solver and it is not easy to re-implement. As such, the literature does not compare against the solver in experiments. As an upper bound of the performance of [35], we compare with an oracle version using the true epipole in the supplementary material (SM).
The second efficient solver to the 4p3v problem was published only recently [19]. In this paper, the authors first transform the 4p3v problem into a minimal problem by considering a line passing through the last correspondence in the third view. The resulting system of equations is solved using an efficient Homotopy continuation (HC) method [14, 53]. To avoid computing large numbers of spurious solutions, an MLP-based classifier is trained. For a given problem , it selects one or several starting problem-solution pairs (so-called anchors), such that the geometrically meaningful/correct solution of can be obtained by HC starting from this anchor. This strategy is fast, running on average per solution. However, it has a high failure rate. The success rate of the 4p3v solver reported in [19] on two test datasets and data without noise is . [19] do not show results for a real scenario, i.e., a RANSAC-like framework with noisy data. Providing such an evaluation, we show that our much simpler solvers outperform [19].
Solutions to the 4p3v problem for orthographic and scaled orthographic views were presented in [59, 32]. In [32], the author suggested an iterative approach for updating to perspective views, but reported results only on a few synthetic instances. According to our own experiments, the update does not work on real data with general perspective cameras, even after spending months on this issue.
Minimal configurations of points and lines in three calibrated views were studied in [13, 22, 12], aiming to classify and derive the number of solutions for different configurations. Solutions to two minimal configurations of combinations of points and lines (Chicago and Cleveland), were proposed in [14] and solved using a HC method [53]. Due to their complexity, the solvers are not practical,
Recently, the HC method was used to solve the 4p3vf problem for cameras with an unknown shared focal length [7, 9]. The running times of the CPU variants of the proposed solvers range from to . Efficient GPU implementations run to . These times are still too slow for practical applications. Due to slow run-times and their dependency on a GPU, we are not comparing with the GPU solvers on real data. A GPU HC method was also used to solve minimal problems of four points/six lines in three views for a generalized camera in [11].
In our solutions, we generate virtual correspondences. Virtual matches are also used in the literature on affine correspondences (AC) [38, 40, 41, 3]. There, points are sampled based on the affine feature geometry to generate point correspondences from affine ones. In our scenario, we are only given point correspondences, without associated feature geometry, and we predict additional point matches.
3 Estimating the relative pose of three cameras
Let cameras observe a set of 3D points . For each point , let be the number of cameras that observe it. A necessary condition for a relative pose problem of calibrated cameras to be minimal is [14]
(1) |
Let denote a sample of 3D points and let denote a subset of with cardinality . A configuration of points in views that satisfies the constraint (1) of a minimal problem is three points visible in all three cameras and two additional points visible in two of the three cameras. We will call this configuration .
The configuration of four points visible in all three cameras, i.e., the configuration , generates an over-constrained problem. In this case, we have one more constraint than DoF, i.e., in Eq. (1), we have . A minimal solution would need to drop one constraint, e.g., by considering only a line passing through one of the points in the third view [19] or by considering a “half” point correspondence. Since, in practice, we always have full correspondences and sampling one less point in one view leads to an under-constrained problem, the configuration , is usually considered “minimal”.
For cameras with an unknown common focal length, we have one more DoF. This means that, for , the right-hand side of equation (1) becomes , resulting in and being minimal configurations.
3.1 Calibrated cameras
In this section, we describe solutions for three calibrated cameras. We start with one baseline minimal solution for the minimal configuration, followed by novel solutions for the “minimal” configuration.
5pt+P3P solver: The 5pt+P3P solver first estimates the relative pose of two cameras from 5 image point correspondences using the efficient 5pt solver [34]. Next, the three points visible in all three views are triangulated. Finally, the third camera is registered using the three 2D-3D point correspondences and the well-known efficient P3P solver [39].
This solver is straightforward and it is based on efficient existing solvers [34, 39]. This solver appears in the literature [13, 34, 46, 35]. Nister et al. [35] showed that the 5pt+P3P solver performs better than their dedicated 4p3v solver. However, in the most recent works [14, 19] that study the relative pose problem for three calibrated cameras, the 5pt+P3P solver is not discussed and is not used as a baseline for comparison. To our knowledge, the performance of this solver on real data and within state-of-the-art RANSAC frameworks in the context of the 4p3v problem has not been extensively studied. Our paper fills this gap in the literature.
Motivated by the efficient 5pt+P3P solver and the minimal configuration, which, compared to the configuration, leads to significantly simpler systems of polynomial equations, we next describe novel solvers to the calibrated 4p3v problem. They efficiently solve the configuration by generating a virtual point correspondence in the first two views, and solving the resulting problem using the 5pt+P3P solver.
4p3v(M) solver: Our first solver is based on a simple observation: If we fix the point in the first view to be the mean of three points , in this view, then the mean point of the corresponding three points in the second view usually has a small epipolar error w.r.t. the ground truth relative pose. Thus is usually a good approximation of a correct correspondence.
This can be considered a surprising observation, since 4 (or actually 3) points in two views define an infinite number of camera poses that can observe these points. However, the reason this mean-point strategy works in practice comes from several simple facts and observations. (1) To generate a good correspondence, we only require that the point in the second view be reasonably close to the epipolar line defined by the mean point in the first view, i.e., the 2D point does not need to correspond to one particular 3D point with a given depth.333By fixing a point in one view, we are defining an epipolar line in the second view. Any of the points on this line (corresponding to 3D points with different depths) is in correspondence with the point in the first view. (2) The epipolar line defined by passes through the triangle defined by the corresponding three points in the second image. Thus, the maximum distance of in the second image from the epipolar line is bounded by the maximum distance of from (for a proof, see SM). (3) For practical applications, when used in RANSAC, it is not necessary that each triplet generates a good correspondence . Samples with a high level of noise in the mean-point correspondence are filtered inside RANSAC.444This property was also used in the HC solver [19], which completely fails for many samples. These samples are filtered within RANSAC. On a large number of different scenes, we observed that even if some image pairs have triplets of points that generate very noisy mean-point correspondences, there are usually enough triplets for which the noise in is reasonably small to provide good estimates. (4) Four point correspondences in two views usually fix the space of possible poses such that the correspondence, even if noisy, often generates a pose that is not very far from the ground truth pose. Such a pose is usually sufficient as an initialization of non-linear optimization on the original four points in three views and subsequent local optimization on detected inliers. We support our observations by experiments on a large amount of synthetic and real data (see Sec. 4 and SM).
Motivated by these observations, our 4p3v(M) solver uses the mean points of three corresponding points detected in two views as a new point correspondence. The 4p3v problem is then solved using the 5pt+P3P solver.
4p3v(M) solver: While the mean point correspondence used in the 4p3v(M) solver can provide a good approximation of a correct correspondence, as mentioned, it can also be noisy. In the 4p3v(M) solver, we thus, in addition to the mean point of three points in the second image, generate two additional points. These points are (1) if the longest dimension of the triangle is in the x-direction or (2) if it is in the y-direction. All three points, i.e., , , and are placed in correspondence with the mean point . The 4p3v(M) solver in the first step calls the 5pt solver [34] three times, with the correspondence being either , , or . The results of these three 5pt solvers are collected to create hypotheses for the relative pose of the first two cameras inside RANSAC. The shift is selected relative to the size of the triangle .
point in the third view: The 4p3v(M) and 4p3v(M) solvers are actually solving the configuration , i.e., they are not using information from the point in third view, i.e., point . The information from can be used in the solver for the configuration in two different ways: Filtering (+F): can be used to filter out geometrically infeasible solutions returned by the 5pt+P3P solver that is used inside M-based solvers. The 5pt+P3P solver returns multiple solutions that can be evaluated w.r.t. . Since the returned solutions can be affected by larger noise in the mean-point correspondence, we do not simply select the solution with the smallest error on , but we keep all solutions that have an epipolar error on smaller than twice the threshold used inside RANSAC. Our experiments show that this filtering can improve the speed of the proposed solvers. However, as a trade-off, there is a small drop in the precision of the solvers since, in some cases, geometrically correct solutions are filtered. Refinement (+R): can be used to refine the solutions returned by the 5pt+P3P solver used inside M-based solvers. These solutions have zero error on the original four points and the mean-point correspondence in the first two views, but can have a large error on . Rather, we want solutions that minimize the epipolar errors on the original four points in the three views. Note that is an overconstrained configuration and thus for noisy data, there is, in general, no solution with zero error on all four points in the three views. We perform refinement of the poses by minimizing the epipolar errors of all original four points in three views using Levengerg-Marquardt optimization (LM), initialized using the solutions from the M-point solvers. Experiments with different numbers of iterations show that two iterations are usually sufficient to obtain an improvement (see SM).
4p3v(L) solvers: We also experiment with learning-based 4p3v(L) solvers, which, instead of using the mean point correspondence, use a neural network to predict the virtual correspondence. In the network, we want to directly use the information from all four points in three views. We use the fact that four triplets of points, in general, define a unique relative pose of three calibrated cameras. We train a network that, given such four corresponding points in three views and a fixed 5 point in the first view, predicts a corresponding 5 point in the second view. We fix the 5 point to the mean point as also defined in the M-based solvers. The network actually learns a shift of the mean point in the second view to be in correspondence with . We use a lightweight architecture with a backbone of shared MLP layers, similar to [6], and the Sampson error as a loss function. Details on the architecture, loss function, and training data are in the SM. 4p3v(L) solvers can be defined similarly to 4p3v(M) solvers (see SM).
Our experiments show that the proposed network can, in general, improve the precision of the mean-point correspondence , resulting in better performance than the baseline 4p3v(M). However, as shown in Sec. 4, after adding the refinement (+R), the difference between the 4p3v(M)+R and 4p3v(L)+R solvers is negligible. Still, the proposed network can be seen as a first step towards a method that can learn better virtual point correspondences.
3.2 Partially calibrated cameras
To show the potential of the proposed mean-point strategy, we applied this idea to the very challenging 4p3vf problem of estimating the relative pose of three cameras with an unknown shared focal length from four correspondences. Our novel solvers for the 4p3vf problem follow the approach of generating virtual correspondences used in our 4p3v solvers for calibrated cameras. The idea is to transform the extremely complex 4p3vf problem into the problem solved by the efficient 6pt+P3P solver. The 6pt+P3P solver first estimates the unknown focal length and the relative pose of the first two cameras using the efficient 6pt solver [56] and then registers the third camera using the P3P solver [39]. This means that, in contrast to the 4p3v solvers presented in Section 3.1 that generate only one virtual correspondence (+ potentially additional shifted versions of this correspondence), our novel 4p3vf solvers need to generate two virtual correspondences to obtain six correspondences in the first two views.
In the 4p3vf(M) solver, we generate the two new virtual correspondences using the mean points of two different triplets of corresponding points in the first two cameras. Similarly to the calibrated case, we also test 4p3vf(L) and (+R), (+F), and -based variants of 4p3vf solvers. More details on all 4p3vf solvers can be found in the SM.
4 Experiments
We extensively evaluated the proposed solvers on a large variety of synthetic and real data to test their robustness to noise and outliers and assess their performance inside state-of-the-art RANSAC-frameworks [2, 26]. We compare our novel solvers with the homotopy continuation 4p3v(HC) solver [19] and the 5pt+P3P and 6pt+P3P baseline minimal solvers for the and configurations for calibrated cameras respectively cameras with an unknown shared focal length. To obtain upper bounds for the precision that can be achieved by our proposed solvers, we also consider ‘Oracle’ versions (denoted (O)), where we use correct correspondence(s), i.e., correspondences that satisfy the epipolar constraint for the ground truth relative pose of the first two cameras, as the / virtual correspondence between these cameras. Without publicly available code for the 4p3v(N) solver [35], we tested only its ‘Oracle’ version. Instead of doing a one-dimensional search over the degree curve of possible epipoles, we use the ground truth epipole. Since this solver performs worse than our ‘Oracles’, we report it only in the SM.
Experimental setup. To obtain feature correspondences, we use SuperPoint [10] features with the LightGlue [31] matcher. We extract at most 2048 features per image. We perform matching for all three image pairs and keep only those matches that were consistently matched across all three views. We performed evaluation within two RANSAC frameworks: PoseLib [26] and GC-RANSAC [2]. For the 5pt solver, we use [34] and for the P3P solver, we use [39]. In PoseLib, we perform LO [8] using LM optimization. In GC-RANSAC we perform LO using non-minimal solvers for fitting models to larger-than-minimal samples. We tested different shifts for our -based solvers (for the ablation study, see SM) and selected .
Evaluation measures. Inspired by [20], we define the pose error as , where and are the angular errors of rotation and translation for pair in degrees [20]. We also report AUC values [20] at different thresholds for the pose error. We include results for an alternative pose error definition which includes and in the SM.
Mean-point strategy. The first experiments aim to support our idea of selecting a virtual point correspondence as the mean points of three corresponding points in two images.
Scene | AVG (∘) | MED (∘) | perc. (∘) |
Brandenburg Gate | 23.1 20.9 / 19.5 22.9 | 18.0 / 12.5 | 8.9 / 4.7 |
Buckingham Palace | 25.7 22.3 / 22.2 23.4 | 19.4 / 14.8 | 9.0 / 5.7 |
Colosseum Exterior | 27.5 22.3 / 20.9 25.2 | 22.4 / 12.4 | 10.4 / 4.1 |
Grand Place Brussels | 25.3 22.0 / 21.7 23.5 | 19.5 / 15.1 | 9.4 / 5.9 |
Notre Dame Front Facade | 27.9 23.0 / 20.2 26.5 | 23.2 / 12.0 | 11.5 / 4.2 |
Palace of Westminster | 22.5 21.8 / 19.2 24.3 | 16.7 / 11.2 | 7.1 / 2.7 |
Pantheon Exterior | 28.8 21.2 / 24.5 21.9 | 24.3 / 19.0 | 13.0 / 7.9 |
Reichstag | 18.5 23.2 / 17.2 25.3 | 12.2 / 9.6 | 5.5 / 3.6 |
Sacre Coeur | 23.8 23.6 / 17.1 25.3 | 17.1 / 8.1 | 7.5 / 2.2 |
St Peters Square | 22.8 21.0 / 21.1 24.2 | 17.3 / 14.1 | 8.7 / 6.0 |
Taj Mahal | 18.1 24.7 / 15.7 23.7 | 10.5 / 7.8 | 4.4 / 2.4 |
Temple Nara Japan | 27.0 25.1 / 23.6 26.8 | 20.7 / 15.4 | 9.9 / 5.7 |
Trevi Fountain | 30.3 22.7 / 20.7 23.5 | 25.9 / 12.3 | 12.0 / 4.2 |
The accuracy of the mean point correspondence depends on a large number of variables, including the depths of the points w.r.t. the cameras, the angle under which the triangle formed by the three points is observed, the shape and size of the triangle, the type of motion, etc. A detailed analysis of all these factors, e.g., through synthetic experiments, is out of the scope of this paper. We thus study the accuracy of the mean point correspondences, and of the resulting estimated poses on real-world data. We only consider pairs instead of triplets as we create virtual correspondences in two views.
For the following experiments, we sample 100 four-tuples of point correspondences, obtained as SuperPoint+LightGlue matches consistent with the ground truth relative pose, i.e. inliers, for each image pair in scenes from the PhotoTourism dataset [20]. We use the first three correspondences to define the triangles in both images.
In our first experiment, we establish correspondences between the mean of the triangle in one image and various points in the triangle in the second image. We express points in the second triangle via their barycentric coordinates and uniformly sample barycentric coordinates , such that (ensuring points inside the triangle). The 3rd coordinate is given as . For each correspondence, we measure: The symmetric epipolar error w.r.t. the ground truth pose, translation and rotation errors, and the percentage of inliers consistent with the pose obtained with the 5pt solver applied on the virtual and the four real correspondences. Note that we are thus considering a 4-point-relative-pose problem.
Fig. 2 shows the results of this experiment on scene Sacre Coeur. It can be seen that the optimum of studied metrics is reached around the mean point of the triangles. To suppress the effect of discrete sampling, for each metric, we fit a 2D Gaussian distribution and report the mean value (in barycentric coordinates) as numbers in brackets in the caption of the figure. The mean values of the 2D Gaussians are very close to the mean point of the triangles, which has barycentric coordinates . This clearly validates our approach of using mean point correspondences. The results for more real scenes and synthetic scenes are in the SM. For all these scenes, we observed a similar behavior.
In our second experiment, we establish the virtual correspondence between the mean points of the triangles. We compare the accuracy of the relative poses obtained by applying the 5pt solver on one virtual and four real correspondences (denoted as the 4pt(M) solver) with the accuracy obtained by the 5pt solver on five real correspondences. Tab. 1 shows the results of this comparison. As can be seen, the 4pt(M) solver is not as accurate as the 5pt solver, which is not surprising given that the virtual correspondence is inherently noisier than the 5th real correspondence used by the 5pt solver. While there is a large gap on some scenes (Collosseum, Notre Dame, Trevi), the gap is noticeably smaller on others (Reichstag, St. Peters), showing that the performance of our solver is scene-dependent. Overall, the gap is not too large, showing that the idea of using a virtual mean-point correspondence is viable. Further, note that the solvers are applied outside of RANSAC and that local optimization inside RANSAC usually compensates for less accurate pose estimates.
Noise experiments. We next test the accuracy of our solvers and the state-of-the-art algorithms w.r.t. increasing image noise. We establish correspondences by projecting 3D points into the images and then add increasing amounts of normally distributed noise to the projections. Due to the approximate nature of our virtual correspondences, our novel solvers return non-zero errors for zero noise. However, at noise levels , these solvers return comparable or even better results than the 5pt+P3P/6pt+P3P solvers. This again shows that our predicted virtual correspondences are good approximations of real correspondences. The recent state-of-the-art HC solver [19] is failing in about 50% of the instances for noiseless data. Its median errors are thus significantly larger than those of the other solvers. See the SM for detailed results of the experiment.
Experiments on real data. We test the solvers on all scenes from the PhotoTourism dataset [51, 20] which provide ground truth poses and intrinsics via a COLMAP [49] reconstruction. In the results, we do not include the St. Peter’s Square scene that we used for the validation of and the number of refinement iterations (see SM). We also include results for the Cambridge Landmarks dataset [21] (except the Street scene, which is commonly not used due to issues with its ground truth). For PhotoTourism, we use the images in their original resolution. For Cambridge Landmarks, we resize the images so that the larger side is 800 px. For each scene, we sample 5,000 random image triplets with at least 10 matches and with at least overlap [20].
Phototourism [20] | ||||||
Estimator | AVG | MED | AUC@5 | @10 | @20 | Runtime (ms) |
4p3v(HC) [19] | 7.17 | 2.34 | 52.74 | 66.63 | 77.86 | 76.45 |
5pt+P3P | 5.99 | 2.00 | 57.31 | 70.54 | 80.81 | 105.50 |
4p3v(M) | 7.17 | 2.49 | 50.96 | 65.46 | 77.32 | 76.77 |
4p3v(M)+R | 6.39 | 2.00 | 56.92 | 69.90 | 80.17 | 74.94 |
4p3v(M)+R+F | 6.59 | 2.07 | 55.90 | 69.06 | 79.56 | 30.54 |
4p3v(M) | 6.39 | 2.19 | 54.70 | 68.58 | 79.59 | 172.06 |
4p3v(M)+R | 5.97 | 1.89 | 58.65 | 71.43 | 81.35 | 189.21 |
4p3v(M)+R+F | 6.15 | 1.97 | 57.42 | 70.47 | 80.69 | 75.78 |
4p3v(L) | 7.12 | 2.50 | 51.00 | 65.57 | 77.42 | 376.31 |
4p3v(L)+R | 6.35 | 2.00 | 56.88 | 69.88 | 80.15 | 297.88 |
4p3v(O) | 5.82 | 1.90 | 58.91 | 71.84 | 81.70 | 58.22 |
4p3v(O)+R | 5.73 | 1.80 | 60.23 | 72.75 | 82.21 | 86.30 |
4p3v(O)+R+F | 5.72 | 1.82 | 59.97 | 72.58 | 82.13 | 36.36 |
Cambridge Landmarks [21] | ||||||
4p3v(HC) [19] | 9.69 | 3.31 | 40.96 | 58.84 | 72.83 | 64.49 |
5pt+P3P | 8.16 | 3.05 | 43.79 | 61.61 | 75.30 | 48.53 |
4p3v(M) | 9.61 | 3.42 | 39.71 | 58.08 | 72.54 | 38.33 |
4p3v(M)+R | 8.77 | 3.11 | 42.98 | 60.90 | 74.59 | 34.84 |
4p3v(M)+R+F | 9.03 | 3.17 | 42.31 | 60.18 | 74.03 | 16.51 |
4p3v(M) | 8.75 | 3.21 | 42.11 | 60.37 | 74.38 | 81.21 |
4p3v(M)+R | 8.32 | 3.01 | 44.17 | 62.04 | 75.60 | 84.98 |
4p3v(M)+R+F | 8.47 | 3.08 | 43.21 | 61.22 | 75.03 | 38.45 |
4p3v(L) | 9.58 | 3.44 | 39.79 | 58.17 | 72.62 | 232.13 |
4p3v(L)+R | 8.75 | 3.09 | 42.94 | 60.86 | 74.60 | 177.06 |
4p3v(O) | 8.62 | 3.07 | 43.58 | 61.65 | 75.30 | 30.43 |
4p3v(O)+R | 8.39 | 2.94 | 44.80 | 62.54 | 75.93 | 34.68 |
4p3v(O)+R+F | 8.36 | 2.97 | 44.64 | 62.38 | 75.82 | 16.41 |
Estimator | AVG | MED | AUC@5 | @10 | @20 | Runtime (ms) |
6pt+P3P | 11.16 | 3.53 | 38.44 | 56.79 | 70.89 | 11.72 |
4p3vf(M) | 18.85 | 4.20 | 33.77 | 49.88 | 62.64 | 12.67 |
4p3vf(M)+R | 19.53 | 4.30 | 33.23 | 49.08 | 61.71 | 14.74 |
4p3vf(M)+R+F | 19.87 | 4.38 | 32.83 | 48.62 | 61.28 | 12.85 |
4p3vf(M) | 18.83 | 4.29 | 33.18 | 49.26 | 62.14 | 20.23 |
4p3vf(M)+R | 18.49 | 4.24 | 33.56 | 49.77 | 62.70 | 21.76 |
4p3vf(M)+R+F | 19.09 | 4.33 | 33.04 | 49.07 | 61.91 | 19.98 |
4p3vf(L) | 20.50 | 4.47 | 32.49 | 47.91 | 60.27 | 138.26 |
4p3vf(L)+R | 20.82 | 4.54 | 32.10 | 47.47 | 59.83 | 141.27 |
4p3vf(Linit)+R | 14.94 | 3.79 | 36.21 | 53.56 | 66.98 | 259.63 |
4p3vf(O) | 11.72 | 3.50 | 38.63 | 56.91 | 70.86 | 12.31 |
4p3vf(O)+R | 11.77 | 3.51 | 38.69 | 56.94 | 70.89 | 14.06 |
Tab. 2 shows results for calibrated cameras when using early termination in PoseLib RANSAC at a confidence threshold. We provide similar results for GC-RANSAC in the SM. As can be seen, with refinement (+R), all of our solvers outperform the state-of-the-art HC-based 4p3v(HC) solver [19] in terms of accuracy. Using filtering (+F) improves the run-time of RANSAC at the cost of a decrease in pose accuracy. Still, our 4p3v(M)+R+F solver outperforms 4p3v(HC) in terms of both accuracy and run-time. The 4p3v(M) and 4p3v(M)+R solvers clearly improve upon the 4p3v(M) solvers, albeit at an increased run-time without filtering. With filtering, 4p3v(M)+R+F is similarly efficient as 4p3v(M)+R at a (slightly) higher accuracy. The 4p3v(L) solvers slightly improve upon the 4p3v(M) solvers. While they produce more accurate virtual correspondences, refinement in the solvers (+R) compensates for the less accurate initial poses provided by the 4p3v(M) solvers, resulting in similar results for 4p3v(M)+R and 4p3v(L)+R. Still, the results for the 4p3v(L) solvers show a direction for improvement, namely learning to predict virtual correspondences. The results for the 4p3v(O) solvers show that there is room for improvement. The results for more variants of L-based solvers, (, +F, etc.) are in the SM.
Method [19] and our solvers solve a different configuration for the three-view-pose estimation problem than the 5pt+P3P solver ( vs. ). Based on limited experiments, [35] concluded that 5pt+P3P outperforms their solver. [19] did not compare to 5pt+P3P. Tab. 2 rectifies this omission, showing that 4p3v(HC) is clearly less accurate than 5pt+P3P, while not being consistently faster. In contrast, our 4p3v(M)+R, 4p3v(M)+R+F, and 4p3v(M)+R+F perform comparable to 5pt+P3P at faster run-times. To the best of our knowledge, we are the first to show that solvers for the configuration are practically relevant.
We also investigate the speed-accuracy trade-off of the solvers by running PoseLib RANSAC for a set of fixed numbers of iterations. Runtimes are reported for 1 core of a 2 GHz Intel Xeon Gold 6338 CPU. As shown in Fig. 3, our 4p3v(M)+R+F provides a better speed-accuracy trade-off than 5pt+P3P on Phototourism and a slightly worse trade-off on Cambridge Landmarks. On both datasets 4p3v(M)+R+F performs better than 4p3v(HC). Fig. 3(c) shows results for the Reichstag scene, where also other variants of our method provide a better trade-off than 5pt+P3P. Fig. 3(d) shows results for King’s College, where 4p3v(M)+R+F outperforms 5pt+P3P. These results show the practical viability of our solvers in a time-constrained setting. Results for more scenes are in the SM.
Fig. 4 shows the potential of our solvers to handle scenarios with low inlier ratios. We synthetically remove outlier matches based on ground truth pose information and replace them with outliers distributed uniformly at random such that the inlier ratio is fixed to for all image triplets. Under the lower inlier ratio, our methods perform better compared to 5pt+P3P, while 5pt+P3P performs very similarly to 4p3v(M)+R+F for the higher inlier ratio scenario in the presented Brandenburg Gate scene. More scenes are in SM.
Shared unknown focal length case. Tab. 3 shows results for the three-view-relative-pose problem for cameras with a shared unknown focal length. It compares our solvers (solving the configuration) with the 6pt+P3P solver (solving the configuration). While our 4p3vf solvers do not reach the accuracy of the 6pt+P3P solver, the results still show that our approach based on generating two virtual correspondences leads to practically useful solvers. To the best of our knowledge, ours are the first practical solvers for the configuration.
Limitations. As shown above, the accuracy of our solvers is scene dependent.555This weakness also applies to [19], since the scene needs to be similar enough to the training scenes for the MLP-based classifier to work well. In addition, we observed that the performance of our approaches drops when the overlap between the images in a triplet is small. In this case, the correspondences form small triangles. This leads to unstable configurations as the distance between the points is relatively small compared to the noise in the points (especially in the virtual points). While correspondences found in a small image region cause the same issues for the 5pt solver, 5pt+P3P is more robust in scenarios with small overlap, as its correspondence has a chance to be farther away from the other correspondences than the virtual correspondences used in our solvers. Properly taking the uncertainty of the virtual points into account, i.e., propagating their uncertainty into the uncertainty of the estimated poses and using this uncertainty during inlier counting [16], is a promising direction to handle the fact that our solvers produce less accurate poses. Creating more accurate correspondences, e.g., by training better networks, is an alternative. At the moment, we use the first three correspondences to create the virtual correspondence. More sophisticated selection criteria, e.g., trying to maximize the size of the formed triangles, could also improve performance.
5 Conclusion
In this paper, we consider the highly challenging problems of relative pose estimation of three calibrated and partially calibrated cameras from four correspondences. We propose a novel and easy-to-implement framework that solves these problems using existing efficient solvers by simply predicting a / point correspondence(s). We propose several solvers based on this framework, one simply using mean coordinates of input points (M-based solvers) and one using a trained predictor (L-based solvers), with multiple variants. Extensive experiments show that our solvers achieve state-of-the-art performance on real data for the challenging configuration of four points in three views.
6 Acknowledgements
C. T. was supported by the Czech Science Foundation (GAČR) JUNIOR STAR Grant (No. 22-23183M). V. K. was supported by the project no. 1/0373/23. and the TERAIS project, a Horizon-Widera-2021 program of the European Union under the Grant agreement number 101079338. (Part of the) Research results was obtained using the computational resources procured in the national project National competence centre for high performance computing (project code: 311070AKF2) funded by European Regional Development Fund, EU Structural Funds Informatization of society, Operational Program Integrated Infrastructure. D. B. was supported by the ETH postdoc fellowship. Z. B. H. was supported by the grant KEGA 004UK-4/2024 “DICH: Digitalization of Cultural Heritage”. T. S. was supported by the EU Horizon 2020 project RICAIP (grant agreement No. 857306) and the European Regional Development Fund under project IMPACT (No. CZ.02.1.01/0.0/0.0/15_003/0000468). Z. K. was supported by the Czech Science Foundation (GAČR) JUNIOR STAR Grant (No. 22-23183M).
Supplementary Material
This supplementary material provides additional details and experimental results promised in the main paper: Sec. 7 discusses M-solvers, provides the proof on the bound of the epipolar error mentioned in the main paper, and experiments supporting the choice of mean point correspondences (see Sec. 3.1 in the main paper). Sec. 8 provides details on L-based solvers, namely information about the used network architecture and our training process, as well as about the -variants of the L-based solvers, including 4p3vf(Linit)+R, used in the paper (see Sec. 3.1 and Tab. 3 of the main paper). Sec. 9 provides details on the experiments mentioned in the main paper, namely experiments with different oracle solvers (see Sec. 4 of the main paper), namely ablations for the choice of the number of refinement iterations (Sec. 3.1 of the main paper) and the choice of (see Sec. 4 in the main paper), experiments with an additional evaluation measure (see Sec. 4 of the main paper), noise experiments (see Sec. 4 of the main paper), results with GC-RANSAC (see Sec. 4 of the main paper), and detailed plots over multiple scenes (see Sec. 4 of the main paper).
7 Using Mean Point Correspondendes
Proof of the bounds on the epipolar error. While the mean point correspondence used in the M-based solvers can provide a good approximation of a correct correspondence, such a correspondence can be noisy. Note that all state-of-the-art 4p3v solvers (including 4p3v(HC) [19] and 4p3v(N) [35]) rely on certain approximations without establishing theoretical proofs to quantify their accuracy. In contrast, the error of our virtual correspondence is bounded: As mentioned in the main paper, it can be proven that the error of the virtual correspondence is bounded by the maximum distance of the mean point from the vertices of the triangle . Here we provide a simple proof.
Lemma 1.
Let us assume two cameras with camera centers and that observe 3D points and (see Figure 5 for an illustration). Let and be the projections of these 3D points in camera 1 and camera 2, respectively. Let be the mean point of the points and let be the essential matrix between these two cameras, i.e., a matrix that satisfies . Then the epipolar line passes through the triangle .
Proof.
The camera center and the 3D points and form a tetrahedron (see Figure 5). The projections in the first camera lie at the edges of this tetrahedron . The ray from the camera center through the mean point thus lies inside the tetrahedron and intersects the plane defined by 3D points and in a point that lies inside the triangle defined by .
The camera center and the 3D points and form a tetrahedron . Again, the projections lie at the edges of the tetrahedron . The ray passing through the camera center and the 3D point lies inside the tetrahedron and thus intersects the image plane of the second camera at a point that lies inside the triangle defined by the points . By construction, the projection of into the second camera lies on the epipolar line . Therefore, the epipolar line which is a line connecting this point and the epipole , passes through the triangle . ∎
It follows from Lemma 1 that since the epipolar line passes through the triangle , the maximum distance of the mean point to the epipolar line is equal to the maximum distance of to the vertices of the triangle.
Experiments validating the use of mean point correspondences. The error of the relative poses estimated with virtual correspondences depends on many aspects, e.g., the baseline and the view angles of the cameras w.r.t. the three points used to compute the mean points, the depth of these points, the size and shape of the triangles defined by the three points, the type of motion of the cameras, the level of noise in the correspondences, etc. Isolating the impact of the individual aspects, e.g., through experiments on synthetic data, is highly non-trivial (e.g., how to generate realistic synthetic scenarios that allow conclusions to generalize to real-world scenarios) and analysing the co-dependencies between different aspects on the overall performance seems to need a paper on its own.
In the main paper, we thus presented results on real-world scenes, without trying to isolate individual factors (see Figure 2 and Table 1 in the main paper). Fig. 2 in the main paper showed results obtained by establishing correspondences between the mean of the triangle in one image and various points in the triangle in the second image. We expressed points in the second triangle via their barycentric coordinates and uniformly sample barycentric coordinates , such that (ensuring points inside the triangle). The 3rd coordinate is given as . For each correspondence, we measured the symmetric epipolar error w.r.t. the ground truth pose, translation and rotation errors, and the percentage of inliers consistent with the pose obtained with the 5pt solver applied on the virtual and the four real correspondences (denoted as the 4pt(M) solver). Fig. 2 in the main paper showed results for the Sacre Coeur scene from the PhotoTourism dataset [20]. Here, Fig. 6 shows the same statistics for two more scenes from the PhotoTourism dataset, St. Peter’s Square (top row) and Temple Nara Japan (bottom row). As with Fig. 2 in the main paper, for each metric, we fit a 2D Gaussian distribution and report the mean value (in barycentric coordinates) as numbers in brackets in the caption of the figure. As can be seen, the same conclusion can be drawn from Fig. 6 as from Fig. 2 in the main paper: The optima of the studied metrics are reached very close to the mean point of the triangles, which has barycentric coordinates .
In addition to the experiments on scenes from the PhotoTourism dataset, Fig. 7 shows results for the symmetric epipolar error on (left) synthetic data and (right) the Shop Facade scene from the Cambridge Landmarks dataset [21]. For the synthetic experiment, we generated 10k scenes with known ground truth parameters. In each scene, the three 3D points were randomly distributed within a cube of size . Each 3D point was projected into two cameras. The orientations and positions of the cameras were selected at random such that they looked towards the origin from a random distance, varying from to , from the scene.
Fig. 7 shows the average symmetric epipolar error as a function of the barycentric coordinates of the point in the second image. The 2D Gaussian distribution fitted to the results on synthetic scenes has mean . The distribution of errors for the Shop Facade scene is very similar to the synthetic data with the minimum at . In both cases, the means of the fitted Gaussian distributions are very close to the mean of the triangle (which has barycentric coordinates ). All the above-mentioned experiments clearly validates our approach of using mean point correspondences.
AVG (∘) | MED (∘) | AUC@5∘ | @10∘ | @20∘ | Time (s) | |
5pt | 5.04 | 0.89 | 63.71 | 74.45 | 83.11 | 0.04 |
4pt(M) | 5.53 | 0.93 | 61.48 | 72.19 | 81.30 | 0.03 |
4pt(M) | 5.21 | 0.90 | 61.80 | 72.33 | 81.27 | 0.02 |
4pt(O) | 4.40 | 0.81 | 65.30 | 75.88 | 84.43 | 0.03 |
Tab. 2 in the main paper compares the relative pose accuracy achieved by the 4pt(M) solver with the accuracy of the classical 5pt relative pose solver. As can be seen, the 4pt(M) solver is not as accurate as the 5tp solver as the 5th correspondence used by the 4pt(M) solver (the mean point correspondence) is significantly more noisy than the one used by the 5pt solver. Tab. 4 shows pose accuracy results obtained by running the 4pt(M) (and its variant 4pt(M)) and the 5pt solvers inside GC-RANSAC on a total of 9,900 image pairs from the Sacre Coeur and St. Peter’s Square datasets. While the individual poses returned by both 4-point solvers are less accurate, RANSAC (and in particular local optimization inside RANSAC) can compensate for this, leading to comparable performance. This experiments not only validates the mean point-strategy, but also shows that virtual correspondences can be used to solver minimal problems from sub-minimal samples.
Tab. 4 also shows results for the oracle variant 4pt(O) of our solvers. As can be seen, there is still some space for improvement when predicting the 5th virtual correspondence, e.g., using a learning-based method. While this can also bring an improvement for two-views, such a learning-based method has a higher potential for improvement for the 4p3v problem, where the information from four points in three views fixes the pose of the input cameras that observe these points.
The method based on virtual correspondences can be theoretically applied to any camera geometry problem, however, we see larger promise in relative pose problems, where it is sufficient to find one 2D point that is sufficiently close to the epipolar line. For absolute pose solvers, a virtual correspondence will need to be close to a 2D point instead of an epipolar line.
8 L-based Solvers
Network architecture and training details. As described in Sec. 3.1 of the main paper, our 4p3v(L)/4p3vf(L) solvers rely on a neural network to predict a virtual correspondence. The following provides details on the network architecture and the training process.
We use a lightweight architecture with a backbone of shared MLP layers, similar to [6], so that each triplet of correspondences is processed independently. The input to our network is a matrix of 4 point correspondences where the row contains the and coordinates of the point in three views, i.e., the correspondence. In estimating the epipolar geometry, the order of the point correspondences does not matter. Thus, our network is designed to be permutation invariant on that input axis. The input is normalized as follows: We apply a rotation matrix to the points in each view independently, so that the mean point of the first three points is sent to , and the fourth point is sent to . Let , and be the mean points of the three corresponding points in three views. We aim to predict the corresponding point of in the second view. Let us denote this predicted point by , i.e., our virtual correspondence in the first two views will be . As suggested by the 4p3v(M) solver, should be close to . Thus, we use the mean point as the initialization of our network and predict a shift from .
We use a simple MLP-based backbone, similar to [6, 42]. More precisely, our input consists of four 6-vectors, of the and coordinates of four point correspondences in three views. The first part of the model is a shared MLP 3-block of dimensions 6, 32, and 32, exporting a 32-dimensional feature for every correspondence. Then we apply a channel-wise max pooling aggregation, which is then concatenated at the end of each 32-dimensional feature. This results in having an 64-dimensional vector for each correspondence, which are passed into another shared MLP 3-block of 64, 64, and 64 dimensions. We aggregate those vectors via a max pooling function to get a 64-dimensional vector encoding, which eventually passes through an MLP 3-block to reduce the dimension gradually from 64, to 32, and finally to 2, which will be the prediction of the and coordinates in the second camera. In all MLPs, in the first 2 layers of the blocks, we use a Leaky RELU activation function [58] with slope 0.01. As for the last layer of the MLPs, in the first two blocks, we use a RELU activation function, while in the last MLP we use a activation, since we want the output to be in the range . For a visualization of the architecture, see Figure 8.
We used a simple network architecture to show the promise of 4p3v(L)/4p3vf(L) solvers, namely that learning can produce more accurate point correspondences. The experiments shown in the paper verify this behavior. We believe that more advanced network architectures (e.g., using a set transformer architecture [29]) have the potential to improve the results even more and reduce the gap between the 4p3v/4p3vf solvers and the oracle 4p3v(O)/4p3vf(O) solvers.
Our loss function is the Sampson error of the prediction to the epipolar line of in the view:
(2) |
For both training and validation, we use synthetic data. Our synthetic dataset contains 1M input instances. 70% of the dataset is used for training while the remaining 30% is used for validating the performance of the network. We generate 10K 3D points inside a 101010 cube, and to generate each instance, we pick 4 random points and project them to 3 cameras with random rotations and translations that view the cube from a random distance between 20 to 50 units.
The network is implemented in PyTorch [37], and we use the Adam optimizer [23] for the training. We train it in batches of 1024 input instances, with a fixed learning rate of 1e-5. In our experiments, the network converges in about 30 epochs.
4p3v(L), 4p3vf(L), and 4p3vf(Linit) solvers. Similar to the 4p3v(M) solver, we try to compensate for potential noise in the prediction returned by the network by running the 5pt solver for the first two views three times for three different virtual correspondences. We test two variants: (1) In the 4p3v(L) and 4p3vf(L) solvers, we add a shift to the predicted point , similar to the 4p3v(M) solver. In this way, we generate two additional virtual correspondences. (2) In the 4p3v(Linit) solver, we add a shift to the initialization of the network. Thus, we run the network three times with three initializations, namely , , and . Each initialization affects the normalization of the points in the 2nd view, in the sense that a different point (the initialization) is sent to (0, 0), leading to a different input, thus predicting different virtual correspondence.
9 Experiments
This section provides more details on the experiments presented in Sec. 4 of the main paper. More precisely, Sec. 9.7 provides details and experiments on the oracle solvers discussed in the main paper. Sec. 9.1 provides ablation studies for the number of iterations of the refinement strategy (+R) and for the choice of . Sec. 9.2 provides details on the noise experiments summarized in the main paper. Sec. 9.3 provides experiments with an evaluation measure taking the relative pose error between the 2nd and 3rd camera in a triplet into account. Sec. 9.4 presents results on the PhotoTourism dataset obtained with GC-RANSAC. Sec. 9.5 presents additional experiments with adding outliers to image triplets (similar to the experiments presented in Fig. 4 in the main paper). Sec. 9.6 presents the additional details on some experiments promised in Sec. 4 of the main paper. Finally, Sec. 9.8 measures the run-times of the different solvers considered in this work.
9.1 Ablation studies
Estimator | AVG | MED | AUC@5 | @10 | @20 | |
4p3v(M) | 0.2 | 6.33 | 3.89 | 36.88 | 57.46 | 74.74 |
0.1 | 6.28 | 3.82 | 37.50 | 57.88 | 74.95 | |
0.09 | 6.29 | 3.77 | 37.98 | 58.27 | 75.18 | |
0.08 | 6.31 | 3.72 | 38.32 | 58.53 | 75.36 | |
0.07 | 6.21 | 3.64 | 39.29 | 59.16 | 75.69 | |
0.06 | 6.18 | 3.68 | 39.16 | 59.18 | 75.76 | |
0.05 | 6.07 | 3.65 | 39.47 | 59.37 | 75.93 | |
0.04 | 5.99 | 3.62 | 39.57 | 59.52 | 76.01 | |
0.03 | 6.04 | 3.64 | 39.44 | 59.36 | 75.92 | |
0.02 | 6.18 | 3.68 | 38.74 | 58.73 | 75.42 | |
0.01 | 6.30 | 3.81 | 37.89 | 57.94 | 74.90 | |
0.005 | 6.39 | 3.88 | 36.62 | 56.94 | 74.36 | |
0.001 | 6.65 | 4.03 | 35.57 | 55.83 | 73.50 | |
4p3v(M)+R | 0.2 | 5.89 | 3.40 | 41.40 | 60.73 | 76.71 |
0.1 | 5.80 | 3.32 | 42.48 | 61.72 | 77.28 | |
0.09 | 5.71 | 3.31 | 42.37 | 61.62 | 77.32 | |
0.08 | 5.68 | 3.29 | 42.86 | 61.95 | 77.42 | |
0.07 | 5.65 | 3.25 | 43.25 | 62.27 | 77.68 | |
0.06 | 5.57 | 3.28 | 43.12 | 62.07 | 77.56 | |
0.05 | 5.59 | 3.24 | 43.23 | 62.18 | 77.66 | |
0.04 | 5.64 | 3.19 | 43.60 | 62.43 | 77.66 | |
0.03 | 5.66 | 3.22 | 43.68 | 62.49 | 77.82 | |
0.02 | 5.65 | 3.21 | 43.24 | 62.11 | 77.56 | |
0.01 | 5.79 | 3.29 | 42.54 | 61.75 | 77.18 | |
0.005 | 5.90 | 3.38 | 42.03 | 61.24 | 76.91 | |
0.001 | 6.09 | 3.46 | 40.80 | 59.99 | 75.98 | |
4p3v(M)+R+F | 0.2 | 6.02 | 3.60 | 39.44 | 59.19 | 75.79 |
0.1 | 6.00 | 3.50 | 40.69 | 60.27 | 76.36 | |
0.09 | 5.92 | 3.43 | 40.83 | 60.40 | 76.51 | |
0.08 | 5.91 | 3.44 | 41.10 | 60.61 | 76.60 | |
0.07 | 5.93 | 3.39 | 41.58 | 60.80 | 76.72 | |
0.06 | 5.79 | 3.38 | 41.85 | 61.01 | 76.91 | |
0.05 | 5.77 | 3.32 | 42.08 | 61.07 | 76.95 | |
0.04 | 5.79 | 3.35 | 42.00 | 61.13 | 76.94 | |
0.03 | 5.73 | 3.33 | 42.46 | 61.50 | 77.22 | |
0.02 | 5.81 | 3.33 | 42.12 | 61.01 | 76.80 | |
0.01 | 5.86 | 3.41 | 41.12 | 60.56 | 76.56 | |
0.005 | 5.96 | 3.51 | 40.64 | 60.12 | 76.19 | |
0.001 | 6.20 | 3.53 | 39.95 | 59.08 | 75.52 |
Validation of . We tested our -based solvers for different values of and measured their performance. In general, there is no common value of the shift that leads to the best results on all datasets. This is expected since the precision of the mean-point correspondence depends on many different factors, e.g., the viewing angles of the cameras, the type of the motion, the depth and spatial distributions of the 3D points, etc. We set the value for and the total number of refinement iterations by evaluating their effects on St. Peter’s Square scene from the PhotoTourism dataset [20] which we used for validation and did not include it in other results for PhotoTourism in the paper. Tab. 5 shows how the different settings of the scale of the shift affect the accuracy of the -based solvers. Based on these experiments we use as it provides the best results for 4p3v(M) solver and is also close to the optimal value for its variants. The choice of optimal parameter may be scene-dependent and could potentially be set by using learning-based approaches.
Inner refinement validation. We also perform validation of the total number of LM steps in the inner refinement (+R) shown in Fig. 9. We chose the value of 2 for other experiments as it provides the best speed-accuracy trade-off across a range of RANSAC iterations. However, we note that other settings may have very similar performance.
Phototourism [20] | ||||||
Estimator | AVG | MED | AUC@5 | @10 | @20 | Runtime (ms) |
4p3v(HC) [19] | 11.41 | 3.89 | 37.90 | 53.76 | 68.08 | 76.45 |
5pt+P3P | 9.88 | 3.41 | 41.36 | 57.36 | 71.13 | 105.50 |
4p3v(M) | 11.65 | 4.21 | 35.38 | 51.73 | 66.77 | 76.77 |
4p3v(M)+R | 10.52 | 3.47 | 41.01 | 56.62 | 70.25 | 74.94 |
4p3v(M)+R+F | 10.79 | 3.58 | 40.09 | 55.73 | 69.56 | 30.54 |
4p3v(M) | 10.44 | 3.72 | 38.82 | 55.07 | 69.51 | 172.06 |
4p3v(M)+R | 9.84 | 3.27 | 42.62 | 58.30 | 71.74 | 189.21 |
4p3v(M)+R+F | 10.14 | 3.40 | 41.45 | 57.24 | 70.92 | 75.78 |
4p3v(L) | 11.56 | 4.21 | 35.32 | 51.77 | 66.84 | 376.31 |
4p3v(L)+R | 10.48 | 3.47 | 41.02 | 56.60 | 70.25 | 297.88 |
4p3v(Linit)+R | 9.86 | 3.28 | 42.60 | 58.26 | 71.71 | 730.90 |
4p3v(O) | 9.64 | 3.29 | 42.60 | 58.54 | 72.09 | 58.22 |
4p3v(O)+R | 9.50 | 3.13 | 44.01 | 59.67 | 72.81 | 86.30 |
4p3v(O)+R+F | 9.45 | 3.16 | 43.82 | 59.55 | 72.76 | 36.36 |
Cambridge Landmarks [21] | ||||||
Estimator | AVG | MED | AUC@5 | @10 | @20 | Runtime (ms) |
4p3v(HC) [19] | 15.12 | 5.51 | 24.50 | 43.42 | 61.10 | 64.49 |
5pt+P3P | 13.38 | 5.17 | 25.94 | 45.38 | 63.16 | 48.53 |
4p3v(M) | 15.31 | 5.68 | 22.93 | 42.06 | 60.26 | 38.33 |
4p3v(M)+R | 14.28 | 5.29 | 25.37 | 44.57 | 62.33 | 34.84 |
4p3v(M)+R+F | 14.60 | 5.39 | 24.89 | 43.97 | 61.71 | 16.51 |
4p3v(M) | 14.18 | 5.36 | 24.72 | 44.12 | 62.15 | 81.21 |
4p3v(M)+R | 13.67 | 5.11 | 26.17 | 45.56 | 63.33 | 84.98 |
4p3v(M)+R+F | 13.89 | 5.23 | 25.43 | 44.78 | 62.70 | 38.45 |
4p3v(L) | 15.24 | 5.72 | 22.98 | 42.15 | 60.35 | 232.13 |
4p3v(L)+R | 14.23 | 5.27 | 25.33 | 44.49 | 62.30 | 177.06 |
4p3v(Linit)+R | 13.63 | 5.10 | 26.11 | 45.58 | 63.35 | 395.43 |
4p3v(O) | 13.85 | 5.22 | 25.61 | 45.19 | 63.11 | 30.43 |
4p3v(O)+R | 13.51 | 5.07 | 26.60 | 46.05 | 63.76 | 34.68 |
4p3v(O)+R+F | 13.47 | 5.10 | 26.47 | 45.98 | 63.71 | 16.41 |
9.2 Noise experiments
We tested the performance of our solvers and the state-of-the-art algorithms w.r.t. increasing image noise. We used the SfM model of the botanical garden scene (randomly selected from all scenes) from the ETH3D dataset [50] to obtain instances of 5/6 points in three views by identifying images in the scene that share 3D points. Perfect noise-free correspondences are generated by projecting the 3D points into the images. We then add increasing amounts of normally distributed noise to these correspondences. We generated more than 9k instances, but show only 1k results per plot to avoid clutter. Note that the 4p3v(HC) solver was trained on the ETH3D dataset while our L-based solvers were trained on purely synthetic data. For the noise experiments we also test the joint 5PC solver, which operates on samples of 5 point correspondences, by estimating the essential matrices independently, using the 5pt solver. Notice that due to estimating the two essential matrices independently of each other, the scales of both translations are indepent from each other. In contrast, the other solvers estimate the scale of the translation of the third camera relative to the scale of the translation between the first two cameras.
The results for increasing noise in the image points are shown in Figs. 10 and 11. The results are represented by the boxplot function which shows the 25% to 75% quantiles as a box with a horizontal line at median. Crosses show data beyond 1.5 times the interquartile range. Let be the error of the estimated relative rotation between cameras and , computed as the angle in the axis-angle representation of and let be the error of the estimated translation computed as the angle between the two unit vectors corresponding to the translations [20]. Fig. 10 shows boxplots of pose errors measured in the same way as in our experiments in the main paper (cf. Sec. 4 in the main paper), i.e., as , for the calibrated 4p3v problem (left), and the partially calibrated 4p3vf problem (right). The errors are zoomed into an interesting interval and are shown as functions of varying noise from to .
Due to the approximate nature of the virtual correspondences, our newly proposed M-based and L-based solvers exhibit non-zero errors for zero noise. However, at noise levels , our -based solvers (both M and L), and for the calibrated case even the pure 4p3v(M) and 4p3v(L) solvers, return comparable or even better results than the 5pt+P3P and 6pt+P3P solvers. Note that the 5pt+P3P and 6pt+P3P solvers sample one/two more points (real correspondences) in the first two cameras, and these points are affected only by the considered noise. This shows that our predicted virtual correspondences are good approximations to real correspondences. For the calibrated case, the recent state-of-the-art solver [19] is failing in about 50% of the instances for noiseless data, even though the solver was trained on the ETH3D dataset. Thus, the median errors are significantly larger than the median errors of the remaining solvers.
The rotation and translation errors in the first two views, i.e., and , for both the calibrated (top row), and the partially calibrated case (bottom row) are shown in Fig. 11. For the partially calibrated case, our new solvers generate two approximate virtual correspondences in the first two views. Therefore, the 4p3vf(M) and 4p3vf(L) solvers have slightly larger errors than the 6pt+P3P solver for all considered noise levels. However, similarly to the pose errors in Figure 10, at noise levels our -based solvers (both M and L) return comparable or even better results in the first two views than the 5pt [34] and 6pt [56] solvers, here represented by the results of 5pt+P3P and 6pt+P3P solvers. For the calibrated case even the pure 4p3v(M) and 4p3v(L) solvers, without the offset , perform comparably well as the 5pt solver. This shows an interesting potential of using our solvers for the two-view relative pose estimation problems by solving these problems from sub-minimal samples.
Phototourism [20] | ||||||
Estimator | AVG | MED | AUC@5 | @10 | @20 | Runtime (s) |
4p3v(HC) [19] | 5.23 | 1.89 | 43.36 | 62.83 | 76.97 | 2.95 |
5pt+P3P | 5.00 | 1.85 | 43.99 | 63.35 | 77.39 | 2.78 |
4p3v(M) | 5.11 | 1.94 | 43.03 | 62.87 | 77.15 | 2.23 |
4p3v(M)+R | 5.07 | 1.91 | 43.30 | 63.14 | 77.32 | 2.47 |
4p3v(M)+R+F | 5.05 | 1.89 | 43.41 | 63.24 | 77.39 | 2.42 |
4p3v(M) | 5.02 | 1.92 | 43.24 | 63.15 | 77.44 | 2.25 |
4p3v(M)+R | 4.96 | 1.90 | 43.57 | 63.48 | 77.67 | 2.53 |
4p3v(M)+R+F | 5.00 | 1.89 | 43.51 | 63.38 | 77.56 | 2.41 |
4p3v(L) | 5.46 | 1.93 | 42.82 | 62.07 | 76.25 | 2.88 |
4p3v(L)+R | 5.05 | 1.91 | 43.28 | 63.19 | 77.42 | 2.50 |
4p3v(O)+R | 4.70 | 1.81 | 44.73 | 64.50 | 78.44 | 2.38 |
4p3v(NO)+R | 4.24 | 1.74 | 46.01 | 65.97 | 79.74 | 2.39 |
Phototourism [20] | ||||||
Estimator | AVG | MED | AUC@5 | @10 | @20 | Runtime (ms) |
4p3v(HC) [19] | 11.41 | 3.89 | 37.90 | 53.76 | 68.08 | 76.45 |
5pt+P3P | 9.88 | 3.41 | 41.36 | 57.36 | 71.13 | 105.50 |
4p3v(M) | 11.65 | 4.21 | 35.38 | 51.73 | 66.77 | 76.77 |
4p3v(M)+R | 10.52 | 3.47 | 41.01 | 56.62 | 70.25 | 74.94 |
4p3v(M)+R+F | 10.79 | 3.58 | 40.09 | 55.73 | 69.56 | 30.54 |
4p3v(M) | 10.44 | 3.72 | 38.82 | 55.07 | 69.51 | 172.06 |
4p3v(M)+R | 9.84 | 3.27 | 42.62 | 58.30 | 71.74 | 189.21 |
4p3v(M)+R+F | 10.14 | 3.40 | 41.45 | 57.24 | 70.92 | 75.78 |
4p3v(L) | 11.56 | 4.21 | 35.32 | 51.77 | 66.84 | 376.31 |
4p3v(L)+R | 10.48 | 3.47 | 41.02 | 56.60 | 70.25 | 297.88 |
4p3v(Linit)+R | 9.86 | 3.28 | 42.60 | 58.26 | 71.71 | 730.90 |
4p3v(O) | 9.64 | 3.29 | 42.60 | 58.54 | 72.09 | 58.22 |
4p3v(O)+R | 9.50 | 3.13 | 44.01 | 59.67 | 72.81 | 86.30 |
4p3v(O)+R+F | 9.45 | 3.16 | 43.82 | 59.55 | 72.76 | 36.36 |
Cambridge Landmarks [21] | ||||||
Estimator | AVG | MED | AUC@5 | @10 | @20 | Runtime (ms) |
4p3v(HC) [19] | 15.12 | 5.51 | 24.50 | 43.42 | 61.10 | 64.49 |
5pt+P3P | 13.38 | 5.17 | 25.94 | 45.38 | 63.16 | 48.53 |
4p3v(M) | 15.31 | 5.68 | 22.93 | 42.06 | 60.26 | 38.33 |
4p3v(M)+R | 14.28 | 5.29 | 25.37 | 44.57 | 62.33 | 34.84 |
4p3v(M)+R+F | 14.60 | 5.39 | 24.89 | 43.97 | 61.71 | 16.51 |
4p3v(M) | 14.18 | 5.36 | 24.72 | 44.12 | 62.15 | 81.21 |
4p3v(M)+R | 13.67 | 5.11 | 26.17 | 45.56 | 63.33 | 84.98 |
4p3v(M)+R+F | 13.89 | 5.23 | 25.43 | 44.78 | 62.70 | 38.45 |
4p3v(L) | 15.24 | 5.72 | 22.98 | 42.15 | 60.35 | 232.13 |
4p3v(L)+R | 14.23 | 5.27 | 25.33 | 44.49 | 62.30 | 177.06 |
4p3v(Linit)+R | 13.63 | 5.10 | 26.11 | 45.58 | 63.35 | 395.43 |
4p3v(O) | 13.85 | 5.22 | 25.61 | 45.19 | 63.11 | 30.43 |
4p3v(O)+R | 13.51 | 5.07 | 26.60 | 46.05 | 63.76 | 34.68 |
4p3v(O)+R+F | 13.47 | 5.10 | 26.47 | 45.98 | 63.71 | 16.41 |
9.3 Alternative evaluation measure
For the evaluation in the main paper, we defined the pose error as , where and are the angular errors of rotation and translation for camera pair in degrees. The 4p3v problem also includes the estimation of and since the relative scale of and is recovered. We therefore also presents results for the pose error defined as
(3) |
The results equivalent to Tab. 2 from the main paper using this pose error definition are presented in Tab. 6. A speed-accuracy comparison equivalent to Fig. 3 in the main is presented in Fig. 12. The overall comparison of the methods remains the same under both the metric used in the main paper and the alternative described in this section.
9.4 GC-RANSAC
We evaluated and compared our proposed solvers against the state-of-the-art solver, also in the GC-RANSAC [2] framework. Results for all scenes of the PhotoTourism dataset [20] are presented in Tab. 7.
St. Mary’s Church [21]
Sacre Coeur [20]
9.5 Outlier experiments
Fig. 4 in the main paper presented results obtained by synthetically removing outlier matches based on ground truth pose information and replacing them with outliers distributed uniformly at random to reach a given inlier ratio for all image triplets. Here we provide additional plots for two scenes: St Mary’s Church from Cambridge Landmarks [21] and Sacre Coeur from Phototourism [20]. The results for the speed-accuracy evaluation are shown in Fig. 13. Consistent with the results in the main paper, these graphs show the potential for our M-based solvers to perform better than the the 5pt+P3P baseline in scenarios with low inlier ratios.
9.6 Detailed experiments on Real Data
Results for individual scenes. Fig. 3 in the main paper shows results on all PhotoTourism scenes (except St. Peter’s Square), the 5 Cambridge Landmarks scenes we consider, and one individual scene from each dataset. In Fig. 14, we provide results for the accuracy-speed trade-off evaluation for more evaluation scenes on both PhotoTourism [20] and Cambridge Landmarks [21]. As discussed and shown in the main paper, the performance of our M-based and L-based solvers is scene-dependent. This can also be seen in Fig. 14, where for some scenes, the 4p3v(M)+R and 4p3v(M)+R solvers perform worse than the 5pt+P3P solver (Shop Facade, Palace of Westminster, Trevi Fountain). However, for the majority of the scenes, our solvers perform similar to 5pt+P3P (Old Hospital) or even outperform 5pt+P3P (Great Court, Buckingham Palace, Colosseum Exterior, Grand Place Brussels, Notre Dame, Pantheon Exterior, Taj Mahal, Temple Nara). Overall, the results validate the practical viability of our solvers in a time-constrained setting.
Results for L-based solvers. In Tab. 9 we provide the same results as shown in Tab. 3 in the main paper, including also more variants for L-based solvers, i.e. 4p3v(L)+R+F, 4p3v(L), 4p3v(L)+R, 4p3v(L)+R+F, 4p3v(Linit), 4p3v(Linit)+R, and 4p3v(Linit)+R+F.
Phototourism [20] | ||||||
Estimator | AVG | MED | AUC@5 | @10 | @20 | Runtime (ms) |
4p3v(HC) [19] | 7.17 | 2.34 | 52.74 | 66.63 | 77.86 | 76.45 |
5pt+P3P | 5.99 | 2.00 | 57.31 | 70.54 | 80.81 | 105.50 |
4p3v(M) | 7.17 | 2.49 | 50.96 | 65.46 | 77.32 | 76.77 |
4p3v(M)+R | 6.39 | 2.00 | 56.92 | 69.90 | 80.17 | 74.94 |
4p3v(M)+R+F | 6.59 | 2.07 | 55.90 | 69.06 | 79.56 | 30.54 |
4p3v(M) | 6.39 | 2.19 | 54.70 | 68.58 | 79.59 | 172.06 |
4p3v(M)+R | 5.97 | 1.89 | 58.65 | 71.43 | 81.35 | 189.21 |
4p3v(M)+R+F | 6.15 | 1.97 | 57.42 | 70.47 | 80.69 | 75.78 |
4p3v(L) | 7.12 | 2.50 | 51.00 | 65.57 | 77.42 | 376.31 |
4p3v(L)+R | 6.35 | 2.00 | 56.88 | 69.88 | 80.15 | 297.88 |
4p3v(L)+R+F | 6.55 | 2.07 | 55.88 | 69.08 | 79.55 | 255.93 |
4p3v(L) | 6.39 | 2.19 | 54.70 | 68.58 | 79.59 | 172.05 |
4p3v(L)+R | 5.97 | 1.89 | 58.65 | 71.43 | 81.35 | 189.28 |
4p3v(L)+R+F | 6.15 | 1.97 | 57.42 | 70.47 | 80.69 | 75.76 |
4p3v(Linit) | 5.97 | 1.91 | 58.36 | 71.25 | 81.26 | 679.80 |
4p3v(Linit)+R | 5.97 | 1.89 | 58.63 | 71.41 | 81.34 | 730.90 |
4p3v(Linit)+R+F | 6.04 | 1.92 | 58.11 | 70.97 | 81.02 | 605.86 |
4p3v(O) | 5.82 | 1.90 | 58.91 | 71.84 | 81.70 | 58.22 |
4p3v(O)+R | 5.73 | 1.80 | 60.23 | 72.75 | 82.21 | 86.30 |
4p3v(O)+R+F | 5.72 | 1.82 | 59.97 | 72.58 | 82.13 | 36.36 |
Cambridge Landmarks [21] | ||||||
Estimator | AVG | MED | AUC@5 | @10 | @20 | Runtime (ms) |
4p3v(HC) [19] | 9.69 | 3.31 | 40.96 | 58.84 | 72.83 | 64.49 |
5pt+P3P | 8.16 | 3.05 | 43.79 | 61.61 | 75.30 | 48.53 |
4p3v(M) | 9.61 | 3.42 | 39.71 | 58.08 | 72.54 | 38.33 |
4p3v(M)+R | 8.77 | 3.11 | 42.98 | 60.90 | 74.59 | 34.84 |
4p3v(M)+R+F | 9.03 | 3.17 | 42.31 | 60.18 | 74.03 | 16.51 |
4p3v(M) | 8.75 | 3.21 | 42.11 | 60.37 | 74.38 | 81.21 |
4p3v(M)+R | 8.32 | 3.01 | 44.17 | 62.04 | 75.60 | 84.98 |
4p3v(M)+R+F | 8.47 | 3.08 | 43.21 | 61.22 | 75.03 | 38.45 |
4p3v(L) | 9.58 | 3.44 | 39.79 | 58.17 | 72.62 | 232.13 |
4p3v(L)+R | 8.75 | 3.09 | 42.94 | 60.86 | 74.60 | 177.06 |
4p3v(L)+R+F | 8.92 | 3.17 | 42.28 | 60.19 | 74.08 | 163.95 |
4p3v(L) | 8.75 | 3.21 | 42.11 | 60.37 | 74.38 | 81.20 |
4p3v(L)+R | 8.32 | 3.01 | 44.17 | 62.04 | 75.60 | 84.89 |
4p3v(L)+R+F | 8.47 | 3.08 | 43.21 | 61.22 | 75.03 | 38.49 |
4p3v(Linit) | 8.43 | 3.05 | 43.80 | 61.70 | 75.34 | 379.49 |
4p3v(Linit)+R | 8.28 | 3.00 | 44.26 | 62.10 | 75.62 | 395.43 |
4p3v(Linit)+R+F | 8.33 | 3.04 | 43.82 | 61.71 | 75.35 | 349.45 |
4p3v(O) | 8.62 | 3.07 | 43.58 | 61.65 | 75.30 | 30.43 |
4p3v(O)+R | 8.39 | 2.94 | 44.80 | 62.54 | 75.93 | 34.68 |
4p3v(O)+R+F | 8.36 | 2.97 | 44.64 | 62.38 | 75.82 | 16.41 |
5pt+P3P | 4p3v(HC) | 4p3v(M) | 4p3v(M) | 4p3v(L) | 4p3v(L) | 4p3v(Linit) | |
Time (s) | 77.90 | 66.06 | 83.92 | 218.71 | 450.26 | 511.28 | 1130.31 |
6pt+P3P | 4p3vf(M) | 4p3vf(M) | 4p3vf(L) | 4p3vf(L) | 4p3vf(Linit) | |
Time (s) | 106.67 | 117.28 | 295.87 | 758.59 | 953.34 | 2162.77 |
9.7 Oracle solvers
We first provide more details on implementation of our oracle version of the 4p3v(N) solver [35], i.e. the 4p3v(NO) solver. In the 4p3v(NO) solver, instead of performing an one-dimensional search over the degree curve of possible epipoles, we provide the solver with the ground truth epipole. To simulate the effect of sampling four points for this solver inside RANSAC, instead of using the second epipole and the epipolar line homography to get the essential matrix , as suggested in the implementation details of [35], we use the second suggested way on how to obtain , i.e., using their 3pt+ep solver. However, we feed the 3pt+ep solver with four points and use SVD instead of the null space. The rest of the solver performs the triangulation and registers the last camera using the P3P solver [39]. This is identical to the original 4p3v(N) solver. Remember that the original 4p3v(N) solver needs to call these evaluations for each search step on the degree curve of possible epipoles (usually [35]). Moreover, this solver has several sources of errors, e.g., the degree curve is affected by noise; sparse sampling of the points on the curve introduces additional potentially large noise in the epipole; the reprojections of the fourth image point in the third view, traced out by sweeping through the curve of possible epipoles, generates complex curves in the third view, with the reprojection cost function having a lot of local minima. In the paper [35], it was reported that even for exact data and 1000 search points followed by refinement at multiple local minima the failure rate of the solver is . Therefore, we expect the original solver 4p3v(N) to perform much worse that the “oracle” 4p3v(NO) solver.
To obtain upper bounds for the precision that can be achieved by our proposed solvers we consider the following “oracle” solvers for real experiments: The 4p3v(O)/4p3vf(O) solvers use a correct correspondence(s), i.e., a correspondence(s) that satisfies the epipolar constraint for the ground truth relative pose of the first two cameras, as the / virtual correspondence between these cameras. Then the 5pt+P3P/6pt+P3P solver is applied to estimate the relative pose of three cameras. The 4p3v(O)/4p3vf(O) solvers thus indicate the maximum precision that our solvers can reach, if they would have been able to predict or infer a precise / correspondence from the coordinates of four input correspondences.
Tab. 7 compares the 4p3v(NO) with the 4p3v(O) solver as well as various variants of our M-based and L-based solvers as well as the 4p3v(HC) [19] and 5pt+P3P approaches. Results are shown over all scenes from the PhotoTourism dataset using GC-RANSAC. The 4p3v(NO) solver performs slightly better than 4p3v(O) in terms of pose accuracy.666We observed that when naively applying 4p3v(NO) in RANSAC, RANSAC tends to terminate too early, resulting in reduced pose accuracy. This resulted in the statement in the main paper that the 4p3v(NO) oracle performed worse than our oracles. In order to obtain results comparable with 4p3v(O), we had to adapt RANSAC. While the pose accuracy of 4p3v(NO) provides an upper bound on the performance of the 4p3v(N) solver [35], the run-time observed for 4p3v(NO) is not indicative of the run-time of 4p3v(N). As highlighted above and in the main paper, the 4p3v(N) solver needs to evaluate up to thousands of epipole estimates and is thus significantly slower than its oracle variant (which only evaluates a single epipole). In contrast, our M-based solvers have a run-time comparable to 4p3v(O). In addition, the epipoles used by 4p3v(N) can be rather noisy, depending on how densely the curve is sampled. Even for a rather dense sampling, which increases the run-time of the 4p3v(N) solver, [35] report that their solver often has problems with local minima. The results of the 4p3v(O) solver show that there is room for improvement for our M-based and L-based solvers by developing approaches to generating more accurate virtual correspondences.
9.8 Solver run-times
In this section, we present run-times of the proposed solvers as well as the state-of-the-art solvers for the relative pose problem of three calibrated/partially calibrated cameras. While the main paper reports run-time results for full RANSAC-based estimation, we now report the run-times of the individual solvers outside of RANSAC. To measure the run-times of the solvers, we calculated the average run-time of each solver on more than 10k instances of the Sacre Coeur scene of the PhotoTourism dataset [51]. For calibrated cameras, the run-times are reported in Table 10, and for partially calibrated cameras in Table 11. The experiments were performed on an Intel(R) Core(TM) i9-10900X CPU @ 3.70GHz. The average run-times of the L-based solvers are higher, because we run the network on the CPU and without batching. In general, the implementations of all proposed solvers are not optimized for speed, and we still see room for speeding them up.
References
- [1] Chris Aholt and Luke Oeding. The ideal of the trifocal variety. Mathematics of Computation, 83(289):2553–2574, 2014.
- [2] D. Barath and J. Matas. Graph-Cut RANSAC. In Conference on Computer Vision and Pattern Recognition, 2018.
- [3] Daniel Barath and Chris Sweeney. Relative pose solvers using monocular depth. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 4037–4043. IEEE, 2022.
- [4] Martin Bujnak, Zuzana Kukelova, and Tomas Pajdla. A general solution to the p4p problem for camera with unknown focal length. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2008.
- [5] Robert Castle, Georg Klein, and David W. Murray. Video-rate localization in multiple maps for wearable augmented reality. In ISWC, 2008.
- [6] Luca Cavalli, Marc Pollefeys, and Daniel Barath. NeFSAC: neurally filtered minimal samples. In European Conference on Computer Vision, pages 351–366. Springer, 2022.
- [7] Chiang-Heng Chien, Hongyi Fan, Ahmad Abdelfattah, Elias Tsigaridas, Stanimire Tomov, and Benjamin Kimia. Gpu-based homotopy continuation for minimal problems in computer vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15765–15776, June 2022.
- [8] Ondřej Chum, Jiří Matas, and Josef Kittler. Locally optimized ransac. In Pattern Recognition, pages 236–243. Springer Berlin Heidelberg, 2003.
- [9] Andrea Porfiri Dal Cin, Timothy Duff, Luca Magri, and Tomas Pajdla. Minimal perspective autocalibration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5064–5073, June 2024.
- [10] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018.
- [11] Yaqing Ding, Chiang-Heng Chien, Viktor Larsson, Karl Åström, and Benjamin Kimia. Minimal solutions to generalized three-view relative pose problem. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8156–8164, October 2023.
- [12] Timothy Duff, Kathlen Kohn, Anton Leykin, and Tomas Pajdla. Plmp-point-line minimal problems in complete multi-view visibility. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1675–1684, 2019.
- [13] Timothy Duff, Kathlén Kohn, Anton Leykin, and Tomás Pajdla. Plp - point-line minimal problems under partial visibility in three views. In ECCV (26), volume 12371 of Lecture Notes in Computer Science, pages 175–192. Springer, 2020.
- [14] Ricardo Fabbri, Timothy Duff, Hongyi Fan, Margaret H. Regan, David da Costa de Pinho, Elias P. Tsigaridas, Charles W. Wampler, Jonathan D. Hauenstein, Peter J. Giblin, Benjamin B. Kimia, Anton Leykin, and Tomás Pajdla. TRPLP - trifocal relative pose from lines at points. In CVPR, pages 12070–12080. Computer Vision Foundation / IEEE, 2020.
- [15] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, June 1981.
- [16] Wolfgang. Förstner and Bernhard P. Wrobel. Photogrammetric Computer Vision Statistics, Geometry, Orientation and Reconstruction / by Wolfgang Förstner, Bernhard P. Wrobel. Geometry and Computing, 11. Springer International Publishing, Cham, 1st ed. 2016. edition, 2016.
- [17] Christian Häne, Lionel Heng, Gim Hee Lee, Friedrich Fraundorfer, Paul Furgale, Torsten Sattler, and Marc Pollefeys. 3d visual perception for self-driving cars using a multi-camera system: Calibration, mapping, localization, and obstacle detection. Image and Vision Computing, 68:14–27, 2017.
- [18] R.J. Holt and A.N. Netravali. Uniqueness of solutions to three perspective views of four points. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(3):303–307, 1995.
- [19] Petr Hruby, Timothy Duff, Anton Leykin, and Tomas Pajdla. Learning to solve hard minimal problems. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5532–5542, June 2022.
- [20] Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image matching across wide baselines: From paper to practice. International Journal of Computer Vision, 2020.
- [21] Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision, pages 2938–2946, 2015.
- [22] J. Kileel. Minimal problems for the calibrated trifocal variety. SIAM Journal on Applied Algebra and Geometry, 1(1):575–598, 2017.
- [23] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- [24] Zuzana Kukelova, Martin Bujnak, and Tomas Pajdla. Real-time solution to the absolute pose problem with unknown radial distortion and focal length. In Proceedings of the IEEE International Conference on Computer Vision, pages 2816–2823, 2013.
- [25] Z. Kukelova and T. Pajdla. A minimal solution to the autocalibration of radial distortion. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2007), 2007.
- [26] Viktor Larsson. PoseLib - Minimal Solvers for Camera Pose Estimation, 2020.
- [27] Viktor Larsson, Kalle Åström, and Magnus Oskarsson. Efficient solvers for minimal problems by syzygy-based reduction. In Computer Vision and Pattern Recognition (CVPR), 2017.
- [28] Viktor Larsson, Torsten Sattler, Zuzana Kukelova, and Marc Pollefeys. Revisiting radial distortion absolute pose. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1062–1071, 2019.
- [29] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In Proceedings of the 36th International Conference on Machine Learning, pages 3744–3753, 2019.
- [30] S. Leonardos, R. Tron, and K. Daniilidis. A metric parametrization for trifocal tensors with non-colinear pinholes. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2015.
- [31] Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed. In International Conference on Computer Vision (ICCV), 2023.
- [32] H. Christopher Longuet-Higgins. A method of obtaining the relative positions of 4 points from 3 perspective projections. In Peter Mowforth, editor, BMVC91, pages 86–94, London, 1991. Springer London.
- [33] Evgeniy V. Martyushev. On some properties of calibrated trifocal tensors. Journal of Mathematical Imaging and Vision, 58(2):321–332, 2017.
- [34] D. Nistér. An efficient solution to the five-point relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6):756–770, June 2004.
- [35] D. Nistér and F. Schaffalitzky. Four points in two or three calibrated views: Theory and practice. International Journal of Computer Vision, 67(2):211–231, 2006.
- [36] Magnus Oskarsson, Andrew Zisserman, and Kalle Åström. Minimal projective reconstruction for combinations of points and lines in three views. Image and Vision Computing, 22(10):777–785, 2004. British Machine Vision Computing 2002.
- [37] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS Autodiff Workshop, 2017.
- [38] Michal Perdoch, Jiri Matas, and Ondrej Chum. Epipolar geometry from two correspondences. In 18th International Conference on Pattern Recognition (ICPR’06), volume 4, pages 215–219. IEEE, 2006.
- [39] Mikael Persson and Klas Nordberg. Lambda twist: An accurate fast robust perspective three point (p3p) solver. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
- [40] James Pritts, Ondrej Chum, and Jiri Matas. Approximate models for fast and accurate epipolar geometry estimation. In 28th International Conference on Image and Vision Computing New Zealand, IVCNZ 2013, Wellington, New Zealand, November 27-29, 2013, pages 106–111. IEEE, 2013.
- [41] James Pritts, Zuzana Kukelova, Viktor Larsson, and Ondrej Chum. Radially-distorted conjugate translations. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 1993–2001. IEEE Computer Society, 2018.
- [42] Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- [43] Long Quan. Invariants of six points and projective reconstruction from three uncalibrated images. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 17(1):34–46, 1995.
- [44] L. Quan, B. Triggs, and B. Mourrain. Some results on minimal euclidean reconstruction from four points. Journal of Mathematical Imaging and Vision, 24(3):341–348, 2006.
- [45] R. Raguram, O. Chum, M. Pollefeys, J. Matas, and J.-M. Frahm. USAC: A universal framework for random sample consensus. IEEE Transactions on Pattern Recognition and Machine Intelligence, 35(8):2022–2038, 2013.
- [46] Volker Rodehorst. Evaluation of the metric trifocal tensor for relative three-view orientation. In Klaus Gürlebeck and Tom Lahmer, editors, Digital Proceedings, International Conference on the Applications of Computer Science and Mathematics in Architecture and Civil Engineering : July 20 - 22 2015, Bauhaus-University Weimar, 2017.
- [47] T. Sattler, B. Leibe, and L. Kobbelt. Efficient & Effective Prioritized Matching for Large-Scale Image-Based Localization. IEEE Transactions on Pattern Recognition and Machine Intelligence, 2016. (To appear).
- [48] D. Scaramuzza and F. Fraundorfer. Visual odometry [tutorial]. IEEE Robot. Automat. Mag., 18(4):80–92, 2011.
- [49] Johannes L. Schönberger and Jan-Michael Frahm. Structure-From-Motion Revisited. In CVPR, 2016.
- [50] Thomas Schops, Johannes L. Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A Multi-View Stereo Benchmark With High-Resolution Images and Multi-Camera Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- [51] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d. In ACM siggraph 2006 papers, pages 835–846. 2006.
- [52] N. Snavely, S. M. Seitz, and R. Szeliski. Modeling the world from internet photo collections. International Journal Computer Vision, 80(2):189–210, Nov. 2008.
- [53] Andrew J Sommese, Andrew J Sommese, and Charles W Wampler. The numerical solution of systems of polynomials arising in engineering and science. World Scientific Publishing Co. Pte. Ltd, Singapore, 2005.
- [54] H. Stewenius, C. Engels, and D. Nistér. Recent developments on direct relative orientation. ISPRS J. of Photogrammetry and Remote Sensing, 60:284–294, 2006.
- [55] H. Stewenius, D. Nister, F. Kahl, and F. Schaffalitzky. A minimal solution for relative pose with unknown focal length. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2005), 2005.
- [56] H. Stewénius, D. Nistér, M. Oskarsson, and K. Åström. Solutions to minimal generalized relative pose problems. In Workshop on Omnidirectional Vision, Beijing China, OCT 2005.
- [57] P. H. S. Torr and A. Zisserman. Robust parameterization and computation of the trifocal tensor. Image and Vision Computing, 15:591–605, 1997.
- [58] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015.
- [59] G. Xu and N. Sugimoto. A linear algorithm for motion from three weak perspective images using euler angles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(1):54–57, 1999.