Relative pose of three calibrated and partially calibrated cameras from four points using virtual correspondences

Charalambos Tzamos11{}^{\textrm{1}}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT  Viktor Kocur22{}^{\textrm{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT  Daniel Barath33{}^{\textrm{3}}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT  Zuzana Berger Haladová22{}^{\textrm{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT
Torsten Sattler44{}^{\textrm{4}}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT  Zuzana Kukelova11{}^{\textrm{1}}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
11{}^{\textrm{1}}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTVisual Recognition Group, Faculty of Electrical Engineering, Czech Technical University in Prague
{tzamocha, kukelzuz}@fel.cvut.cz
22{}^{\textrm{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTFaculty of Mathematics, Physics and Informatics, Comenius University in Bratislava
{viktor.kocur, haladova}@fmph.uniba.sk
33{}^{\textrm{3}}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTETH Zürich, Computer Vision and Geometry Group, Switzerland
[email protected]
44{}^{\textrm{4}}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPTCzech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague
[email protected]
Abstract

We study challenging problems of estimating the relative pose of three cameras and propose novel efficient solutions to the configurations (1) of four points in three calibrated cameras (the 4p3v problem), and (2) of four points in three cameras with unknown shared focal length (the 4p3vf problem). Our solutions are based on the simple idea of generating one or two additional virtual point correspondences in two views by using the information from the locations of the input correspondences. We generate such correspondences using a very simple and efficient strategy, where the new points are the mean points of three corresponding input points. The new solvers are efficient and easy to implement, since they are based on existing efficient minimal solvers, i.e., the well-known 5-point and 6-point relative pose solvers and the P3P solver. Extensive experiments on real data show that our solvers achieve state-of-the-art results. We also present a simple network that can improve the precision of the mean-point correspondences, showing the potential to learn better virtual point correspondences.

1 Introduction

Camera geometry estimation is crucial in many computer vision applications, e.g., visual navigation [48], Structure-from-Motion [52], augmented reality [5], self-driving cars [17], and visual localization [47]. Due to noise and outliers in the input correspondences, the predominant way for camera geometry estimation is to use a hypothesis-and-test framework, e.g., RANSAC [15, 8, 45, 2]. For RANSAC-like methods, using as few (ideally the minimal number of) correspondences as possible for estimation is important since the number of RANSAC iterations (and, thus, its run-time) grows exponentially with the number of correspondences required for the model estimation.

Refer to caption5ptP3P
Figure 1: Visualization of the 4p3v problem and our solution that is based on generating new virtual correspondence(s) between two views. This is done using coordinates of input point correspondences. Then the 4p3v problem is solved by existing efficient minimal solvers, i.e., the 5pt [34] and the P3P solvers [39].

Minimal camera geometry problems often result in complex systems of polynomial equations. Efficient algebraic methods helped to solve many previously unsolved problems  [55, 4, 28, 25, 24, 54]. Still, they fail to generate efficient and/or numerically stable solutions for some problems. In this paper, we study such challenging problems of estimating the relative pose of three cameras. These problems have received attention for a long time [18, 44, 30, 33, 1]. However, due to their complexity, they are still not considered fully solved. There are no efficient and practical solutions for most of the minimal configurations of point and/or line correspondences [22]. One such configuration that is particularly interesting is the notoriously difficult configuration of four points in three views [44, 34]. This configuration is minimal for cameras with an unknown shared focal length, i.e., the 4p3vf problem, and provides one more constraint than minimal for calibrated cameras, i.e., the 4p3v problem.

State-of-the-art algebraic and numerical methods are known to fail in generating efficient and numerically stable solutions to the 4p3v and 4p3vf problems. The existing methods for solving these problems are either too slow for practical applications [7, 9] or are only approximate [19, 35]. By solving only for one (a few) solutions from the 272 solutions of the 4p3v problem [19], and by discretely sampling the space of potential solutions [35], the existing 4p3v methods can often fail, i.e., the returned solution can be, in general, arbitrarily far from the geometrically correct solution. To decrease the failure rate, both methods [19, 35] require a lot of tuning and are not easy to re-implement.111For the solver proposed in [35], there is no publicly available implementation. The publicly available implementation of the solver from [19] is quite complex and requires a non-negligible effort to run.

In this paper, we propose a novel approach for solving the 4p3v and 4p3vf problems. Our solutions are based on the simple idea of generating new approximate point correspondence(s) between two of the three views.222Note that similarly to the state-of-the-art solutions [35, 19], our solutions are only approximate. However, as we show in the experiments, they provide good initialization for local optimization [8] and outperform [19]. Such approximate correspondences are generated using only the locations of the original input point correspondences, without any information about the image itself (e.g., appearance or local features). Consequently, the new correspondences do not need to correspond to any physical 3D points in the scene. Thus, we call them virtual correspondences. Using virtual correspondences, we can efficiently solve the 4p3v and 4p3vf problems by first estimating the relative pose of two cameras from five/six correspondences using efficient 5pt/6pt solvers [34, 56], and then registering the third camera using a P3P solver [39]. We call these combinations the 5pt+P3P and 6pt+P3P solvers.

Based on this idea, we propose a group of novel solvers for the 4p3v and 4p3vf problems. These solvers, called M-based solvers (4p3v(M),4p3vf(M)), use the mean points of three corresponding points detected in two views and, potentially, points in their vicinity ((M±δplus-or-minus𝛿\pm\delta± italic_δ)-solvers) as new virtual point correspondences. To compensate for noise in these virtual correspondences, our solvers refine the solutions on the original four points in three views using just a few iterations of Levenberg-Marquardt refinement. While conceptually very simple and efficient, the novel solvers achieve state-of-the-art results on real data.

The contributions of the paper are as follows:

  • For the well-known challenging 4p3v problem for calibrated cameras, we propose novel M-based solvers. These solvers generate an additional virtual point correspondence(s) in two views as the mean points of three corresponding points and refine the approximate solution on the original four points in three views. The new solvers achieve state-of-the-art results in terms of accuracy on real data. Compared to state-of-the-art 4p3v solvers [19, 35], which are non-trivial and difficult to re-implement such that they are numerically stable and fast, our new solutions can be easily implemented using existing efficient implementations of the 5pt solver [34] and the P3P solver [39]. The source code of our solvers will be publicly available.

  • We provide efficient solutions to the 4p3vf problem for cameras with an unknown shared focal length. Our novel solvers generate for each instance two virtual correspondences to solve the problem via the efficient 6pt [34] and the P3P [39] solvers. Our solutions are significantly faster than the existing homotopy-continuation solutions [7, 9]. Our solvers show the potential of virtual correspondences to be applied to other camera geometry problems.

  • We present preliminary results for a simple network that can improve the precision of the mean-point correspondences. While our current versions of the learning-based (L-based) solvers do not provide sufficient improvement of virtual correspondences to produce a visible improvement after the refinement on all four correspondences, the proposed network shows the potential to learn better virtual point correspondences.

  • To the best of our knowledge, we are the first to extensively evaluate solutions to the 4p3v and 4p3vf problems on a large variety of real-world scenes and within state-of-the-art RANSAC frameworks, and to compare them to the baseline minimal 5pt+P3P and 6pt+P3P solvers on such data.

2 Related work

Estimating the relative pose of three cameras from a minimal number of point and line correspondences is known as an extremely challenging problem.

For three uncalibrated cameras, 6 point correspondences are necessary to estimate the trifocal tensor, with a solution known for a long time [43, 57]. Solutions to three minimal combinations of points and lines are presented in [36]. The minimal configuration using 9 lines is much more challenging and was solved only recently  [27]. However, the final solver is far from practical since it runs for 17.8s.

For calibrated cameras, the configuration that attracts most of the attention is the configuration of four points in three views (the 4p3v problem). Note that this is not a minimal configuration since it generates 12 constraints for 11 degrees-of-freedom (DoF) (see also Section 3). The 4p3v problem is known to be extremely difficult to solve. Several papers present mostly theoretical results [30, 33, 1]. For four triplets of exact points without noise, it is shown that the 4p3v problem has, in general, a unique solution [18, 44].

To the best of our knowledge, there are only two reasonably efficient solutions to the 4p3v problem reported in the literature. The first solver [35] is based on a one-dimensional exhaustive search. It performs a sweep of a 10thsuperscript10𝑡10^{th}10 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT-degree curve of possible epipoles. For each potential epipole, it computes the relative pose of two cameras, registers the third camera using three triangulated points, and finally extracts the solution minimizing the reprojection error of the fourth point in the third view. Evaluation of the solver on one potential epipole is fast. Yet, to obtain reasonable precise and stable results, usually, 1,000 candidates need to be evaluated and even then, refinement at multiple local minima is required to improve the precision. The runtimes reported for this solver were 112ms112𝑚𝑠1-12ms1 - 12 italic_m italic_s depending on the number of points searched. There is no publicly available implementation for this solver and it is not easy to re-implement. As such, the literature does not compare against the solver in experiments. As an upper bound of the performance of [35], we compare with an oracle version using the true epipole in the supplementary material (SM).

The second efficient solver to the 4p3v problem was published only recently [19]. In this paper, the authors first transform the 4p3v problem into a minimal problem by considering a line passing through the last correspondence in the third view. The resulting system of equations is solved using an efficient Homotopy continuation (HC) method [14, 53]. To avoid computing large numbers of spurious solutions, an MLP-based classifier is trained. For a given problem p𝑝pitalic_p, it selects one or several starting problem-solution pairs (so-called anchors), such that the geometrically meaningful/correct solution of p𝑝pitalic_p can be obtained by HC starting from this anchor. This strategy is fast, running 16.3μs16.3𝜇𝑠16.3\mu s16.3 italic_μ italic_s on average per solution. However, it has a high failure rate. The success rate of the 4p3v solver reported in [19] on two test datasets and data without noise is 26.3%percent26.326.3\%26.3 %. [19] do not show results for a real scenario, i.e., a RANSAC-like framework with noisy data. Providing such an evaluation, we show that our much simpler solvers outperform [19].

Solutions to the 4p3v problem for orthographic and scaled orthographic views were presented in [59, 32]. In [32], the author suggested an iterative approach for updating to perspective views, but reported results only on a few synthetic instances. According to our own experiments, the update does not work on real data with general perspective cameras, even after spending months on this issue.

Minimal configurations of points and lines in three calibrated views were studied in [13, 22, 12], aiming to classify and derive the number of solutions for different configurations. Solutions to two minimal configurations of combinations of points and lines (Chicago and Cleveland), were proposed in [14] and solved using a HC method [53]. Due to their complexity, the solvers are not practical,

Recently, the HC method was used to solve the 4p3vf problem for cameras with an unknown shared focal length [7, 9]. The running times of the CPU variants of the proposed solvers range from 250ms250𝑚𝑠250ms250 italic_m italic_s to 1456ms1456𝑚𝑠1456ms1456 italic_m italic_s. Efficient GPU implementations run 16.7ms16.7𝑚𝑠16.7ms16.7 italic_m italic_s to 154ms154𝑚𝑠154ms154 italic_m italic_s. These times are still too slow for practical applications. Due to slow run-times and their dependency on a GPU, we are not comparing with the GPU solvers on real data. A GPU HC method was also used to solve minimal problems of four points/six lines in three views for a generalized camera in [11].

In our solutions, we generate virtual correspondences. Virtual matches are also used in the literature on affine correspondences (AC) [38, 40, 41, 3]. There, points are sampled based on the affine feature geometry to generate point correspondences from affine ones. In our scenario, we are only given point correspondences, without associated feature geometry, and we predict additional point matches.

3 Estimating the relative pose of three cameras

Let N𝑁Nitalic_N cameras observe a set of 3D points 𝒫𝒫\mathcal{P}caligraphic_P. For each point 𝐏𝒫𝐏𝒫\mathbf{P}\in\mathcal{P}bold_P ∈ caligraphic_P, let C𝐏>1subscript𝐶𝐏1C_{\mathbf{P}}>1italic_C start_POSTSUBSCRIPT bold_P end_POSTSUBSCRIPT > 1 be the number of cameras that observe it. A necessary condition for a relative pose problem of N𝑁Nitalic_N calibrated cameras to be minimal is [14]

𝐏𝒫(2C𝐏3)=6N7.subscript𝐏𝒫2subscript𝐶𝐏36𝑁7\sum_{\mathbf{P}\in\mathcal{P}}(2C_{\mathbf{P}}-3)=6N-7\enspace.∑ start_POSTSUBSCRIPT bold_P ∈ caligraphic_P end_POSTSUBSCRIPT ( 2 italic_C start_POSTSUBSCRIPT bold_P end_POSTSUBSCRIPT - 3 ) = 6 italic_N - 7 . (1)

Let Smsubscript𝑆𝑚S_{m}italic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denote a sample of m𝑚mitalic_m 3D points and let Smksuperscriptsubscript𝑆𝑚𝑘S_{m}^{k}italic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT denote a subset of Smsubscript𝑆𝑚S_{m}italic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT with cardinality k𝑘kitalic_k. A configuration of points in N=3𝑁3N=3italic_N = 3 views that satisfies the constraint (1) of a minimal problem is three points visible in all three cameras and two additional points visible in two of the three cameras. We will call this configuration (S5,S5,S53)subscript𝑆5subscript𝑆5superscriptsubscript𝑆53{(S_{5},S_{5},S_{5}^{3})}( italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ).

The configuration of four points visible in all three cameras, i.e., the configuration (S4,S4,S4)subscript𝑆4subscript𝑆4subscript𝑆4{(S_{4},S_{4},S_{4})}( italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ), generates an over-constrained problem. In this case, we have one more constraint than DoF, i.e., in Eq. (1), we have 12>11121112>1112 > 11. A minimal solution would need to drop one constraint, e.g., by considering only a line passing through one of the points in the third view [19] or by considering a “half” point correspondence. Since, in practice, we always have full correspondences and sampling one less point in one view leads to an under-constrained problem, the configuration (S4,S4,S4)subscript𝑆4subscript𝑆4subscript𝑆4{(S_{4},S_{4},S_{4})}( italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ), is usually considered “minimal”.

For cameras with an unknown common focal length, we have one more DoF. This means that, for N=3𝑁3N=3italic_N = 3, the right-hand side of equation (1) becomes 186=121861218-6=1218 - 6 = 12, resulting in (S6,S6,S63)subscript𝑆6subscript𝑆6superscriptsubscript𝑆63{(S_{6},S_{6},S_{6}^{3})}( italic_S start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) and (S4,S4,S4)subscript𝑆4subscript𝑆4subscript𝑆4{(S_{4},S_{4},S_{4})}( italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) being minimal configurations.

3.1 Calibrated cameras

In this section, we describe solutions for three calibrated cameras. We start with one baseline minimal solution for the minimal (S5,S5,S53)subscript𝑆5subscript𝑆5superscriptsubscript𝑆53{(S_{5},S_{5},S_{5}^{3})}( italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) configuration, followed by novel solutions for the “minimal” (S4,S4,S4)subscript𝑆4subscript𝑆4subscript𝑆4{(S_{4},S_{4},S_{4})}( italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) configuration.

5pt+P3P solver:  The 5pt+P3P solver first estimates the relative pose of two cameras from 5 image point correspondences using the efficient 5pt solver [34]. Next, the three points visible in all three views are triangulated. Finally, the third camera is registered using the three 2D-3D point correspondences and the well-known efficient P3P solver [39].

This solver is straightforward and it is based on efficient existing solvers [34, 39]. This solver appears in the literature [13, 34, 46, 35]. Nister et al. [35] showed that the 5pt+P3P solver performs better than their dedicated 4p3v solver. However, in the most recent works [14, 19] that study the relative pose problem for three calibrated cameras, the 5pt+P3P solver is not discussed and is not used as a baseline for comparison. To our knowledge, the performance of this solver on real data and within state-of-the-art RANSAC frameworks in the context of the 4p3v problem has not been extensively studied. Our paper fills this gap in the literature.

Motivated by the efficient 5pt+P3P solver and the minimal (S5,S5,S53)subscript𝑆5subscript𝑆5superscriptsubscript𝑆53{(S_{5},S_{5},S_{5}^{3})}( italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) configuration, which, compared to the (S4,S4,S4)subscript𝑆4subscript𝑆4subscript𝑆4{(S_{4},S_{4},S_{4})}( italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) configuration, leads to significantly simpler systems of polynomial equations, we next describe novel solvers to the calibrated 4p3v problem. They efficiently solve the (S4,S4,S4)subscript𝑆4subscript𝑆4subscript𝑆4{(S_{4},S_{4},S_{4})}( italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) configuration by generating a virtual point correspondence in the first two views, and solving the resulting (S5,S5,S53)subscript𝑆5subscript𝑆5superscriptsubscript𝑆53{(S_{5},S_{5},S_{5}^{3})}( italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) problem using the 5pt+P3P solver.

4p3v(M) solver:  Our first solver is based on a simple observation: If we fix the 5thsuperscript5th5^{\text{th}}5 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT point in the first view to be the mean 𝐦1superscript𝐦1\mathbf{m}^{1}bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT of three points {𝐱i1,𝐱j1,𝐱k1},i,j,k{1,,4}superscriptsubscript𝐱𝑖1superscriptsubscript𝐱𝑗1superscriptsubscript𝐱𝑘1𝑖𝑗𝑘14\left\{\mathbf{x}_{i}^{1},\mathbf{x}_{j}^{1},\mathbf{x}_{k}^{1}\right\},\;i,j,% k\in\left\{1,\dots,4\right\}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT } , italic_i , italic_j , italic_k ∈ { 1 , … , 4 }, in this view, then the mean point 𝐦2superscript𝐦2\mathbf{m}^{2}bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT of the corresponding three points {𝐱i2,𝐱j2,𝐱k2}superscriptsubscript𝐱𝑖2superscriptsubscript𝐱𝑗2superscriptsubscript𝐱𝑘2\left\{\mathbf{x}_{i}^{2},\mathbf{x}_{j}^{2},\mathbf{x}_{k}^{2}\right\}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } in the second view usually has a small epipolar error w.r.t. the ground truth relative pose. Thus 𝐦1𝐦2superscript𝐦1superscript𝐦2\mathbf{m}^{1}\leftrightarrow\mathbf{m}^{2}bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ↔ bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is usually a good approximation of a correct correspondence.

This can be considered a surprising observation, since 4 (or actually 3) points in two views define an infinite number of camera poses that can observe these points. However, the reason this mean-point strategy works in practice comes from several simple facts and observations. (1) To generate a good correspondence, we only require that the point in the second view be reasonably close to the epipolar line defined by the mean point 𝐦1superscript𝐦1\mathbf{m}^{1}bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT in the first view, i.e., the 2D point does not need to correspond to one particular 3D point with a given depth.333By fixing a point in one view, we are defining an epipolar line in the second view. Any of the points on this line (corresponding to 3D points with different depths) is in correspondence with the point in the first view. (2) The epipolar line defined by 𝐦1superscript𝐦1\mathbf{m}^{1}bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT passes through the triangle defined by the corresponding three points {𝐱i2,𝐱j2,𝐱k2}superscriptsubscript𝐱𝑖2superscriptsubscript𝐱𝑗2superscriptsubscript𝐱𝑘2\left\{\mathbf{x}_{i}^{2},\mathbf{x}_{j}^{2},\mathbf{x}_{k}^{2}\right\}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } in the second image. Thus, the maximum distance of 𝐦2superscript𝐦2\mathbf{m}^{2}bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in the second image from the epipolar line is bounded by the maximum distance of 𝐦2superscript𝐦2\mathbf{m}^{2}bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT from {𝐱i2,𝐱j2,𝐱k2}superscriptsubscript𝐱𝑖2superscriptsubscript𝐱𝑗2superscriptsubscript𝐱𝑘2\left\{\mathbf{x}_{i}^{2},\mathbf{x}_{j}^{2},\mathbf{x}_{k}^{2}\right\}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } (for a proof, see SM). (3) For practical applications, when used in RANSAC, it is not necessary that each triplet {𝐱i2,𝐱j2,𝐱k2}superscriptsubscript𝐱𝑖2superscriptsubscript𝐱𝑗2superscriptsubscript𝐱𝑘2\left\{\mathbf{x}_{i}^{2},\mathbf{x}_{j}^{2},\mathbf{x}_{k}^{2}\right\}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } generates a good correspondence 𝐦1𝐦2superscript𝐦1superscript𝐦2\mathbf{m}^{1}\leftrightarrow\mathbf{m}^{2}bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ↔ bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Samples with a high level of noise in the mean-point correspondence are filtered inside RANSAC.444This property was also used in the HC solver [19], which completely fails for many samples. These samples are filtered within RANSAC. On a large number of different scenes, we observed that even if some image pairs have triplets of points that generate very noisy mean-point correspondences, there are usually enough triplets for which the noise in 𝐦2superscript𝐦2\mathbf{m}^{2}bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is reasonably small to provide good estimates. (4) Four point correspondences in two views usually fix the space of possible poses such that the 5thsuperscript5𝑡5^{th}5 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT correspondence, even if noisy, often generates a pose that is not very far from the ground truth pose. Such a pose is usually sufficient as an initialization of non-linear optimization on the original four points in three views and subsequent local optimization on detected inliers. We support our observations by experiments on a large amount of synthetic and real data (see Sec. 4 and SM).

Motivated by these observations, our 4p3v(M) solver uses the mean points of three corresponding points detected in two views as a new 5thsuperscript5th5^{\text{th}}5 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT point correspondence. The 4p3v problem is then solved using the 5pt+P3P solver.

4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ) solver:  While the mean point correspondence used in the 4p3v(M) solver can provide a good approximation of a correct correspondence, as mentioned, it can also be noisy. In the 4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ) solver, we thus, in addition to the mean point 𝐦2=[x,y]superscript𝐦2𝑥𝑦\mathbf{m}^{2}=\left[x,y\right]bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = [ italic_x , italic_y ] of three points {𝐱i2,𝐱j2,𝐱k2}superscriptsubscript𝐱𝑖2superscriptsubscript𝐱𝑗2superscriptsubscript𝐱𝑘2\left\{\mathbf{x}_{i}^{2},\mathbf{x}_{j}^{2},\mathbf{x}_{k}^{2}\right\}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } in the second image, generate two additional points. These points are (1) 𝐦±δ2=[x±δ,y]subscriptsuperscript𝐦2plus-or-minus𝛿plus-or-minus𝑥𝛿𝑦\mathbf{m}^{2}_{\pm\delta}=\left[x\pm\delta,y\right]bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ± italic_δ end_POSTSUBSCRIPT = [ italic_x ± italic_δ , italic_y ] if the longest dimension of the triangle 𝒯2=Δ{𝐱i2,𝐱j2,𝐱k2}superscript𝒯2Δsuperscriptsubscript𝐱𝑖2superscriptsubscript𝐱𝑗2superscriptsubscript𝐱𝑘2\mathcal{T}^{2}=\Delta\left\{\mathbf{x}_{i}^{2},\mathbf{x}_{j}^{2},\mathbf{x}_% {k}^{2}\right\}caligraphic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_Δ { bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } is in the x-direction or (2) 𝐦±δ2=[x,y±δ]subscriptsuperscript𝐦2plus-or-minus𝛿𝑥plus-or-minus𝑦𝛿\mathbf{m}^{2}_{\pm\delta}=\left[x,y\pm\delta\right]bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ± italic_δ end_POSTSUBSCRIPT = [ italic_x , italic_y ± italic_δ ] if it is in the y-direction. All three points, i.e., 𝐦2superscript𝐦2\mathbf{m}^{2}bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, 𝐦δ2subscriptsuperscript𝐦2𝛿\mathbf{m}^{2}_{-\delta}bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_δ end_POSTSUBSCRIPT, and 𝐦+δ2subscriptsuperscript𝐦2𝛿\mathbf{m}^{2}_{+\delta}bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + italic_δ end_POSTSUBSCRIPT are placed in correspondence with the mean point 𝐦1superscript𝐦1\mathbf{m}^{1}bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT. The 4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ) solver in the first step calls the 5pt solver [34] three times, with the 5thsuperscript5𝑡5^{th}5 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT correspondence being either 𝐦1𝐦2superscript𝐦1superscript𝐦2\mathbf{m}^{1}\leftrightarrow\mathbf{m}^{2}bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ↔ bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, 𝐦1𝐦δ2superscript𝐦1subscriptsuperscript𝐦2𝛿\mathbf{m}^{1}\leftrightarrow\mathbf{m}^{2}_{-\delta}bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ↔ bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_δ end_POSTSUBSCRIPT, or 𝐦1𝐦+δ2superscript𝐦1subscriptsuperscript𝐦2𝛿\mathbf{m}^{1}\leftrightarrow\mathbf{m}^{2}_{+\delta}bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ↔ bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + italic_δ end_POSTSUBSCRIPT. The results of these three 5pt solvers are collected to create hypotheses for the relative pose of the first two cameras inside RANSAC. The shift δ𝛿\deltaitalic_δ is selected relative to the size of the triangle 𝒯2superscript𝒯2\mathcal{T}^{2}caligraphic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

𝟒𝐭𝐡superscript4𝐭𝐡\mathbf{4^{th}}bold_4 start_POSTSUPERSCRIPT bold_th end_POSTSUPERSCRIPT point in the third view:  The 4p3v(M) and 4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ) solvers are actually solving the configuration (S4,S4,S43)subscript𝑆4subscript𝑆4superscriptsubscript𝑆43{(S_{4},S_{4},S_{4}^{3})}( italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ), i.e., they are not using information from the 4thsuperscript4𝑡4^{th}4 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT point in third view, i.e., point 𝐱43subscriptsuperscript𝐱34\mathbf{x}^{3}_{4}bold_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT. The information from 𝐱43subscriptsuperscript𝐱34\mathbf{x}^{3}_{4}bold_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT can be used in the solver for the (S4,S4,S4)subscript𝑆4subscript𝑆4subscript𝑆4{(S_{4},S_{4},S_{4})}( italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) configuration in two different ways: Filtering (+F): 𝐱43subscriptsuperscript𝐱34\mathbf{x}^{3}_{4}bold_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT can be used to filter out geometrically infeasible solutions returned by the 5pt+P3P solver that is used inside M-based solvers. The 5pt+P3P solver returns multiple solutions that can be evaluated w.r.t. 𝐱43subscriptsuperscript𝐱34\mathbf{x}^{3}_{4}bold_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT. Since the returned solutions can be affected by larger noise in the mean-point correspondence, we do not simply select the solution with the smallest error on 𝐱43subscriptsuperscript𝐱34\mathbf{x}^{3}_{4}bold_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, but we keep all solutions that have an epipolar error on 𝐱43subscriptsuperscript𝐱34\mathbf{x}^{3}_{4}bold_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT smaller than twice the threshold used inside RANSAC. Our experiments show that this filtering can improve the speed of the proposed solvers. However, as a trade-off, there is a small drop in the precision of the solvers since, in some cases, geometrically correct solutions are filtered. Refinement (+R): 𝐱43subscriptsuperscript𝐱34\mathbf{x}^{3}_{4}bold_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT can be used to refine the solutions returned by the 5pt+P3P solver used inside M-based solvers. These solutions have zero error on the original four points and the mean-point correspondence in the first two views, but can have a large error on 𝐱43subscriptsuperscript𝐱34\mathbf{x}^{3}_{4}bold_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT. Rather, we want solutions that minimize the epipolar errors on the original four points in the three views. Note that (S4,S4,S4)subscript𝑆4subscript𝑆4subscript𝑆4{(S_{4},S_{4},S_{4})}( italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) is an overconstrained configuration and thus for noisy data, there is, in general, no solution with zero error on all four points in the three views. We perform refinement of the poses by minimizing the epipolar errors of all original four points in three views using Levengerg-Marquardt optimization (LM), initialized using the solutions from the M-point solvers. Experiments with different numbers of iterations show that two iterations are usually sufficient to obtain an improvement (see SM).

4p3v(L) solvers:  We also experiment with learning-based 4p3v(L) solvers, which, instead of using the mean point correspondence, use a neural network to predict the virtual correspondence. In the network, we want to directly use the information from all four points in three views. We use the fact that four triplets of points, in general, define a unique relative pose of three calibrated cameras. We train a network that, given such four corresponding points in three views and a fixed 5thth{}^{\text{th}}start_FLOATSUPERSCRIPT th end_FLOATSUPERSCRIPT point in the first view, predicts a corresponding 5thth{}^{\text{th}}start_FLOATSUPERSCRIPT th end_FLOATSUPERSCRIPT point in the second view. We fix the 5thth{}^{\text{th}}start_FLOATSUPERSCRIPT th end_FLOATSUPERSCRIPT point to the mean point 𝐦1superscript𝐦1\mathbf{m}^{1}bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT as also defined in the M-based solvers. The network actually learns a shift of the mean point 𝐦2superscript𝐦2\mathbf{m}^{2}bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in the second view to be in correspondence with 𝐦1superscript𝐦1\mathbf{m}^{1}bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT. We use a lightweight architecture with a backbone of shared MLP layers, similar to [6], and the Sampson error as a loss function. Details on the architecture, loss function, and training data are in the SM. 4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δ) solvers can be defined similarly to 4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ) solvers (see SM).

Our experiments show that the proposed network can, in general, improve the precision of the mean-point correspondence 𝐦1𝐦2superscript𝐦1superscript𝐦2\mathbf{m}^{1}\leftrightarrow\mathbf{m}^{2}bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ↔ bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, resulting in better performance than the baseline 4p3v(M). However, as shown in Sec. 4, after adding the refinement (+R), the difference between the 4p3v(M)+R and 4p3v(L)+R solvers is negligible. Still, the proposed network can be seen as a first step towards a method that can learn better virtual point correspondences.

3.2 Partially calibrated cameras

To show the potential of the proposed mean-point strategy, we applied this idea to the very challenging 4p3vf problem of estimating the relative pose of three cameras with an unknown shared focal length from four correspondences. Our novel solvers for the 4p3vf problem follow the approach of generating virtual correspondences used in our 4p3v solvers for calibrated cameras. The idea is to transform the extremely complex 4p3vf problem into the problem solved by the efficient 6pt+P3P solver. The 6pt+P3P solver first estimates the unknown focal length and the relative pose of the first two cameras using the efficient 6pt solver [56] and then registers the third camera using the P3P solver [39]. This means that, in contrast to the 4p3v solvers presented in Section 3.1 that generate only one virtual correspondence (+ potentially additional shifted versions of this correspondence), our novel 4p3vf solvers need to generate two virtual correspondences to obtain six correspondences in the first two views.

In the 4p3vf(M) solver, we generate the two new virtual correspondences using the mean points of two different triplets of corresponding points in the first two cameras. Similarly to the calibrated case, we also test 4p3vf(L) and (+R), (+F), and δ𝛿\deltaitalic_δ-based variants of 4p3vf solvers. More details on all 4p3vf solvers can be found in the SM.

4 Experiments

We extensively evaluated the proposed solvers on a large variety of synthetic and real data to test their robustness to noise and outliers and assess their performance inside state-of-the-art RANSAC-frameworks [2, 26]. We compare our novel solvers with the homotopy continuation 4p3v(HC) solver  [19] and the 5pt+P3P and 6pt+P3P baseline minimal solvers for the (S5,S5,S53)subscript𝑆5subscript𝑆5superscriptsubscript𝑆53{(S_{5},S_{5},S_{5}^{3})}( italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) and (S6,S6,S63)subscript𝑆6subscript𝑆6superscriptsubscript𝑆63{(S_{6},S_{6},S_{6}^{3})}( italic_S start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) configurations for calibrated cameras respectively cameras with an unknown shared focal length. To obtain upper bounds for the precision that can be achieved by our proposed solvers, we also consider ‘Oracle’ versions (denoted (O)), where we use correct correspondence(s), i.e., correspondences that satisfy the epipolar constraint for the ground truth relative pose of the first two cameras, as the 5thsuperscript5th5^{\text{th}}5 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT/6thsuperscript6th6^{\text{th}}6 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT virtual correspondence between these cameras. Without publicly available code for the 4p3v(N) solver [35], we tested only its ‘Oracle’ version. Instead of doing a one-dimensional search over the 10thsuperscript10th10^{\text{th}}10 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT degree curve of possible epipoles, we use the ground truth epipole. Since this solver performs worse than our ‘Oracles’, we report it only in the SM.

Experimental setup.  To obtain feature correspondences, we use SuperPoint [10] features with the LightGlue [31] matcher. We extract at most 2048 features per image. We perform matching for all three image pairs and keep only those matches that were consistently matched across all three views. We performed evaluation within two RANSAC frameworks: PoseLib [26] and GC-RANSAC [2]. For the 5pt solver, we use [34] and for the P3P solver, we use [39]. In PoseLib, we perform LO [8] using LM optimization. In GC-RANSAC we perform LO using non-minimal solvers for fitting models to larger-than-minimal samples. We tested different shifts for our δ𝛿\deltaitalic_δ-based solvers (for the ablation study, see SM) and selected δ=0.04(longest triangle dim.)𝛿0.04(longest triangle dim.)\delta=0.04*\texttt{(longest triangle dim.)}italic_δ = 0.04 ∗ (longest triangle dim.).

Evaluation measures.  Inspired by [20], we define the pose error as max(0.5(𝚁err12+𝚁err13),0.5(𝐭err12+𝐭err13))max0.5superscriptsubscript𝚁𝑒𝑟𝑟12superscriptsubscript𝚁𝑒𝑟𝑟130.5superscriptsubscript𝐭𝑒𝑟𝑟12superscriptsubscript𝐭𝑒𝑟𝑟13\text{max}\left(0.5(\mathtt{R}_{err}^{12}+\mathtt{R}_{err}^{13}),0.5(\mathbf{t% }_{err}^{12}+\mathbf{t}_{err}^{13})\right)max ( 0.5 ( typewriter_R start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT + typewriter_R start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT ) , 0.5 ( bold_t start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT + bold_t start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT ) ), where 𝚁errijsuperscriptsubscript𝚁𝑒𝑟𝑟𝑖𝑗\mathtt{R}_{err}^{ij}typewriter_R start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT and 𝐭errijsuperscriptsubscript𝐭𝑒𝑟𝑟𝑖𝑗\mathbf{t}_{err}^{ij}bold_t start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT are the angular errors of rotation and translation for pair ij𝑖𝑗ijitalic_i italic_j in degrees [20]. We also report AUC values [20] at different thresholds for the pose error. We include results for an alternative pose error definition which includes 𝚁err23superscriptsubscript𝚁𝑒𝑟𝑟23\mathtt{R}_{err}^{23}typewriter_R start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 23 end_POSTSUPERSCRIPT and 𝐭err23superscriptsubscript𝐭𝑒𝑟𝑟23\mathbf{t}_{err}^{23}bold_t start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 23 end_POSTSUPERSCRIPT in the SM.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Left to right: Distribution of the average symmetric epipolar error (0.3319, 0.3308); rotation error (0.3373, 0.3347); translation error (0.3325, 0.3382); and percentage of inliers gathered (0.3377, 0.3354), as a function of the barycentric coordinates of the triangle in the second image w.r.t. the mean point of the corresponding triangle in the first image on 465k four-tuples of correspondences from scene Sacre Coeur from the PhotoTourism dataset [20]. For each metric, we fit a 2D Gaussian distribution and report the mean in brackets.

Mean-point strategy.  The first experiments aim to support our idea of selecting a virtual point correspondence as the mean points of three corresponding points in two images.

Scene AVG () MED () 20thsuperscript20𝑡20^{th}20 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT perc. ()
Brandenburg Gate 23.1 ±plus-or-minus\pm± 20.9 / 19.5 ±plus-or-minus\pm± 22.9 18.0 / 12.5 18.9 / 4.7
Buckingham Palace 25.7 ±plus-or-minus\pm± 22.3 / 22.2 ±plus-or-minus\pm± 23.4 19.4 / 14.8 19.0 / 5.7
Colosseum Exterior 27.5 ±plus-or-minus\pm± 22.3 / 20.9 ±plus-or-minus\pm± 25.2 22.4 / 12.4 10.4 / 4.1
Grand Place Brussels 25.3 ±plus-or-minus\pm± 22.0 / 21.7 ±plus-or-minus\pm± 23.5 19.5 / 15.1 19.4 / 5.9
Notre Dame Front Facade 27.9 ±plus-or-minus\pm± 23.0 / 20.2 ±plus-or-minus\pm± 26.5 23.2 / 12.0 11.5 / 4.2
Palace of Westminster 22.5 ±plus-or-minus\pm± 21.8 / 19.2 ±plus-or-minus\pm± 24.3 16.7 / 11.2 17.1 / 2.7
Pantheon Exterior 28.8 ±plus-or-minus\pm± 21.2 / 24.5 ±plus-or-minus\pm± 21.9 24.3 / 19.0 13.0 / 7.9
Reichstag 18.5 ±plus-or-minus\pm± 23.2 / 17.2 ±plus-or-minus\pm± 25.3 12.2 / 19.6 15.5 / 3.6
Sacre Coeur 23.8 ±plus-or-minus\pm± 23.6 / 17.1 ±plus-or-minus\pm± 25.3 17.1 / 18.1 17.5 / 2.2
St Peters Square 22.8 ±plus-or-minus\pm± 21.0 / 21.1 ±plus-or-minus\pm± 24.2 17.3 / 14.1 18.7 / 6.0
Taj Mahal 18.1 ±plus-or-minus\pm± 24.7 / 15.7 ±plus-or-minus\pm± 23.7 10.5 / 17.8 14.4 / 2.4
Temple Nara Japan 27.0 ±plus-or-minus\pm± 25.1 / 23.6 ±plus-or-minus\pm± 26.8 20.7 / 15.4 19.9 / 5.7
Trevi Fountain 30.3 ±plus-or-minus\pm± 22.7 / 20.7 ±plus-or-minus\pm± 23.5 25.9 / 12.3 12.0 / 4.2
Table 1: Comparing the accuracy of the 4pt(M) / 5pt solvers.

The accuracy of the mean point correspondence depends on a large number of variables, including the depths of the points w.r.t. the cameras, the angle under which the triangle formed by the three points is observed, the shape and size of the triangle, the type of motion, etc. A detailed analysis of all these factors, e.g., through synthetic experiments, is out of the scope of this paper. We thus study the accuracy of the mean point correspondences, and of the resulting estimated poses on real-world data. We only consider pairs instead of triplets as we create virtual correspondences in two views.

For the following experiments, we sample 100 four-tuples of point correspondences, obtained as SuperPoint+LightGlue matches consistent with the ground truth relative pose, i.e. inliers, for each image pair in scenes from the PhotoTourism dataset [20]. We use the first three correspondences to define the triangles in both images.

In our first experiment, we establish correspondences between the mean of the triangle in one image and various points in the triangle in the second image. We express points in the second triangle via their barycentric coordinates and uniformly sample 19×19191919\times 1919 × 19 barycentric coordinates (a,b)[0,1]2𝑎𝑏superscript012(a,b)\in[0,1]^{2}( italic_a , italic_b ) ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, such that a+b1𝑎𝑏1a+b\leq 1italic_a + italic_b ≤ 1 (ensuring points inside the triangle). The 3rd coordinate is given as c=1ab𝑐1𝑎𝑏c=1-a-bitalic_c = 1 - italic_a - italic_b. For each correspondence, we measure: The symmetric epipolar error w.r.t. the ground truth pose, translation and rotation errors, and the percentage of inliers consistent with the pose obtained with the 5pt solver applied on the virtual and the four real correspondences. Note that we are thus considering a 4-point-relative-pose problem.

Fig. 2 shows the results of this experiment on scene Sacre Coeur. It can be seen that the optimum of studied metrics is reached around the mean point of the triangles. To suppress the effect of discrete sampling, for each metric, we fit a 2D Gaussian distribution and report the mean value (in barycentric coordinates) as numbers in brackets in the caption of the figure. The mean values of the 2D Gaussians are very close to the mean point of the triangles, which has barycentric coordinates (0.3¯,0.3¯)formulae-sequence0¯30¯3(0.\bar{3},0.\bar{3})( 0 . over¯ start_ARG 3 end_ARG , 0 . over¯ start_ARG 3 end_ARG ). This clearly validates our approach of using mean point correspondences. The results for more real scenes and synthetic scenes are in the SM. For all these scenes, we observed a similar behavior.

In our second experiment, we establish the virtual correspondence between the mean points of the triangles. We compare the accuracy of the relative poses obtained by applying the 5pt solver on one virtual and four real correspondences (denoted as the 4pt(M) solver) with the accuracy obtained by the 5pt solver on five real correspondences. Tab. 1 shows the results of this comparison. As can be seen, the 4pt(M) solver is not as accurate as the 5pt solver, which is not surprising given that the virtual correspondence is inherently noisier than the 5th real correspondence used by the 5pt solver. While there is a large gap on some scenes (Collosseum, Notre Dame, Trevi), the gap is noticeably smaller on others (Reichstag, St. Peters), showing that the performance of our solver is scene-dependent. Overall, the gap is not too large, showing that the idea of using a virtual mean-point correspondence is viable. Further, note that the solvers are applied outside of RANSAC and that local optimization inside RANSAC usually compensates for less accurate pose estimates.

4p3v(M)4p3v(M)+R4p3v(M)+R+F4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R+F4p3v(HC)5pt+P3P4p3v(O)+R+F
Refer to caption
(a) Phototourism [20]
Refer to caption
(b) Cambridge Landmarks [21]
Refer to caption
(c) Reichstag [20]
Refer to caption
(d) King’s College [21]
Figure 3: Speed-accuracy trade-off on (a) all scenes from Phototourism [20] except St. Peter’s Square, (b) 5 Cambridge Landmarks [21] scenes, (c) the Reichstag scene from [20], and (d) the King’s College scene from [21]. We report the AUC@10 of the pose error and vary the number of Poselib RANSAC iterations (100, 200, 500, 1000, 2000, 5000, 10000). Runtimes are averaged over all image triplets.
Refer to caption
(a) original matches
Refer to caption
(b) w/ synth. outliers
Figure 4: Results on Brandenburg Gate. We show results using (a) only original triplet matches and (b) adding synthetic outliers to reach a 40% inlier ratio. The legend is provided in Fig. 3.

Noise experiments.  We next test the accuracy of our solvers and the state-of-the-art algorithms w.r.t. increasing image noise. We establish correspondences by projecting 3D points into the images and then add increasing amounts of normally distributed noise to the projections. Due to the approximate nature of our virtual correspondences, our novel solvers return non-zero errors for zero noise. However, at noise levels 2pxabsent2𝑝𝑥\geq 2px≥ 2 italic_p italic_x, these solvers return comparable or even better results than the 5pt+P3P/6pt+P3P solvers. This again shows that our predicted virtual correspondences are good approximations of real correspondences. The recent state-of-the-art HC solver [19] is failing in about 50% of the instances for noiseless data. Its median errors are thus significantly larger than those of the other solvers. See the SM for detailed results of the experiment.

Experiments on real data.  We test the solvers on all scenes from the PhotoTourism dataset [51, 20] which provide ground truth poses and intrinsics via a COLMAP [49] reconstruction. In the results, we do not include the St. Peter’s Square scene that we used for the validation of δ𝛿\deltaitalic_δ and the number of refinement iterations (see SM). We also include results for the Cambridge Landmarks dataset [21] (except the Street scene, which is commonly not used due to issues with its ground truth). For PhotoTourism, we use the images in their original resolution. For Cambridge Landmarks, we resize the images so that the larger side is 800 px. For each scene, we sample 5,000 random image triplets with at least 10 matches and with at least 10%percent1010\%10 % overlap [20].

Phototourism [20]
Estimator AVG ()(^{\circ})( start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) \downarrow MED ()(^{\circ})( start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) \downarrow AUC@5 \uparrow @10 \uparrow @20 \uparrow Runtime (ms) \downarrow
4p3v(HC) [19] 7.17 2.34 52.74 66.63 77.86 176.45
5pt+P3P 5.99 2.00 57.31 70.54 80.81 105.50
4p3v(M) 7.17 2.49 50.96 65.46 77.32 176.77
4p3v(M)+R 6.39 2.00 56.92 69.90 80.17 174.94
4p3v(M)+R+F 6.59 2.07 55.90 69.06 79.56 130.54
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ) 6.39 2.19 54.70 68.58 79.59 172.06
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R 5.97 1.89 58.65 71.43 81.35 189.21
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R+F 6.15 1.97 57.42 70.47 80.69 175.78
4p3v(L) 7.12 2.50 51.00 65.57 77.42 376.31
4p3v(L)+R 6.35 2.00 56.88 69.88 80.15 297.88
4p3v(O) 5.82 1.90 58.91 71.84 81.70 158.22
4p3v(O)+R 5.73 1.80 60.23 72.75 82.21 186.30
4p3v(O)+R+F 5.72 1.82 59.97 72.58 82.13 136.36
Cambridge Landmarks [21]
4p3v(HC) [19] 9.69 3.31 40.96 58.84 72.83 164.49
5pt+P3P 8.16 3.05 43.79 61.61 75.30 148.53
4p3v(M) 9.61 3.42 39.71 58.08 72.54 138.33
4p3v(M)+R 8.77 3.11 42.98 60.90 74.59 134.84
4p3v(M)+R+F 9.03 3.17 42.31 60.18 74.03 116.51
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ) 8.75 3.21 42.11 60.37 74.38 181.21
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R 8.32 3.01 44.17 62.04 75.60 184.98
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R+F 8.47 3.08 43.21 61.22 75.03 138.45
4p3v(L) 9.58 3.44 39.79 58.17 72.62 232.13
4p3v(L)+R 8.75 3.09 42.94 60.86 74.60 177.06
4p3v(O) 8.62 3.07 43.58 61.65 75.30 130.43
4p3v(O)+R 8.39 2.94 44.80 62.54 75.93 134.68
4p3v(O)+R+F 8.36 2.97 44.64 62.38 75.82 116.41
Table 2: Results for different solvers implemented in the PoseLib framework [26] on all scenes from the PhotoTourism [20] and 5 scenes from the Cambridge Landmarks [21] datasets. We mark the best and second best results (excluding oracle solvers). Reported runtimes are for the whole RANSAC.
Estimator AVG ()(^{\circ})( start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) \downarrow MED ()(^{\circ})( start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) \downarrow AUC@5 \uparrow @10 \uparrow @20 \uparrow Runtime (ms) \downarrow
6pt+P3P 11.16 3.53 38.44 56.79 70.89 111.72
4p3vf(M) 18.85 4.20 33.77 49.88 62.64 112.67
4p3vf(M)+R 19.53 4.30 33.23 49.08 61.71 114.74
4p3vf(M)+R+F 19.87 4.38 32.83 48.62 61.28 112.85
4p3vf(M±δplus-or-minus𝛿\pm\delta± italic_δ) 18.83 4.29 33.18 49.26 62.14 120.23
4p3vf(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R 18.49 4.24 33.56 49.77 62.70 121.76
4p3vf(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R+F 19.09 4.33 33.04 49.07 61.91 119.98
4p3vf(L) 20.50 4.47 32.49 47.91 60.27 138.26
4p3vf(L)+R 20.82 4.54 32.10 47.47 59.83 141.27
4p3vf(L±δplus-or-minus𝛿\pm\delta± italic_δinit)+R 14.94 3.79 36.21 53.56 66.98 259.63
4p3vf(O) 11.72 3.50 38.63 56.91 70.86 112.31
4p3vf(O)+R 11.77 3.51 38.69 56.94 70.89 114.06
Table 3: Results for the 4p3vf problem on 5 scenes from Cambridge Landmarks [21]. δ𝛿\deltaitalic_δ and (+R) do not always need to improve performance due to the early stopping criterion in RANSAC and being far from the optimum. In this case, the δ𝛿\deltaitalic_δ-version of the L-based solver performs the best for the 4p3vf problem (for more details on the 4p3vf(L±δplus-or-minus𝛿\pm\delta± italic_δinit)+R solver, see SM).

Tab. 2 shows results for calibrated cameras when using early termination in PoseLib RANSAC at a 0.99990.99990.99990.9999 confidence threshold. We provide similar results for GC-RANSAC in the SM. As can be seen, with refinement (+R), all of our solvers outperform the state-of-the-art HC-based 4p3v(HC) solver [19] in terms of accuracy. Using filtering (+F) improves the run-time of RANSAC at the cost of a decrease in pose accuracy. Still, our 4p3v(M)+R+F solver outperforms 4p3v(HC) in terms of both accuracy and run-time. The 4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ) and 4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R solvers clearly improve upon the 4p3v(M) solvers, albeit at an increased run-time without filtering. With filtering, 4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R+F is similarly efficient as 4p3v(M)+R at a (slightly) higher accuracy. The 4p3v(L) solvers slightly improve upon the 4p3v(M) solvers. While they produce more accurate virtual correspondences, refinement in the solvers (+R) compensates for the less accurate initial poses provided by the 4p3v(M) solvers, resulting in similar results for 4p3v(M)+R and 4p3v(L)+R. Still, the results for the 4p3v(L) solvers show a direction for improvement, namely learning to predict virtual correspondences. The results for the 4p3v(O) solvers show that there is room for improvement. The results for more variants of L-based solvers, (δ𝛿\deltaitalic_δ, +F, etc.) are in the SM.

Method [19] and our solvers solve a different configuration for the three-view-pose estimation problem than the 5pt+P3P solver ((S4,S4,S4)subscript𝑆4subscript𝑆4subscript𝑆4{(S_{4},S_{4},S_{4})}( italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) vs. (S5,S5,S53)subscript𝑆5subscript𝑆5superscriptsubscript𝑆53{(S_{5},S_{5},S_{5}^{3})}( italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT )). Based on limited experiments, [35] concluded that 5pt+P3P outperforms their solver. [19] did not compare to 5pt+P3P. Tab. 2 rectifies this omission, showing that 4p3v(HC) is clearly less accurate than 5pt+P3P, while not being consistently faster. In contrast, our 4p3v(M)+R, 4p3v(M)+R+F, and 4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R+F perform comparable to 5pt+P3P at faster run-times. To the best of our knowledge, we are the first to show that solvers for the (S4,S4,S4)subscript𝑆4subscript𝑆4subscript𝑆4{(S_{4},S_{4},S_{4})}( italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) configuration are practically relevant.

We also investigate the speed-accuracy trade-off of the solvers by running PoseLib RANSAC for a set of fixed numbers of iterations. Runtimes are reported for 1 core of a 2 GHz Intel Xeon Gold 6338 CPU. As shown in Fig. 3, our 4p3v(M)+R+F provides a better speed-accuracy trade-off than 5pt+P3P on Phototourism and a slightly worse trade-off on Cambridge Landmarks. On both datasets 4p3v(M)+R+F performs better than 4p3v(HC). Fig. 3(c) shows results for the Reichstag scene, where also other variants of our method provide a better trade-off than 5pt+P3P. Fig. 3(d) shows results for King’s College, where 4p3v(M)+R+F outperforms 5pt+P3P. These results show the practical viability of our solvers in a time-constrained setting. Results for more scenes are in the SM.

Fig. 4 shows the potential of our solvers to handle scenarios with low inlier ratios. We synthetically remove outlier matches based on ground truth pose information and replace them with outliers distributed uniformly at random such that the inlier ratio is fixed to 0.40.40.40.4 for all image triplets. Under the lower inlier ratio, our methods perform better compared to 5pt+P3P, while 5pt+P3P performs very similarly to 4p3v(M)+R+F for the higher inlier ratio scenario in the presented Brandenburg Gate scene. More scenes are in SM.

Shared unknown focal length case.  Tab. 3 shows results for the three-view-relative-pose problem for cameras with a shared unknown focal length. It compares our solvers (solving the (S4,S4,S43)subscript𝑆4subscript𝑆4superscriptsubscript𝑆43{(S_{4},S_{4},S_{4}^{3})}( italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) configuration) with the 6pt+P3P solver (solving the (S6,S6,S63)subscript𝑆6subscript𝑆6superscriptsubscript𝑆63{(S_{6},S_{6},S_{6}^{3})}( italic_S start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) configuration). While our 4p3vf solvers do not reach the accuracy of the 6pt+P3P solver, the results still show that our approach based on generating two virtual correspondences leads to practically useful solvers. To the best of our knowledge, ours are the first practical solvers for the (S4,S4,S43)subscript𝑆4subscript𝑆4superscriptsubscript𝑆43{(S_{4},S_{4},S_{4}^{3})}( italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) configuration.

Limitations.  As shown above, the accuracy of our solvers is scene dependent.555This weakness also applies to [19], since the scene needs to be similar enough to the training scenes for the MLP-based classifier to work well. In addition, we observed that the performance of our approaches drops when the overlap between the images in a triplet is small. In this case, the correspondences form small triangles. This leads to unstable configurations as the distance between the points is relatively small compared to the noise in the points (especially in the virtual points). While correspondences found in a small image region cause the same issues for the 5pt solver, 5pt+P3P is more robust in scenarios with small overlap, as its 5thsuperscript5𝑡5^{th}5 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT correspondence has a chance to be farther away from the other correspondences than the virtual correspondences used in our solvers. Properly taking the uncertainty of the virtual points into account, i.e., propagating their uncertainty into the uncertainty of the estimated poses and using this uncertainty during inlier counting [16], is a promising direction to handle the fact that our solvers produce less accurate poses. Creating more accurate correspondences, e.g., by training better networks, is an alternative. At the moment, we use the first three correspondences to create the virtual correspondence. More sophisticated selection criteria, e.g., trying to maximize the size of the formed triangles, could also improve performance.

5 Conclusion

In this paper, we consider the highly challenging problems of relative pose estimation of three calibrated and partially calibrated cameras from four correspondences. We propose a novel and easy-to-implement framework that solves these problems using existing efficient solvers by simply predicting a 5thsuperscript5th5^{\text{th}}5 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT/6thsuperscript6th6^{\text{th}}6 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT point correspondence(s). We propose several solvers based on this framework, one simply using mean coordinates of input points (M-based solvers) and one using a trained predictor (L-based solvers), with multiple variants. Extensive experiments show that our solvers achieve state-of-the-art performance on real data for the challenging configuration of four points in three views.

6 Acknowledgements

C. T. was supported by the Czech Science Foundation (GAČR) JUNIOR STAR Grant (No. 22-23183M). V. K. was supported by the project no. 1/0373/23. and the TERAIS project, a Horizon-Widera-2021 program of the European Union under the Grant agreement number 101079338. (Part of the) Research results was obtained using the computational resources procured in the national project National competence centre for high performance computing (project code: 311070AKF2) funded by European Regional Development Fund, EU Structural Funds Informatization of society, Operational Program Integrated Infrastructure. D. B. was supported by the ETH postdoc fellowship. Z. B. H. was supported by the grant KEGA 004UK-4/2024 “DICH: Digitalization of Cultural Heritage”. T. S. was supported by the EU Horizon 2020 project RICAIP (grant agreement No. 857306) and the European Regional Development Fund under project IMPACT (No. CZ.02.1.01/0.0/0.0/15_003/0000468). Z. K. was supported by the Czech Science Foundation (GAČR) JUNIOR STAR Grant (No. 22-23183M).

Supplementary Material

This supplementary material provides additional details and experimental results promised in the main paper: Sec. 7 discusses M-solvers, provides the proof on the bound of the epipolar error mentioned in the main paper, and experiments supporting the choice of mean point correspondences (see Sec. 3.1 in the main paper). Sec. 8 provides details on L-based solvers, namely information about the used network architecture and our training process, as well as about the δ𝛿{\delta}italic_δ-variants of the L-based solvers, including 4p3vf(L±δplus-or-minus𝛿\pm\delta± italic_δinit)+R, used in the paper (see Sec. 3.1 and Tab. 3 of the main paper). Sec. 9 provides details on the experiments mentioned in the main paper, namely experiments with different oracle solvers (see Sec. 4 of the main paper), namely ablations for the choice of the number of refinement iterations (Sec. 3.1 of the main paper) and the choice of δ𝛿\deltaitalic_δ (see Sec. 4 in the main paper), experiments with an additional evaluation measure (see Sec. 4 of the main paper), noise experiments (see Sec. 4 of the main paper), results with GC-RANSAC (see Sec. 4 of the main paper), and detailed plots over multiple scenes (see Sec. 4 of the main paper).

7 Using Mean Point Correspondendes

Proof of the bounds on the epipolar error.  While the mean point correspondence used in the M-based solvers can provide a good approximation of a correct correspondence, such a correspondence can be noisy. Note that all state-of-the-art 4p3v solvers (including 4p3v(HC) [19] and 4p3v(N) [35]) rely on certain approximations without establishing theoretical proofs to quantify their accuracy. In contrast, the error of our virtual correspondence is bounded: As mentioned in the main paper, it can be proven that the error of the virtual correspondence 𝐦1𝐦2superscript𝐦1superscript𝐦2\mathbf{m}^{1}\leftrightarrow\mathbf{m}^{2}bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ↔ bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is bounded by the maximum distance of the mean point 𝐦2superscript𝐦2\mathbf{m}^{2}bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT from the vertices of the triangle {𝐱i2,𝐱j2,𝐱k2}superscriptsubscript𝐱𝑖2superscriptsubscript𝐱𝑗2superscriptsubscript𝐱𝑘2\left\{\mathbf{x}_{i}^{2},\mathbf{x}_{j}^{2},\mathbf{x}_{k}^{2}\right\}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }. Here we provide a simple proof.

Refer to caption
Figure 5: Illustration of the considered situation.
Lemma 1.

Let us assume two cameras with camera centers 𝐂1superscript𝐂1\mathbf{C}^{1}bold_C start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and 𝐂2superscript𝐂2\mathbf{C}^{2}bold_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT that observe 3D points Xi,Xj,subscript𝑋𝑖subscript𝑋𝑗X_{i},X_{j},italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , and Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (see Figure 5 for an illustration). Let {𝐱i1,𝐱j1,𝐱k1}superscriptsubscript𝐱𝑖1superscriptsubscript𝐱𝑗1superscriptsubscript𝐱𝑘1\left\{\mathbf{x}_{i}^{1},\mathbf{x}_{j}^{1},\mathbf{x}_{k}^{1}\right\}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT } and {𝐱i2,𝐱j2,𝐱k2}superscriptsubscript𝐱𝑖2superscriptsubscript𝐱𝑗2superscriptsubscript𝐱𝑘2\left\{\mathbf{x}_{i}^{2},\mathbf{x}_{j}^{2},\mathbf{x}_{k}^{2}\right\}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } be the projections of these 3D points in camera 1 and camera 2, respectively. Let 𝐦1superscript𝐦1\mathbf{m}^{1}bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT be the mean point of the points {𝐱i1,𝐱j1,𝐱k1}superscriptsubscript𝐱𝑖1superscriptsubscript𝐱𝑗1superscriptsubscript𝐱𝑘1\left\{\mathbf{x}_{i}^{1},\mathbf{x}_{j}^{1},\mathbf{x}_{k}^{1}\right\}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT } and let 𝐄𝐄\mathbf{E}bold_E be the essential matrix between these two cameras, i.e., a matrix that satisfies 𝐱l2𝐄𝐱l1=0,l{i,j,k}formulae-sequencesuperscriptsubscript𝐱𝑙superscript2topsuperscriptsubscript𝐄𝐱𝑙10𝑙𝑖𝑗𝑘\mathbf{x}_{l}^{2^{\top}}\mathbf{E}\mathbf{x}_{l}^{1}=0,\;l\in\left\{i,j,k\right\}bold_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT bold_Ex start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = 0 , italic_l ∈ { italic_i , italic_j , italic_k }. Then the epipolar line 𝐄𝐦1superscript𝐄𝐦1\mathbf{E}\mathbf{m}^{1}bold_Em start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT passes through the triangle {𝐱i2,𝐱j2,𝐱k2}superscriptsubscript𝐱𝑖2superscriptsubscript𝐱𝑗2superscriptsubscript𝐱𝑘2\left\{\mathbf{x}_{i}^{2},\mathbf{x}_{j}^{2},\mathbf{x}_{k}^{2}\right\}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }.

Proof.

The camera center 𝐂1superscript𝐂1\mathbf{C}^{1}bold_C start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and the 3D points 𝐗i,𝐗j,subscript𝐗𝑖subscript𝐗𝑗\mathbf{X}_{i},\mathbf{X}_{j},bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , and 𝐗ksubscript𝐗𝑘\mathbf{X}_{k}bold_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT form a tetrahedron T1superscript𝑇1T^{1}italic_T start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT (see Figure 5). The projections {𝐱i1,𝐱j1,𝐱k1}superscriptsubscript𝐱𝑖1superscriptsubscript𝐱𝑗1superscriptsubscript𝐱𝑘1\left\{\mathbf{x}_{i}^{1},\mathbf{x}_{j}^{1},\mathbf{x}_{k}^{1}\right\}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT } in the first camera lie at the edges of this tetrahedron T1superscript𝑇1T^{1}italic_T start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT. The ray from the camera center 𝐂1superscript𝐂1\mathbf{C}^{1}bold_C start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT through the mean point 𝐦1superscript𝐦1\mathbf{m}^{1}bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT thus lies inside the tetrahedron T1superscript𝑇1T^{1}italic_T start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and intersects the plane defined by 3D points 𝐗i,𝐗j,subscript𝐗𝑖subscript𝐗𝑗\mathbf{X}_{i},\mathbf{X}_{j},bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , and 𝐗ksubscript𝐗𝑘\mathbf{X}_{k}bold_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in a point 𝐌𝐌\mathbf{M}bold_M that lies inside the triangle defined by {𝐗i,𝐗j,𝐗k}subscript𝐗𝑖subscript𝐗𝑗subscript𝐗𝑘\left\{\mathbf{X}_{i},\mathbf{X}_{j},\mathbf{X}_{k}\right\}{ bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }.

The camera center 𝐂2superscript𝐂2\mathbf{C}^{2}bold_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and the 3D points 𝐗i,𝐗j,subscript𝐗𝑖subscript𝐗𝑗\mathbf{X}_{i},\mathbf{X}_{j},bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , and 𝐗ksubscript𝐗𝑘\mathbf{X}_{k}bold_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT form a tetrahedron T2superscript𝑇2T^{2}italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Again, the projections {𝐱i2,𝐱j2,𝐱k2}superscriptsubscript𝐱𝑖2superscriptsubscript𝐱𝑗2superscriptsubscript𝐱𝑘2\left\{\mathbf{x}_{i}^{2},\mathbf{x}_{j}^{2},\mathbf{x}_{k}^{2}\right\}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } lie at the edges of the tetrahedron T2superscript𝑇2T^{2}italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The ray passing through the camera center 𝐂2superscript𝐂2\mathbf{C}^{2}bold_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and the 3D point 𝐌𝐌\mathbf{M}bold_M lies inside the tetrahedron T2superscript𝑇2T^{2}italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and thus intersects the image plane of the second camera at a point that lies inside the triangle defined by the points {𝐱i2,𝐱j2,𝐱k2}superscriptsubscript𝐱𝑖2superscriptsubscript𝐱𝑗2superscriptsubscript𝐱𝑘2\left\{\mathbf{x}_{i}^{2},\mathbf{x}_{j}^{2},\mathbf{x}_{k}^{2}\right\}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }. By construction, the projection of 𝐌𝐌\mathbf{M}bold_M into the second camera lies on the epipolar line 𝐄𝐦1superscript𝐄𝐦1\mathbf{E}\mathbf{m}^{1}bold_Em start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT. Therefore, the epipolar line 𝐄𝐦1superscript𝐄𝐦1\mathbf{E}\mathbf{m}^{1}bold_Em start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT which is a line connecting this point and the epipole 𝐞2superscript𝐞2\mathbf{e}^{2}bold_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, passes through the triangle {𝐱i2,𝐱j2,𝐱k2}superscriptsubscript𝐱𝑖2superscriptsubscript𝐱𝑗2superscriptsubscript𝐱𝑘2\left\{\mathbf{x}_{i}^{2},\mathbf{x}_{j}^{2},\mathbf{x}_{k}^{2}\right\}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }. ∎

It follows from Lemma 1 that since the epipolar line 𝐄𝐦1superscript𝐄𝐦1\mathbf{E}\mathbf{m}^{1}bold_Em start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT passes through the triangle {𝐱i2,𝐱j2,𝐱k2}superscriptsubscript𝐱𝑖2superscriptsubscript𝐱𝑗2superscriptsubscript𝐱𝑘2\left\{\mathbf{x}_{i}^{2},\mathbf{x}_{j}^{2},\mathbf{x}_{k}^{2}\right\}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }, the maximum distance of the mean point 𝐦2superscript𝐦2\mathbf{m}^{2}bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to the epipolar line 𝐄𝐦1superscript𝐄𝐦1\mathbf{E}\mathbf{m}^{1}bold_Em start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT is equal to the maximum distance of 𝐦2superscript𝐦2\mathbf{m}^{2}bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to the vertices of the triangle.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Left to right: Distribution of the average symmetric epipolar error (top: 0.3337, 0.3327) (bottom: 0.3355, 0.3290); rotation error (top: 0.3373, 0.3349) (bottom: 0.3261, 0.3496); translation error (top: 0.3336, 0.3417) (bottom: 0.3213, 0.3515); and percentage of inliers gathered (top: 0.3266, 0.3434) (bottom: 0.3198, 0.3552), as a function of the barycentric coordinates of the triangle in the second image w.r.t. the mean point of the corresponding triangle in the first image on 485k four-tuples of correspondences from scene (top) St. Peter’s Square, (bottom) Temple Nara Japan from the PhotoTourism dataset [20]. For each metric, we fit a 2D Gaussian distribution and report the mean of the distribution in brackets.

Experiments validating the use of mean point correspondences.  The error of the relative poses estimated with virtual correspondences depends on many aspects, e.g., the baseline and the view angles of the cameras w.r.t. the three points used to compute the mean points, the depth of these points, the size and shape of the triangles defined by the three points, the type of motion of the cameras, the level of noise in the correspondences, etc. Isolating the impact of the individual aspects, e.g., through experiments on synthetic data, is highly non-trivial (e.g., how to generate realistic synthetic scenarios that allow conclusions to generalize to real-world scenarios) and analysing the co-dependencies between different aspects on the overall performance seems to need a paper on its own.

In the main paper, we thus presented results on real-world scenes, without trying to isolate individual factors (see Figure 2 and Table 1 in the main paper). Fig. 2 in the main paper showed results obtained by establishing correspondences between the mean of the triangle in one image and various points in the triangle in the second image. We expressed points in the second triangle via their barycentric coordinates and uniformly sample 19×19191919\times 1919 × 19 barycentric coordinates (a,b)[0,1]2𝑎𝑏superscript012(a,b)\in[0,1]^{2}( italic_a , italic_b ) ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, such that a+b1𝑎𝑏1a+b\leq 1italic_a + italic_b ≤ 1 (ensuring points inside the triangle). The 3rd coordinate is given as c=1ab𝑐1𝑎𝑏c=1-a-bitalic_c = 1 - italic_a - italic_b. For each correspondence, we measured the symmetric epipolar error w.r.t. the ground truth pose, translation and rotation errors, and the percentage of inliers consistent with the pose obtained with the 5pt solver applied on the virtual and the four real correspondences (denoted as the 4pt(M) solver). Fig. 2 in the main paper showed results for the Sacre Coeur scene from the PhotoTourism dataset [20]. Here, Fig. 6 shows the same statistics for two more scenes from the PhotoTourism dataset, St. Peter’s Square (top row) and Temple Nara Japan (bottom row). As with Fig. 2 in the main paper, for each metric, we fit a 2D Gaussian distribution and report the mean value (in barycentric coordinates) as numbers in brackets in the caption of the figure. As can be seen, the same conclusion can be drawn from Fig. 6 as from Fig. 2 in the main paper: The optima of the studied metrics are reached very close to the mean point of the triangles, which has barycentric coordinates (0.3¯,0.3¯)formulae-sequence0¯30¯3(0.\bar{3},0.\bar{3})( 0 . over¯ start_ARG 3 end_ARG , 0 . over¯ start_ARG 3 end_ARG ).

Refer to caption
Refer to caption
Figure 7: Distribution of the average symmetric epipolar error as a function of barycentric coordinates of the point in the second image w.r.t. the mean point of three points in the first image, for (left) synthetic data and (right) real data in the form of the Shop Facade scene from the Cambridge Landmarks dataset [21]

In addition to the experiments on scenes from the PhotoTourism dataset, Fig. 7 shows results for the symmetric epipolar error on (left) synthetic data and (right) the Shop Facade scene from the Cambridge Landmarks dataset [21]. For the synthetic experiment, we generated 10k scenes with known ground truth parameters. In each scene, the three 3D points were randomly distributed within a cube of size 10×10×1010101010\times 10\times 1010 × 10 × 10. Each 3D point was projected into two cameras. The orientations and positions of the cameras were selected at random such that they looked towards the origin from a random distance, varying from 20202020 to 50505050, from the scene.

Fig. 7 shows the average symmetric epipolar error as a function of the barycentric coordinates of the point in the second image. The 2D Gaussian distribution fitted to the results on synthetic scenes has mean μ=(0.3319,0.3308)𝜇0.33190.3308\mu=(0.3319,0.3308)italic_μ = ( 0.3319 , 0.3308 ). The distribution of errors for the Shop Facade scene is very similar to the synthetic data with the minimum at (0.333,0.333)0.3330.333(0.333,0.333)( 0.333 , 0.333 ). In both cases, the means of the fitted Gaussian distributions are very close to the mean of the triangle (which has barycentric coordinates (0.3¯,0.3¯)formulae-sequence0¯30¯3(0.\bar{3},0.\bar{3})( 0 . over¯ start_ARG 3 end_ARG , 0 . over¯ start_ARG 3 end_ARG )). All the above-mentioned experiments clearly validates our approach of using mean point correspondences.

AVG () \downarrow MED () \downarrow AUC@5 \uparrow @10 @20 Time (s) \downarrow
5pt 5.04 0.89 63.71 74.45 83.11 0.04
4pt(M) 5.53 0.93 61.48 72.19 81.30 0.03
4pt(M±δplus-or-minus𝛿\pm\delta± italic_δ) 5.21 0.90 61.80 72.33 81.27 0.02
4pt(O) 4.40 0.81 65.30 75.88 84.43 0.03
Table 4: The average and median pose errors in degrees, and Area Under the recall Curve (AUC), thresholded at 5, 10, and 20, as well as the average run-time (in seconds) on 9,900 image pairs from two scenes, Sacre Coeur and St. Peter’s Square, from the PhotoTourism dataset [20].

Tab. 2 in the main paper compares the relative pose accuracy achieved by the 4pt(M) solver with the accuracy of the classical 5pt relative pose solver. As can be seen, the 4pt(M) solver is not as accurate as the 5tp solver as the 5th correspondence used by the 4pt(M) solver (the mean point correspondence) is significantly more noisy than the one used by the 5pt solver. Tab. 4 shows pose accuracy results obtained by running the 4pt(M) (and its variant 4pt(M±δplus-or-minus𝛿\pm\delta± italic_δ)) and the 5pt solvers inside GC-RANSAC on a total of 9,900 image pairs from the Sacre Coeur and St. Peter’s Square datasets. While the individual poses returned by both 4-point solvers are less accurate, RANSAC (and in particular local optimization inside RANSAC) can compensate for this, leading to comparable performance. This experiments not only validates the mean point-strategy, but also shows that virtual correspondences can be used to solver minimal problems from sub-minimal samples.

Tab. 4 also shows results for the oracle variant 4pt(O) of our solvers. As can be seen, there is still some space for improvement when predicting the 5th virtual correspondence, e.g., using a learning-based method. While this can also bring an improvement for two-views, such a learning-based method has a higher potential for improvement for the 4p3v problem, where the information from four points in three views fixes the pose of the input cameras that observe these points.

The method based on virtual correspondences can be theoretically applied to any camera geometry problem, however, we see larger promise in relative pose problems, where it is sufficient to find one 2D point that is sufficiently close to the epipolar line. For absolute pose solvers, a virtual correspondence will need to be close to a 2D point instead of an epipolar line.

8 L-based Solvers

Refer to caption
Figure 8: The architecture of our network. With LR we denote the Leaky RELU activation function [58]. The input consists of four 6-vectors, one for each correspondence. Each 6-vector contains the x𝑥xitalic_x and y𝑦yitalic_y coordinates of a correspondence in three views. First, the input passes through a shared MLP block, so that each 6-vector is processed independently. Then, the output feature vectors of dimension 32 are aggregated using a channel-wise max pooling function, and the result of it is concatenated at the end of each feature vector. Then, there is one more block of shared MLPs, of which the results are aggregated again by a max pooling, to get a single 64-vector. We reduce the dimensions of this vector by passing it through an MLP block, of which the last layer has 2 nodes and a tanh\tanhroman_tanh activation function.

Network architecture and training details.  As described in Sec. 3.1 of the main paper, our 4p3v(L)/4p3vf(L) solvers rely on a neural network to predict a virtual correspondence. The following provides details on the network architecture and the training process.

We use a lightweight architecture with a backbone of shared MLP layers, similar to [6], so that each triplet of correspondences is processed independently. The input to our network is a 4×6464\times 64 × 6 matrix of 4 point correspondences where the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row contains the x𝑥xitalic_x and y𝑦yitalic_y coordinates of the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT point in three views, i.e., the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT correspondence. In estimating the epipolar geometry, the order of the point correspondences does not matter. Thus, our network is designed to be permutation invariant on that input axis. The input is normalized as follows: We apply a rotation matrix to the points in each view independently, so that the mean point of the first three points is sent to (0,0)00(0,0)( 0 , 0 ), and the fourth point is sent to (0,y)0𝑦(0,y)( 0 , italic_y ). Let 𝐦1,𝐦2superscript𝐦1superscript𝐦2\mathbf{m}^{1},\mathbf{m}^{2}bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and 𝐦3superscript𝐦3\mathbf{m}^{3}bold_m start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT be the mean points of the three corresponding points in three views. We aim to predict the corresponding point of 𝐦1superscript𝐦1\mathbf{m}^{1}bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT in the second view. Let us denote this predicted point by 𝐦~2superscript~𝐦2\mathbf{\tilde{m}}^{2}over~ start_ARG bold_m end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, i.e., our 5thsuperscript5𝑡5^{th}5 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT virtual correspondence in the first two views will be 𝐦1𝐦~2superscript𝐦1superscript~𝐦2\mathbf{m}^{1}\leftrightarrow\mathbf{\tilde{m}}^{2}bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ↔ over~ start_ARG bold_m end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. As suggested by the 4p3v(M) solver, 𝐦~2superscript~𝐦2\mathbf{\tilde{m}}^{2}over~ start_ARG bold_m end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT should be close to 𝐦2superscript𝐦2\mathbf{m}^{2}bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Thus, we use the mean point 𝐦2superscript𝐦2\mathbf{m}^{2}bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as the initialization of our network and predict a shift from 𝐦2superscript𝐦2\mathbf{m}^{2}bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

We use a simple MLP-based backbone, similar to [6, 42]. More precisely, our input consists of four 6-vectors, of the x,𝑥x,italic_x , and y𝑦yitalic_y coordinates of four point correspondences in three views. The first part of the model is a shared MLP 3-block of dimensions 6, 32, and 32, exporting a 32-dimensional feature for every correspondence. Then we apply a channel-wise max pooling aggregation, which is then concatenated at the end of each 32-dimensional feature. This results in having an 64-dimensional vector for each correspondence, which are passed into another shared MLP 3-block of 64, 64, and 64 dimensions. We aggregate those vectors via a max pooling function to get a 64-dimensional vector encoding, which eventually passes through an MLP 3-block to reduce the dimension gradually from 64, to 32, and finally to 2, which will be the prediction of the x𝑥xitalic_x and y𝑦yitalic_y coordinates in the second camera. In all MLPs, in the first 2 layers of the blocks, we use a Leaky RELU activation function [58] with slope 0.01. As for the last layer of the MLPs, in the first two blocks, we use a RELU activation function, while in the last MLP we use a tanh\tanhroman_tanh activation, since we want the output to be in the range (1,1)11(-1,1)( - 1 , 1 ). For a visualization of the architecture, see Figure 8.

We used a simple network architecture to show the promise of 4p3v(L)/4p3vf(L) solvers, namely that learning can produce more accurate point correspondences. The experiments shown in the paper verify this behavior. We believe that more advanced network architectures (e.g., using a set transformer architecture [29]) have the potential to improve the results even more and reduce the gap between the 4p3v/4p3vf solvers and the oracle 4p3v(O)/4p3vf(O) solvers.

Our loss function is the Sampson error Ssubscript𝑆\mathcal{L}_{S}caligraphic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT of the prediction to the epipolar line of 𝐦1superscript𝐦1\mathbf{m}^{1}bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT in the 2ndsuperscript2nd2^{\text{nd}}2 start_POSTSUPERSCRIPT nd end_POSTSUPERSCRIPT view:

S=𝐦~2𝐄𝐦1𝐄𝐦12+𝐄𝐦~22.subscript𝑆superscript~𝐦superscript2topsuperscript𝐄𝐦1superscriptdelimited-∥∥superscript𝐄𝐦12superscriptdelimited-∥∥superscript𝐄topsuperscript~𝐦22\mathcal{L}_{S}=\frac{{\mathbf{\tilde{m}}^{2^{\top}}}\mathbf{E}\mathbf{m}^{1}}% {\sqrt{\lVert\mathbf{E}\mathbf{m}^{1}\rVert^{2}+\lVert\mathbf{E}^{\top}\mathbf% {\tilde{m}}^{2}\rVert^{2}}}\enspace.caligraphic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = divide start_ARG over~ start_ARG bold_m end_ARG start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT bold_Em start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG ∥ bold_Em start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ bold_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG bold_m end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG . (2)

For both training and validation, we use synthetic data. Our synthetic dataset contains 1M input instances. 70% of the dataset is used for training while the remaining 30% is used for validating the performance of the network. We generate 10K 3D points inside a 10×\times×10×\times×10 cube, and to generate each instance, we pick 4 random points and project them to 3 cameras with random rotations and translations that view the cube from a random distance between 20 to 50 units.

The network is implemented in PyTorch [37], and we use the Adam optimizer [23] for the training. We train it in batches of 1024 input instances, with a fixed learning rate of 1e-5. In our experiments, the network converges in about 30 epochs.

4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δ), 4p3vf(L±δplus-or-minus𝛿\pm\delta± italic_δ), and 4p3vf(L±δplus-or-minus𝛿\pm\delta± italic_δinit) solvers.  Similar to the 4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ) solver, we try to compensate for potential noise in the prediction 𝐦~2superscript~𝐦2\tilde{\mathbf{m}}^{2}over~ start_ARG bold_m end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT returned by the network by running the 5pt solver for the first two views three times for three different virtual correspondences. We test two variants: (1) In the 4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δ) and 4p3vf(L±δplus-or-minus𝛿\pm\delta± italic_δ) solvers, we add a shift ±δplus-or-minus𝛿\pm\delta± italic_δ to the predicted point 𝐦~2superscript~𝐦2\tilde{\mathbf{m}}^{2}over~ start_ARG bold_m end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, similar to the 4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ) solver. In this way, we generate two additional virtual correspondences. (2) In the 4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δinit) solver, we add a shift δ𝛿\deltaitalic_δ to the initialization 𝐦2superscript𝐦2\mathbf{m}^{2}bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT of the network. Thus, we run the network three times with three initializations, namely 𝐦2superscript𝐦2\mathbf{m}^{2}bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, 𝐦+δ2subscriptsuperscript𝐦2𝛿\mathbf{m}^{2}_{+\delta}bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + italic_δ end_POSTSUBSCRIPT, and 𝐦δ2subscriptsuperscript𝐦2𝛿\mathbf{m}^{2}_{-\delta}bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_δ end_POSTSUBSCRIPT. Each initialization affects the normalization of the points in the 2nd view, in the sense that a different point (the initialization) is sent to (0, 0), leading to a different input, thus predicting different virtual correspondence.

9 Experiments

This section provides more details on the experiments presented in Sec. 4 of the main paper. More precisely, Sec. 9.7 provides details and experiments on the oracle solvers discussed in the main paper. Sec. 9.1 provides ablation studies for the number of iterations of the refinement strategy (+R) and for the choice of δ𝛿\deltaitalic_δ. Sec. 9.2 provides details on the noise experiments summarized in the main paper. Sec. 9.3 provides experiments with an evaluation measure taking the relative pose error between the 2nd and 3rd camera in a triplet into account. Sec. 9.4 presents results on the PhotoTourism dataset obtained with GC-RANSAC. Sec. 9.5 presents additional experiments with adding outliers to image triplets (similar to the experiments presented in Fig. 4 in the main paper). Sec. 9.6 presents the additional details on some experiments promised in Sec. 4 of the main paper. Finally, Sec. 9.8 measures the run-times of the different solvers considered in this work.

9.1 Ablation studies

Estimator δ𝛿\deltaitalic_δ AVG ()(^{\circ})( start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) \downarrow MED ()(^{\circ})( start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) \downarrow AUC@5 \uparrow @10 \uparrow @20 \uparrow
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ) 0.2 6.33 3.89 36.88 57.46 74.74
0.1 6.28 3.82 37.50 57.88 74.95
0.09 6.29 3.77 37.98 58.27 75.18
0.08 6.31 3.72 38.32 58.53 75.36
0.07 6.21 3.64 39.29 59.16 75.69
0.06 6.18 3.68 39.16 59.18 75.76
0.05 6.07 3.65 39.47 59.37 75.93
0.04 5.99 3.62 39.57 59.52 76.01
0.03 6.04 3.64 39.44 59.36 75.92
0.02 6.18 3.68 38.74 58.73 75.42
0.01 6.30 3.81 37.89 57.94 74.90
0.005 6.39 3.88 36.62 56.94 74.36
0.001 6.65 4.03 35.57 55.83 73.50
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R 0.2 5.89 3.40 41.40 60.73 76.71
0.1 5.80 3.32 42.48 61.72 77.28
0.09 5.71 3.31 42.37 61.62 77.32
0.08 5.68 3.29 42.86 61.95 77.42
0.07 5.65 3.25 43.25 62.27 77.68
0.06 5.57 3.28 43.12 62.07 77.56
0.05 5.59 3.24 43.23 62.18 77.66
0.04 5.64 3.19 43.60 62.43 77.66
0.03 5.66 3.22 43.68 62.49 77.82
0.02 5.65 3.21 43.24 62.11 77.56
0.01 5.79 3.29 42.54 61.75 77.18
0.005 5.90 3.38 42.03 61.24 76.91
0.001 6.09 3.46 40.80 59.99 75.98
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R+F 0.2 6.02 3.60 39.44 59.19 75.79
0.1 6.00 3.50 40.69 60.27 76.36
0.09 5.92 3.43 40.83 60.40 76.51
0.08 5.91 3.44 41.10 60.61 76.60
0.07 5.93 3.39 41.58 60.80 76.72
0.06 5.79 3.38 41.85 61.01 76.91
0.05 5.77 3.32 42.08 61.07 76.95
0.04 5.79 3.35 42.00 61.13 76.94
0.03 5.73 3.33 42.46 61.50 77.22
0.02 5.81 3.33 42.12 61.01 76.80
0.01 5.86 3.41 41.12 60.56 76.56
0.005 5.96 3.51 40.64 60.12 76.19
0.001 6.20 3.53 39.95 59.08 75.52
Table 5: Evaluation of the effects of the scale of the δ𝛿\deltaitalic_δ shift on the St. Peter’s Square scene from Phototourism [20].

Validation of δ𝛿\deltaitalic_δ We tested our δ𝛿\deltaitalic_δ-based solvers for different values of δ𝛿\deltaitalic_δ and measured their performance. In general, there is no common value of the δ𝛿\deltaitalic_δ shift that leads to the best results on all datasets. This is expected since the precision of the mean-point correspondence depends on many different factors, e.g., the viewing angles of the cameras, the type of the motion, the depth and spatial distributions of the 3D points, etc. We set the value for δ𝛿\deltaitalic_δ and the total number of refinement iterations by evaluating their effects on St. Peter’s Square scene from the PhotoTourism dataset [20] which we used for validation and did not include it in other results for PhotoTourism in the paper. Tab. 5 shows how the different settings of the scale of the δ𝛿\deltaitalic_δ shift affect the accuracy of the δ𝛿\deltaitalic_δ-based solvers. Based on these experiments we use δ=0.04𝛿0.04\delta=0.04italic_δ = 0.04 as it provides the best results for 4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ) solver and is also close to the optimal value for its variants. The choice of optimal δ𝛿\deltaitalic_δ parameter may be scene-dependent and could potentially be set by using learning-based approaches.

Refer to caption
Figure 9: Evaluation of the effects of the number of inner refinement (+R) iterations within 4p3v(M)+R+F solver on St. Peter’s Square scene from PhotoTourism [20]. Shown is the speed-accuracy evaluation with different number of Poselib RANSAC iterations.
Refer to caption
Refer to caption
Figure 10: Noise experiment showing the pose error measured as max(0.5(𝚁err12+𝚁err13),0.5(𝐭err12+𝐭err13))max0.5superscriptsubscript𝚁𝑒𝑟𝑟12superscriptsubscript𝚁𝑒𝑟𝑟130.5superscriptsubscript𝐭𝑒𝑟𝑟12superscriptsubscript𝐭𝑒𝑟𝑟13\text{max}\left(0.5(\mathtt{R}_{err}^{12}+\mathtt{R}_{err}^{13}),0.5(\mathbf{t% }_{err}^{12}+\mathbf{t}_{err}^{13})\right)max ( 0.5 ( typewriter_R start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT + typewriter_R start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT ) , 0.5 ( bold_t start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT + bold_t start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT ) ), for the calibrated 4p3v problem (left) and the partial calibrated 4p3vf problem (right) as a function of the noise scale in pixels. Here, 𝚁ijsubscript𝚁𝑖𝑗\mathtt{R}_{ij}typewriter_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and 𝚝ijsubscript𝚝𝑖𝑗\mathtt{t}_{ij}typewriter_t start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT are the relative rotation and translation of the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT and jthsuperscript𝑗thj^{\text{th}}italic_j start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT views, respectively.

Inner refinement validation.  We also perform validation of the total number of LM steps in the inner refinement (+R) shown in Fig. 9. We chose the value of 2 for other experiments as it provides the best speed-accuracy trade-off across a range of RANSAC iterations. However, we note that other settings may have very similar performance.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 11: Noise experiment showing 𝚁err12superscriptsubscript𝚁𝑒𝑟𝑟12\mathtt{R}_{err}^{12}typewriter_R start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT (left) and 𝐭err12superscriptsubscript𝐭𝑒𝑟𝑟12\mathbf{t}_{err}^{12}bold_t start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT (right) as functions of the noise scale in pixels for the calibrated 4p3v problem (top) and the partially calibrated 4p3vf problem (bottom). Here 𝚁12subscript𝚁12\mathtt{R}_{12}typewriter_R start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT and 𝚝12subscript𝚝12\mathtt{t}_{12}typewriter_t start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT are the relative rotation and translation between the first two views.
Phototourism [20]
Estimator AVG ()(^{\circ})( start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) \downarrow MED ()(^{\circ})( start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) \downarrow AUC@5 \uparrow @10 \uparrow @20 \uparrow Runtime (ms) \downarrow
4p3v(HC) [19] 11.41 3.89 37.90 53.76 68.08 176.45
5pt+P3P 19.88 3.41 41.36 57.36 71.13 105.50
4p3v(M) 11.65 4.21 35.38 51.73 66.77 176.77
4p3v(M)+R 10.52 3.47 41.01 56.62 70.25 174.94
4p3v(M)+R+F 10.79 3.58 40.09 55.73 69.56 130.54
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ) 10.44 3.72 38.82 55.07 69.51 172.06
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R 19.84 3.27 42.62 58.30 71.74 189.21
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R+F 10.14 3.40 41.45 57.24 70.92 175.78
4p3v(L) 11.56 4.21 35.32 51.77 66.84 376.31
4p3v(L)+R 10.48 3.47 41.02 56.60 70.25 297.88
4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δinit)+R 19.86 3.28 42.60 58.26 71.71 730.90
4p3v(O) 19.64 3.29 42.60 58.54 72.09 158.22
4p3v(O)+R 19.50 3.13 44.01 59.67 72.81 186.30
4p3v(O)+R+F 19.45 3.16 43.82 59.55 72.76 136.36
Cambridge Landmarks [21]
Estimator AVG ()(^{\circ})( start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) \downarrow MED ()(^{\circ})( start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) \downarrow AUC@5 \uparrow @10 \uparrow @20 \uparrow Runtime (ms) \downarrow
4p3v(HC) [19] 15.12 5.51 24.50 43.42 61.10 164.49
5pt+P3P 13.38 5.17 25.94 45.38 63.16 148.53
4p3v(M) 15.31 5.68 22.93 42.06 60.26 138.33
4p3v(M)+R 14.28 5.29 25.37 44.57 62.33 134.84
4p3v(M)+R+F 14.60 5.39 24.89 43.97 61.71 116.51
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ) 14.18 5.36 24.72 44.12 62.15 181.21
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R 13.67 5.11 26.17 45.56 63.33 184.98
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R+F 13.89 5.23 25.43 44.78 62.70 138.45
4p3v(L) 15.24 5.72 22.98 42.15 60.35 232.13
4p3v(L)+R 14.23 5.27 25.33 44.49 62.30 177.06
4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δinit)+R 13.63 5.10 26.11 45.58 63.35 395.43
4p3v(O) 13.85 5.22 25.61 45.19 63.11 130.43
4p3v(O)+R 13.51 5.07 26.60 46.05 63.76 134.68
4p3v(O)+R+F 13.47 5.10 26.47 45.98 63.71 116.41
Table 6: Results for different solvers implemented in the PoseLib framework [26] on all scenes from the PhotoTourism [20] and 5 scenes from the Cambridge Landmarks [21] datasets with the alternative definition of pose error (3). We mark the best and second best results (excluding oracle solvers). Reported runtimes are for the whole RANSAC.

9.2 Noise experiments

We tested the performance of our solvers and the state-of-the-art algorithms w.r.t. increasing image noise. We used the SfM model of the botanical garden scene (randomly selected from all scenes) from the ETH3D dataset [50] to obtain instances of 5/6 points in three views by identifying images in the scene that share 3D points. Perfect noise-free correspondences are generated by projecting the 3D points into the images. We then add increasing amounts of normally distributed noise to these correspondences. We generated more than 9k instances, but show only 1k results per plot to avoid clutter. Note that the 4p3v(HC) solver was trained on the ETH3D dataset while our L-based solvers were trained on purely synthetic data. For the noise experiments we also test the joint 5PC solver, which operates on samples of 5 point correspondences, by estimating the essential matrices 𝙴12,𝙴13subscript𝙴12subscript𝙴13\mathtt{E}_{12},\mathtt{E}_{13}typewriter_E start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT , typewriter_E start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT independently, using the 5pt solver. Notice that due to estimating the two essential matrices independently of each other, the scales of both translations are indepent from each other. In contrast, the other solvers estimate the scale of the translation of the third camera relative to the scale of the translation between the first two cameras.

The results for increasing noise in the image points are shown in Figs. 10 and 11. The results are represented by the boxplot function which shows the 25% to 75% quantiles as a box with a horizontal line at median. Crosses show data beyond 1.5 times the interquartile range. Let 𝚁errijsuperscriptsubscript𝚁𝑒𝑟𝑟𝑖𝑗\mathtt{R}_{err}^{ij}typewriter_R start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT be the error of the estimated relative rotation between cameras i𝑖iitalic_i and j𝑗jitalic_j, computed as the angle in the axis-angle representation of 𝚁ij1𝚁ij𝙶𝚃superscriptsubscript𝚁𝑖𝑗1superscriptsubscript𝚁𝑖𝑗𝙶𝚃\mathtt{R}_{ij}^{-1}\mathtt{R}_{ij}^{\mathtt{GT}}typewriter_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT typewriter_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_GT end_POSTSUPERSCRIPT and let 𝐭errijsuperscriptsubscript𝐭𝑒𝑟𝑟𝑖𝑗\mathbf{t}_{err}^{ij}bold_t start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT be the error of the estimated translation computed as the angle between the two unit vectors corresponding to the translations [20]. Fig. 10 shows boxplots of pose errors measured in the same way as in our experiments in the main paper (cf. Sec. 4 in the main paper), i.e., as max(0.5(𝚁err12+𝚁err13),0.5(𝐭err12+𝐭err13))max0.5superscriptsubscript𝚁𝑒𝑟𝑟12superscriptsubscript𝚁𝑒𝑟𝑟130.5superscriptsubscript𝐭𝑒𝑟𝑟12superscriptsubscript𝐭𝑒𝑟𝑟13\text{max}\left(0.5(\mathtt{R}_{err}^{12}+\mathtt{R}_{err}^{13}),0.5(\mathbf{t% }_{err}^{12}+\mathbf{t}_{err}^{13})\right)max ( 0.5 ( typewriter_R start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT + typewriter_R start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT ) , 0.5 ( bold_t start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT + bold_t start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT ) ), for the calibrated 4p3v problem (left), and the partially calibrated 4p3vf problem (right). The errors are zoomed into an interesting interval and are shown as functions of varying noise from 0px0𝑝𝑥0px0 italic_p italic_x to 4px4𝑝𝑥4px4 italic_p italic_x.

Due to the approximate nature of the virtual correspondences, our newly proposed M-based and L-based solvers exhibit non-zero errors for zero noise. However, at noise levels 1pxabsent1𝑝𝑥\geq 1px≥ 1 italic_p italic_x, our δ𝛿\deltaitalic_δ-based solvers (both M and L), and for the calibrated case even the pure 4p3v(M) and 4p3v(L) solvers, return comparable or even better results than the 5pt+P3P and 6pt+P3P solvers. Note that the 5pt+P3P and 6pt+P3P solvers sample one/two more points (real correspondences) in the first two cameras, and these points are affected only by the considered noise. This shows that our predicted virtual correspondences are good approximations to real correspondences. For the calibrated case, the recent state-of-the-art solver [19] is failing in about 50% of the instances for noiseless data, even though the solver was trained on the ETH3D dataset. Thus, the median errors are significantly larger than the median errors of the remaining solvers.

The rotation and translation errors in the first two views, i.e., 𝚁err12superscriptsubscript𝚁𝑒𝑟𝑟12\mathtt{R}_{err}^{12}typewriter_R start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT and 𝐭err12superscriptsubscript𝐭𝑒𝑟𝑟12\mathbf{t}_{err}^{12}bold_t start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT, for both the calibrated (top row), and the partially calibrated case (bottom row) are shown in Fig. 11. For the partially calibrated case, our new solvers generate two approximate virtual correspondences in the first two views. Therefore, the 4p3vf(M) and 4p3vf(L) solvers have slightly larger errors than the 6pt+P3P solver for all considered noise levels. However, similarly to the pose errors in Figure 10, at noise levels 2pxabsent2𝑝𝑥\geq 2px≥ 2 italic_p italic_x our δ𝛿\deltaitalic_δ-based solvers (both M and L) return comparable or even better results in the first two views than the 5pt [34] and 6pt [56] solvers, here represented by the results of 5pt+P3P and 6pt+P3P solvers. For the calibrated case even the pure 4p3v(M) and 4p3v(L) solvers, without the offset δ𝛿\deltaitalic_δ, perform comparably well as the 5pt solver. This shows an interesting potential of using our solvers for the two-view relative pose estimation problems by solving these problems from sub-minimal samples.

Refer to caption
(a) Phototourism [20]
Refer to caption
(b) Cambridge Landmarks [21]
Figure 12: Speed-accuracy trade-off on (a) all scenes from Phototourism [20] except St. Peter’s Square, (b) 5 Cambridge Landmarks [21] scenes. We report the AUC@10 using the alternative definition of pose error (3). We vary the number of Poselib RANSAC iterations (100, 200, 500, 1000, 2000, 5000, 10000). Runtimes are averaged over all image triplets. Legend is the same as in Fig. 14.
Phototourism [20]
Estimator AVG ()(^{\circ})( start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) \downarrow MED ()(^{\circ})( start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) \downarrow AUC@5 \uparrow @10 \uparrow @20 \uparrow Runtime (s) \downarrow
4p3v(HC) [19] 5.23 1.89 43.36 62.83 76.97 2.95
5pt+P3P 5.00 1.85 43.99 63.35 77.39 2.78
4p3v(M) 5.11 1.94 43.03 62.87 77.15 2.23
4p3v(M)+R 5.07 1.91 43.30 63.14 77.32 2.47
4p3v(M)+R+F 5.05 1.89 43.41 63.24 77.39 2.42
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ) 5.02 1.92 43.24 63.15 77.44 2.25
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R 4.96 1.90 43.57 63.48 77.67 2.53
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R+F 5.00 1.89 43.51 63.38 77.56 2.41
4p3v(L) 5.46 1.93 42.82 62.07 76.25 2.88
4p3v(L)+R 5.05 1.91 43.28 63.19 77.42 2.50
4p3v(O)+R 4.70 1.81 44.73 64.50 78.44 2.38
4p3v(NO)+R 4.24 1.74 46.01 65.97 79.74 2.39
Table 7: Results for different methods implemented in the GC-RANSAC framework [2] for all scenes from the PhotoTourism [20] dataset. We mark the best and second best results (excluding oracle solvers).
Phototourism [20]
Estimator AVG ()(^{\circ})( start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) \downarrow MED ()(^{\circ})( start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) \downarrow AUC@5 \uparrow @10 \uparrow @20 \uparrow Runtime (ms) \downarrow
4p3v(HC) [19] 11.41 3.89 37.90 53.76 68.08 176.45
5pt+P3P 19.88 3.41 41.36 57.36 71.13 105.50
4p3v(M) 11.65 4.21 35.38 51.73 66.77 176.77
4p3v(M)+R 10.52 3.47 41.01 56.62 70.25 174.94
4p3v(M)+R+F 10.79 3.58 40.09 55.73 69.56 130.54
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ) 10.44 3.72 38.82 55.07 69.51 172.06
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R 19.84 3.27 42.62 58.30 71.74 189.21
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R+F 10.14 3.40 41.45 57.24 70.92 175.78
4p3v(L) 11.56 4.21 35.32 51.77 66.84 376.31
4p3v(L)+R 10.48 3.47 41.02 56.60 70.25 297.88
4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δinit)+R 19.86 3.28 42.60 58.26 71.71 730.90
4p3v(O) 19.64 3.29 42.60 58.54 72.09 158.22
4p3v(O)+R 19.50 3.13 44.01 59.67 72.81 186.30
4p3v(O)+R+F 19.45 3.16 43.82 59.55 72.76 136.36
Cambridge Landmarks [21]
Estimator AVG ()(^{\circ})( start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) \downarrow MED ()(^{\circ})( start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) \downarrow AUC@5 \uparrow @10 \uparrow @20 \uparrow Runtime (ms) \downarrow
4p3v(HC) [19] 15.12 5.51 24.50 43.42 61.10 164.49
5pt+P3P 13.38 5.17 25.94 45.38 63.16 148.53
4p3v(M) 15.31 5.68 22.93 42.06 60.26 138.33
4p3v(M)+R 14.28 5.29 25.37 44.57 62.33 134.84
4p3v(M)+R+F 14.60 5.39 24.89 43.97 61.71 116.51
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ) 14.18 5.36 24.72 44.12 62.15 181.21
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R 13.67 5.11 26.17 45.56 63.33 184.98
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R+F 13.89 5.23 25.43 44.78 62.70 138.45
4p3v(L) 15.24 5.72 22.98 42.15 60.35 232.13
4p3v(L)+R 14.23 5.27 25.33 44.49 62.30 177.06
4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δinit)+R 13.63 5.10 26.11 45.58 63.35 395.43
4p3v(O) 13.85 5.22 25.61 45.19 63.11 130.43
4p3v(O)+R 13.51 5.07 26.60 46.05 63.76 134.68
4p3v(O)+R+F 13.47 5.10 26.47 45.98 63.71 116.41
Table 8: Results for different solvers implemented in the PoseLib framework [26] on all scenes from the PhotoTourism [20] and 5 scenes from the Cambridge Landmarks [21] datasets with the alternative definition of pose error (3). We mark the best and second best results (excluding oracle solvers). Reported runtimes are for the whole RANSAC.

9.3 Alternative evaluation measure

For the evaluation in the main paper, we defined the pose error as max(0.5(𝚁err12+𝚁err13),0.5(𝐭err12+𝐭err13))max0.5superscriptsubscript𝚁𝑒𝑟𝑟12superscriptsubscript𝚁𝑒𝑟𝑟130.5superscriptsubscript𝐭𝑒𝑟𝑟12superscriptsubscript𝐭𝑒𝑟𝑟13\text{max}\left(0.5(\mathtt{R}_{err}^{12}+\mathtt{R}_{err}^{13}),0.5(\mathbf{t% }_{err}^{12}+\mathbf{t}_{err}^{13})\right)max ( 0.5 ( typewriter_R start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT + typewriter_R start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT ) , 0.5 ( bold_t start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT + bold_t start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT ) ), where 𝚁errijsuperscriptsubscript𝚁𝑒𝑟𝑟𝑖𝑗\mathtt{R}_{err}^{ij}typewriter_R start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT and 𝐭errijsuperscriptsubscript𝐭𝑒𝑟𝑟𝑖𝑗\mathbf{t}_{err}^{ij}bold_t start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT are the angular errors of rotation and translation for camera pair ij𝑖𝑗ijitalic_i italic_j in degrees. The 4p3v problem also includes the estimation of 𝚁23subscript𝚁23\mathtt{R}_{23}typewriter_R start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT and 𝐭23subscript𝐭23\mathbf{t}_{23}bold_t start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT since the relative scale of 𝐭12subscript𝐭12\mathbf{t}_{12}bold_t start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT and 𝐭13subscript𝐭13\mathbf{t}_{13}bold_t start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT is recovered. We therefore also presents results for the pose error defined as

Perr=max(𝚁err12,𝚁err13,𝚁err23,𝐭err12,𝐭err13,𝐭err23).subscript𝑃𝑒𝑟𝑟maxsuperscriptsubscript𝚁𝑒𝑟𝑟12superscriptsubscript𝚁𝑒𝑟𝑟13superscriptsubscript𝚁𝑒𝑟𝑟23superscriptsubscript𝐭𝑒𝑟𝑟12superscriptsubscript𝐭𝑒𝑟𝑟13superscriptsubscript𝐭𝑒𝑟𝑟23P_{err}=\text{max}\left(\mathtt{R}_{err}^{12},\mathtt{R}_{err}^{13},\mathtt{R}% _{err}^{23},\mathbf{t}_{err}^{12},\mathbf{t}_{err}^{13},\mathbf{t}_{err}^{23}% \right)\enspace.italic_P start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT = max ( typewriter_R start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT , typewriter_R start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT , typewriter_R start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 23 end_POSTSUPERSCRIPT , bold_t start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT , bold_t start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT , bold_t start_POSTSUBSCRIPT italic_e italic_r italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 23 end_POSTSUPERSCRIPT ) . (3)

The results equivalent to Tab. 2 from the main paper using this pose error definition are presented in Tab. 6. A speed-accuracy comparison equivalent to Fig. 3 in the main is presented in Fig. 12. The overall comparison of the methods remains the same under both the metric used in the main paper and the alternative described in this section.

9.4 GC-RANSAC

We evaluated and compared our proposed solvers against the state-of-the-art solver, also in the GC-RANSAC [2] framework. Results for all scenes of the PhotoTourism dataset [20] are presented in Tab. 7.

4p3v(M)4p3v(M)+R4p3v(M)+R+F4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R+F4p3v(HC)5pt+P3P4p3v(O)+R+F

St. Mary’s Church [21]

Refer to caption
(a) Original Matches
Refer to caption
(b) Synth - 0.6 inlier ratio
Refer to caption
(c) Synth - 0.4 inlier ratio
Refer to caption
(d) Synth - 0.2 inlier ratio

Sacre Coeur [20]

Refer to caption
(e) Original Matches
Refer to caption
(f) Synth - 0.6 inlier ratio
Refer to caption
(g) Synth - 0.4 inlier ratio
Refer to caption
(h) Synth - 0.2 inlier ratio
Figure 13: Outlier experiments for St. Mary’s Church scene from Cambridge Landmarks [21] (a-d) and Sacre Coeur scene from Photorouism [20] (e-h). We show the speed-accuracy trade-off evaluation (see Fig. 14) for the original matches (a,e) and for synthetic scenario where we kept the matches that are inliers w.r.t. ground truth poses and added randomly distributed noisy matches to obtain desired inlier ratio (b-d,f-h).

9.5 Outlier experiments

Fig. 4 in the main paper presented results obtained by synthetically removing outlier matches based on ground truth pose information and replacing them with outliers distributed uniformly at random to reach a given inlier ratio for all image triplets. Here we provide additional plots for two scenes: St Mary’s Church from Cambridge Landmarks [21] and Sacre Coeur from Phototourism [20]. The results for the speed-accuracy evaluation are shown in Fig. 13. Consistent with the results in the main paper, these graphs show the potential for our M-based solvers to perform better than the the 5pt+P3P baseline in scenarios with low inlier ratios.

4p3v(M)4p3v(M)+R4p3v(M)+R+F4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R+F4p3v(HC)5pt+P3P4p3v(O)+R+F
Refer to caption
(a) Great Court
Refer to caption
(b) Shop Facade
Refer to caption
(c) Old Hospital
Refer to caption
(d) Buckingham Palace
Refer to caption
(e) Colosseum Exterior
Refer to caption
(f) Grand Place Brussels
Refer to caption
(g) Notre Dame
Refer to caption
(h) Palace of Westminster
Refer to caption
(i) Pantheon Exterior
Refer to caption
(j) Taj Mahal
Refer to caption
(k) Temple Nara
Refer to caption
(l) Trevi Fountain
Figure 14: Results for individual scenes from the Cambrdige Landmarks [21] (a-c) and Phototourism [20] (d-l) datasets which were not presented in the main paper or Fig. 13. We report the AUC@10 of the pose error and vary the number of Poselib RANSAC iterations (100, 200, 500, 1000, 2000, 5000, 10000). Runtimes are averaged over all image triplets.

9.6 Detailed experiments on Real Data

Results for individual scenes.  Fig. 3 in the main paper shows results on all PhotoTourism scenes (except St. Peter’s Square), the 5 Cambridge Landmarks scenes we consider, and one individual scene from each dataset. In Fig. 14, we provide results for the accuracy-speed trade-off evaluation for more evaluation scenes on both PhotoTourism [20] and Cambridge Landmarks [21]. As discussed and shown in the main paper, the performance of our M-based and L-based solvers is scene-dependent. This can also be seen in Fig. 14, where for some scenes, the 4p3v(M)+R and 4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R solvers perform worse than the 5pt+P3P solver (Shop Facade, Palace of Westminster, Trevi Fountain). However, for the majority of the scenes, our solvers perform similar to 5pt+P3P (Old Hospital) or even outperform 5pt+P3P (Great Court, Buckingham Palace, Colosseum Exterior, Grand Place Brussels, Notre Dame, Pantheon Exterior, Taj Mahal, Temple Nara). Overall, the results validate the practical viability of our solvers in a time-constrained setting.

Results for L-based solvers.  In Tab. 9 we provide the same results as shown in Tab. 3 in the main paper, including also more variants for L-based solvers, i.e. 4p3v(L)+R+F, 4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δ), 4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δ)+R, 4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δ)+R+F, 4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δinit), 4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δinit)+R, and 4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δinit)+R+F.

Phototourism [20]
Estimator AVG ()(^{\circ})( start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) \downarrow MED ()(^{\circ})( start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) \downarrow AUC@5 \uparrow @10 \uparrow @20 \uparrow Runtime (ms) \downarrow
4p3v(HC) [19] 7.17 2.34 52.74 66.63 77.86 176.45
5pt+P3P 5.99 2.00 57.31 70.54 80.81 105.50
4p3v(M) 7.17 2.49 50.96 65.46 77.32 176.77
4p3v(M)+R 6.39 2.00 56.92 69.90 80.17 174.94
4p3v(M)+R+F 6.59 2.07 55.90 69.06 79.56 130.54
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ) 6.39 2.19 54.70 68.58 79.59 172.06
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R 5.97 1.89 58.65 71.43 81.35 189.21
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R+F 6.15 1.97 57.42 70.47 80.69 175.78
4p3v(L) 7.12 2.50 51.00 65.57 77.42 376.31
4p3v(L)+R 6.35 2.00 56.88 69.88 80.15 297.88
4p3v(L)+R+F 6.55 2.07 55.88 69.08 79.55 255.93
4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δ) 6.39 2.19 54.70 68.58 79.59 172.05
4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δ)+R 5.97 1.89 58.65 71.43 81.35 189.28
4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δ)+R+F 6.15 1.97 57.42 70.47 80.69 175.76
4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δinit) 5.97 1.91 58.36 71.25 81.26 679.80
4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δinit)+R 5.97 1.89 58.63 71.41 81.34 730.90
4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δinit)+R+F 6.04 1.92 58.11 70.97 81.02 605.86
4p3v(O) 5.82 1.90 58.91 71.84 81.70 158.22
4p3v(O)+R 5.73 1.80 60.23 72.75 82.21 186.30
4p3v(O)+R+F 5.72 1.82 59.97 72.58 82.13 136.36
Cambridge Landmarks [21]
Estimator AVG ()(^{\circ})( start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) \downarrow MED ()(^{\circ})( start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) \downarrow AUC@5 \uparrow @10 \uparrow @20 \uparrow Runtime (ms) \downarrow
4p3v(HC) [19] 9.69 3.31 40.96 58.84 72.83 164.49
5pt+P3P 8.16 3.05 43.79 61.61 75.30 148.53
4p3v(M) 9.61 3.42 39.71 58.08 72.54 138.33
4p3v(M)+R 8.77 3.11 42.98 60.90 74.59 134.84
4p3v(M)+R+F 9.03 3.17 42.31 60.18 74.03 116.51
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ) 8.75 3.21 42.11 60.37 74.38 181.21
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R 8.32 3.01 44.17 62.04 75.60 184.98
4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ)+R+F 8.47 3.08 43.21 61.22 75.03 138.45
4p3v(L) 9.58 3.44 39.79 58.17 72.62 232.13
4p3v(L)+R 8.75 3.09 42.94 60.86 74.60 177.06
4p3v(L)+R+F 8.92 3.17 42.28 60.19 74.08 163.95
4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δ) 8.75 3.21 42.11 60.37 74.38 181.20
4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δ)+R 8.32 3.01 44.17 62.04 75.60 184.89
4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δ)+R+F 8.47 3.08 43.21 61.22 75.03 138.49
4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δinit) 8.43 3.05 43.80 61.70 75.34 379.49
4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δinit)+R 8.28 3.00 44.26 62.10 75.62 395.43
4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δinit)+R+F 8.33 3.04 43.82 61.71 75.35 349.45
4p3v(O) 8.62 3.07 43.58 61.65 75.30 130.43
4p3v(O)+R 8.39 2.94 44.80 62.54 75.93 134.68
4p3v(O)+R+F 8.36 2.97 44.64 62.38 75.82 116.41
Table 9: Results for different solvers implemented in the PoseLib framework [26] on all scenes from the PhotoTourism [20] and 5 scenes from the Cambridge Landmarks [21] datasets including all evaluated variants of the proposed solvers. We mark the best and second best results (excluding oracle solvers). Reported runtimes are for the whole RANSAC.
5pt+P3P 4p3v(HC) 4p3v(M) 4p3v(M±δplus-or-minus𝛿\pm\delta± italic_δ) 4p3v(L) 4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δ) 4p3v(L±δplus-or-minus𝛿\pm\delta± italic_δinit)
Time (μ𝜇\muitalic_μs) 77.90 66.06 83.92 218.71 450.26 511.28 1130.31
Table 10: The average run-time, averaged over more than 10k instances of the Sacre Coeur scene of the PhotoTourism dataset [51], of the solvers for the calibrated case.
6pt+P3P 4p3vf(M) 4p3vf(M±δplus-or-minus𝛿\pm\delta± italic_δ) 4p3vf(L) 4p3vf(L±δplus-or-minus𝛿\pm\delta± italic_δ) 4p3vf(L±δplus-or-minus𝛿\pm\delta± italic_δinit)
Time (μ𝜇\muitalic_μs) 106.67 117.28 295.87 758.59 953.34 2162.77
Table 11: The average run-time, averaged over more than 10k instances of the Sacre Coeur scene of the PhotoTourism dataset [51], of the solvers for the partially calibrated case.

9.7 Oracle solvers

We first provide more details on implementation of our oracle version of the 4p3v(N) solver [35], i.e. the 4p3v(NO) solver. In the 4p3v(NO) solver, instead of performing an one-dimensional search over the 10thsuperscript10th10^{\text{th}}10 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT degree curve of possible epipoles, we provide the solver with the ground truth epipole. To simulate the effect of sampling four points for this solver inside RANSAC, instead of using the second epipole and the epipolar line homography to get the essential matrix 𝙴𝙴\mathtt{E}typewriter_E, as suggested in the implementation details of [35], we use the second suggested way on how to obtain 𝙴𝙴\mathtt{E}typewriter_E, i.e., using their 3pt+ep solver. However, we feed the 3pt+ep solver with four points and use SVD instead of the null space. The rest of the solver performs the triangulation and registers the last camera using the P3P solver [39]. This is identical to the original 4p3v(N) solver. Remember that the original 4p3v(N) solver needs to call these evaluations for each search step on the 10thsuperscript10th10^{\text{th}}10 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT degree curve of possible epipoles (usually 40×1000×40{\times}-1000{\times}40 × - 1000 × [35]). Moreover, this solver has several sources of errors, e.g., the 10thsuperscript10th10^{\text{th}}10 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT degree curve is affected by noise; sparse sampling of the points on the curve introduces additional potentially large noise in the epipole; the reprojections of the fourth image point in the third view, traced out by sweeping through the curve of possible epipoles, generates complex curves in the third view, with the reprojection cost function having a lot of local minima. In the paper [35], it was reported that even for exact data and 1000 search points followed by refinement at multiple local minima the failure rate of the solver is 3%absentpercent3\approx 3\%≈ 3 %. Therefore, we expect the original solver 4p3v(N) to perform much worse that the “oracle” 4p3v(NO) solver.

To obtain upper bounds for the precision that can be achieved by our proposed solvers we consider the following “oracle” solvers for real experiments: The 4p3v(O)/4p3vf(O) solvers use a correct correspondence(s), i.e., a correspondence(s) that satisfies the epipolar constraint for the ground truth relative pose of the first two cameras, as the 5thsuperscript5th5^{\text{th}}5 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT/6thsuperscript6th6^{\text{th}}6 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT virtual correspondence between these cameras. Then the 5pt+P3P/6pt+P3P solver is applied to estimate the relative pose of three cameras. The 4p3v(O)/4p3vf(O) solvers thus indicate the maximum precision that our solvers can reach, if they would have been able to predict or infer a precise 5thsuperscript5th5^{\text{th}}5 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT/6thsuperscript6th6^{\text{th}}6 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT correspondence from the coordinates of four input correspondences.

Tab. 7 compares the 4p3v(NO) with the 4p3v(O) solver as well as various variants of our M-based and L-based solvers as well as the 4p3v(HC) [19] and 5pt+P3P approaches. Results are shown over all scenes from the PhotoTourism dataset using GC-RANSAC. The 4p3v(NO) solver performs slightly better than 4p3v(O) in terms of pose accuracy.666We observed that when naively applying 4p3v(NO) in RANSAC, RANSAC tends to terminate too early, resulting in reduced pose accuracy. This resulted in the statement in the main paper that the 4p3v(NO) oracle performed worse than our oracles. In order to obtain results comparable with 4p3v(O), we had to adapt RANSAC. While the pose accuracy of 4p3v(NO) provides an upper bound on the performance of the 4p3v(N) solver [35], the run-time observed for 4p3v(NO) is not indicative of the run-time of 4p3v(N). As highlighted above and in the main paper, the 4p3v(N) solver needs to evaluate up to thousands of epipole estimates and is thus significantly slower than its oracle variant (which only evaluates a single epipole). In contrast, our M-based solvers have a run-time comparable to 4p3v(O). In addition, the epipoles used by 4p3v(N) can be rather noisy, depending on how densely the curve is sampled. Even for a rather dense sampling, which increases the run-time of the 4p3v(N) solver, [35] report that their solver often has problems with local minima. The results of the 4p3v(O) solver show that there is room for improvement for our M-based and L-based solvers by developing approaches to generating more accurate virtual correspondences.

9.8 Solver run-times

In this section, we present run-times of the proposed solvers as well as the state-of-the-art solvers for the relative pose problem of three calibrated/partially calibrated cameras. While the main paper reports run-time results for full RANSAC-based estimation, we now report the run-times of the individual solvers outside of RANSAC. To measure the run-times of the solvers, we calculated the average run-time of each solver on more than 10k instances of the Sacre Coeur scene of the PhotoTourism dataset [51]. For calibrated cameras, the run-times are reported in Table 10, and for partially calibrated cameras in Table 11. The experiments were performed on an Intel(R) Core(TM) i9-10900X CPU @ 3.70GHz. The average run-times of the L-based solvers are higher, because we run the network on the CPU and without batching. In general, the implementations of all proposed solvers are not optimized for speed, and we still see room for speeding them up.

References

  • [1] Chris Aholt and Luke Oeding. The ideal of the trifocal variety. Mathematics of Computation, 83(289):2553–2574, 2014.
  • [2] D. Barath and J. Matas. Graph-Cut RANSAC. In Conference on Computer Vision and Pattern Recognition, 2018.
  • [3] Daniel Barath and Chris Sweeney. Relative pose solvers using monocular depth. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 4037–4043. IEEE, 2022.
  • [4] Martin Bujnak, Zuzana Kukelova, and Tomas Pajdla. A general solution to the p4p problem for camera with unknown focal length. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2008.
  • [5] Robert Castle, Georg Klein, and David W. Murray. Video-rate localization in multiple maps for wearable augmented reality. In ISWC, 2008.
  • [6] Luca Cavalli, Marc Pollefeys, and Daniel Barath. NeFSAC: neurally filtered minimal samples. In European Conference on Computer Vision, pages 351–366. Springer, 2022.
  • [7] Chiang-Heng Chien, Hongyi Fan, Ahmad Abdelfattah, Elias Tsigaridas, Stanimire Tomov, and Benjamin Kimia. Gpu-based homotopy continuation for minimal problems in computer vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15765–15776, June 2022.
  • [8] Ondřej Chum, Jiří Matas, and Josef Kittler. Locally optimized ransac. In Pattern Recognition, pages 236–243. Springer Berlin Heidelberg, 2003.
  • [9] Andrea Porfiri Dal Cin, Timothy Duff, Luca Magri, and Tomas Pajdla. Minimal perspective autocalibration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5064–5073, June 2024.
  • [10] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018.
  • [11] Yaqing Ding, Chiang-Heng Chien, Viktor Larsson, Karl Åström, and Benjamin Kimia. Minimal solutions to generalized three-view relative pose problem. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8156–8164, October 2023.
  • [12] Timothy Duff, Kathlen Kohn, Anton Leykin, and Tomas Pajdla. Plmp-point-line minimal problems in complete multi-view visibility. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1675–1684, 2019.
  • [13] Timothy Duff, Kathlén Kohn, Anton Leykin, and Tomás Pajdla. Pl11{}_{\mbox{1}}start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPTp - point-line minimal problems under partial visibility in three views. In ECCV (26), volume 12371 of Lecture Notes in Computer Science, pages 175–192. Springer, 2020.
  • [14] Ricardo Fabbri, Timothy Duff, Hongyi Fan, Margaret H. Regan, David da Costa de Pinho, Elias P. Tsigaridas, Charles W. Wampler, Jonathan D. Hauenstein, Peter J. Giblin, Benjamin B. Kimia, Anton Leykin, and Tomás Pajdla. TRPLP - trifocal relative pose from lines at points. In CVPR, pages 12070–12080. Computer Vision Foundation / IEEE, 2020.
  • [15] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, June 1981.
  • [16] Wolfgang. Förstner and Bernhard P. Wrobel. Photogrammetric Computer Vision Statistics, Geometry, Orientation and Reconstruction / by Wolfgang Förstner, Bernhard P. Wrobel. Geometry and Computing, 11. Springer International Publishing, Cham, 1st ed. 2016. edition, 2016.
  • [17] Christian Häne, Lionel Heng, Gim Hee Lee, Friedrich Fraundorfer, Paul Furgale, Torsten Sattler, and Marc Pollefeys. 3d visual perception for self-driving cars using a multi-camera system: Calibration, mapping, localization, and obstacle detection. Image and Vision Computing, 68:14–27, 2017.
  • [18] R.J. Holt and A.N. Netravali. Uniqueness of solutions to three perspective views of four points. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(3):303–307, 1995.
  • [19] Petr Hruby, Timothy Duff, Anton Leykin, and Tomas Pajdla. Learning to solve hard minimal problems. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5532–5542, June 2022.
  • [20] Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image matching across wide baselines: From paper to practice. International Journal of Computer Vision, 2020.
  • [21] Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision, pages 2938–2946, 2015.
  • [22] J. Kileel. Minimal problems for the calibrated trifocal variety. SIAM Journal on Applied Algebra and Geometry, 1(1):575–598, 2017.
  • [23] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  • [24] Zuzana Kukelova, Martin Bujnak, and Tomas Pajdla. Real-time solution to the absolute pose problem with unknown radial distortion and focal length. In Proceedings of the IEEE International Conference on Computer Vision, pages 2816–2823, 2013.
  • [25] Z. Kukelova and T. Pajdla. A minimal solution to the autocalibration of radial distortion. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2007), 2007.
  • [26] Viktor Larsson. PoseLib - Minimal Solvers for Camera Pose Estimation, 2020.
  • [27] Viktor Larsson, Kalle Åström, and Magnus Oskarsson. Efficient solvers for minimal problems by syzygy-based reduction. In Computer Vision and Pattern Recognition (CVPR), 2017.
  • [28] Viktor Larsson, Torsten Sattler, Zuzana Kukelova, and Marc Pollefeys. Revisiting radial distortion absolute pose. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1062–1071, 2019.
  • [29] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In Proceedings of the 36th International Conference on Machine Learning, pages 3744–3753, 2019.
  • [30] S. Leonardos, R. Tron, and K. Daniilidis. A metric parametrization for trifocal tensors with non-colinear pinholes. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2015.
  • [31] Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed. In International Conference on Computer Vision (ICCV), 2023.
  • [32] H. Christopher Longuet-Higgins. A method of obtaining the relative positions of 4 points from 3 perspective projections. In Peter Mowforth, editor, BMVC91, pages 86–94, London, 1991. Springer London.
  • [33] Evgeniy V. Martyushev. On some properties of calibrated trifocal tensors. Journal of Mathematical Imaging and Vision, 58(2):321–332, 2017.
  • [34] D. Nistér. An efficient solution to the five-point relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6):756–770, June 2004.
  • [35] D. Nistér and F. Schaffalitzky. Four points in two or three calibrated views: Theory and practice. International Journal of Computer Vision, 67(2):211–231, 2006.
  • [36] Magnus Oskarsson, Andrew Zisserman, and Kalle Åström. Minimal projective reconstruction for combinations of points and lines in three views. Image and Vision Computing, 22(10):777–785, 2004. British Machine Vision Computing 2002.
  • [37] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS Autodiff Workshop, 2017.
  • [38] Michal Perdoch, Jiri Matas, and Ondrej Chum. Epipolar geometry from two correspondences. In 18th International Conference on Pattern Recognition (ICPR’06), volume 4, pages 215–219. IEEE, 2006.
  • [39] Mikael Persson and Klas Nordberg. Lambda twist: An accurate fast robust perspective three point (p3p) solver. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  • [40] James Pritts, Ondrej Chum, and Jiri Matas. Approximate models for fast and accurate epipolar geometry estimation. In 28th International Conference on Image and Vision Computing New Zealand, IVCNZ 2013, Wellington, New Zealand, November 27-29, 2013, pages 106–111. IEEE, 2013.
  • [41] James Pritts, Zuzana Kukelova, Viktor Larsson, and Ondrej Chum. Radially-distorted conjugate translations. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 1993–2001. IEEE Computer Society, 2018.
  • [42] Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [43] Long Quan. Invariants of six points and projective reconstruction from three uncalibrated images. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 17(1):34–46, 1995.
  • [44] L. Quan, B. Triggs, and B. Mourrain. Some results on minimal euclidean reconstruction from four points. Journal of Mathematical Imaging and Vision, 24(3):341–348, 2006.
  • [45] R. Raguram, O. Chum, M. Pollefeys, J. Matas, and J.-M. Frahm. USAC: A universal framework for random sample consensus. IEEE Transactions on Pattern Recognition and Machine Intelligence, 35(8):2022–2038, 2013.
  • [46] Volker Rodehorst. Evaluation of the metric trifocal tensor for relative three-view orientation. In Klaus Gürlebeck and Tom Lahmer, editors, Digital Proceedings, International Conference on the Applications of Computer Science and Mathematics in Architecture and Civil Engineering : July 20 - 22 2015, Bauhaus-University Weimar, 2017.
  • [47] T. Sattler, B. Leibe, and L. Kobbelt. Efficient & Effective Prioritized Matching for Large-Scale Image-Based Localization. IEEE Transactions on Pattern Recognition and Machine Intelligence, 2016. (To appear).
  • [48] D. Scaramuzza and F. Fraundorfer. Visual odometry [tutorial]. IEEE Robot. Automat. Mag., 18(4):80–92, 2011.
  • [49] Johannes L. Schönberger and Jan-Michael Frahm. Structure-From-Motion Revisited. In CVPR, 2016.
  • [50] Thomas Schops, Johannes L. Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A Multi-View Stereo Benchmark With High-Resolution Images and Multi-Camera Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [51] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d. In ACM siggraph 2006 papers, pages 835–846. 2006.
  • [52] N. Snavely, S. M. Seitz, and R. Szeliski. Modeling the world from internet photo collections. International Journal Computer Vision, 80(2):189–210, Nov. 2008.
  • [53] Andrew J Sommese, Andrew J Sommese, and Charles W Wampler. The numerical solution of systems of polynomials arising in engineering and science. World Scientific Publishing Co. Pte. Ltd, Singapore, 2005.
  • [54] H. Stewenius, C. Engels, and D. Nistér. Recent developments on direct relative orientation. ISPRS J. of Photogrammetry and Remote Sensing, 60:284–294, 2006.
  • [55] H. Stewenius, D. Nister, F. Kahl, and F. Schaffalitzky. A minimal solution for relative pose with unknown focal length. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2005), 2005.
  • [56] H. Stewénius, D. Nistér, M. Oskarsson, and K. Åström. Solutions to minimal generalized relative pose problems. In Workshop on Omnidirectional Vision, Beijing China, OCT 2005.
  • [57] P. H. S. Torr and A. Zisserman. Robust parameterization and computation of the trifocal tensor. Image and Vision Computing, 15:591–605, 1997.
  • [58] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015.
  • [59] G. Xu and N. Sugimoto. A linear algorithm for motion from three weak perspective images using euler angles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(1):54–57, 1999.