skip to main content
research-article

Wasserstein Dictionary Learning: : Optimal Transport-Based Unsupervised Nonlinear Dictionary Learning

Published: 01 January 2018 Publication History

Abstract

This paper introduces a new nonlinear dictionary learning method for histograms in the probability simplex. The method leverages optimal transport theory, in the sense that our aim is to reconstruct histograms using so-called displacement interpolations (a.k.a. Wasserstein barycenters) between dictionary atoms; such atoms are themselves synthetic histograms in the probability simplex. Our method simultaneously estimates such atoms and, for each datapoint, the vector of weights that can optimally reconstruct it as an optimal transport barycenter of such atoms. Our method is computationally tractable thanks to the addition of an entropic regularization to the usual optimal transportation problem, leading to an approximation scheme that is efficient, parallel, and simple to differentiate. Both atoms and weights are learned using a gradient-based descent method. Gradients are obtained by automatic differentiation of the generalized Sinkhorn iterations that yield barycenters with entropic smoothing. Because of its formulation relying on Wasserstein barycenters instead of the usual matrix product between dictionary and codes, our method allows for nonlinear relationships between atoms and the reconstruction of input data. We illustrate its application in several different image processing settings.

References

[1]
M. Agueh and G. Carlier, Barycenters in the Wasserstein space, SIAM J. Math. Anal., 43 (2011), pp. 904--924, https://doi.org/10.1137/100805741.
[2]
M. Aharon, M. Elad, and A. Bruckstein, K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation, IEEE Trans. Signal Process., 54 (2006), pp. 4311--4322, https://doi.org/10.1109/tsp.2006.881199.
[3]
N. Aifanti, C. Papachristou, and A. Delopoulos, The MUG facial expression database, in Proceedings of the 11th IEEE International Conference on Image Analysis for Multimedia Interactive Services (WIAMIS), 2010, pp. 1--4.
[4]
J. Altschuler, J. Weed, and P. Rigollet, Near-Linear Time Approximation Algorithms for Optimal Transport via Sinkhorn Iteration, preprint, https://arxiv.org/abs/1705.09634, 2017.
[5]
M. Arjovsky, S. Chintala, and L. Bottou, Wasserstein GAN, preprint, https://arxiv.org/abs/1701.07875, 2017.
[6]
F. Bassetti, A. Bodini, and E. Regazzini, On minimum Kantorovich distance estimators, Statist. Probab. Lett., 76 (2006), pp. 1298--1302, https://doi.org/10.1016/j.spl.2006.02.001.
[7]
F. Bassetti and E. Regazzini, Asymptotic properties and robustness of minimum dissimilarity estimators of location-scale parameters, Theory Probab. Appl., 50 (2006), pp. 171--186, https://doi.org/10.1137/S0040585X97981664.
[8]
J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyré, Iterative Bregman projections for regularized transportation problems, SIAM J. Sci. Comput., 37 (2015), pp. A1111--A1138, https://doi.org/10.1137/141000439.
[9]
E. Bernton, P. E. Jacob, M. Gerber, and C. P. Robert, Inference in Generative Models Using the Wasserstein Distance, preprint, https://arxiv.org/abs/1701.05146, 2017.
[10]
D. P. Bertsekas, The auction algorithm: A distributed relaxation method for the assignment problem, Ann. Oper. Res., 14 (1988), pp. 105--123, https://doi.org/10.1007/bf02186476.
[11]
J. Bigot, R. Gouet, T. Klein, and A. López, Geodesic PCA in the Wasserstein Space, preprint, https://arxiv.org/abs/1307.7721, 2013.
[12]
D. Blei and J. Lafferty, Topic models, in Text Mining: Classification, Clustering, and Applications, Chapman & Hall, London, 2009, pp. 71--93, https://doi.org/10.1111/j.1751-5823.2010.00109_1.x.
[13]
E. Boissard, T. Le Gouic, and J.-M. Loubes, Distribution's template estimate with Wasserstein metrics, Bernoulli, 21 (2015), pp. 740--759, https://doi.org/10.3150/13-bej585.
[14]
N. Bonneel, G. Peyré, and M. Cuturi, Wasserstein barycentric coordinates: Histogram regression using optimal transport, ACM Trans. Graphics, 35 (2016), 71, https://doi.org/10.1145/2897824.2925918.
[15]
N. Bonneel, J. Rabin, G. Peyré, and H. Pfister, Sliced and Radon Wasserstein barycenters of measures, J. Math. Imaging Vision, 51 (2015), pp. 22--45, https://doi.org/10.1007/s10851-014-0506-3.
[16]
G. Carlier, A. Oberman, and E. Oudet, Numerical methods for matching for teams and Wasserstein barycenters, ESAIM Math. Model. Numer. Anal., 49 (2015), pp. 1621--1642, https://doi.org/10.1051/m2an/2015033.
[17]
L. Chizat, G. Peyré, B. Schmitzer, and F.-X. Vialard, Scaling Algorithms for Unbalanced Transport Problems, preprint, https://arxiv.org/abs/1607.05816, 2016.
[18]
M. Cuturi, Sinkhorn distances: Lightspeed computation of optimal transport, in Advances in Neural Information Processing Systems, MIT Press, Cambridge, MA, 2013, pp. 2292--2300.
[19]
M. Cuturi and A. Doucet, Fast computation of Wasserstein barycenters, in Proceedings of the 31st International Conference on Machine Learning, 2014, pp. 685--693.
[20]
M. Cuturi and G. Peyré, A smoothed dual approach for variational Wasserstein problems, SIAM J. Imaging Sci., 9 (2016), pp. 320--343, https://doi.org/10.1137/15M1032600.
[21]
A. d'Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. G. Lanckriet, A direct formulation for sparse PCA using semidefinite programming, SIAM Rev., 49 (2007), pp. 434--448, https://doi.org/10.1137/050645506.
[22]
W. E. Deming and F. F. Stephan, On a least squares adjustment of a sampled frequency table when the expected marginal totals are known, Ann. Math. Statistics, 11 (1940), pp. 427--444, https://doi.org/10.1214/aoms/1177731829.
[23]
S. Erlander and N. F. Stewart, The Gravity Model in Transportation Analysis: Theory and Extensions, Vol. 3, VSP, Utrecht, The Netherlands, 1990.
[24]
P. T. Fletcher, C. Lu, S. M. Pizer, and S. Joshi, Principal geodesic analysis for the study of nonlinear statistics of shape, IEEE Trans. Med. Imag., 23 (2004), pp. 995--1005, https://doi.org/10.1109/tmi.2004.831793.
[25]
J. Franklin and J. Lorenz, On the scaling of multidimensional matrices, Linear Algebra Appl., 114 (1989), pp. 717--735.
[26]
M. Fréchet, Les éléments aléatoires de nature quelconque dans un espace distancié, Ann. Inst. H. Poincaré, 10 (1948), pp. 215--310.
[27]
C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio, Learning with a Wasserstein loss, in Advances in Neural Information Processing Systems, MIT Press, Cambridge, MA, 2015, pp. 2053--2061.
[28]
W. Gao, J. Chen, C. Richard, and J. Huang, Online dictionary learning for kernel LMS, IEEE Trans. Signal Process., 62 (2014), pp. 2765--2777, https://doi.org/10.1109/tsp.2014.2318132.
[29]
A. Genevay, G. Peyré, and M. Cuturi, Learning Generative Models with Sinkhorn Divergences, preprint, https://arxiv.org/abs/1706.00292, 2017.
[30]
A. Griewank and A. Walther, Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, 2nd ed., SIAM, Philadelphia, 2008, https://doi.org/10.1137/1.9780898717761.
[31]
S. Haker, L. Zhu, A. Tannenbaum, and S. Angenent, Optimal mass transport for registration and warping, Int. J. Comput. Vis., 60 (2004), pp. 225--240, https://doi.org/10.1023/b:visi.0000036836.66311.97.
[32]
M. Harandi and M. Salzmann, Riemannian coding and dictionary learning: Kernels to the rescue, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3926--3935, https://doi.org/10.1109/cvpr.2015.7299018.
[33]
M. Harandi, C. Sanderson, C. Shen, and B. C. Lovell, Dictionary learning and sparse coding on Grassmann manifolds: An extrinsic solution, in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 3120--3127, https://doi.org/10.1109/iccv.2013.387.
[34]
M. T. Harandi, C. Sanderson, R. Hartley, and B. C. Lovell, Sparse coding and dictionary learning for symmetric positive definite matrices: A kernel approach, in Computer Vision--ECCV 2012, Springer, Berlin, Heidelberg, 2012, pp. 216--229, https://doi.org/10.1007/978-3-642-33709-3_16.
[35]
G. E. Hinton and R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science, 313 (2006), pp. 504--507, https://doi.org/10.1126/science.1127647.
[36]
J. Ho, Y. Xie, and B. Vemuri, On a nonlinear generalization of sparse coding and dictionary learning, in Proceedings of the International Conference on Machine Learning, 2013, pp. 1480--1488.
[37]
A. Hyvärinen, J. Karhunen, and E. Oja, Independent Component Analysis, Wiley Ser. Adapt. Learn. Syst. Signal Process. Commun. Control 46, John Wiley & Sons, New York, 2004.
[38]
Z. Irace and H. Batatia, Motion-based interpolation to estimate spatially variant PSF in positron emission tomography, in Proceedings of the 21st European IEEE Signal Processing Conference (EUSIPCO), 2013, pp. 1--5.
[39]
H. W. Kuhn, The Hungarian method for the assignment problem, Naval Res. Logist. Quart., 2 (1955), pp. 83--97, https://doi.org/10.1002/nav.3800020109.
[40]
R. Laureijs, J. Amiaux, S. Arduini, J.-L. Augueres, J. Brinchmann, R. Cole, M. Cropper, C. Dabin, L. Duvet, A. Ealet, et al., Euclid Definition Study Report, preprint, https://arxiv.org/abs/1110.3193, 2011.
[41]
D. D. Lee and H. S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature, 401 (1999), pp. 788--791.
[42]
H. Lee, A. Battle, R. Raina, and A. Y. Ng, Efficient sparse coding algorithms, in Advances in Neural Information Processing Systems, MIT Press, Cambridge, MA, 2007, pp. 801--808.
[43]
C. Léonard, A survey of the Schrödinger problem and some of its connections with optimal transport, Discrete Contin. Dyn. Syst., 34 (2014), pp. 1533--1574, https://doi.org/10.3934/dcds.2014.34.1533.
[44]
P. Li, Q. Wang, W. Zuo, and L. Zhang, Log-euclidean kernels for sparse representation and dictionary learning, in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1601--1608, https://doi.org/10.1109/iccv.2013.202.
[45]
H. Liu, J. Qin, H. Cheng, and F. Sun, Robust kernel dictionary learning using a whole sequence convergent algorithm, in Proceedings of the 24th International Conference on Artificial Intelligence (IJCAI), 2015, pp. 3678--3684. .
[46]
J. Mairal, F. Bach, J. Ponce, and G. Sapiro, Online learning for matrix factorization and sparse coding, J. Mach. Learn. Res., 11 (2010), pp. 19--60.
[47]
S. Mallat, A Wavelet Tour of Signal Processing, Academic Press, New York, 1999, https://doi.org/10.1016/b978-0-12-374370-1.00002-1.
[48]
R. J. McCann, A convexity principle for interacting gases, Adv. Math., 128 (1997), pp. 153--179, https://doi.org/10.1006/aima.1997.1634.
[49]
Q. Mérigot, A multiscale approach to optimal transport, Comput. Graph. Forum, 30 (2011), pp. 1583--1592. https://doi.org/10.1111/j.1467-8659.2011.02032.x.
[50]
G. Monge, Mémoire sur la théorie des déblais et des remblais, in Histoire de l'Académie Royale des Sciences de Paris, De l'Imprimerie Royale, Paris, 1781, pp. 666--704.
[51]
G. Montavon, K.-R. Müller, and M. Cuturi, Wasserstein training of restricted Boltzmann machines, in Advances in Neural Information Processing Systems, MIT Press, Cambridge, MA, 2016, pp. 3711--3719.
[52]
J. L. Morales and J. Nocedal, Remark on “Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound constrained optimization,'' ACM Trans. Math. Software, 38 (2011), 7, https://doi.org/10.1145/2049662.2049669.
[53]
Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Appl. Optim. 87, Springer Science & Business Media, New York, 2013, https://doi.org/10.1007/978-1-4419-8853-9.
[54]
F. Ngolè and J.-L. Starck, PSF Field Learning Based on Optimal Transport Distances, preprint, https://arxiv.org/abs/1703.06066, 2017.
[55]
F. Ngolè, J.-L. Starck, S. Ronayette, K. Okumura, and J. Amiaux, Super-resolution method using sparse regularization for point-spread function recovery, Astron. Astrophys., 575 (2015), A86, https://doi.org/10.1051/0004-6361/201424167.
[56]
N. Papadakis, Optimal Transport for Image Processing, habilitation à diriger des recherches, Université de Bordeaux, Bordeaux, France, 2015, https://hal.archives-ouvertes.fr/tel-01246096.
[57]
K. Pearson, LIII. On lines and planes of closest fit to systems of points in space, Lond. Edinb. Dubl. Phil. Mag., 2 (1901), pp. 559--572, https://doi.org/10.1080/14786440109462720.
[58]
J. Pennington, R. Socher, and C. D. Manning, GloVe: Global vectors for word representation, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Vol. 14, 2014, pp. 1532--1543, https://doi.org/10.3115/v1/d14-1162.
[59]
G. Peyré, L. Chizat, F.-X. Vialard, and J. Solomon, Quantum Optimal Transport for Tensor Field Processing, preprint, https://arxiv.org/abs/1612.08731, 2016.
[60]
F. Pitié, A. C. Kokaram, and R. Dahyot, N-dimensional probability density function transfer and its application to color transfer, in Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV), Vol. 2, Washington, D.C., 2005, pp. 1434--1439, https://doi.org/10.1109/ICCV.2005.166.
[61]
B. T. Polyak, Some methods of speeding up the convergence of iteration methods, U.S.S.R. Comput. Math. and Math. Phys., 4 (1964), pp. 1--17, https://doi.org/10.1016/0041-5553(64)90137-5.
[62]
Y. Quan, C. Bao, and H. Ji, Equiangular kernel dictionary learning with applications to dynamic texture analysis, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 308--316, https://doi.org/10.1109/cvpr.2016.40.
[63]
J. Rabin, G. Peyré, J. Delon, and M. Bernot, Wasserstein barycenter and its application to texture mixing, in Scale Space and Variational Methods in Computer Vision, Springer, Berlin, Heidelberg, 2011, pp. 435--446, https://doi.org/10.1007/978-3-642-24785-9_37.
[64]
S. Rachev and L. Rüschendorf, Mass Transportation Problems: Theory, Vol. 1, Springer-Verlag, New York, 1998.
[65]
A. Rolet, M. Cuturi, and G. Peyré, Fast dictionary learning with a smoothed Wasserstein loss, in Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 2016, pp. 630--638.
[66]
R. Rubinstein, M. Zibulevsky, and M. Elad, Efficient implementation of the K-SVD algorithm using batch orthogonal matching pursuit, Cs Technion, 40 (2008), pp. 1--15.
[67]
Y. Rubner, C. Tomasi, and L. J. Guibas, The earth mover's distance as a metric for image retrieval, Int. J. Comput. Vis., 40 (2000), pp. 99--121, https://doi.org/10.1023/A:1026543900054.
[68]
G. Salton and M. J. McGill, Introduction to Modern Information Retrieval, McGraw--Hill, New York, 1986.
[69]
R. Sandler and M. Lindenbaum, Nonnegative matrix factorization with earth mover's distance metric, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 1873--1880, https://doi.org/10.1109/cvpr.2009.5206834.
[70]
M. A. Schmitz, M. Heitz, N. Bonneel, F. Ngolè, D. Coeurjolly, M. Cuturi, G. Peyré, and J.-L. Starck, Optimal Transport-Based Dictionary Learning and Its Application to Euclid-Like Point Spread Function Representation, Proc. SPIE 10394, International Society for Optics and Photonics, Bellingham, WA, 2017, https://doi.org/10.1117/12.2270641.
[71]
B. Schmitzer, Stabilized Sparse Scaling Algorithms for Entropy Regularized Transport Problems, preprint, https://arxiv.org/abs/1610.06519, 2016.
[72]
B. Schölkopf, A. Smola, and K.-R. Müller, Kernel principal component analysis, in Proceedings of the 7th International Conference on Artificial Neural Networks (ICANN), 1997, pp. 583--588.
[73]
E. Schrödinger, Über die umkehrung der naturgesetze, Verlag Akademie der wissenschaften in kommission bei Walter de Gruyter u. Company, Berlin, 1931.
[74]
V. Seguy and M. Cuturi, Principal geodesic analysis for probability measures under the optimal transport metric, in Advances in Neural Information Processing Systems, MIT Press, Cambridge, MA, 2015, pp. 3312--3320.
[75]
S. Shirdhonkar and D. W. Jacobs, Approximate earth mover's distance in linear time, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1--8, https://doi.org/10.1109/cvpr.2008.4587662.
[76]
R. Sinkhorn, Diagonal equivalence to matrices with prescribed row and column sums, Amer. Math. Monthly, 74 (1967), pp. 402--405, https://doi.org/10.2307/2314570.
[77]
J. Solomon, F. De Goes, G. Peyré, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas, Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains, ACM Trans. Graphics, 34 (2015), 66.
[78]
J. Solomon, R. Rustamov, L. Guibas, and A. Butscher, Wasserstein propagation for semi-supervised learning, in Proceedings of the 31st International Conference on Machine Learning, 2014, pp. 306--314.
[79]
M. Talagrand, Transportation cost for Gaussian and other product measures, Geom. Funct. Anal., 6 (1996), pp. 587--600, https://doi.org/10.1007/bf02249265.
[80]
Theano Development Team, Theano: A Python Framework for Fast Computation of Mathematical Expressions, preprint, https://arxiv.org/abs/1605.02688, 2016.
[81]
M. Turk and A. Pentland, Eigenfaces for recognition, J. Cogn. Neurosci., 3 (1991), pp. 71--86, https://doi.org/10.1162/jocn.1991.3.1.71.
[82]
H. Van Nguyen, V. M. Patel, N. M. Nasrabadi, and R. Chellappa, Design of non-linear kernel dictionaries for object recognition, IEEE Trans. Image Process., 22 (2013), pp. 5123--5135, https://doi.org/10.1109/tip.2013.2282078.
[83]
C. Villani, Topics in Optimal Transportation, Grad. Stud. Math. 58, AMS, Providence, RI, 2003, https://doi.org/10.1090/gsm/058.
[84]
C. Villani, Optimal Transport: Old and New, Grundlehren Math. Wiss. 338, Springer, Berlin, Heidelberg, 2008.
[85]
W. Wang, D. Slepcev, S. Basu, J. A. Ozolek, and G. K. Rohde, A linear optimal transportation framework for quantifying and visualizing variations in sets of images, Int. J. Comput. Vis., 101 (2013), pp. 254--269, https://doi.org/10.1007/s11263-012-0566-z.
[86]
J. Ye, P. Wu, J. Z. Wang, and J. Li, Fast discrete distribution clustering using Wasserstein barycenter with sparse support, IEEE Trans. Signal Process., 65 (2017), pp. 2317--2332, https://doi.org/10.1109/TSP.2017.2659647.
[87]
S. Zavriev and F. Kostyuk, Heavy-ball method in nonconvex optimization problems, Comput. Math. Model., 4 (1993), pp. 336--341, https://doi.org/10.1007/bf01128757.

Cited By

View all

Index Terms

  1. Wasserstein Dictionary Learning: Optimal Transport-Based Unsupervised Nonlinear Dictionary Learning
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image SIAM Journal on Imaging Sciences
            SIAM Journal on Imaging Sciences  Volume 11, Issue 1
            EISSN:1936-4954
            DOI:10.1137/sjisbi.11.1
            Issue’s Table of Contents

            Publisher

            Society for Industrial and Applied Mathematics

            United States

            Publication History

            Published: 01 January 2018

            Author Tags

            1. optimal transport
            2. Wasserstein barycenter
            3. dictionary learning

            Author Tags

            1. 33F05
            2. 49M99
            3. 65D99
            4. 90C08

            Qualifiers

            • Research-article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 13 Jan 2025

            Other Metrics

            Citations

            Cited By

            View all

            View Options

            View options

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media