skip to main content
article
Free access

Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion

Published: 01 December 2010 Publication History

Abstract

We explore an original strategy for building deep networks, based on stacking layers of denoising autoencoders which are trained locally to denoise corrupted versions of their inputs. The resulting algorithm is a straightforward variation on the stacking of ordinary autoencoders. It is however shown on a benchmark of classification problems to yield significantly lower classification error, thus bridging the performance gap with deep belief networks (DBN), and in several cases surpassing it. Higher level representations learnt in this purely unsupervised fashion also help boost the performance of subsequent SVM classifiers. Qualitative experiments show that, contrary to ordinary autoencoders, denoising autoencoders are able to learn Gabor-like edge detectors from natural image patches and larger stroke detectors from digit images. This work clearly establishes the value of using a denoising criterion as a tractable unsupervised objective to guide the learning of useful higher level representations.

References

[1]
G. An. The effects of adding noise during backpropagation training on a generalization performance. Neural Computation, 8(3):643-674, 1996.
[2]
H. Baird. Document image defect models. In IAPR Workshop on Syntactic and Structural Pattern Recognition, pages 38-46, Murray Hill, NJ., 1990.
[3]
P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning fromexamples without local minima. Neural Networks, 2:53-58, 1989.
[4]
A. Bell and T.J. Sejnowski. The independent components of natural scenes are edge filters. Vision Research, 37:3327-3338, 1997.
[5]
A.J. Bell and T.J. Sejnowski. An information maximisation approach to blind separation and blind deconvolution. Neural Computation, 7(6):1129-1159, 1995.
[6]
Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1): 1-127, 2009. Also published as a book. Now Publishers, 2009.
[7]
Y. Bengio and O. Delalleau. Justifying and generalizing contrastive divergence. Neural Computation, 21(6):1601-1621, June 2009.
[8]
Y. Bengio and Y. LeCun. Scaling learning algorithms towards AI. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press, 2007.
[9]
Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In Bernhard Schölkopf, John Platt, and Thomas Hoffman, editors, Advances in Neural Information Processing Systems 19 (NIPS'06), pages 153-160. MIT Press, 2007.
[10]
J. Bergstra. Algorithms for classifying recorded music by genre. Master's thesis, Université de Montreal, 2006.
[11]
J. Besag. Statistical analysis of non-lattice data. The Statistician, 24(3):179-195, 1975.
[12]
C.M. Bishop. Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7 (1):108-116, 1995.
[13]
H. Bourlard and Y. Kamp. Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics, 59:291-294, 1988.
[14]
O. Chapelle, B. Schölkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, Cambridge, MA, 2006.
[15]
Y. Cho and L. Saul. Kernel methods for deep learning. In Y. Bengio, D. Schuurmans, C. Williams, J. Lafferty, and A. Culotta, editors, Advances in Neural Information Processing Systems 22 (NIPS'09), pages 342-350. NIPS Foundation, 2010.
[16]
D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol, P. Vincent, and S. Bengio. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11:625-660, February 2010.
[17]
P. Gallinari, Y. LeCun, S. Thiria, and F. Fogelman-Soulie. Memoires associatives distribuees. In Proceedings of COGNITIVA 87, Paris, La Villette, 1987.
[18]
Y. Grandvalet, S. Canu, and S. Boucheron. Noise injection: Theoretical prospects. Neural Computation, 9(5):1093-1108, 1997.
[19]
J. Håstad. Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th annual ACM Symposium on Theory of Computing, pages 6-20, Berkeley, California, 1986. ACM Press.
[20]
J. Håstad and M. Goldmann. On the power of small-depth threshold circuits. Computational Complexity, 1:113-129, 1991.
[21]
D. Heckerman, D.M. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency networks for inference, collaborative filtering, and data visualization. Journal of Machine Learning Research, 1:49-75, 2000.
[22]
G.E. Hinton. Connectionist learning procedures. Artificial Intelligence, 40:185-234, 1989.
[23]
G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771-1800, 2002.
[24]
G.E. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504-507, July 2006.
[25]
G.E. Hinton, S. Osindero, and Y.W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527-1554, 2006.
[26]
L. Holmstrm and P. Koistinen. Using additive noise in back-propagation training. IEEE Transactions on Neural Networks, 3(1):24-38, 1992.
[27]
J.J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, USA, 79, 1982.
[28]
D.H. Hubel and T.N. Wiesel. Receptive fields of single neurons in the cat's striate cortex. Journal of Physiology, 148:574-591, 1959.
[29]
V. Jain and S.H. Seung. Natural image denoising with convolutional networks. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and Leon Bottou, editors, Advances in Neural Information Processing Systems 21 (NIPS'08), 2008.
[30]
N. Japkowicz, S.J. Hanson, and M.A. Gluck. Nonlinear autoassociation is not equivalent to PCA. Neural Computation, 12(3):531-545, 2000.
[31]
H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In Z. Ghahramani, editor, Proceedings of the Twenty-fourth International Conference on Machine Learning (ICML'07), pages 473-480. ACM, 2007.
[32]
H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin. Exploring strategies for training deep neural networks. Journal of Machine Learning Research, 10:1-40, January 2009a.
[33]
H. Larochelle, D. Erhan, and P. Vincent. Deep learning using robust interdependent codes. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS 2009), pages 312-319, April 2009b.
[34]
Y. LeCun. Modèles connexionistes de l'apprentissage. PhD thesis, Université de Paris VI, 1987.
[35]
Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and L.D. Jackel. Back-propagation applied to handwritten zip code recognition. Neural Computation, 1(4):541-551, 1989.
[36]
H. Lee, C. Ekanadham, and A. Ng. Sparse deep belief net model for visual area V2. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20 (NIPS'07), pages 873-880, Cambridge, MA, 2008. MIT Press.
[37]
R. Linsker. An application of the principle of maximum information preservation to linear systems. In D.S. Touretzky, editor, Advances in Neural Information Processing Systems 1 (NIPS'88). Morgan Kaufmann, 1989.
[38]
J.L. McClelland, D.E. Rumelhart, and the PDP Research Group. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume 2. MIT Press, Cambridge, 1986.
[39]
B.A. Olshausen and D.J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381:607-609, 1996.
[40]
B.A. Olshausen and D.J. Field. Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision Research, 37:3311-3325, December 1997.
[41]
T. Poggio and T. Vetter. Recognition and structure from one 2d model view: Observations on prototypes, object classes and symmetries. Technical Report A.I. Memo No. 1347, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 1992.
[42]
M. Ranzato, C.S. Poultney, S. Chopra, and Y. LeCun. Efficient learning of sparse representations with an energy-based model. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19 (NIPS'06), pages 1137-1144. MIT Press, 2007.
[43]
M. Ranzato, Y. Boureau, and Y. LeCun. Sparse feature learning for deep belief networks. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20 (NIPS'07), pages 1185-1192, Cambridge, MA, 2008. MIT Press.
[44]
R. Scalettar and A. Zee. Emergence of grandmother memory in feed forward networks: Learning with noise and forgetfulness. In D. Waltz and J. A. Feldman, editors, Connectionist Models and Their Implications: Readings from Cognitive Science, pages 309-332. Ablex, Norwood, 1988.
[45]
B. Schölkopf, C.J.C. Burges, and V. Vapnik. Incorporating invariances in support vector learning machines. In C. von der Malsburg, W. von Seelen, J. C. Vorbrggen, and B. Sendhoff, editors, Lecture Notes in Computer Science (Vol 112), Artificial Neural Netowrks ICANN'96, pages 47- 52. Springer, 1996.
[46]
S.H. Seung. Learning continuous attractors in recurrent networks. In M.I. Jordan, M.J. Kearns, and S.A. Solla, editors, Advances in Neural Information Processing Systems 10 (NIPS'97), pages 654-660. MIT Press, 1998.
[47]
J. Sietsma and R. Dow. Creating artificial neural networks that generalize. Neural Networks, 4(1): 67-79, 1991.
[48]
P. Simard, B. Victorri, Y. LeCun, and J. Denker. Tangent prop - A formalism for specifying selected invariances in an adaptive network. In J.E. Moody S.J. Hanson and R.P. Lippmann, editors, Advances in Neural Information Processing Systems 4 (NIPS'91), pages 895-903, San Mateo, CA, 1992. Morgan Kaufmann.
[49]
P. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. In D.E. Rumelhart and J.L. McClelland, editors, Parallel Distributed Processing, volume 1, chapter 6, pages 194-281. MIT Press, Cambridge, 1986.
[50]
P.E. Utgoff and D.J. Stracuzzi. Many-layered learning. Neural Computation, 14:2497-2539, 2002.
[51]
P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol. Extracting and composing robust features with denoising autoencoders. In W.W. Cohen, A. McCallum, and S.T. Roweis, editors, Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML'08), pages 1096-1103. ACM, 2008.
[52]
A. von Lehman, E.G. Paek, P.F. Liao, A. Marrakchi, and J.S. Patel. Factors influencing learning by back-propagation. In IEEE International Conference on Neural Networks, volume 1, pages 335-341, San Diego 1988, 1988. IEEE, New York.
[53]
J. Weston, F. Ratle, and R. Collobert. Deep learning via semi-supervised embedding. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML'08), pages 1168-1175, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-205-4.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research
The Journal of Machine Learning Research  Volume 11, Issue
3/1/2010
3637 pages
ISSN:1532-4435
EISSN:1533-7928
Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 December 2010
Published in JMLR Volume 11

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)772
  • Downloads (Last 6 weeks)89
Reflects downloads up to 13 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media