skip to main content
10.1145/2463676.2465338acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
tutorial

Machine learning for big data

Published: 22 June 2013 Publication History

Abstract

Statistical Machine Learning has undergone a phase transition from a pure academic endeavor to being one of the main drivers of modern commerce and science. Even more so, recent results such as those on tera-scale learning [1] and on very large neural networks [2] suggest that scale is an important ingredient in quality modeling. This tutorial introduces current applications, techniques and systems with the aim of cross-fertilizing research between the database and machine learning communities.
The tutorial covers current large scale applications of Machine Learning, their computational model and the workflow behind building those. Based on this foundation, we present the current state-of-the-art in systems support in the bulk of the tutorial. We also identify critical gaps in the state-of-the-art. This leads to the closing of the seminar, where we introduce two sets of open research questions: Better systems support for the already established use cases of Machine Learning and support for recent advances in Machine Learning research.

References

[1]
A. Agarwal, O. Chapelle, M. Dudik and J. Langford, "A Reliable Effective Terascale Linear Learning System," arXiv.org, 2012.
[2]
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, A. Senior, P. Tucker, K. Yang and A. Ng, "Large Scale Distributed Deep Networks," in Advances in Neural Information Processing Systems, 2013.
[3]
The Apache Project, "Apache Hadoop NextGen MapReduce (YARN)," The Apache Project, {Online}. Available: http://hadoop.apache.org/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/YARN.html.
[4]
B. Hindman, A. Konwinski, M. Zaharia and I. Stoica, "A Common Substrate for Cluster Computing," in HotCloud, 2009.
[5]
M. Kearns, "Efficient noise-tolerant learning from statistical queries," Journal of the ACM, pp. 392--401, 1998.
[6]
C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski and A. Y. Ng, "Map-Reduce for Machine Learning on Multicore," in Advances in Neural Information Processing Systems 19, Cambridge, MA, 2007.
[7]
The Apache Foundation, "Apache Pig," 11 12 2012. {Online}. Available: http://pig.apache.org/.
[8]
J. Dean and S. Ghemat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, pp. 107--113, 2008.
[9]
The Apache Mahout Project, "Apache Mahout," 17 9 2012. {Online}. Available: http://mahout.apache.org/. {Accessed 17 9 2012}.
[10]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker and I. Stoica, "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing," in USENIX NSDI, San Jose, CA, 2012.
[11]
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser and G. Czajkowski, "Pregel: a system for large-scale graph processing," in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, Indianapolis, Indiana, USA, 2010.
[12]
The Apache Software Foundation, "Apache Giraph," {Online}. Available: http://giraph.apache.org/.
[13]
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola and J. M. Hellerstein, "Distributed GraphLab: a framework for machine learning and data mining in the cloud," Proceedings of the VLDB Endowment, vol. 5, no. 8, pp. 716--727, April 2012.
[14]
G. Hinton, "Learning Multiple Layers of Representation," Trends in Cognitive Sciences, vol. 11, pp. 428--434, 2007.
[15]
G. E. Hinton, S. Osindero and Y. Teh, "A fast learning algorithm for deep belief nets," Neural Computation, vol. 18, pp. 1526--1554, 2006.
[16]
J. Markoff, "How Many Computers to Identify a Cat? 16,000," The New York Times, 25 June 2012. {Online}. Available: http://www.nytimes.com/2012/06/26/technology/in-a-big-network-of-computers-evidence-of-machine-learning.html. {Accessed 11 December 2012}.
[17]
A. Smola, A. Ahmed and M. Weimer, "WWW 2012 Tutorial: New Templates for Scalable Data Analysis," June 2012. {Online}. Available: http://www2012.wwwconference.org/program/tutorials/ and http://cs.markusweimer.com/2012/04/06/www-2012-tutorial-new-templates-for-scalable-data-analysis/.
[18]
G. Dror, N. Koenigstein, Y. Koren and M. Weimer, "The Yahoo! Music Dataset and KDD-Cup'11," in Proceedings of KDDCup 2011, San Diego, CA, 2011.

Cited By

View all

Index Terms

  1. Machine learning for big data

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
    June 2013
    1322 pages
    ISBN:9781450320375
    DOI:10.1145/2463676
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 June 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. big data
    2. databases
    3. machine learning

    Qualifiers

    • Tutorial

    Conference

    SIGMOD/PODS'13
    Sponsor:

    Acceptance Rates

    SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)43
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 04 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media