skip to main content
research-article
Open access

The Lean Data Scientist: Recent Advances Toward Overcoming the Data Bottleneck

Published: 20 January 2023 Publication History

Abstract

A taxonomy of the methods used to obtain quality datasets enhances existing resources.

Supplementary Material

PDF File (p92-shani-supp.pdf)
Supplemental material.

References

[1]
Alaíz, C., Fanuel, M., and Suykens, J. Convex formulation for kernel PCA and its use in semisupervised learning. IEEE Trans. Neural Networks and Learning Systems 29, 8 (2017), 3863--3869.
[2]
Alrashed, H. and Berbar, M. Facial Gender Recognition Using Eyes Images (2013).
[3]
Anaby-Tavor, A., et al. Do not have enough data? Deep learning to the rescue. In Proceedings of the AAAI Conf. Artificial Intelligence 34 (2020), 7383--7390.
[4]
Attenberg, J., Weinberger, K., Dasgupta, A., Smola, A., and Zinkevich, M. Collaborative email-spam filtering with the hashing trick. CEAS (2009).
[5]
Baxter, J. A Bayesian/information theoretic model of learning to learn via multiple task sampling. Machine Learning 28, 1 (1997), 7--39.
[6]
Bertero, D. and Fung, P. Predicting humor response in dialogues from TV sitcoms. In 2016 IEEE Intern. Conf. Acoustics, Speech, and Signal Processing, 5780--5784.
[7]
Cubuk, E., Zoph, B., Mané, D., Vasudevan, V., and Le, Q. AutoAugment: Learning augmentation strategies from data. In Proceedings of 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, 113--123.
[8]
De Fauw, J., et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nature Medicine 24, 9 (2018), 1342--1350.
[9]
Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conf. the North American Chapter of the Assoc. Computational Linguistics: Human Language Technologies, 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, MN, 4171--4186.
[10]
Duong, L., Cohn, T., Bird, S., and Cook, P. Low resource dependency parsing: Cross-lingual parameter sharing in a neural network parser. In Proceedings of the 53rd Annual Meeting of the Assoc. Computational Linguistics and the 7th Intern. Joint Conf. Natural Language Processing 2 (Short Papers, 2015, 845--850.
[11]
Farhadi, A., Endres, I., Hoiem, D., and Forsyth, D. Describing objects by their attributes. In Proceedings of the 2009 IEEE Conf. Computer Vision and Pattern Recognition, 1778--1785.
[12]
Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th Intern. Conf. Machine Learning-Volume 70. JMLR. Org, 2017, 1126--1135.
[13]
Golding, A. and Roth, D. A winnow-based approach to context-sensitive spelling correction. Machine Learning 34, 1-3 (1999), 107--130.
[14]
Gurevich, N., Markovitch, S., and Rivlin, E. Active Learning with near Misses. AAAI Press, 2006, 362--367.
[15]
Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., and Smith, N. Annotation artifacts in natural language inference data. 2018; arXiv:1803,02324.
[16]
Hacohen, G., Dekel, A., and Weinshall, D. Active learning on a budget: Opposite strategies suit high and low budgets. 2022; arXiv:2202.02794.
[17]
Hacohen, G. and Weinshall, D. On the power of curriculum learning in training deep networks. 2019; arXiv:1904.03626.
[18]
Hope, T. and Shahaf, D. Ballpark learning: Estimating labels from rough group comparisons. In Proceedings of the Joint European Conf. Machine Learning and Knowledge Discovery in Databases. Springer, 2016, 299--314.
[19]
Jiang, L., Meng, D., Zhao, Q., Shan, S., and Hauptmann, A. Self-paced curriculum learning. In Proceedings of the 29th AAAI Conf. Artificial Intelligence, 2015.
[20]
Kottur, S., Vedantam, R., Moura, J., and Parikh, D. VisualWord2Vec (Vis-W2V): Learning visually grounded word embeddings using abstract scenes. In Proceedings of the 2016 IEEE Conf. Computer Vision and Pattern Recognition, 2015, 4985--4994.
[21]
Kuehlkamp, A., Becker, B., and Bowyer, K. Gender-from-iris or gender-from-mascara. In Proceedings of the 2017 IEEE Winter Conf. Applications of Computer Vision, 1151--1159.
[22]
Lampert, C., Nickisch, H., and Harmeling, S. Learning to detect unseen object classes by between-class attribute transfer. In Proceedings of the 2009 IEEE Conf. Computer Vision and Pattern Recognition, 951--958.
[23]
Lin, Y., Liu, Z., Sun, M., Liu, Y., and Zhu, X. Learning entity and relation embeddings for knowledge graph completion. In Proceedings of the 29th AAAI Conf. Artificial Intelligence, 2015.
[24]
Mintz, M., Bills, S., Snow, R., and Jurafsky, D. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conf. 47th Annual Meeting of the ACL and the 4th Intern. Joint Conf. Natural Language Processing of the AFNLP, 2. Assoc. Computational Linguistics, 2009, 1003--1011.
[25]
Morschheuser, B. and Hamari, J. The gamification of work: Lessons from crowdsourcing. J. Management Inquiry 28, 2 (2019), 145--148.
[26]
Možina, M., Žabkar, J., and Bratko, I. Argument based machine learning. Artificial Intelligence 171, 10-15 (2007), 922--937.
[27]
Munkhdalai, T. and Yu, H. Meta networks. In Proceedings of the 34th Intern. Conf. Machine Learning 70. JMLR. Org, 2017, 2554--2563.
[28]
Perez, L. and Wang, J. The effectiveness of data augmentation in image classification using deep learning. 2017; arXiv:1712,04621.
[29]
Raffel, C., et al. Exploring the limits of transfer learning with a unified text-to-text transformer. 2019; arXiv:1910.10683.
[30]
Rasmus, A., Berglund, M., Honkala, M., Valpola, H., and Raiko, T. Semi-supervised Learning with Ladder Networks. In NIPS, 2015.
[31]
Ratner, A., Bach, S., Ehrenberg, H., Fries, J., Wu, S., and Ré, C. Snorkel: Rapid training data creation with weak supervision. The VLDB J. (2019), 1--22.
[32]
Ren, P., et al. A survey of deep active learning. ACM Computing Surveys 54, 9 (2021), 1--40.
[33]
Roh, Y., Heo, G., and Whang, S. A survey on data collection for machine learning: a big data-ai integration perspective. IEEE Trans. Knowledge and Data Engineering 33, 4 (2019), 1328--1347.
[34]
Roth, D. Incidental supervision: Moving beyond supervised learning. In Proceedings of the 31st AAAI Conf. Artificial Intelligence, 2017.
[35]
Rozovskaya, A. and Roth, D. Building a state-of-the-art grammatical error correction system. Tran. Assoc. Computational Linguistics 2 (2014), 419--434.
[36]
Ruder, S. An Overview of Multi-Task Learning in Deep Neural Networks. 2017; arXiv abs/1706.05098 (2017).
[37]
Schwartz, E., Karlinsky, L., Feris, R., Giryes, R., and Bronstein, A. Baby steps towards few-shot learning with multiple semantics. arXiv preprint arXiv:1906.01905 (2019).
[38]
Schwartz, R., Dodge, J., Smith, N., and Etzioni, O. Green AI. arXiv preprint arXiv:1907.10597, 2019.
[39]
Shin, Y., Sagong, M., Yeo, Y., Kim, S., and Ko, S. Pepsi++: fast and lightweight network for image inpainting. IEEE Trans. Neural Networks and Learning Systems (2020).
[40]
Taylor, L. and Nitschke, G. Improving deep learning using generic data augmentation. 2017; arXiv:1708.06020.
[41]
Tian, Y., Shi, J., Li, B., Duan, Z., and Xu, C. Audiovisual event localization in unconstrained videos. In Proceedings of the 2018 European Conf. Computer Vision, 247--263.
[42]
Torrey, L. and Shavlik, J. Transfer learning. Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques. IGI Global, 2010, 242--264.
[43]
Tran, T., Do, T., Reid, I., and Carneiro, G. Bayesian Generative Active Deep Learning. 2019; arXiv abs/1904.11643.
[44]
Triguero, I., García, S., and Herrera, F. Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowledge and Information Systems 42 (2013), 245--284.
[45]
van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel Recurrent Neural Networks. In ICML, 2016.
[46]
van Engelen, J. and Hoos, H. A survey on semi-supervised learning. Machine Learning (2019), 1--68.
[47]
Vapnik, V. Statistical learning theory (1998).
[48]
Vapnik, V. The vicinal risk minimization principle and the SVMs. The Nature of Statistical Learning Theory. Springer, 2000, 267--290.
[49]
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P. Extracting and composing robust features with denoising autoencoders. ICML '08.
[50]
Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. Matching networks for one shot learning. Advances in Neural Information Processing Systems 29 (2016), 3630--3638.
[51]
Visotsky, R., Atzmon, Y., and Chechik, G. Few-shot learning with per-sample rich supervision. 2019; arXiv:1906.03859 (2019).
[52]
Von Ahn, L. and Dabbish, L. Labeling images with a computer game. In Proceedings of the 2004 SIGCHI Conf. Human Factors in Computing Systems, 319--326.
[53]
Wang, M. and. Manning, C. Cross-lingual projected expectation regularization for weakly supervised learning. Trans. Assoc. Computational Linguistics 2 (2014), 55--66
[54]
Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., and Summers, R. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the 2017 IEEE Conf. Computer Vision and Pattern Recognition, 2097--2106.
[55]
West, R. and Horvitz, E. Reverse-engineering satire, or 'paper on computational humor accepted despite making serious advances.' In Proceedings of the 2019 AAAI Conf. Artificial Intelligence 33, 7265--7272.
[56]
Xiaojin, Z. Semi-supervised learning literature survey. Computer Sciences TR 1530 (2008).
[57]
Zarecki, J. and Markovitch, S. Textual membership queries. In Proceedings of the 29th Intern. Conf. Intern. Joint Conf. Artificial Intelligence, 2021, 2662--2668.
[58]
Zaremba, W. and Sutskever, I. Learning to execute. 2014; arXiv:1410.4615.
[59]
Zhang, H., Cissé, M., Dauphin, Y., and Lopez-Paz, D. Mixup: Beyond Empirical Risk Minimization. 2017; arXiv abs/1710.09412.
[60]
Zhu, J. and Bento, J. Generative Adversarial Active Learning. 2017; arXiv abs/1702.07956.

Cited By

View all
  • (2024)Wearable Data From Subjects Playing Super Mario, Taking University Exams, or Performing Physical Exercise Help Detect Acute Mood Disorder Episodes via Self-Supervised Learning: Prospective, Exploratory, Observational StudyJMIR mHealth and uHealth10.2196/5509412(e55094)Online publication date: 17-Jul-2024
  • (2024)Human-Centered AI (Also) for Humanistic ManagementHumanism in Marketing10.1007/978-3-031-67155-5_11(225-255)Online publication date: 26-Oct-2024
  • (2023)Human-in-the-Loop Machine Learning for the Treatment of Pancreatic Cancer2023 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN54540.2023.10191456(1-9)Online publication date: 18-Jun-2023
  • Show More Cited By

Index Terms

  1. The Lean Data Scientist: Recent Advances Toward Overcoming the Data Bottleneck

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Communications of the ACM
      Communications of the ACM  Volume 66, Issue 2
      February 2023
      104 pages
      ISSN:0001-0782
      EISSN:1557-7317
      DOI:10.1145/3581931
      • Editor:
      • James Larus
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 20 January 2023
      Published in CACM Volume 66, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article
      • Popular
      • Refereed

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1,151
      • Downloads (Last 6 weeks)190
      Reflects downloads up to 20 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Wearable Data From Subjects Playing Super Mario, Taking University Exams, or Performing Physical Exercise Help Detect Acute Mood Disorder Episodes via Self-Supervised Learning: Prospective, Exploratory, Observational StudyJMIR mHealth and uHealth10.2196/5509412(e55094)Online publication date: 17-Jul-2024
      • (2024)Human-Centered AI (Also) for Humanistic ManagementHumanism in Marketing10.1007/978-3-031-67155-5_11(225-255)Online publication date: 26-Oct-2024
      • (2023)Human-in-the-Loop Machine Learning for the Treatment of Pancreatic Cancer2023 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN54540.2023.10191456(1-9)Online publication date: 18-Jun-2023
      • (2023)Occupational models from 42 million unstructured job postingsPatterns10.1016/j.patter.2023.1007574:7(100757)Online publication date: Jul-2023
      • (2023)A visual-semantic approach for building content-based recommender systemsInformation Systems10.1016/j.is.2023.102243117:COnline publication date: 1-Jul-2023
      • (2023)Addressing the data bottleneck in medical deep learning models using a human-in-the-loop machine learning approachNeural Computing and Applications10.1007/s00521-023-09197-236:5(2597-2616)Online publication date: 21-Nov-2023

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Digital Edition

      View this article in digital edition.

      Digital Edition

      Magazine Site

      View this article on the magazine site (external)

      Magazine Site

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media