The amount of data available to train modern machine learning systems has been increasing rapidly, so much so that we're using, e.g., entirety of the publicly available text data to train state-of-the-art (SoTA) large language models (LLMs), interaction data from billions of users to train SoTA recommender systems, etc. Training of such large machine learning systems on such large datasets entails a high (i) computational runtime, (ii) economical cost, and (iii) carbon footprint; all of which we aim to minimize for different reasons.
While a large body of literature develops "model-centric" techniques to better model a given dataset, in this thesis, we develop a "data-centric" viewpoint, where we are interested in techniques that can appropriately summarize a given training dataset, such that models can be trained equally effectively on the data summary vs. training on the much larger original dataset. In addition to being more efficient overall, data-efficient techniques further aim to improve the trained model's quality by stripping away the low-quality and noisy sources of information in the original dataset.
More specifically, we develop techniques from two disparate data summarization ideologies: (i) data pruning (a.k.a. coreset construction) techniques that sample the most relevant portions from the dataset using various grounded heuristics, and (ii) data distillation techniques that generate synthetic data-points which summarize the underlying information in the dataset, and are optimized end-to-end using meta-learning. We restrict our scope to training (i) language models on textual datasets, and (ii) recommender systems on user-item interaction datasets.
By pushing the frontier of data-efficient training of machine learning systems, we believe our research can effectively contribute to the practical success of such widely-deployed systems, as well as provide a better understanding for the research community to build future work on.