Skip to main content

Showing 1–21 of 21 results for author: Cafarella, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2409.02038  [pdf, other

    cs.CL cs.AI cs.DB

    BEAVER: An Enterprise Benchmark for Text-to-SQL

    Authors: Peter Baile Chen, Fabian Wenz, Yi Zhang, Moe Kayali, Nesime Tatbul, Michael Cafarella, Çağatay Demiralp, Michael Stonebraker

    Abstract: Existing text-to-SQL benchmarks have largely been constructed using publicly available tables from the web with human-generated tests containing question and SQL statement pairs. They typically show very good results and lead people to think that LLMs are effective at text-to-SQL tasks. In this paper, we apply off-the-shelf LLMs to a benchmark containing enterprise data warehouse data. In this env… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

  2. arXiv:2406.11784  [pdf, other

    cs.CL cs.AI

    MDCR: A Dataset for Multi-Document Conditional Reasoning

    Authors: Peter Baile Chen, Yi Zhang, Chunwei Liu, Sejal Gupta, Yoon Kim, Michael Cafarella

    Abstract: The same real-life questions posed to different individuals may lead to different answers based on their unique situations. For instance, whether a student is eligible for a scholarship depends on eligibility conditions, such as major or degree required. ConditionalQA was proposed to evaluate models' capability of reading a document and answering eligibility questions, considering unmentioned cond… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  3. arXiv:2405.14696  [pdf, other

    cs.CL cs.AI cs.DB

    A Declarative System for Optimizing AI Workloads

    Authors: Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baille Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, Gerardo Vitagliano

    Abstract: A long-standing goal of data management systems has been to build systems which can compute quantitative insights over large corpora of unstructured data in a cost-effective manner. Until recently, it was difficult and expensive to extract facts from company documents, data from scientific papers, or metrics from image and video corpora. Today's models can accomplish these tasks with high accuracy… ▽ More

    Submitted 29 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

    Comments: 29 pages, 9 figures

    ACM Class: H.2.3; I.2.5

  4. arXiv:2310.00749  [pdf, other

    cs.DB cs.LG

    SEED: Domain-Specific Data Curation With Large Language Models

    Authors: Zui Chen, Lei Cao, Sam Madden, Tim Kraska, Zeyuan Shang, Ju Fan, Nan Tang, Zihui Gu, Chunwei Liu, Michael Cafarella

    Abstract: Data curation tasks that prepare data for analytics are critical for turning data into actionable insights. However, due to the diverse requirements of applications in different domains, generic off-the-shelf tools are typically insufficient. As a result, data scientists often have to develop domain-specific solutions tailored to both the dataset and the task, e.g. writing domain-specific code or… ▽ More

    Submitted 24 April, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

    Comments: preprint, 20 pages, 4 figures

  5. arXiv:2305.08741  [pdf, other

    cs.DB

    Causal Data Integration

    Authors: Brit Youngmann, Michael Cafarella, Babak Salimi, Anna Zeng

    Abstract: Causal inference is fundamental to empirical scientific discoveries in natural and social sciences; however, in the process of conducting causal inference, data management problems can lead to false discoveries. Two such problems are (i) not having all attributes required for analysis, and (ii) misidentifying which attributes are to be included in the analysis. Analysts often only have access to p… ▽ More

    Submitted 15 May, 2023; originally announced May 2023.

  6. arXiv:2212.14161  [pdf, other

    cs.DB cs.DC cs.SE

    Transactions Make Debugging Easy

    Authors: Qian Li, Peter Kraft, Michael Cafarella, Çağatay Demiralp, Goetz Graefe, Christos Kozyrakis, Michael Stonebraker, Lalith Suresh, Matei Zaharia

    Abstract: We propose TROD, a novel transaction-oriented framework for debugging modern distributed web applications and online services. Our critical insight is that if applications store all state in databases and only access state transactionally, TROD can use lightweight always-on tracing to track the history of application state changes and data provenance, and then leverage the captured traces and tran… ▽ More

    Submitted 28 December, 2022; originally announced December 2022.

    Comments: CIDR'23

  7. arXiv:2210.02943  [pdf, other

    cs.DB

    On Explaining Confounding Bias

    Authors: Brit Youngmann, Michael Cafarella, Yuval Moskovitch, Babak Salimi

    Abstract: When analyzing large datasets, analysts are often interested in the explanations for surprising or unexpected results produced by their queries. In this work, we focus on aggregate SQL queries that expose correlations in the data. A major challenge that hinders the interpretation of such queries is confounding bias, which can lead to an unexpected correlation. We generate explanations in terms of… ▽ More

    Submitted 6 October, 2022; originally announced October 2022.

  8. arXiv:2208.13068  [pdf, other

    cs.DB cs.DC

    Apiary: A DBMS-Integrated Transactional Function-as-a-Service Framework

    Authors: Peter Kraft, Qian Li, Kostis Kaffes, Athinagoras Skiadopoulos, Deeptaanshu Kumar, Danny Cho, Jason Li, Robert Redmond, Nathan Weckwerth, Brian Xia, Peter Bailis, Michael Cafarella, Goetz Graefe, Jeremy Kepner, Christos Kozyrakis, Michael Stonebraker, Lalith Suresh, Xiangyao Yu, Matei Zaharia

    Abstract: Developers increasingly use function-as-a-service (FaaS) platforms for data-centric applications that perform low-latency and transactional operations on data, such as for microservices or web serving. Unfortunately, existing FaaS platforms support these applications poorly because they physically and logically separate application logic, executed in cloud functions, from data management, done in… ▽ More

    Submitted 30 June, 2023; v1 submitted 27 August, 2022; originally announced August 2022.

    Comments: 14 pages, 13 figures, 3 tables. Preprint

  9. arXiv:2208.06497  [pdf, other

    cs.DB

    SeeSaw: Interactive Ad-hoc Search Over Image Databases

    Authors: Oscar Moll, Manuel Favela, Samuel Madden, Vijay Gadepally, Michael Cafarella

    Abstract: As image datasets become ubiquitous, the problem of ad-hoc searches over image data is increasingly important. Many high-level data tasks in machine learning, such as constructing datasets for training and testing object detectors, imply finding ad-hoc objects or scenes within large image datasets as a key sub-problem. New foundational visual-semantic embeddings trained on massive web datasets suc… ▽ More

    Submitted 14 September, 2023; v1 submitted 12 August, 2022; originally announced August 2022.

    Comments: SIGMOD 2024 camera ready

  10. arXiv:2103.13428  [pdf, other

    cs.CV cs.LG

    TagMe: GPS-Assisted Automatic Object Annotation in Videos

    Authors: Songtao He, Favyen Bastani, Mohammad Alizadeh, Hari Balakrishnan, Michael Cafarella, Tim Kraska, Sam Madden

    Abstract: Training high-accuracy object detection models requires large and diverse annotated datasets. However, creating these data-sets is time-consuming and expensive since it relies on human annotators. We design, implement, and evaluate TagMe, a new approach for automatic object annotation in videos that uses GPS data. When the GPS trace of an object is available, TagMe matches the object's motion from… ▽ More

    Submitted 24 March, 2021; originally announced March 2021.

    Comments: https://people.csail.mit.edu/songtao/tagme.html

  11. arXiv:2103.01986  [pdf, other

    cs.DB cs.IR

    Technical Report on Data Integration and Preparation

    Authors: El Kindi Rezig, Michael Cafarella, Vijay Gadepally

    Abstract: AI application developers typically begin with a dataset of interest and a vision of the end analytic or insight they wish to gain from the data at hand. Although these are two very important components of an AI workflow, one often spends the first few weeks (sometimes months) in the phase we refer to as data conditioning. This step typically includes tasks such as figuring out how to prepare data… ▽ More

    Submitted 2 March, 2021; originally announced March 2021.

  12. arXiv:2007.11112  [pdf, other

    cs.OS cs.AR cs.DB cs.DC cs.NI

    DBOS: A Proposal for a Data-Centric Operating System

    Authors: Michael Cafarella, David DeWitt, Vijay Gadepally, Jeremy Kepner, Christos Kozyrakis, Tim Kraska, Michael Stonebraker, Matei Zaharia

    Abstract: Current operating systems are complex systems that were designed before today's computing environments. This makes it difficult for them to meet the scalability, heterogeneity, availability, and security challenges in current cloud and parallel computing environments. To address these problems, we propose a radically new OS design based on data-centric architecture: all operating system state shou… ▽ More

    Submitted 21 July, 2020; originally announced July 2020.

  13. arXiv:2004.13645  [pdf, other

    cs.CL

    Unnatural Language Processing: Bridging the Gap Between Synthetic and Natural Language Data

    Authors: Alana Marzoev, Samuel Madden, M. Frans Kaashoek, Michael Cafarella, Jacob Andreas

    Abstract: Large, human-annotated datasets are central to the development of natural language processing models. Collecting these datasets can be the most challenging part of the development process. We address this problem by introducing a general purpose technique for ``simulation-to-real'' transfer in language understanding problems with a delimited set of target behaviors, making it possible to develop m… ▽ More

    Submitted 28 April, 2020; originally announced April 2020.

  14. Duoquest: A Dual-Specification System for Expressive SQL Queries

    Authors: Christopher Baik, Zhongjun Jin, Michael Cafarella, H. V. Jagadish

    Abstract: Querying a relational database is difficult because it requires users to know both the SQL language and be familiar with the schema. On the other hand, many users possess enough domain familiarity or expertise to describe their desired queries by alternative means. For such users, two major alternatives to writing SQL are natural language interfaces (NLIs) and programming-by-example (PBE). Both of… ▽ More

    Submitted 16 March, 2020; originally announced March 2020.

    Comments: Technical Report, 16 pages. Shorter version to be published in SIGMOD 2020

  15. arXiv:1812.07658  [pdf, other

    cs.DB

    Demonstration of a Multiresolution Schema Mapping System

    Authors: Zhongjun Jin, Christopher Baik, Michael Cafarella, H. V. Jagadish, Yuze Lou

    Abstract: Enterprise databases usually contain large and complex schemas. Authoring complete schema mapping queries in this case requires deep knowledge about the source and target schemas and is thereby very challenging to programmers. Sample-driven schema mapping allows the user to describe the schema mapping using data records. However, real data records are still harder to specify than other useful insi… ▽ More

    Submitted 18 December, 2018; originally announced December 2018.

    Comments: 4 pages, 5 figures, CIDR 2019

    Journal ref: 9th Biennial Conference on Innovative Data Systems Research (CIDR 2019)

  16. Physical Representation-based Predicate Optimization for a Visual Analytics Database

    Authors: Michael R. Anderson, Michael Cafarella, German Ros, Thomas F. Wenisch

    Abstract: Querying the content of images, video, and other non-textual data sources requires expensive content extraction methods. Modern extraction techniques are based on deep convolutional neural networks (CNNs) and can classify objects within images with astounding accuracy. Unfortunately, these methods are slow: processing a single image can take about 10 milliseconds on modern GPU-based hardware. As m… ▽ More

    Submitted 27 February, 2019; v1 submitted 11 June, 2018; originally announced June 2018.

    Comments: Camera-ready version of the paper submitted to ICDE 2019, In Proceedings of the 35th IEEE International Conference on Data Engineering (ICDE 2019)

    Journal ref: Proceedings of the 35th IEEE International Conference on Data Engineering (ICDE 2019), 1466-1477

  17. arXiv:1803.00701  [pdf, other

    cs.DB

    CLX: Towards verifiable PBE data transformation

    Authors: Zhongjun Jin, Michael Cafarella, H. V. Jagadish, Sean Kandel, Michael Minar, Joseph M. Hellerstein

    Abstract: Effective data analytics on data collected from the real world usually begins with a notoriously expensive pre-processing step of data transformation and wrangling. Programming By Example (PBE) systems have been proposed to automatically infer transformations using simple examples that users provide as hints. However, an important usability issue - verification - limits the effective use of such P… ▽ More

    Submitted 12 August, 2019; v1 submitted 1 March, 2018; originally announced March 2018.

    Comments: 16 pages

  18. Database Learning: Toward a Database that Becomes Smarter Every Time

    Authors: Yongjoo Park, Ahmad Shahab Tajik, Michael Cafarella, Barzan Mozafari

    Abstract: In today's databases, previous query answers rarely benefit answering future queries. For the first time, to the best of our knowledge, we change this paradigm in an approximate query processing (AQP) context. We make the following observation: the answer to each query reveals some degree of knowledge about the answer to another query because their answers stem from the same underlying distributio… ▽ More

    Submitted 28 March, 2017; v1 submitted 15 March, 2017; originally announced March 2017.

    Comments: This manuscript is an extended report of the work published in ACM SIGMOD conference 2017

  19. arXiv:1510.03921  [pdf, other

    cs.DB

    Visualization-Aware Sampling for Very Large Databases

    Authors: Yongjoo Park, Michael Cafarella, Barzan Mozafari

    Abstract: Interactive visualizations are crucial in ad hoc data exploration and analysis. However, with the growing number of massive datasets, generating visualizations in interactive timescales is increasingly challenging. One approach for improving the speed of the visualization tool is via data reduction in order to reduce the computational overhead, but at a potential cost in visualization accuracy. Co… ▽ More

    Submitted 23 January, 2017; v1 submitted 13 October, 2015; originally announced October 2015.

    Journal ref: Data Engineering (ICDE), 2016 IEEE 32nd International Conference on. IEEE, 2016

  20. arXiv:1506.01461  [pdf, other

    cs.SI physics.soc-ph

    Link-Prediction Enhanced Consensus Clustering for Complex Networks

    Authors: Matthew Burgess, Eytan Adar, Michael Cafarella

    Abstract: Many real networks that are inferred or collected from data are incomplete due to missing edges. Missing edges can be inherent to the dataset (Facebook friend links will never be complete) or the result of sampling (one may only have access to a portion of the data). The consequence is that downstream analyses that consume the network will often yield less accurate results than if the edges were c… ▽ More

    Submitted 4 June, 2015; originally announced June 2015.

  21. arXiv:1104.3217  [pdf

    cs.DB cs.DC

    Automatic Optimization for MapReduce Programs

    Authors: Eaman Jahani, Michael J. Cafarella, Christopher Ré

    Abstract: The MapReduce distributed programming framework has become popular, despite evidence that current implementations are inefficient, requiring far more hardware than a traditional relational databases to complete similar tasks. MapReduce jobs are amenable to many traditional database query optimizations (B+Trees for selections, column-store- style techniques for projections, etc), but existing syste… ▽ More

    Submitted 16 April, 2011; originally announced April 2011.

    Comments: VLDB2011

    Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 4, No. 6, pp. 385-396 (2011)