People, computers, and the hot mess of real data

JM Hellerstein - Proceedings of the 22nd ACM SIGKDD International …, 2016 - dl.acm.org
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge …, 2016dl.acm.org
In practice, end-to-end data analysis is rarely a cleanly engineered process. Acquiring data
can be tricky. Data assessment, wrangling and feature extraction are time-consuming and
subjective. Models and algorithms used to derive data products are highly contextualized by
time-varying properties of data sources, code and application needs. All of these issues
would ideally benefit from an organizational view, but are often driven by individual users.
Viewed holistically, both agile analytics and the establishment of analytic pipelines involve …
In practice, end-to-end data analysis is rarely a cleanly engineered process. Acquiring data can be tricky. Data assessment, wrangling and feature extraction are time-consuming and subjective. Models and algorithms used to derive data products are highly contextualized by time-varying properties of data sources, code and application needs. All of these issues would ideally benefit from an organizational view, but are often driven by individual users. Viewed holistically, both agile analytics and the establishment of analytic pipelines involve interactions between people, computation and infrastructure. In this talk I'll share some anecdotes from our research, user studies, and field experience with companies (Trifacta, Captricity), as well as an emerging open-source project (Ground).
ACM Digital Library