CFP

DEW2020: Data Excellence Workshop 2020

virtual

Amsterdam, Netherlands, October 26, 2020

Conference website	http://eval.how/dew2020
Submission link	https://easychair.org/conferences/?conf=dew2020
Abstract registration deadline	October 15, 2020
Submission deadline	October 15, 2020

Human annotated data is crucial for operationalizing empirical ways for evaluating, comparing, and assessing the progress of ML/AI research. As human annotated data represents the compass that the entire ML/AI community relies on, the human computation (HCOMP) research community has a multiplicative effect on the progress of the field. Optimizing the cost, size, and speed of collecting data has attracted significant attention by HCOMP and related research communities. In the first to market rush with data, aspects of maintainability, reliability, validity, and fidelity of datasets are often overlooked. We want to turn this way of thinking on its head and highlight examples, case-studies, methodologies for excellence in data collection.

Panos Ipeirotis [Ipeirotis, 2010], one of the founders of the HCOMP research community, warned us that crowdsourced data collection platforms had the structural properties of a market for lemons [Akerlof, 1967]. Due to uncertainty in the notion of quality, the market focuses on price, resulting in an equilibrium state where the good sellers are priced out of the market and only lemons remain. The focus on scale, speed, and cost of building datasets has resulted in an impact on quality, which is nebulous and often circularly defined, since the annotators are the source of data and ground truth [Riezler, 2014]. While dataset quality is still the top concern everyone has, the ways in which that is measured in practice is poorly understood and sometimes simply wrong. A decade later, we see some cause for concern: fairness and bias issues in labeled datasets [Goel and Faltings, 2019], quality issues in datasets [Crawford and Paglen, 2019], limitations of benchmarks [Kovaleva et al., 2019, Welty et al., 2019] reproducibility concerns in machine learning research [Pineau et al., 2018, Gunderson and Kjensmo, 2018], lack of documentation and replication of data [Katsuno et al., 2019].

Currently, data excellence happens organically due to appropriate support, expertise, diligence, commitment, pride, community, etc. However, Data Excellence is more than maintaining a minimum standard for the ways in which we collect, publish or assess our data. It’s a metascientific enterprise of recognizing what’s important in the long term for Science.

Goal of workshop

Decades of catastrophic failures within high-stakes software projects (e.g., explosions of billion dollar spacecrafts in the 60s and 90s due to floating point overflows and missing hyphens in source code) have burnt the vital importance of upfront investments in software engineering excellence into our collective memory. It was through careful post-hoc analysis of these kinds of disasters that software engineering has matured as a field and achieved a more robust understanding of the costs and benefits that come with: processes like systematic code reviews, standards like coding conventions and design patterns, infrastructure for debugging and experimentation, as well as incentive structures that prioritize careful quality control over hasty roll-outs.

With the rise of artificial intelligence human-labeled data has increasingly become the fuel and compass of AI-based software systems. However, an analogous framework for excellence in data engineering does not yet exist, bearing the risk of similarly disastrous catastrophes to arise from the use of insufficient datasets in AI-based software systems.

This workshop aims to leverage lessons learned from decades of excellence in software engineering to inspire and enable an analogous framework for data excellence in AI systems.

The outcomes of this workshop will be:

defining properties and metrics of data excellence
gathering and reviewing various case studies of data excellence and data catastrophes
building a catalog of best practices and incentive structures for data excellence
discussing the cost-benefit trade-off for investments in data excellence
gathering and reviewing empirical and theoretical methodologies for reliability, validity, maintainability, fidelity of data

Submission Guidelines

We invite authors to submit short position papers (2 pages) addressing one or more of the topic areas. Papers will be peer-reviewed by the program committee and accepted papers will be presented through lightning talks during the workshop.

List of Topics

Maintainability: Maintaining data at scale, e.g., Knowledge Graph [Bollacker et al, 2008] has similar challenges to maintaining software at scale, or potentially knowledge engineering. What lessons for maintaining data at scale could we learn or adapt from software and knowledge engineering at scale? Data engineering often refers to data munging tasks, and is far more important and challenging that has been appreciated thus far. The fastest thing to do isn’t the most maintainable or reusable. By analogy, data should be well documented [Gebru, 2018], have clear owners/maintainers, should not fork or replicate existing data without documented rationale, should follow shared best practices.
Reliability: Reliability captures internal aspects of validity of the data, such as: consistency, replicability, reproducibility of data. Irreproducible data allows us to draw whatever conclusions we want to draw, while giving us the facade of being data-driven, when it is dangerously hunch driven. So surely we want data to be reproducible [Paritosh, 2012].
Validity: Validity tells us about how well the data helps us explain things related to the phenomena captured by the data, e.g., via correlation between the data and external measures. Education research tries to explore whether grades are valid by studying its correlations to external indicators of student outcome. One thing is clear: you cannot validate the data by itself!
Fidelity: Users of data often assume that the dataset accurately and comprehensively represents the phenomenon, which is almost never the case. For example, if sentences in a sentiment corpus were sampled from Wikipedia, it might not work as well as for various languages, or for different types of text, such as news articles. Various types of sampling from larger corpora can impact the fidelity of a dataset, e.g. temporal splitting can introduce bias if not done right in cases such as signal processing or sequential learning; or user-based splitting not keeping data of users separated is a potential bias source. E.g. if data from the same user is in test and train sets. Additionally, the choice of a specific grounding in prior research could also influence the fidelity of a dataset, specifically with respect to metrics used.

Committees

Program Committee (TBC)

Organizing committee

Praveen Paritosh, Google
Matthew Lease, Amazon and UT Austin
Mike Schaekermann, Amazon and University of Waterloo
Lora Aroyo, Google

Invited Speakers (TBC)

Publication

We plan to publish (upon authors aggreement) a summary of the DEW2020 workshop including the position statements and keynotes in the AAAI AI Magazine

Venue

The workshop is colocated with the HCOMP2020 Conference

Contact

All questions about submissions should be emailed to the organizers