Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
101 views15 pages

Dataperf: Benchmarks For Data-Centric Ai Development

1) DataPerf is a benchmark package for evaluating ML datasets and algorithms for working with datasets. 2) Current ML research has focused more on models than datasets, leading to issues when models are deployed in the real world. 3) DataPerf aims to create a "data ratchet" where training and test sets continually improve each other through feedback, accelerating data-centric AI development.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views15 pages

Dataperf: Benchmarks For Data-Centric Ai Development

1) DataPerf is a benchmark package for evaluating ML datasets and algorithms for working with datasets. 2) Current ML research has focused more on models than datasets, leading to issues when models are deployed in the real world. 3) DataPerf aims to create a "data ratchet" where training and test sets continually improve each other through feedback, accelerating data-centric AI development.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

DataPerf:

Benchmarks for Data-Centric AI Development

Mark Mazumder1 , Colby Banbury1 , Xiaozhe Yao2 , Bojan Karlaš2 , William Gaviria Rojas3 ,
Sudnya Diamos3 , Greg Diamos4 , Lynn He5 , Douwe Kiela6 , David Jurado7 , David Kanter7 ,
Rafael Mosquera7 , Juan Ciro7 , Lora Aroyo9 , Bilge Acun8 , Sabri Eyuboglu10 , Amirata Ghorbani10 ,
Emmett Goodman10 , Tariq Kane3,9 , Christine R. Kirkpatrick11 , Tzu-Sheng Kuo12 , Jonas Mueller13 ,
arXiv:2207.10062v1 [cs.LG] 20 Jul 2022

Tristan Thrush6 , Joaquin Vanschoren14 , Margaret Warren15 , Adina Williams8 , Serena Yeung10 ,
Newsha Ardalani8 , Praveen Paritosh9 , Ce Zhang2 , James Zou10 , Carole-Jean Wu8 , Cody Coleman3 ,
Andrew Ng4,5,10 , Peter Mattson9 , and Vijay Janapa Reddi1
1
Harvard University, 2 ETH Zurich, 3 Coactive.AI, 4 Landing AI, 5 DeepLearning.AI, 6 Hugging Face,
7
MLCommons, 8 Meta, 9 Google, 10 Stanford University, 11 San Diego Supercomputer Center,
UC San Diego, 12 Carnegie Mellon University, 13 Cleanlab, 14 TU Eindhoven,
15
Institute for Human and Machine Cognition

Abstract

Machine learning (ML) research has generally focused on models, while the most
prominent datasets have been employed for everyday ML tasks without regard for
the breadth, difficulty, and faithfulness of these datasets to the underlying problem.
Neglecting the fundamental importance of datasets has caused major problems in-
volving data cascades in real-world applications and saturation of dataset-driven
criteria for model quality, hindering research growth. To solve this problem, we
present DataPerf, a benchmark package for evaluating ML datasets and dataset-
working algorithms. We intend it to enable the “data ratchet,” in which training
sets will aid in evaluating test sets on the same problems, and vice versa. Such a
feedback-driven strategy will generate a virtuous loop that will accelerate develop-
ment of data-centric AI. The MLCommons Association will maintain DataPerf.

1 Introduction
Machine learning (ML) research has focused more on creating better models than on creating bet-
ter datasets. We have seen massive progress in ML model architectures driven by datasets that
serve as benchmarks to measure model performance. In this way, large public datasets such as Im-
ageNet [Deng et al., 2009], Freebase [Bollacker et al., 2008], Switchboard [Godfrey et al., 1992]
and SQuAD [Rajpurkar et al., 2016] have provided compasses for ML research. Because of an over-
whelming focus on benchmarking model performance, researchers eagerly adopt the largest existing
dataset without fully considering its breadth, difficulty, and fidelity to the underlying problem.
Better data quality and data excellence [Aroyo et al., 2022] are important and becoming increasingly
necessary to avoid data cascades in the real-world. As models leave the lab to enter the wild, they
exhibit performance discrepancies leading to reduced accuracy, persistent fairness and bias issues
[Buolamwini and Gebru, 2018, Denton et al., 2020, Mehrabi et al., 2021], challenges in areas such
as health [Wilkinson et al., 2020], data cascades [Sambasivan et al., 2021] and data reuse [Koch
et al., 2021]—problems often resulting not from the model but the data that trained the models.
In conventional model-centric ML, the term “benchmark” often means a standard, fixed dataset for
model comparisons and performance measurements. For example, ImageNet is a benchmark for
image classification models such as ResNet. Although this paradigm was useful for advancing the
Figure 1: ML-benchmark saturation relative to human performance (black line) [Kiela et al., 2021].

field of model design, prior work shows that such benchmarks are saturating. That is, models are at-
taining perfect or “human-level” performance according to these benchmarks [Kiela et al., 2021], as
Figure 1 shows. This saturation raises two questions: First, is ML research making real progress on
the underlying capabilities, or is it just overfitting to the benchmark datasets or suffering from data
artifacts? A growing body of literature explores this question and the evidence supporting bench-
mark limitations [Weissenborn et al., 2017, Gururangan et al., 2018, Poliak et al., 2018, Tsuchiya,
2018, Ribeiro et al., 2018, Belinkov et al., 2019, Geva et al., 2019, Wallace et al., 2019]. Second,
how should benchmarks evolve to push the frontier of ML research?
With these questions in mind, we designed DataPerf to rapidly improve the training and test data
for model benchmarking and, consequently, ML models. Competition between rapidly evolving
ML solutions that combine proprietary models and datasets will drive ML progress. For a specific
problem, multiple submitters develop competing solutions. In parallel, cross-organizational groups
define test sets that serve as challenges for those solutions. Holistically, as ML solutions improve
and the solution to a given problem arrives, the test sets will become more complete and harder,
leading to development of even better solutions. We term this constructive feedback-driven com-
petition the “data ratchet.” DataPerf aims to provide fast-moving data ratchets for the most critical
ML problems, such as vision and speech to text. It will aid in developing training sets and test sets
for the same problems. The training sets can aid in evaluation of the test sets and vice versa in an
ongoing cycle. Aside from boosting the ability to build high-quality training and test sets, the data
ratchet also provides favorable system-level consequences. Processing large amounts of data re-
quires tremendous resources [Kuchnik et al., 2022]. Also, as is increasingly apparent, data pipelines
are a crucial bottleneck in model training Mohan et al. [2020]. Prior work shows that for many
ML jobs, the input data pipeline produces data slower than the models can consume it Murray et al.
[2021]. But by enhancing data quality, we can reduce the time spent sorting through data-centric
activities that contribute to AI tax [Richins et al., 2021, 2020], eventually reducing training times.
To understand the scope, quality and limitations of datasets and accelerate subsequent improve-
ments, this paper defines the DataPerf benchmark suite, which is a collection of tasks, metrics and
rules. We take today’s complex data-centric development pipelines and abstract a subset of concrete
tasks that we believe are the main bottlenecks. Figure 2 illustrates one such pipeline. To develop
high-quality ML applications, a user often relies on a collection of data-centric operations to im-
prove data quality and repeated data-centric iterations to refine these operations strategically, given
the errors a model makes. DataPerf’s goal is to capture the primary stages of such a data-centric
pipeline to improve ML data quality. Benchmark examples include data debugging, data valuation,
training- and test-set creation, and selection algorithms covering a range of ML applications.
DataPerf is a scientific instrument to systematically measure the quality of training and test datasets
on a variety of ML tasks and to measure the quality of algorithms for constructing such datasets.
Its benchmarks, metrics, rules, leaderboards and challenges are designed to accelerate data-centric
AI. The aim is to foster data and model benchmarking so development of an ML solution for one

2
Data-centric Operations

Data Quality Training


Training Data Parsing
Assessment
ML Model
Data
New Training
Data
Data
Data Acquisition
Augmentation

Testing Error Discovery


Data & Debugging
Representation
Data Cleaning
Selection New Testing
Data

Other Data
Sources Data-centric Iterations

Figure 2: To develop high-quality ML applications, a user often relies on a collection of data-centric


operations to improve data quality and repeated data-centric iterations to refine these operations
strategically, given the errors a model makes. DataPerf aims to benchmark all of the key stages of
such a data-centric pipeline to improve ML data quality.

data-centric problem (e.g., training-set selection) feeds into the development of an ML solution for
another data-centric problem (e.g., test-set selection). We introduce speech and vision benchmarks
designed to evaluate dataset creation and selection. Each one evaluates a specific artifact under test,
such as a dataset or algorithm, using a particular method and metric. These benchmarks build on the
contributions of prior work, such as CATS4ML [Aroyo et al., 2021b], Dynabench [Kiela et al., 2021]
and MLPerf [Reddi et al., 2020, Mattson et al., 2020] and we welcome additional contributions.1

2 DataPerf

DataPerf’s goal is to make building, maintaining and evaluating datasets easier, cheaper, and more
repeatable. This section describes the types of benchmarks DataPerf embodies. The tasks range
from test-set creation to data-valuation algorithms. We will add more benchmarks as they receive
acknowledgment from MLCommons2 , a nonprofit organization supported by more than 50 member
companies and academics to improve ML through benchmarks, open data, and recommendations.
MLCommons hosts the MLPerf benchmark suites [Mattson et al., 2020, Reddi et al., 2020], which
emerged from a collaboration between academia and industry to allow fair and relevant comparisons
of ML systems. Building on this momentum, we want to enable similar progress in ML datasets.

2.1 The Data-centric AI Challenge

The DataPerf movement began with an early benchmark—a data-centric AI challenge—to better un-
derstand how to construct data-centric AI benchmarks. Traditionally, contestants in ML challenges
must train a high-accuracy model given a fixed dataset. This model-centric approach is ubiqui-
tous and has accelerated ML development, but it has neglected the surrounding infrastructure of
real-world ML [Sculley et al., 2015]. To draw more attention to other parts of the ML pipeline,
we flipped the conventional format by creating the Data-Centric AI (DCAI) competition [Ng et al.,
2021], inviting competitors to focus on optimizing accuracy by improving a dataset given a fixed
model. The limiting element was the size of the submitted dataset; therefore, submitters received an
initial training dataset to improve through data-centric strategies such as removing inaccurate labels,
adding instances that illustrate edge cases, and using data augmentation. The competition, inspired
by MNIST, focuses on classification of Roman-numeral digits. Just by iterating on the dataset, par-
ticipants increased the baseline accuracy from 64.4% to 85.8%; human-level performance (HLP)
was 90.2%. We learned several lessons from the 2,500 submissions and applied them to DataPerf:

1
Join us at dataperf.org.
2
https://www.mlcommons.org/

3
1. Common data pipelines. Successful entries followed a similar procedure: i) picking seed
photos, ii) augmenting them, iii) training a new model, iv) assessing model errors, and v)
slicing groups of images with comparable mistakes for (i). We believe more competitions
will lead to further establishment and refinement of generalizable and effective practices.
2. Automated methods won. We had anticipated successful participants would discover and
remedy labeling problems, but data selection and augmentation strategies performed best.
3. Novel dataset optimizations. Examples of successful tactics include automated methods
for (i) recognizing noisy images and labels, (ii) identifying mislabeled images, (iii) defin-
ing explicit labeling rules for confusing images, (iv) correcting class imbalance, and (v)
selecting and enhancing images from the long tail of classes. We believe that with the right
set of challenges and ML tasks in place, other novel data-centric optimizations will emerge.
4. New methods emerged. In addition to conventional evaluation criteria (the highest per-
formance on common metrics), we created a separate category which evaluated how in-
novative the technique was. This encouraged participants to explore and introduce novel
systematic techniques with potential impact beyond the leaderboard.
5. New supporting infrastructure is needed. The unconventional competition format neces-
sitated a technology that simultaneously supported a customized competition pipeline and
ample storage and training time. We used CodaLab, which provided the platform customi-
sation required for the Data-centric AI competition. However, it quickly became evident
that platforms and competitions will need to grow complementary functions to specifically
support the unique needs for data-centric AI development. Moreover, the competition was
computationally expensive to run. Therefore, we need a more efficient way to train the
models on data. Computational power, memory, and bandwidth are all major limitations.

These five lessons influenced DataPerf’s design. We intend to publish insights from the DataPerf
challenges and incorporate them into future challenges. By doing so, we can aid the entire industry.

2.2 Benchmark Suite

Building on the lessons from the data-centric AI challenge, we assembled the next major set of tasks
for the benchmark suite. We looked at the entire data-centric development pipeline and abstracted
a subset of concrete tasks that are bottlenecks today. Specifically, these tasks cover the end-to-end
process of engineering data for ML, including both training and test datasets, a Figure 3 shows. The
benchmark tasks include (1) Training set creation, (2) Test set creation, (3) Data selection, (4) Data
debugging, (5) Data valuation, and (6) Slice discovery. In the following sections, we provide a high
level overview of the DataPerf benchmark suite. For the specifics about each benchmark, refer to
Table 1. Also, visit http://github.com/mlcommons/dataperf for the latest updates.

Training Set Creation


Generating data, augmenting data and other data-centric development techniques can transform
small datasets into valuable training sets, but finding the right combination of methods can be
painstaking and error prone. The set of possible techniques results in an open-ended problem with
a combinatorially large set of solutions. The challenge is to create a pipeline that expands a limited
dataset into one that represents the real world. This type of benchmark aims to measure a novel
training dataset by training various models and measuring the resulting accuracy. Most ML com-
petitions ask participants to build a high-accuracy model given a fixed dataset. Here, we invert the
traditional format and ask submitters to improve a dataset given a fixed model.

Test Set Creation


Conceptually, a Test Dataset benchmark measures a novel test dataset, or adversarial test data, by
evaluating if it is (1) labeled incorrectly by a variety of models, (2) labeled correctly by humans,
and (3) novel relative to other existing test data. The purpose of this type of benchmark is to foster
innovation in the way we sample data for test sets and to discover how data properties influence ML
performance with respect to accuracy, reliability, fairness, diversity and reproducibility. Test dataset
benchmarks measure a set of adversarial test data examples.

4
Figure 3: DataPerf design overview illustrating the data-centric ML engineering process. The
middle box represents the model-centric approach to measuring performance; the left and right
boxes are DataPerf’s focus.

Selection Algorithm
Collecting large amounts of data has become straightforward, but creating valuable training sets
from that data can be cumbersome and resource intensive. Naively processing the data wastes valu-
able computational and labeling resources because the data is often redundant and heavily skewed.
This challenge tasks participants with algorithmically identifying and selecting the most informative
examples from a dataset to use for training. The selection-algorithm benchmark then evaluates the
quality of the algorithmic methods for curating datasets (e.g., active learning or core-set selection
techniques) by training a fixed set of models and testing them on held-out data.

Debugging Algorithm
Training datasets can contain data errors [Northcutt et al., 2021, Li et al., 2021], such as missing
or corrupted values and incorrect labels. Repairing these errors is costly and often involves human
labor. It is therefore useful to manage the tradeoff between the cost of repair against the expected
benefits [Koh and Liang, 2017, Northcutt et al., 2017, Karlaš et al., 2022, Ghorbani and Zou, 2019,
Jia et al., 2019]. Given a fixed budget of data examples that can be repaired, the challenge is to
select a subset of training examples that, after repair, yield the biggest performance improvement
in the the trained model. The benchmark provides a “dirty” dataset that contains data errors (e.g.,
incorrect labels). The error source is task-specific but can range from random noise to a process
that resembles real-world data pipelines. The task is to select the subset of samples that should be
repaired. These samples are then repaired automatically by replacing them with the original, error-
free samples from a hidden dataset. Once the repairs are applied, we check whether the required
model quality threshold has been reached. If so, we score the debugging algorithm on the basis
of the selected subset size. Otherwise, we increase the number of selected samples and repeat the
process until the threshold is reached.

Valuation Algorithm
Conceptually, there is a data market in which data acquirers and data providers buy and sell datasets.
Assessing the data quality is crucial. Data acquirers need ways to estimate an unlabeled dataset’s
incremental value, before acquisition and labeling, relative to some labeled dataset [Ghorbani and
Zou, 2019, Jia et al., 2019]. This benchmark will measure the quality of an algorithm that estimates

5
Table 1: DataPerf benchmark types.
Benchmark Type Benchmark Method Benchmark Metric
Training Set Creation Replace given training set with novel Accuracy of models trained on novel
training set training set
Test Set Creation Select a fixed number of additional Number of submitted test data items
test-data items from the supplemen- incorrectly labeled by models and
tal set correctly labeled by humans, where
credit for each item is divided by
number of submissions containing
that item
Selection Algorithm Replace training set with subset Accuracy of models trained on sub-
set
Debugging Algorithm Identify labeling errors in version of Accuracy of trained models after
training set that contains some cor- identified labels are corrected
rupted labels
Slicing Algorithm Divide training set into semantically Fraction of data assigned to the cor-
coherent slices rect slice
Valuation Algorithm Estimate accuracy improvement Absolute difference between pre-
from training on set A to training on dicted accuracy and actual accuracy
set A + set B (where B lacks labels
at time of estimate)

the relative value of a new dataset by measuring the difference between estimated accuracy and
the true accuracy of a model trained on the union of the two datasets. The true accuracy will be
calculated with the true labels during the evaluation of the participants’ algorithms.

Slice Discovery Algorithm

ML models that achieve high overall accuracy often make systematic errors on important data sub-
groups (or slices). For instance, models trained to detect collapsed lungs in chest X-rays have been
shown to make predictions based on the presence of chest drains, a common treatment device. As a
result, these models frequently make prediction errors in cases without chest drains, a critical data
slice where false negatives could be life-threatening. Identifying underperforming slices is challeng-
ing when working with high-dimensional inputs (e.g., images and audio) where data slices are often
unlabeled [Eyuboglu et al., 2022b]. This benchmark evaluates automated slice discovery algorithms
that mine unstructured data for underperforming slices. The benchmark measures how closely the
top-k examples in each slice match the top-k examples in a ground truth slice, and it adds newly
discovered useful slices into the ground truth.

2.3 Benchmark Types × Tasks

To capture the wide range of ML uses, the DataPerf suite is a cross-product of the benchmark types
we described previously and specific ML tasks that evaluate the datasets or algorithms under test.
Example tasks include image classification, keyword identification and natural-language processing
(NLP). Table 2 shows the benchmark matrix. We propose such a benchmark matrix for three main
reasons. First, each column embodies a data ratchet for a specific problem in the form of a training
set benchmark and a test set benchmark. Recall that in a data ratchet, the training sets are used to
evaluate the test sets and vice versa so they improve each other in an ongoing cycle. Second, as
long as it is model-agnostic, the same data-centric algorithm can be submitted to all benchmarks in
the same row for algorithmic benchmarks to demonstrate generality. Third, pragmatically, rules and
infrastructure developed to support one benchmark may be leveraged for other challenges.

6
Table 2: The cross-product matrix of benchmark types and tasks that we will define in DataPerf.
Machine Learning
Tasks
Image Roman-Numeral Keyword Natural-Language
...
Classification OCR Identification Processing
Training
... ... ... ... ...
Dataset
Test
... ...
Dataset
Selection
... ...
Algorithm
Each cell is a specific benchmark. E.g., the “Training Dataset: Image classification”
Benchmark Debugging
... benchmark evaluates a training set constructed for the image-classification ML task. ...
Types Algorithm
Slicing
... ...
Algorithm
Valuation
... ...
Algorithm
... ... ... ... ... ...

3 Competitions, Challenges and Leaderboards

DataPerf will use leaderboards and challenges to encourage constructive competition, identify the
best ideas, and inspire the next generation of concepts for building and optimizing datasets. A
leaderboard is a public summary of benchmark results; it helps to quickly identify state-of-the-art
approaches. A challenge is a public contest to achieve the best result on a leaderboard in a fixed time.
Challenges motivate rapid progress through recognition, awards, and/or prizes. We are interested in
benchmarks related to dataset and algorithm quality. We will host the leaderboard and challenges
on an augmented Dynabench system supported by MLCommons.
We provide three example benchmarks (“reference implementations”) to demonstrate how DataPerf
operates. These reference implementations highlight the benchmark’s most important elements but
do not guarantee their optimal performance. Rather, they serve as benchmark-compliant baselines
and hence are a starting point for code reference.

3.1 Selection for Speech

DataPerf v0.5 includes a dataset-selection-algorithm challenge with a speech-centric focus. The


objective of the speech-selection task is to develop a selection algorithm that chooses the most
effective training samples from a vast (and noisy) corpus of spoken words. The algorithm must then
use the provided training set to train a collection of fixed keyword-detection models. The algorithm
is evaluated on the basis of the resulting model’s classification accuracy on the evaluation set.

Use-Case Rationale Keyword spotting (KWS) is a ubiquitous speech-classification task that bil-
lions of devices perform. A KWS model detects a limited vocabulary of keywords. Production
examples include the wakeword interfaces for Google Voice Assistant, Siri and Alexa. Because
of growing demand for virtual assistants, the scale of KWS datasets has increased dramatically in
recent years to cover more words in more languages. One such dataset is the Multilingual Spo-
ken Words Corpus [Mazumder et al., 2021b] (MSWC), a large and growing audio dataset of over
340,000 spoken words in 50 languages. Collectively, these languages represent more than five bil-
lion people. The scale of MSWC was the result of automatically extracting word-length audio clips
from crowdsourced data. But owing to the errors in the generation process and in the source data,
some of the samples are incorrect. For instance, they may be too noisy, may be missing part of
the target sample (e.g., “weathe-” instead of “weather”), or may contain part of an adjacent word
(e.g., “time to” instead of “time”). Additionally, since the dataset derives from continuous speech,
classes are imbalanced and roughly follow the frequency distribution in natural language. Widen-

7
Benchmark Open
Test MSWC
Hidden
Test Set
Artifact Under Test Set

Generated

Hidden
Target Selection Train
Test Selection Train
Words IDs Eval Words IDs Eval
Container Container

Unofficial Official
Su Score Score
b mit

Submitter’s Dynabench
Leaderboard
Machine Server
Figure 4: System design for the DataPerf speech benchmark’s full launch.

ing the scope of KWS systems to encompass more target words and languages requires automatic
data-selection algorithms.

Benchmark Definition Participants in this challenge must submit a training-set-selection algo-


rithm. The algorithm will select the fewest possible data samples to train a suite of five-target
keyword-spotting models. The model suite consists of SVC and logistic-regression classifiers, which
output one of six categories (five target classes and one “unknown” class). The input to the clas-
sifier will be 1,024-dimensional vectors of embedding representations generated from a pretrained
keyword-feature extractor [Mazumder et al., 2021a]. Participants may only define training samples
used by the model suite; all other configuration parameters will remain constant for all submitters to
emphasize the importance of selecting the most-informative samples.
The benchmark contains a public set of target classes and a concealed set. By contrast, the con-
ventional concealed test set withholds a number of samples to evaluate a trained model’s generaliz-
ability. In data-selection benchmarking, the concealed set is a new set of training data from which
to select; it may have a set of classes distinct from those of the public set. This concealed set de-
fines a new group of target words that may be in a different language than the public test set. As
with standard model-centric test sets, the purpose of data-centric concealed sets is to evaluate the
generalizability of a technique or algorithm.

Benchmark-System Design For the benchmark’s initial beta release, participants must submit
a training set that they have algorithmically selected from the specified training data. The Dyn-
abench platform evaluates this training set automatically [Kiela et al., 2021] and posts the result
to a live scoreboard. In addition, submissions link to the selection algorithm’s implementation to
ensure some early-stage reproducability. For its full-scale release, DataPerf will allow participants
to directly submit their algorithm for testing of their generalizability. Figure 4 illustrates the speech-
selection benchmark for the beta release. Submitters may develop and interate their selection algo-
rithm on their own machine using datasets and an evaluation script from DataPerf. Once they have a
satisfactory implementation, they submit a containerized version of their selection algorithm to the
Dynabench server, where the benchmark infrastructure will first rerun the open test set on the selec-
tion container to check whether the submission is valid and then, if so, run the selection algorithm
on the hidden test set to produce an official score. That score will appear on a live leaderboard.
This system design solves two problems we identified in the data-centric AI challenge (Section 2.1).
First, by enabling offline development we can restrict the number of official submissions in a given
timespan, thereby capping the computational requirements for running the benchmark. Second, by
having participants submit their selection algorithm, we can run their submission on a new set of tar-
get words (potentially in a new language) to test for generalizability and to prevent overoptimization
on the public test set.

8
3.2 Selection for Vision

The selection-for-vision challenge invites participants to design novel data-centric AI for selecting
data to train image classifiers.

Use-Case Rationale Large datasets have been critical to many ML achievements, but they create
problems. Massive datasets are cumbersome and expensive to work with, especially when they
contain unstructured data such as images, videos and speech. Careful data selection can mitigate
some of the difficulties by focusing on the most valuable examples. By using a more data-centric
approach that emphasizes quality rather than quantity, we can ease the training of ML models, which
is both costly and time consuming.
The vision-selection-algorithm benchmark evaluates binary classification of visual concepts (e.g.,
“monster truck” or “jean jacket”) in unlabeled images. Familiar production examples of similar
models include automatic labeling services by Amazon Rekognition, Google Cloud Vision API and
Azure Cognitive Services. Successful approaches to this challenge will enable image classification
of long-tail concepts where discovery of high-value data is critical, and it is a major step towards the
democratization of computer vision.

Benchmark Definition The task is to design a data-selection strategy that chooses the best training
examples from a large pool of training images. Imagine, for example, creating a subset of the
Open Images Dataset V6 training set that maximizes the mean average precision (mAP) for a set
of concepts (“cupcake,” “hawk” and “sushi”). Because the images are unlabeled, we provide a set
of positive examples for each classification task that participants can use to search for images that
contain the target concepts.

Benchmark-System Design For the initial release, participants must submit a training set for
each classification task in addition to a description of the selection method they used to obtain the
training sets. The training sets will undergo automatic evaluation on the Dynabench servers. For the
benchmark’s full-scale release, participants must submit their algorithm in the same manner as for
the speech-selection benchmark (Section 3.1).

3.3 Debugging for Vision

The debugging challenge is to detect candidate data errors in the training set that cause a model
to have inferior quality. The aim is to assist a user in prioritizing which data to inspect, correct
and clean. A debugging method’s purpose is to identify the most detrimental data points from a
potentially noisy training set. After inspecting and correcting the selected data points, the cleaned
dataset is used to train a new classification model. Evaluation of the debugging approach is based
on the number of data points it must correct to attain a certain accuracy.

Use-Case Rationale In recent years, the size of ML datasets has exploded. The Open Images
Dataset V6, for instance, has 59 million image-level labels. Such datasets are annotated either
manually or using ML. Unfortunately, noise is unavoidable and can originate from both human
annotators and algorithms. ML models trained on these noisy datasets will lose accuracy in addition
to facing other dangers such as unfairness.
Even though dataset cleaning is a solution, it is costly and time-consuming and typically involves
human review. Consequently, examining and sanitizing the full dataset is often impractical. A
data-centric method that focuses human attention and cleaning efforts on the most important data
elements is necessary to save time, money and labor.

Benchmark Definition The debugging task is based on image-binary classification. For each ac-
tivity, participants receive a noisy training set (i.e., some labels are inaccurate) and a validation set
with correct labels. They must provide a debugging approach that assigns a priority value (harm-
fulness) to each training-set item. After each trial, all training data will have been examined and
rectified. Each time a new item is examined, a classification model undergoes training on the clean
dataset, and the test accuracy on a hidden test set is computed. The challenge then returns a score
indicating the algorithm’s effectiveness.

9
The image datasets are from the Open Images Dataset [Kuznetsova et al., 2020], with two important
considerations: The first is that the number of data points should be sufficient to permit random
selection of samples for the training, validation and test sets. The second consideration is that the
number of discrepancies between the machine-generated label and the human-verified label varies
by task; the challenges thus reflect varying classification complexity. We therefore introduce two
types of noise into the training set’s human-verified labels: some labels are arbitrarily inverted, and
2) machine-generated labels are substituted for some human-verified labels to imitate the noise from
algorithmic labeling.
We use a 2,048-dimensional vector of embedding representations built by a pretrained image-feature
extractor as the classifier’s input data. Participants may just prioritize each training sample used by
the classifier; all other configurations are fixed for all submissions.
A concealed test set is used to evaluate the performance of the trained classification model on each
task. In contrast to prior challenges, the statistic used to evaluate each submission is not the test set’s
accuracy. The objective of the debugging challenge is to determine which debugging method pro-
duces sufficient accuracy while analyzing the fewest data points. Therefore, the assessment metric
for the debugging challenge is defined as the proportion of inspections required for a submission to
achieve 95% of the accuracy of the classifier trained on the cleaned training set.

Benchmark System Design Participants of this challenge develop and validate their algorithms
on their own machines using the dataset and evaluation framework provided by DataPerf. Once they
are satisfied with their implementation, they submit a containerized version to the server, similar
to the speech example (Section 3.1). On the server, the uploaded implementation will be rerun on
several hidden tasks, and the average score will be posted to a leaderboard.

4 Open and Closed Divisions


DataPerf provides two submission categories for challenge and leaderboard participants. These are
the open and closed divisions, respectively. In the open division, we would want to see creative
ideas that use any algorithm to tackle a problem, including solutions that involve some manual com-
ponent (i.e. Human-in-the-loop). In the closed division, we seek to ensure rigorous reproducibility
and an equitable comparison of similar approaches. For instance, in the selection algorithm chal-
lenge (Section2.2) submitters in the open division must include a description of the technique they
developed and include their training set. We then evaluate the training set’s quality. While in the
closed division, participants submit a description of their technique, the training set they created,
and the code that they created or used to construct the training set. We then review the results and
rate the training set. In addition, the submitted code is executed to validate if the code indeed creates
the specified training set. Therefore, the closed division does not allow solutions with a manual
component. These two categories will allow us to address the fact that some submitters may not
wish to divulge their methods because they contain proprietary logic, but others in the open source
community may wish to share their solutions and facilitate rapid adoption.

5 Technologies Leveraged and Related Research


Datasets and benchmarks are crucial for the development of machine learning methods but also
require their own publishing and reviewing guidelines, such as proper descriptions of how the data
was collected, whether they show intrinsic bias, and whether they will remain accessible. For bench-
marks, reproducibility, interpretability and design are key factors that require careful consideration.
One goal of DataPerf is to inspire and enable the evaluation of rapidly evolving submissions to this
track. DataPerf builds upon prior benchmarks, challenges, tracks, workshops, and organizations in
more detail. DataPerf adopts inspiration and several of the best practices from these prior works.
There is a strong and growing focus on datasets and benchmarks in the ML community. Burgeoning
research venues34 have emerged to focus on the science and engineering of data for machine learning
[Aroyo et al., 2021a, Raji et al., 2021, Paullada et al., 2021]. These communities need support and
3
https://www.eval.how/, https://sites.google.com/view/sedl-workshop
4
https://datacentricai.org/

10
better tools for evolving the datasets upon which the whole field depends. In 2021, the conference on
Neural Information Processing Systems (NeurIPS) launched the new Datasets and Benchmarks track
5
as a novel venue for exceptional work in creating high-quality datasets, insightful benchmarks, and
discussions on how to improve dataset development and data-oriented work more broadly.
DCBench Eyuboglu et al. [2022a] is a benchmark for algorithms used to construct and analyze
machine learning datasets. It comprises a diverse set of tasks, each corresponding to one step in
a broader machine learning pipeline. For example, machine learning practitioners will commonly
spend some time cleaning input features, so DCBench includes a task for algorithms that select
training data points for cleaning. The tasks in DCBench are supported by a standard Python API that
facilitates downloading problems, running evaluations, and comparing methods. DCBench provides
the basis for the debugging algorithm and slicing algorithm benchmarks used in DataPerf.
The Crowdsourcing Adverse Test Sets for Machine Learning (CATS4ML) Data Challenge [Aroyo
et al., 2021b] aims to raise the bar for ML evaluation sets and to find as many examples as possible
that are confusing or otherwise problematic for algorithms to process, starting with image classifi-
cation. Many evaluation datasets contain items that are easy to evaluate, e.g., photos with a subject
that is easy to identify. Thus they miss the natural ambiguity of real-world context. The absence of
ambiguous real-world examples in evaluation undermines the ability to test machine learning per-
formance reliably, which makes ML models prone to developing “weak spots.” CATS4ML relies
on human skills and intuition to spot new data examples about which ML is confident but misclas-
sified. An open CATS4ML challenge that asks participants to submit misclassified samples from
the Google Open Images dataset was unveiled at HCOMP 20206 . Participants generated 15,000
adversarial examples, two thirds of which were validated independently to fool the state-of-the-art
computer vision algorithms in making incorrect predictions. While performance on standard test
splits approach 90%, the performance on the CATS4ML adversarial test set for the state-of-the-art
is 0%, by design. We use this very hard test set to motivate participants to build training datasets
that could help improve performance on these adversarial examples. This is an example of the data
ratchet (Section 1), where one challenge leads to the improvement of another benchmark task.
An interesting side consequence, however, of the CATS4ML challenge was that because the chal-
lenge required humans to study photographs and their proposed target labels so closely, far more
information about the nature of the original training data was exposed to organizers and partici-
pants. As vision categorization has become more advanced, it is now much more apparent that
models frequently produce ambiguous, redundant, and seemingly conflicting classifications with al-
most equal confidence, despite the model’s innocence. Thus, it is easy to assert that the challenge
itself, while producing interesting evaluation data, revealed more about the obvious fact that AI de-
velopment will have to take a much more serious, data-centric approach to the quality of the training
data in order to make new advances in the field than in the iterative refinement of the model.
Dynabench7 is the platform we use in DataPerf for dynamic data collection and benchmarking that
challenges existing ML benchmarking dogma by embracing dynamic dataset generation [Kiela et al.,
2021]. Benchmarks for machine learning solutions based on static datasets have well-known issues:
they saturate quickly, are susceptible to overfitting, contain exploitable annotator artifacts and have
unclear or imperfect evaluation metrics. In essence, the Dynabench platform is a scientific experi-
ment: is it possible to make faster progress if data is collected dynamically, with humans and models
in the loop, rather than in the old-fashioned static way? DataPerf adopts DynaBench as the underly-
ing platform that facilitates our leaderboards and challenges, as described previously.
To ensure innovations in academia translate into real-world impact, systems researchers traditionally
rely on benchmarking for everything from CPU speed to a cell phone’s battery life. MLPerf [Mattson
et al., 2020, Reddi et al., 2020], DAWNBench [Coleman et al., 2017] and similar benchmarks [Gao
et al., 2018, Zhu et al., 2018, Tang et al., 2021] have laid the groundwork for industry-strength
standards. Enabling conversion of ML improvements in the lab into solutions that work in the real
world requires similar standards for measuring dataset quality as well as algorithms for curating the
datasets. The MLPerf benchmark suites [Mattson et al., 2020, Reddi et al., 2020] closely parallel
what we are trying here with DataPerf. MLPerf defines clear rules for measuring speed and power
consumption across various systems, spanning from datacenter scale ML systems that consume
5
https://blog.neurips.cc/2021/04/07/announcing-the-neurips-2021-datasets-and-benchmarks-track/
6
https://ai.googleblog.com/2021/02/uncovering-unknown-unknowns-in-machine.html
7
https://dynabench.org/

11
megawatts of power to tiny embedded ML systems that consume only microwatts of power. The
MLPerf benchmark suites jump-started a virtuous cycle of healthy and transparent competition that
drove rapid improvements in ML performance. The MLCommons consortium hosts the DataPerf
working group to evolve the benchmarks and supports the leaderboards and challenges.

6 Conclusion and Future Work


The purpose of DataPerf is to improve machine learning by expanding the horizon of AI research
from models to datasets. Through DataPerf we seek to enhance best practices for the development
of ML datasets. Benchmarking datasets systematically is vital because what gets measured gets
improved. DataPerf seeks to measure the quality of training and test datasets for a variety of machine
learning application cases. In addition, it permits us to advance the quality of the procedures used to
produce such datasets. Through leaderboards and challenges, DataPerf encourages the community to
address real-world data problems. Community members are urged to join DataPerf in order to effect
the future of AI and ML. To get involved with DataPerf please visit https://dataperf.org.

References
L. Aroyo, M. Lease, P. Paritosh, and M. Schaekermann. Data excellence for ai: Why should you
care. arXiv preprint arXiv:2111.10391, 2021a.
L. Aroyo, P. Paritosh, S. Ibtasam, D. Bansal, K. Rong, and K. Wong. Adversarial test set for image
classification: Lessons learned from cats4ml data challenge. Under review, 2021b.
L. Aroyo, M. Lease, P. Paritosh, and M. Schaekermann. Data excellence for ai: why should you
care? Interactions, 29(2):66–69, 2022.
Y. Belinkov, A. Poliak, S. Shieber, B. Van Durme, and A. Rush. Don’t take the premise for
granted: Mitigating artifacts in natural language inference. In Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics, pages 877–891, Florence, Italy, July
2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1084. URL https:
//aclanthology.org/P19-1084.
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created
graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD
international conference on Management of data, pages 1247–1250, 2008.
J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial
gender classification. In Conference on fairness, accountability and transparency, pages 77–91.
Proceedings of Machine Learning Research, 2018.
C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. Ré, and
M. Zaharia. Dawnbench: An end-to-end deep learning benchmark and competition. Training,
100(101):102, 2017.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical
image database. In 2009 IEEE conference on computer vision and pattern recognition, pages
248–255. Ieee, 2009.
E. Denton, A. Hanna, R. Amironesei, A. Smart, H. Nicole, and M. K. Scheuerman. Bringing the
people back in: Contesting benchmark machine learning datasets. abs/2007.07399, 2020. URL
https://arxiv.org/abs/2007.07399.
S. Eyuboglu, B. Karlaš, C. Ré, C. Zhang, and J. Zou. Dcbench: A benchmark for data-centric ai
systems. In Proceedings of the Sixth Workshop on Data Management for End-To-End Machine
Learning, DEEM ’22, New York, NY, USA, 2022a. Association for Computing Machinery. ISBN
9781450393751. doi: 10.1145/3533028.3533310. URL https://doi.org/10.1145/
3533028.3533310.
S. Eyuboglu, M. Varma, K. Saab, J.-B. Delbrouck, C. Lee-Messer, J. Dunnmon, J. Zou, and C. Ré.
Domino: Discovering systematic errors with cross-modal embeddings, 2022b. URL https:
//arxiv.org/abs/2203.14960.

12
W. Gao, C. Luo, L. Wang, X. Xiong, J. Chen, T. Hao, Z. Jiang, F. Fan, M. Du, Y. Huang, et al.
Aibench: towards scalable and comprehensive datacenter ai benchmarking. In International
Symposium on Benchmarking, Measuring and Optimization, pages 3–9. Springer, 2018.
M. Geva, Y. Goldberg, and J. Berant. Are we modeling the task or the annotator? an investigation
of annotator bias in natural language understanding datasets. arXiv preprint arXiv:1908.07898,
2019.
A. Ghorbani and J. Zou. Data shapley: Equitable valuation of data for machine learning. In
International Conference on Machine Learning, pages 2242–2251, 2019.
J. Godfrey, E. Holliman, and J. McDaniel. Switchboard: telephone speech corpus for research and
development. In [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics,
Speech, and Signal Processing, pages 517–520, 1992.
S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith. An-
notation artifacts in natural language inference data. In Proceedings of the 2018 Conference
of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana,
June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2017. URL
https://aclanthology.org/N18-2017.
R. Jia, D. Dao, B. Wang, F. A. Hubis, N. Hynes, N. M. Gürel, B. Li, C. Zhang, D. Song, and
C. J. Spanos. Towards efficient data valuation based on the shapley value. In K. Chaudhuri and
M. Sugiyama, editors, Proceedings of the Twenty-Second International Conference on Artificial
Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages
1167–1176. PMLR, 16–18 Apr 2019. URL https://proceedings.mlr.press/v89/
jia19a.html.
B. Karlaš, D. Dao, M. Interlandi, B. Li, S. Schelter, W. Wu, and C. Zhang. Data debugging with
shapley importance over end-to-end machine learning pipelines, 2022. URL https://arxiv.
org/abs/2204.11131.
D. Kiela, M. Bartolo, Y. Nie, D. Kaushik, A. Geiger, Z. Wu, B. Vidgen, G. Prasad, A. Singh,
P. Ringshia, Z. Ma, T. Thrush, S. Riedel, Z. Waseem, P. Stenetorp, R. Jia, M. Bansal, C. Potts,
and A. Williams. Dynabench: Rethinking benchmarking in NLP. In Proceedings of the 2021
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, pages 4110–4124, Online, June 2021. Association for Computa-
tional Linguistics. doi: 10.18653/v1/2021.naacl-main.324. URL https://aclanthology.
org/2021.naacl-main.324.
B. Koch, E. Denton, A. Hanna, and J. G. Foster. Reduced, reused and recycled: The life of a dataset
in machine learning research, 2021.
P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. In D. Precup
and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning,
volume 70 of Proceedings of Machine Learning Research, pages 1885–1894. PMLR, 06–11 Aug
2017. URL https://proceedings.mlr.press/v70/koh17a.html.
M. Kuchnik, A. Klimovic, J. Simsa, V. Smith, and G. Amvrosiadis. Plumber: Diagnosing and
removing performance bottlenecks in machine learning data pipelines. Proceedings of Machine
Learning and Systems, 4:33–51, 2022.
A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov,
M. Malloci, A. Kolesnikov, et al. The open images dataset v4. International Journal of Computer
Vision, 128(7):1956–1981, 2020.
P. Li, X. Rao, J. Blase, Y. Zhang, X. Chu, and C. Zhang. Cleanml: A study for evaluating the im-
pact of data cleaning on ml classification tasks. In 2021 IEEE 37th International Conference on
Data Engineering (ICDE), pages 13–24, Los Alamitos, CA, USA, apr 2021. IEEE Computer Soci-
ety. doi: 10.1109/ICDE51399.2021.00009. URL https://doi.ieeecomputersociety.
org/10.1109/ICDE51399.2021.00009.

13
P. Mattson, C. Cheng, G. Diamos, C. Coleman, P. Micikevicius, D. Patterson, H. Tang, G.-Y. Wei,
P. Bailis, V. Bittorf, D. Brooks, D. Chen, D. Dutta, U. Gupta, K. Hazelwood, A. Hock, X. Huang,
D. Kang, D. Kanter, N. Kumar, J. Liao, D. Narayanan, T. Oguntebi, G. Pekhimenko, L. Pentecost,
V. Janapa Reddi, T. Robie, T. St John, C.-J. Wu, L. Xu, C. Young, and M. Zaharia. Mlperf training
benchmark. In Proceedings of Machine Learning and Systems, volume 2, 2020.
M. Mazumder, C. Banbury, J. Meyer, P. Warden, and V. J. Reddi. Few-shot keyword spotting in any
language. arXiv preprint arXiv:2104.01454, 2021a.
M. Mazumder, S. Chitlangia, C. Banbury, Y. Kang, J. M. Ciro, K. Achorn, D. Galvez, M. Sabini,
P. Mattson, D. Kanter, et al. Multilingual spoken words corpus. In Thirty-fifth Conference on
Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021b.
N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan. A survey on bias and fairness in
machine learning. ACM Computing Surveys (CSUR), 54(6):1–35, 2021.
J. Mohan, A. Phanishayee, A. Raniwala, and V. Chidambaram. Analyzing and mitigating data stalls
in dnn training. arXiv preprint arXiv:2007.06775, 2020.
D. G. Murray, J. Simsa, A. Klimovic, and I. Indyk. tf. data: A machine learning data processing
framework. arXiv preprint arXiv:2101.12127, 2021.
A. Ng, L. He, and D. Laird. Data-Centric AI Competition, 2021. URL https://
https-deeplearning-ai.github.io/data-centric-comp/.
C. G. Northcutt, T. Wu, and I. L. Chuang. Learning with confident examples: Rank pruning for ro-
bust classification with noisy labels, 2017. URL https://arxiv.org/abs/1705.01936.
C. G. Northcutt, A. Athalye, and J. Mueller. Pervasive label errors in test sets destabilize machine
learning benchmarks. In Proceedings of the 35th Conference on Neural Information Processing
Systems Track on Datasets and Benchmarks, December 2021.
A. Paullada, I. D. Raji, E. M. Bender, E. Denton, and A. Hanna. Data and its (dis)contents: A
survey of dataset development and use in machine learning research. Patterns, 2(11):100336,
2021. ISSN 2666-3899. doi: https://doi.org/10.1016/j.patter.2021.100336. URL https://
www.sciencedirect.com/science/article/pii/S2666389921001847.
A. Poliak, J. Naradowsky, A. Haldar, R. Rudinger, and B. Van Durme. Hypothesis only baselines
in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and
Computational Semantics, pages 180–191, New Orleans, Louisiana, June 2018. Association for
Computational Linguistics. doi: 10.18653/v1/S18-2023. URL https://aclanthology.
org/S18-2023.
I. D. Raji, E. M. Bender, A. Paullada, E. Denton, and A. Hanna. AI and the everything in the whole
wide world benchmark. abs/2111.15366, 2021. URL https://arxiv.org/abs/2111.
15366.
P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine compre-
hension of text. arXiv preprint arXiv:1606.05250, 2016.
V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C.-J. Wu, B. Anderson, M. Breughe,
M. Charlebois, W. Chou, R. Chukka, C. Coleman, S. Davis, P. Deng, G. Diamos, J. Duke, D. Fick,
J. S. Gardner, I. Hubara, S. Idgunji, T. B. Jablin, J. Jiao, T. S. John, P. Kanwar, D. Lee, J. Liao,
A. Lokhmotov, F. Massa, P. Meng, P. Micikevicius, C. Osborne, G. Pekhimenko, A. T. R. Rajan,
D. Sequeira, A. Sirasao, F. Sun, H. Tang, M. Thomson, F. Wei, E. Wu, L. Xu, K. Yamada, B. Yu,
G. Yuan, A. Zhong, P. Zhang, and Y. Zhou. Mlperf inference benchmark. In Proceedings of the
ACM/IEEE Annual International Symposium on Computer Architecture, 2020.
M. T. Ribeiro, S. Singh, and C. Guestrin. Semantically equivalent adversarial rules for de-
bugging NLP models. In Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pages 856–865, Melbourne, Australia,
July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1079. URL
https://aclanthology.org/P18-1079.

14
D. Richins, D. Doshi, M. Blackmore, A. T. Nair, N. Pathapati, A. Patel, B. Daguman, D. Dobri-
jalowski, R. Illikkal, K. Long, et al. Missing the forest for the trees: End-to-end ai application
performance in edge data centers. In 2020 IEEE International Symposium on High Performance
Computer Architecture (HPCA), pages 515–528. IEEE, 2020.
D. Richins, D. Doshi, M. Blackmore, A. T. Nair, N. Pathapati, A. Patel, B. Daguman, D. Dobri-
jalowski, R. Illikkal, K. Long, et al. Ai tax: The hidden cost of ai data center applications. ACM
Transactions on Computer Systems (TOCS), 37(1-4):1–32, 2021.
N. Sambasivan, S. Kapania, H. Highfill, D. Akrong, P. Paritosh, and L. M. Aroyo. “everyone wants
to do the model work, not the data work”: Data cascades in high-stakes ai. In proceedings of the
2021 CHI Conference on Human Factors in Computing Systems, pages 1–15, 2021.
D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J.-F.
Crespo, and D. Dennison. Hidden technical debt in machine learning systems. Advances in neural
information processing systems, 28, 2015.
F. Tang, W. Gao, J. Zhan, C. Lan, X. Wen, L. Wang, C. Luo, Z. Cao, X. Xiong, Z. Jiang,
et al. Aibench training: Balanced industry-standard ai training benchmarking. In 2021 IEEE
International Symposium on Performance Analysis of Systems and Software (ISPASS), pages
24–35. IEEE, 2021.
M. Tsuchiya. Performance impact caused by hidden bias of training data for recognizing textual
entailment. In Proceedings of the Eleventh International Conference on Language Resources and
Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Associa-
tion (ELRA). URL https://aclanthology.org/L18-1239.
E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh. Universal adversarial triggers for
attacking and analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pages 2153–2162, Hong Kong, China, Nov. 2019. Association for
Computational Linguistics. doi: 10.18653/v1/D19-1221. URL https://aclanthology.
org/D19-1221.
D. Weissenborn, G. Wiese, and L. Seiffe. Making neural QA as simple as possible but not sim-
pler. In R. Levy and L. Specia, editors, Proceedings of the 21st Conference on Computational
Natural Language Learning (CoNLL 2017), Vancouver, Canada, August 3-4, 2017, pages 271–
280. Association for Computational Linguistics, 2017. doi: 10.18653/v1/K17-1028. URL
https://doi.org/10.18653/v1/K17-1028.
J. Wilkinson, K. F. Arnold, E. J. Murray, M. van Smeden, K. Carr, R. Sippy, M. de Kamps, A. Beam,
S. Konigorski, C. Lippert, et al. Time to reality check the promises of machine learning-powered
precision medicine. The Lancet Digital Health, 2020.
H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Phanishayee, B. Schroeder, and G. Pekhimenko. Tbd:
Benchmarking and analyzing deep neural network training. arXiv preprint arXiv:1803.06905,
2018.

15

You might also like