Artificial Intelligence and Machine Learning Review Articles

The Lean Data Scientist: Recent Advances Toward Overcoming the Data Bottleneck

A taxonomy of the methods used to obtain quality datasets enhances existing resources.

By Chen Shani, Jonathan Zarecki, and Dafna Shahaf

Posted Feb 1 2023

Introduction
Key Insights
Taxonomy
Obstacle: Missing Data
Obstacle: Missing Labels
Conclusion
References
Authors
Footnotes

tapered collection of red balls, illustration

Obtaining data has become the key bottleneck in many machine-learning (ML) applications. The rise of deep learning has further exacerbated this issue. Although high-quality ML models are finally making the transition from expensive-to-develop, highly specialized code to something more like a commodity, these models involve millions (or even billions) of parameters and require massive amounts of data to train. Thus, the dominant paradigm in ML today is to create a new (large) dataset whenever facing a novel task. In fact, there are now entire conferences dedicated to the creation of new data resources (for example, the International Conference on Language Resources and Evaluation or resource papers at CIKM).

Key Insights

Recent machine learning algorithms are increasingly data-hungry. A widespread approach is to construct large, task-specific datasets, which is inefficient and sometimes infeasible.
Many ways to tackle this data bottleneck problem have been proposed, but they are scattered across different subfields.
We present a practitioner-centric taxonomy of these methods. We distill each method’s main assumptions and explain when it is useful, in hopes of encouraging more efficient use of resources as well as uncovering novel research directions.

While this approach resulted in significant advances, it suffers from a major caveat, as collecting large, high-quality datasets is often very demanding in terms of time and human resources. For several tasks, such as rare disease detection, large datasets are nearly infeasible to construct.

While there has been much effort suggesting workarounds to this data-bottleneck problem, they are scattered across many different subfields, often unaware of one another. There exist many method-specific and domain-specific surveys, but broader, big-picture surveys are difficult to find. The closest in spirit to our work is Roh et al.,³³ which focuses more on the data management point of view and the early stages of the pipeline.

In this article, we aim to bring order to this area. Our main contribution is a simple yet comprehensive taxonomy of ways to tackle the data bottleneck. We survey major research directions and organize them into a taxonomy in a way designed to be useful for practitioners choosing between different approaches. The emphasis here is not on covering methods in depth; rather, we discuss the main ideas behind various methods, the assumptions they make and their underlying concepts. For each topic, we mention several important or interesting works, and refer the interested reader to surveys where possible.

We wish to first raise awareness of the methods that already exist, to encourage more efficient use of data. In addition (and perhaps more importantly), we hope the organization of the taxonomy would also reveal gaps in current techniques and suggest novel directions of research that could inspire the creation of new, less data-hungry learning methods.

Taxonomy

A note on scope. The data-bottleneck problem is widespread across the field of machine learning. It is especially crucial in supervised learning but applies to unsupervised paradigms as well. In this work we focus on the supervised, unsupervised, and semi-supervised settings. Reinforcement learning is generally beyond the scope of this article, although some of the methods we present are applicable to it.

We start with a high-level view of our taxonomy, depicted in Figure 1. We first make the distinction between cases where data (X) is hard to collect, and cases where labels (Y) pose the difficulty. For example, collecting a dataset of patients with rare diseases is challenging due to the condition’s rarity. In contrast, it is relatively easy to collect a large dataset of unlabeled images for an image segmentation task, but annotation is slow and costly.

Figure 1. Flowchart of the taxonomy for ways to tackle the data bottleneck.

If obtaining data is the main obstacle, we identify three major approaches:

Add examples: Generate more examples from available data (for example, through data augmentation).
Use additional information on existing data: Increase the dimensionality of X in a manner that can assist the learner (for example, curriculum learning).
Use models encoding relevant knowledge: Instead of learning from scratch, take advantage of models trained in a different yet relevant setup (for example, transfer learning).

If unlabeled data is abundant but labels are difficult to obtain, we identify two main approaches:

Acquire labels efficiently: Label examples that should heavily contribute to the learning process.
Weak labeling: Using proxy labels, either making assumptions about label distribution (for example, semi-supervised learning) or about the labeling process (for example, data programming), or using external (noisy) supervision signals correlated with the true labels (incidental supervision).

We note these approaches may also be combined. For example, one might add more examples and increase the dimensionality of the data. Here, we follow the taxonomy and elaborate on the different approaches and best practices.

Obstacle: Missing Data

Quite often, data is hard (or impossible) to obtain. In the following, we survey some of the main methods from the left branch in Figure 1: obtaining more examples efficiently, adding informative dimensions to existing data, or taking advantage of related tasks.

Add examples. This category focuses on methods for obtaining more examples.

Dataset repurposing

Use a preexisting dataset for a new purpose.

Dataset repurposing is perhaps the most obvious method to add data and is mentioned here for the sake of completeness. The idea is to use a preexisting dataset for a different task than it was originally constructed for.

For example, ImageNet was originally made and used for classification, but later was reused for image generation.⁴⁵ Similarly, the MS-COCO image captioning dataset was reused for training visually grounded word embeddings.²⁰

Data repurposing also includes transformations on existing datasets. For example, consider inpainting, the process of restoring lost parts of an image based on the surrounding information. Inpainting is done using various preexisting datasets such as CelebA, Place2, and ImageNet,³⁹ where the same image splits into both X and Y (sometimes in more than one way).

Of course, it is also possible to re-purpose a dataset created with no machine-learning task in mind at all: for example, Bertero and Fung⁶ used a dataset of TV sitcoms for a supervised humor detection task, with recorded laughter serving as labels.

Data augmentation

Perform transformations on X to enlarge the dataset.

Data augmentation is a common approach for generating more data; it artificially inflates the training set by applying modifications. This method’s initial goal was to prevent overfitting.

Data augmentation often employs vicinal risk minimization (VRM).⁴⁸ In VRM, human knowledge is needed to define a neighborhood around each example in the training data, and virtual examples are drawn from this vicinity distribution. It is easiest to demonstrate this idea in the field of computer vision; there, common augmentations are geometric transformations such as flipping, cropping, scaling, and rotating (see Figure 2). The idea is to make the classifier invariant to change in position and orientation. Similarly, photometric transformations amend the color channels to make the classifier invariant to change in lighting and color.

Figure 2. Examples for common data augmentation manipulations of images as presented by Taylor and Nitschke.⁴⁰

Data augmentation leads to improved generalization, especially with small datasets³ or when the dataset is unbalanced (instead of under sampling, which is data-inefficient).

Augmentation methods have seen a recent surge of interest. Recent advances include methods that jointly train a model for generating augmentations,²⁸ and methods that learn which augmentations best fit the data.⁷ For example, AutoAugment⁷ randomly chooses a sub-policy of batch transformation and searches for the one that yields the highest validation accuracy.

Beyond human-defined transformations, recent methods suggested using pretrained generative adversarial networks (GANs) to create new examples. Interestingly, the generated data points do not have to be interpretable by humans. For example, Mixup⁵⁹ trains a neural network on convex combinations of pairs of examples and their interpolated labels, treating it as “noisy” training data.

More information on existing data. Instead of adding new data points, this set of methods focuses on adding dimensions to existing points.

Multimodal learning

Integrate associated information on X from multiple modalities.

Multimodal learning attempts to enrich the input to the learning algorithm, giving the learner access to more than one modality of X; for example, an image accompanied by its caption. Multimodal learning is intuitive and like how infants learn (that is, children see new objects is often accompanied by additional semantic information). The main drawbacks of multimodal learning are obtaining rich input and effectively integrating it into the model.

Although the term “multimodal learning” is recent, many works combined information from different modalities.^11,22,41 These works, and more recent ones, show the promise of this method as an effective way to reduce data requirements and improve generalization.

Moreover, multimodal learning is also often used when the number of data points is extremely small, and in particular, few-, one-, and zero-shot learning (when only a few target-specific labeled examples exist for the learning process; thus, the learner must understand new concepts using only a handful of examples). For example, Visotsky et al.⁵¹ used multimodal learning for few-shot learning by integrating additional per-sample information—in this case, a list of objects appearing in the input image (see Figure 3). Schwartz et al.³⁷ demonstrated that it is possible to outperform previous state of the art results on the popular miniImageNet and CUB few-shot learning benchmarks by combining images with multiple and richer semantics (category labels, attributes, and natural language descriptions).

Figure 3. An illustration of the learning setup used by Visotsky et al.⁵¹

Curriculum learning

Present examples to the learner according to a predetermined order, usually based on difficulty.

In curriculum learning, the learner is exposed to examples using a predetermined curriculum, where examples are usually sorted in increasing order of difficulty. Meta-data on X is needed to determine its place in the learning process.

The motivation behind curriculum learning comes from humans, as teachers tend to start by teaching simpler concepts (for example, learning to ride a bicycle with training wheels first). Thus, curriculum learning attempts to augment training examples with a difficulty score, often corresponding to typicality.

Given the difficulty score, the algorithm starts with a set of simple data points and gradually increases the difficulty of training examples throughout the learning process. This progression enables the model to learn the broad concept on a few easy examples and later refine the concept with more difficult ones. Figure 4 shows photos of dogs in the top row are more typical and should be easier for a classifier to recognize.

Figure 4: Typical versus non typical images of dogs are considered to be easy versus hard, respectively, in a dogs versus cats classification task.

Curriculum learning has been shown to improve performance while decreasing the number of examples needed for convergence.¹⁷ For example, Zaremba and Sutskever⁵⁸ showed how curriculum improves learning for the task of predicting the output of Python code without executing it.

A major caveat of curriculum learning is the inherent need for a difficulty-label estimator. Human labeling of difficulty can be very demanding, perhaps even more than standard annotation. In practice, the difficulty of each example is often learned by a teacher model, which may have access to related training data.¹⁷

A related concept is self-paced learning (SPL).¹⁹ Intuitively, the curriculum in SPL is determined by the student’s abilities, rather than being fixed by the teacher. Instead of heuristically designing a difficulty measure, SPL introduces a regularizor into the learning objective, with the goal of optimizing a curriculum for the model itself. This makes SPL broadly applicable.

Argumentation-based machine learning

Use experts’ local knowledge to restrict the search space.

Argumentation-based machine learning (ABML) is a method to constrain the search space using experts’ local knowledge.²⁶ In a nutshell, in ABML the learner attempts to find if-then rules to explain argumented examples in a rule induction process. The learner starts by finding a rule, adding it to a set of rules and removing all training data points that are covered by that rule. This process is repeated until all examples are removed. ABML’s main advantage is the use of expert knowledge to justify specific examples, which is often easier than explaining global phenomena.

For example, Možina et al.²⁶ used ABML for medical records of deceased patients, where they used a physician’s reasoning for the cause of death to limit the search space.

The data-bottleneck problem is widespread across the field of machine learning.

ABML is perhaps less popular than the other methods in this section. Nevertheless, if expert local knowledge is available, ABML is a powerful way to integrate partial prior knowledge. Moreover, the induced hypothesis should make more sense to an expert, as it must be consistent with the input arguments.

Models encoding relevant knowledge. Here, we go beyond the classical pipeline of training a model for a task; we present models that can take advantage of other, related tasks.

Multi-task learning

Co-learn multiple tasks simultaneously to enhance cross task similarities for better generalization.

Multi-task learning (MTL) is a prominent area of research where one attempts to train on multiple different (yet related) tasks simultaneously. These multiple tasks are solved con-currently, exploiting commonalities and differences across them.

It has been shown that challenging the learner to solve multiple problems at the same time results in better generalization and better performance on each individual task.³⁶ Indeed, MTL is successfully used in both vision and NLP. The key factors for this success in the absence of a large dataset are: It is an implicit data augmentation method, based on cross-task commonalities; it enables unraveling cross tasks and feature correlations; and encouraging a classifier to also perform well on a slightly different task is a better regularization than uninformed regularizers (for example, enforce weights to be small, which is the typical L2-regularization).

As an example, consider the case of spam-filtering. Quite often, data from an individual user is insufficient for training a model. Intuitively, different people have different distributions of features that distinguish spam from legitimate email. For example, email messages in Russian are probably spam for English speakers, but not for Russian speakers. However, inter-user commonalities can be utilized to solve this problem (for example, text related to money transfer is probably spam). To build upon these similarities, Attenberg et al.⁴ created an MTL-based spam-filter, treating each individual user as one distinct but related classification task and training a model across the different users.

A more recent example of MTL learning is the T5 model (see Figure 5).²⁹ This model achieves state-of-the-art results on many NLP benchmarks while being flexible enough to be fine-tuned to a variety of downstream tasks. T5 receives as input the task at hand and thus allows the use of the same model, loss function, and hyperparameters for any NLP task.

Figure 5. Multi-task paradigm as presented by Raffel et al.²⁹

MTL implementations can be divided into two main categories – hard versus soft parameter sharing of the hidden layers, where hard parameter sharing is more commonly used. In the hard type, the hidden layers are shared between all tasks while keeping several task-specific output layers. Baxter⁵ showed that hard parameter sharing reduces the risk of overfitting to order N (the number of tasks), which is smaller than the risk of overfitting the task-specific parameters (the output layers). In soft parameter sharing, each task has its own model and parameters. The distance between model’s parameters is then regularized to encourage them to be similar (enhance cross tasks’ similarity), as done by Duong et al.¹⁰

Transfer learning

Transfer knowledge gained while solving one problem to a different yet related problem.

Transfer learning is a widely used, highly effective way to integrate prior knowledge, like humans, who never approach a new problem tabula rasa, but rather with rich experience of somewhat similar problems and their solutions.⁴²

The idea is to use preexisting models trained on related tasks. These pre-trained models are usually used as an initialization for finetuning using a small dataset for the task in hand. Thus, significantly less task-specific examples are needed for convergence.

Another beneficial side effect is the use of the model’s initial wide domain knowledge, compared to initialization with random weights. In other words, the model starts the fine-tuning phase with some relevant world knowledge.

For example, models trained on ImageNet have been transferred to medical imaging tasks, including inspecting chest x-rays⁵⁴ and retinal fundus images.⁸ The idea is that a network trained on a large and diverse dataset of images captures universal visual features such as curves and edges in its early layers (similar to the primary visual cortex of humans and many other mammals, a Nobel prize winning discovery^a). Despite the difference between the images in ImageNet and those in the downstream tasks, these features are relevant for many vision tasks. Therefore, this approach significantly decreases the size of labeled task-specific data needed.

In NLP, the commonly used pre-trained model BERT achieves state-of-the-art results in various tasks.⁹ Pretraining such models is often done in a self-supervised manner, where different parts of the input are masked, and the learner’s goal is to predict the masked parts. For example, given a sentence, it is possible to iterate over it, masking a different word each time, to create various examples.

Fine-tuning in deep networks is usually done either by adding an untrained last layer and training the new model on the small task-specific dataset or by taking the output embeddings of the next to last layer. Another possible fine-tuning technique is to train the whole network with a relatively small learning rate; that is, perform small changes on the already-decent weights (as a heuristic, about 10 times smaller than the learning rate used for pretraining). Fine-tuning can also be done by freezing the weights of the first few layers of the pretrained model. The motivation behind this technique is that the first layers capture universal features that would probably also be relevant to the new task. Thus, freezing them during fine-tuning should keep the captured information that is relevant for both the original and the new tasks.

To conclude, transfer learning is a powerful tool for both reducing the amount of task-specific data needed and improving models’ performance.

Meta learning

Improve the learning algorithm by generalizing based on experience from multiple learning episodes.

Meta-learning (also known as “learning to learn”) is a recent subfield of machine learning,¹² focusing on designing models that can learn new tasks or adapt to new environments rapidly, with only a few training examples. It is based on creating a meta-learner that has wide prior knowledge regarding the relevant topic(s). Meta learning is also inspired by human learning. For example, people who know how to ride a bicycle are more likely to quickly learn to ride a motorcycle.

Note that while meta learning can often be meaningfully combined with MTL systems, their objectives are different. While MTL aims to solve all training tasks, meta learning aims to use the training tasks for solving new tasks with small data. Thus, meta learning is about creating models with prior experience that can quickly adapt to new tasks. Specifically, the meta-learner gradually learns meta-knowledge across tasks, which can be generalized to a new task using little task-specific information.

There are three common approaches to meta-learning: metric-based (similar to nearest-neighbor algorithms), optimization-based (meta-gradients optimizing), and model-based (no assumptions about data distribution).

As an example of a metric-based approach, Vinyals et al.⁵⁰ proposed a framework that explicitly learns from a given support set to minimize a loss over a batch. The result is a model that learns to map a small, labeled support set and an unlabeled example to its label, obviating the need for fine-tuning to adapt to new class types. They then showed the superiority of this method in both vision and NLP tasks.

A well-known work in the optimization-based line of research is model-agnostic meta-learning (MAML), which is a general optimization algorithm, compatible with any gradient descent-based model.¹² It uses a meta-loss specifically designed to induce quick changes when fine-tuned on new tasks and is based on N-gradients (where N is the total number of tasks).

In the model-based line of research, Munkhdalai and Yu²⁷ presented MetaNet, a meta-learning model designed specifically for rapid generalization across tasks. The rapid generalization of MetaNet relies on “fast weights”, which are parameters of the network with a smaller timescale for changes than the regular gradient-based weight changes. This Hebbian short-term plasticity maintains a dynamically changing short-term memory of the recent history of the units’ activities in the network, as opposed to the standard slow recurrent connectivity. This model outperforms various other recurrent models across several tasks.

Obstacle: Missing Labels

We now turn our attention to the second major branch in Figure 1, where unlabeled data is abundant, but there are few labels (or no labels at all). This setting is common in practice because unlabeled data is often much easier to obtain than labeled data. In this section we cover two main approaches. The first deals with ways to acquire labels efficiently, and the other uses weak labels.

Active learning

Generate examples which are close to the decision boundary. These examples should contribute to the learning process more than random examples.

Acquiring labels efficiently. When more labels are needed but annotation is costly, an immediate question would be how to acquire new labeled data efficiently. The prime example of this is active learning, in which the learner can iteratively query an oracle (information source) to label new data points.³² These queries can include unlabeled examples either from the dataset or new ex-nihilo data points, often ones that are close to the decision boundary. The rationale is that not all examples contribute equally to the learning process: diverse examples that are difficult for the learner to classify might be especially useful and could decrease the number of data points needed for learning.

There are many methods to determine which data points from the training set should be queried next. Common objectives include picking examples which will change the current model the most, examples which the current model is least certain about, or diverse examples that resemble the data distribution. For example, Hacohen et al.¹⁶ recently showed that in the presence of little data it is most beneficial to present the model with typical examples (compared to scenarios with more data, in which it is best to use examples that are close to the decision boundary).

When generating new examples (rather than selecting unlabeled ones from the training set), it is important to remember that humans will be the ones labeling them. We wish to point out that while data augmentation modifies the input but keeps its label (as discussed earlier), active learning generates examples without labels. Thus, the generation algorithm should keep the new points interpretable, that is, ensure they have a clear label.¹⁴ For example, Zarecki and Markovitch⁵⁷ automatically transformed sentences’ sentiment by replacing key words that bring them closer to the classification boundary (while keeping their syntax).

Recent approaches use GANs to generate new examples, either from scratch (and label them),⁶⁰ or by modifying an existing example (while attempting to preserve the label).⁴³ Both scenarios update the learner and the GAN model simultaneously after labeling a new example.

Importantly, the GAN approaches are more expressive than transformation-based approaches, but the result is often less interpretable. Figure 6 shows an example of modified images from Tran et al.⁴³ Note that while the MNIST examples (handwritten digits) have relatively clear labels, the CIFAR10 examples (tiny images in ten classes such as airplane, dog, and ship) are not as easy to label.

Figure 6. Images generated by the GAN transformation approach for “near-miss” examples.⁴³

A note on gamification. Active learning is the dominant paradigm for reducing the number of annotations needed. However, a different approach to label efficiently is to reduce the cost of annotations. A notable example is gamification—applying gaming mechanics to non-gaming environments, to make tasks more enjoyable and give annotators a non-monetary incentive to provide labels. The challenge in gamification is often to design the game to create the right incentive. This is far from trivial, and requires knowledge of game design, motivational psychology, and an understanding of the target group.²⁵ Ignorance of the complexity involved in gamification often results in modest outcomes.

The seminal work of Von Ahn and Dabbish⁵² demonstrated a two-player game for image labeling, where the players gain points for describing an image using the exact same term. The researchers famously estimated that if users were to play the game at the same rate as other popular online games, most images on the Web could be labeled (for free) within only a few months. Another example is the unfun.me corpus used in humor research. This corpus was constructed via an online game where players change satirical headlines into serious ones with minimal edits.⁵⁵

Weak labeling. If we cannot obtain labels efficiently, we could choose to obtain noisy labels as a proxy. In vision, this is sometimes referred to as “automatic image annotation.” We cover two main types of noisy labels here.

Assumptions on P (Y = y|X = x).

Semi-supervised learning

Harness information regarding P(X = x) to reduce labeling requirements by integrating labeled and non-labeled examples in the learning process.

Semi-supervised learning (SSL) is a very large and active area of research, and we do not profess to cover all it; for a recent survey on SSL, we refer the reader to van Engelen and Hoos.⁴⁶

SSL estimates the distribution P(X = x) using a large amount of unlabeled, to reduce the annotated data requirements. It makes strong assumptions about the relation between P(X = x) and P (Y = y|X = x) to reduce the number of labeled examples needed.⁵⁶ Typically, these assumptions take the following forms:

Smoothness: Points that are close to each other are more likely to share a label. More formally, every two adjacent samples x, x’ should have similar labels.
Cluster-ability: Data tend to form discrete clusters where points belonging to the same cluster are more likely to share a label. Thus, the decision boundary can only pass through low-density areas in the feature space.
Manifold: Data lies approximately on a manifold of a much lower dimension than the input space. Thus, when considering low-dimensional manifolds of the input space, any data points on the same manifold should have the same label.

All three assumptions can be seen as different definitions of interpoints similarity: The smoothness defines it as proximity in the input space, the cluster-ability assumes high-density areas contain similar data points, and the manifold states that points which lie on the same low-dimensional manifold are similar.

When generating new examples (rather than selecting unlabeled ones from the training set), it is important to remember that humans will be the ones labeling them.

Another important distinction in SSL is between inductive and transductive methods. The former yields a classification model to predict the label of a new example, like supervised learning f: X → Y). The latter do not yield such a model, but instead directly provide predictions. Transductive approaches are usually graph-based, while the inductive approaches can be further divided into unsupervised preprocessing, intrinsically semi-supervised, and wrapper methods.⁴⁶

One popular way of using the unsupervised preprocessing approach is to use the knowledge on P (X = x) to extract useful features in a lower dimension than the original dimension of X and thus reduce the learning complexity. This includes learning a representation using an auto-encoder model⁴⁹ or applying a dimensionality reduction method like PCA.¹

Under the inductive approach, it is also possible to use an intrinsically semi-supervised model like semi-supervised SVM, which changes the optimization target to find a decision boundary with maximal margin from both labeled and unlabeled points (for example, using SVM).⁴⁷ This can also be applied to neural networks by adding a form of regularization over the unlabeled data.³⁰

In wrapper methods, a model is initially trained from the available set of examples.⁴⁴ It then makes predictions on the unlabeled dataset. The model’s pseudo-labels are added as labeled data for the next iteration of supervised learning. This process is repeated until convergence.

Data programming

Integrate multiple weak heuristics regarding the labeling process f: X → Y to create noisy labels.

Data programming is a paradigm for the programmatic creation of training sets. In data programming, users express weak supervision strategies or domain heuristics as labeling functions (LFs), which are programs that label subsets of the data. Importantly, LFs are imprecise and can contradict each other, resulting in noisy labels. By explicitly representing the labeling process f: X → Y as a generative model, data programming aims to “denoise” the generated training set.

For example, in spam-detection, potential LFs would return “spam” if the email contains a URL or a money transfer request, and “no-spam” if coming from someone in your contact list. These functions alone achieve poor performance; however, like ensemble methods (where a group of weak learners comes together to form a strong one with superior accuracy), the strength of data programming is in the combination of many weak heuristics.

A popular system for data programming is Snorkel.³¹ It applies the (noisy) LFs to the data and estimates their accuracy and correlations, using only their agreements and disagreements. This information is then used to reweight and combine LF predictions to output probabilistic noise-aware training labels. This process is presented in Figure 7.

Figure 7. Illustration of Snorkel’s pipeline.

Expectation regularization

Using prior knowledge regarding the proportion of the different labels in sub-groups of the data to create noisy labels.

Prior knowledge regarding labels’ proportion in various subgroups of the data, makes it possible to automatically create noisy labels in a process called expectation regularization (learn from label proportions).⁵³

This estimation process relies on uniform convergence properties of the expectation operator. It uses empirical means of the sub-groups to approximate expectations with respect to a group’s distribution. The latter is then used to compute expectations with respect to a given label, and finally, the conditional means on the label distribution are used to estimate the conditional group means.

A recent work in this area is ballpark learning, which relaxes the assumption of known label proportions, assuming instead soft constraints on proportions within and between groups of instances (for example, “the percentage of spam in emails mentioning a certain word is between k_low and k_high“, or “emails containing a link have at least k% more spam than emails without links”).¹⁸ Ballpark learning learns a model that labels individual instances while satisfying these soft, noisy constraints.

Noisy Supervision from External Datasets. It is sometimes possible to take advantage of preexisting datasets to get a noisy supervision signal.

Distant supervision

Use a preexisting database to collect examples for the desired relation. These examples are then used to automatically generate labeled training data.

Distant supervision is a popular method to use existing datasets. In distant supervision, a model is learned given a labeled training set, as in “standard” supervised ML, but the training data is weakly labeled (that is, labeled automatically, based on heuristics or rules).

For example, Mintz et al.²⁴ used Free-base, a large, unlabeled, semantic database, to provide distant supervision for relation extraction. The intuition is that any sentence that contains a pair of entities with a known Freebase relation is likely to express that relation in some way.²³ For example, each pair of “Barack Obama” and “Michelle Obama” that appear in the same sentence can be extracted as a positive example for the marriage relation. Due to the potentially large number of sentences that contain a given entity pair, it is possible to extract and combine noisy features for the labeling process. Based on these semantic signals, Mintz et al.²⁴ was able to use 116 million unlabeled instances.

Incidental supervision

Exploit weak signals that exist in data independently of the task at hand.

The incidental supervision framework is based on the idea that informative cues for a task could exist in datasets that were not constructed with this task in mind. For example, suppose we want to infer gender from first names. One could use Wikipedia, which was not created for this task. The incidental signal would be pronouns and other gender indicators appearing in the first paragraph of Wikipedia pages about people with that first name. This signal is correlated to the task at hand and (together with other signals and inferences), could be used for supervision, reducing the need for annotations.

Incidental supervision does not assume knowledge about the labeling process.³⁴ Moreover, incidental signals can be noisy, partial, or only weakly correlated with the target task, and still be used to provide supervision and facilitate learning. Note that the notion of supervision here is different from that of distant supervision: In distant supervision, the model learns in the standard supervised learning way, but the training set is labeled automatically, based on heuristics. In incidental supervision, a complete training set might never exist.

Context-sensitive spelling and grammar checking is a task that has been relying on incidental supervision for over 20 years now.¹³ Under the assumption that most edited textual resources (books, newspapers, Wikipedia) do not contain many spelling and grammar errors, these methods generate contextual representations for words, punctuation marks and phenomena such as agreements. These representations are then used to identify mistakes and correct them in a context-sensitive manner.³⁵

Identifying assumptions is essential for breaking them—and breaking assumptions is an established technique for encouraging creativity and innovation.

An unintentional example for the power of incidental signals comes from image processing, where the task of gender detection based on the iris texture was solved with great accuracy (over 80% for most papers and an impressive score of 99.5% reported by Al-rashed and Berbar²). However, it was later discovered that most models did not detect a person’s gender; rather, they detected the use of cosmetic mascara, which is a much easier task and is indeed correlated with the original assignment.²¹ Thus, although unintentionally, this finding emphasized the potential of using incidental cues.

Conclusion

The dominant paradigm in ML today is creating large, task-specific datasets (often using crowdsourcing). In this review we devise a taxonomy for alternative ways to tackle the data bottleneck problem. The taxonomy aims to bring order to the various methods suggested across different subfields, as well as making it easier to identify underlying assumptions and potential new directions. Identifying assumptions is essential for breaking them—and breaking assumptions is an established technique for encouraging creativity and innovation.

For example, surveying the taxonomy, several common assumptions that stand out are that samples tend to be representative of the data, that we have information about X and Y conjointly, and that each example has exactly one correct label. This raises the prospect of new learning settings (for example, what if we only have knowledge about the distributions of data points P (X = x) and labels P (Y = y), separately?), and of new ways to aggregate multiple (correct but different) labels.

We note that our taxonomy covers widely diverse techniques, making very different assumptions. Ultimately, we expect that choosing a technique will often boil down to what the practitioner has access to (that is, which assumptions are met). For example, in multitask learning the practitioner not only possess labeled data for their task, but also for several related tasks; in data programming, they have no (or very few) labels for their task but possess some partial knowledge about the labeling process; in curriculum learning, they know something about the hardness of data points; and so on.

We further wish to point out that it is not always obvious whether a method’s assumptions are met in practice, or to estimate which method is better suited for a specific use case. The answer might depend on many factors, such as the inherent difficulty of the concept one wishes to learn, biases in the data, or the manual effort needed to obtain high-quality input for the different methods. For example, in methods using weak labeling, the tradeoff between implementation speed and accuracy for different weak labels is often not clear in advance.

In addition to the inherent difficulty of collecting large datasets, we note there are growing concerns about such datasets, including environmental costs, financial costs, opportunity costs, and more.³⁸ We also note that large datasets are still prone to fitting artifacts, and that several recent methods have attempted to address the recurring challenges of the annotation artifacts and human biases found in many existing datasets.¹⁵

In conclusion, ML has made tremendous progress using large datasets, but they are not a panacea for all problems. Our hope is that this paper will encourage re-thinking about current annotation-heavy approaches.

Acknowledgment. This work was supported by a grant from Israel Ministry of Science and Technology and by the European Research Council (ERC) under European Union’s Horizon 2020 research and innovation program (grant no. 852686, SIAM).

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

The Lean Data Scientist: Recent Advances Toward Overcoming the Data Bottleneck

View in the ACM Digital Library

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

DOI

10.1145/3551635

February 2023 Issue

Published: February 1, 2023

Vol. 66 No. 2

Pages: 92-102

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News Sep 17 2025

Is It Real, or Is It AI?

Logan Kugler

Artificial Intelligence and Machine Learning

real diamond and fake diamond side by side

BLOG@CACM Sep 16 2025

Strengthening Enterprise Quantum Security

Carl Torrance

Architecture and Hardware

BLOG@CACM Sep 15 2025

Airlines Rely on the Cloud

Hazel Raoult

Architecture and Hardware

aerial view of clouds from an airplane window

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Key Insights

Taxonomy

Obstacle: Missing Data

Obstacle: Missing Labels

Conclusion

The Lean Data Scientist: Recent Advances Toward Overcoming the Data Bottleneck

DOI

February 2023 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.