Delving Into Deep Imbalanced Regression
Delving Into Deep Imbalanced Regression
Yuzhe Yang 1 Kaiwen Zha 1 Ying-Cong Chen 1 Hao Wang 2 Dina Katabi 1
Number of samples
butions, where certain target values have signif-
icantly fewer observations. Existing techniques
for dealing with imbalanced data focus on tar-
gets with categorical indices, i.e., different classes.
However, many tasks involve continuous targets,
where hard boundaries between classes do not Missing data
exist. We define Deep Imbalanced Regression
(DIR) as learning from such imbalanced data with Continuous target value
continuous targets, dealing with potential missing
data for certain target values, and generalizing to Figure 1. Deep Imbalanced Regression (DIR) aims to learn from
the entire target range. Motivated by the intrinsic imbalanced data with continuous targets, tackle potential missing
data for certain regions, and generalize to the entire target range.
difference between categorical and continuous la-
bel space, we propose distribution smoothing for
both labels and features, which explicitly acknowl- Existing solutions for learning from imbalanced data, how-
edges the effects of nearby targets, and calibrates ever, focus on targets with categorical indices, i.e., the tar-
both label and learned feature distributions. We gets are different classes. However, many real-world tasks
curate and benchmark large-scale DIR datasets involve continuous and even infinite target values. For ex-
from common real-world tasks in computer vi- ample, in vision applications, one needs to infer the age of
sion, natural language processing, and healthcare different people based on their visual appearances, where
domains. Extensive experiments verify the supe- age is a continuous target and can be highly imbalanced.
rior performance of our strategies. Our work fills Treating different ages as distinct classes is unlikely to yield
the gap in benchmarks and techniques for prac- the best results because it does not take advantage of the
tical imbalanced regression problems. Code and similarity between people with nearby ages. Similar issues
data are available at: https://github.com/ happen in medical applications since many health metrics
YyzHarry/imbalanced-regression. including heart rate, blood pressure, and oxygen saturation,
are continuous and often have skewed distributions across
patient populations.
1. Introduction
In this work, we systematically investigate Deep Imbalanced
Data imbalance is ubiquitous and inherent in the real world. Regression (DIR) arising in real-world settings (see Fig. 1).
Rather than preserving an ideal uniform distribution over We define DIR as learning continuous targets from natural
each category, the data often exhibit skewed distributions imbalanced data, dealing with potentially missing data for
with a long tail (Buda et al., 2018; Liu et al., 2019), where certain target values, and generalizing to a test set that is
certain target values have significantly fewer observations. balanced over the entire range of continuous target values.
This phenomenon poses great challenges for deep recogni- This definition is analogous to the class imbalance problem
tion models, and has motivated many prior techniques for (Liu et al., 2019), but focuses on the continuous setting.
addressing data imbalance (Cao et al., 2019; Cui et al., 2019;
Huang et al., 2019; Liu et al., 2019; Tang et al., 2020). DIR brings new challenges distinct from its classification
counterpart. First, given continuous (potentially infinite)
1
MIT Computer Science & Artificial Intelligence Laboratory target values, the hard boundaries between classes no longer
2
Department of Computer Science, Rutgers University. Correspon- exist, causing ambiguity when directly applying traditional
dence to: Yuzhe Yang <[email protected]>.
imbalanced classification methods such as re-sampling and
Proceedings of the 38 th International Conference on Machine re-weighting. Moreover, continuous labels inherently pos-
Learning, PMLR 139, 2021. Copyright 2021 by the author(s). sess a meaningful distance between targets, which has im-
Delving into Deep Imbalanced Regression
plication for how we should interpret data imbalance. For interpolating samples in the same class (Chawla et al., 2002).
example, say two target labels t1 and t2 have a small number Model-based solutions include re-weighting or adjusting the
of observations in training data. However, t1 is in a highly loss function to compensate for class imbalance (Cao et al.,
represented neighborhood (i.e., there are many samples in 2019; Cui et al., 2019; Dong et al., 2019; Huang et al.,
the range [t1 − ∆, t1 + ∆]), while t2 is in a weakly repre- 2016; 2019), and leveraging relevant learning paradigms,
sented neighborhood. In this case, t1 does not suffer from including transfer learning (Yin et al., 2019), metric learn-
the same level of imbalance as t2 . Finally, unlike classifi- ing (Zhang et al., 2017), meta-learning (Shu et al., 2019),
cation, certain target values may have no data at all, which and two-stage training (Kang et al., 2020). Recent studies
motivates the need for target extrapolation & interpolation. have also discovered that semi-supervised learning and self-
supervised learning lead to better imbalanced classification
In this paper, we propose two simple yet effective methods
results (Yang & Xu, 2020). In contrast to these past work,
for addressing DIR: label distribution smoothing (LDS) and
we identify the limitations of applying class imbalance meth-
feature distribution smoothing (FDS). A key idea underlying
ods to regression problems, and introduce new techniques
both approaches is to leverage the similarity between nearby
particularly suitable for learning continuous target values.
targets by employing a kernel distribution to perform ex-
plicit distribution smoothing in the label and feature spaces. Imbalanced Regression. Regression over imbalanced data
Both techniques can be easily embedded into existing deep is not as well explored. Most of the work on this topic is
networks and allow optimization in an end-to-end fashion. a direct adaptation of the SMOTE algorithm to regression
We verify that our techniques not only successfully calibrate scenarios (Branco et al., 2017; 2018; Torgo et al., 2013).
for the intrinsic underlying imbalance, but also provide large Synthetic samples are created for pre-defined rare target re-
and consistent gains when combined with other methods. gions by either directly interpolating both inputs and targets
(Torgo et al., 2013), or using Gaussian noise augmentation
To support practical evaluation of imbalanced regression,
(Branco et al., 2017). A bagging-based ensemble method
we curate and benchmark large-scale DIR datasets for com-
that incorporates multiple data pre-processing steps has also
mon real-world tasks in computer vision, natural language
been introduced (Branco et al., 2018). However, there exist
processing, and healthcare. They range from single-value
several intrinsic drawbacks for these methods. First, they
prediction such as age, text similarity score, health condition
fail to take the distance between targets into account, and
score, to dense-value prediction such as depth. We further
rather heuristically divide the dataset into rare and frequent
set up benchmarks for proper DIR performance evaluation.
sets, then plug in classification-based methods. Moreover,
Our contributions are as follows: modern data is of extremely high dimension (e.g., images
and physiological signals); linear interpolation of two sam-
• We formally define the DIR task as learning from imbal- ples of such data does not lead to meaningful new synthetic
anced data with continuous targets, and generalizing to the samples. Our methods are intrinsically different from past
entire target range. DIR provides thorough and unbiased work in their approach. They can be combined with existing
evaluation of learning algorithms in practical settings. methods to improve their performance, as we show in Sec. 4.
• We develop two simple, effective, and interpretable algo- Further, our approaches are tested on large-scale real-world
rithms for DIR, LDS and FDS, which exploit the similarity datasets in computer vision, NLP, and healthcare.
between nearby targets in both label and feature space.
• We curate benchmark DIR datasets in different domains: 3. Methods
computer vision, natural language processing, and health-
care. We set up strong baselines as well as benchmarks Problem Setting. Let {(xi , yi )}N i=1 be a training set, where
for proper DIR performance evaluation. xi ∈ Rd denotes the input and yi ∈ R is the label, which is
• Extensive experiments on large-scale DIR datasets verify a continuous target. We introduce an additional structure for
the consistent and superior performance of our strategies. the label space Y, where we divide Y into B groups (bins)
with equal intervals, i.e., [y0 , y1 ), [y1 , y2 ), . . . , [yB−1 , yB ).
2. Related Work Throughout the paper, we use b ∈ B to denote the group
index of the target value, where B = {1, . . . , B} ⊂ Z+
Imbalanced Classification. Much prior work has focused is the index space. In practice, the defined bins reflect a
on the imbalanced classification problem (also referred to minimum resolution we care for grouping data in a regres-
as long-tailed recognition (Liu et al., 2019)). Past solutions sion task. For instance, in age estimation, we could define
can be divided into data-based and model-based solutions: δy , yb+1 − yb = 1, showing a minimum age difference of
Data-based solutions either over-sample the minority class 1 is of interest. Finally, we denote z = f (x; θ) as the fea-
or under-sample the majority (Chawla et al., 2002; Garcı́a ture for x, where f (x; θ) is parameterized by a deep neural
& Herrera, 2009; He et al., 2008). For example, SMOTE network model with parameter θ. The final prediction ŷ is
generates synthetic samples for minority classes by linearly given by a regression function g(·) that operates over z.
Delving into Deep Imbalanced Regression
1e2
1e2 Pearson
Pearsoncorrelation:
correlation:-0.76
-0.76 Pearson
Pearsoncorrelation:
correlation:-0.47
-0.47
44
# of samples
33
22
11
00
20
20
Test error
Test error
15
15
10
10
55
Test error
00
00 20
20 40
40 60
60 80
80 100
100 00 20
20 40
40 60
60 80
80 100
100
Categorical
Categoricallabel
labelspace
space(class
(classindex)
index) Continuous
Continuouslabel
labelspace
space(age)
(age)
3.1. Label Distribution Smoothing LDS for Imbalanced Data Density Estimation. The
above example shows that, in the continuous case, the empir-
We start by showing an example to demonstrate the differ- ical label distribution does not reflect the real label density
ence between classification and regression when imbalance distribution. This is because of the dependence between
comes into the picture. data samples at nearby labels (e.g., images of close ages).
Motivating Example. We employ two datasets: (1) CIFAR- In fact, there is a significant literature in statistics on how to
100 (Krizhevsky et al., 2009), which is a 100-class classi- estimate the expected density in such cases (Parzen, 1962).
fication dataset, and (2) the IMDB-WIKI dataset (Rothe Thus, Label Distribution Smoothing (LDS) advocates the
et al., 2018), which is a large-scale image dataset for age use of kernel density estimation to learn the effective imbal-
estimation from visual appearance. The two datasets have ance in datasets that corresponds to continuous targets.
intrinsically different label space: CIFAR-100 exhibits cat- LDS convolves a symmetric kernel with the empirical den-
egorical label space where the target is class index, while sity distribution to extract a kernel-smoothed version that
IMDB-WIKI has a continuous label space where the target accounts for the overlap in information of data samples of
is age. We limit the age range to 0 ∼ 99 so that the two nearby labels. A symmetric kernel is any kernel that satis-
datasets have the same label range, and subsample them to fies: k(y, y 0 ) = k(y 0 , y) and ∇y k(y, y 0 ) + ∇y0 k(y 0 , y) = 0,
simulate data imbalance, while ensuring they have exactly ∀y, y 0 ∈ Y. Note that a Gaussian or a Laplacian kernel is a
the same label density distribution (Fig. 2). We make both symmetric kernel, while k(y, y 0 ) = yy 0 is not. The symmet-
test sets balanced. We then train a plain ResNet-50 model ric kernel characterizes the similarity between target values
on the two datasets, and plot their test error distributions. y 0 and any y w.r.t. their distance in the target space. Thus,
We observe from Fig. 2(a) that the error distribution corre- LDS computes the effective label density distribution as:
lates with label density distribution. Specifically, the test Z
error as a function of class index has a high negative Pearson p̃(y 0 ) , k(y, y 0 )p(y)dy, (1)
correlation with the label density distribution (i.e., −0.76) in Y
the categorical label space. The phenomenon is expected, as
where p(y) is the number of appearances of label of y in the
majority classes with more samples are better learned than
training data, and p̃(y 0 ) is the effective density of label y 0 .
minority classes. Interestingly however, as Fig. 2(b) shows,
the error distribution is very different for IMDB-WIKI with Fig. 3 illustrates LDS and how it smooths the label density
continuous label space, even when the label density distri- distribution. Further, it shows that the resulting label density
bution is the same as CIFAR-100. In particular, the error computed by LDS correlates well with the error distribution
distribution is much smoother and no longer correlates well (−0.83). This demonstrates that LDS captures the real imba-
with the label density distribution (−0.47). lance that affects regression problems.
The reason why this example is interesting is that all im- Now that the effective label density is available, techniques
balanced learning methods, directly or indirectly, operate for addressing class imbalance problems can be directly
by compensating for the imbalance in the empirical label adapted to the DIR context. For example, a straightforward
density distribution. This works well for class imbalance, adaptation can be the cost-sensitive re-weighting method,
but for continuous labels the empirical density does not ac- where we re-weight the loss function by multiplying it by the
curately reflect the imbalance as seen by the neural network. inverse of the LDS estimated label density for each target.
Hence, compensating for data imbalance based on empirical We show in Sec. 4 that LDS can be seamlessly incorporated
label density is inaccurate for the continuous label space. with a wide range of techniques to boost DIR performance.
Delving into Deep Imbalanced Regression
and group features with the same target value in the same Figure 4. Feature statistics similarity for age 30. Top: Cosine simi-
bin. We then compute the feature statistics (i.e., mean and larity of the feature mean at a particular age w.r.t. its value at the
variance) with respect to the data in each bin, which we anchor age. Bottom: Cosine similarity of the feature variance at a
denote as {µb , σb }B particular age w.r.t. its value at the anchor age. The color of the
b=1 . To visualize the similarity between
background refers to the data density in a particular target range.
feature statistics, we select an anchor bin b0 , and calculate
The figure shows that nearby ages have close similarities; However,
the cosine similarity of the feature statistics between b0 and it also shows that there is unjustified similarity between images at
all other bins. The results are summarized in Fig. 4 for ages 0 to 6 and age 30, due to data imbalance.
b0 = 30. The figure also shows the regions with different
data densities using the colors purple, yellow, and pink. ous feature elements within z:
Nb
Fig. 4 shows that the feature statistics around b0 = 30 are 1 X
highly similar to their values at b0 = 30. Specifically, the µb = zi , (2)
Nb i=1
cosine similarity of the feature mean and feature variance
X N
for all bins between age 25 and 35 are within a few percent 1 b
from their values at age 30 (the anchor age). Further, the Σb = (zi − µb )(zi − µb )> , (3)
Nb − 1 i=1
similarity gets higher for tighter ranges around the anchor.
Note that bin 30 falls in the high shot region. In fact, it is where Nb is the total number of samples in b-th bin. Given
among the few bins that have the most samples. So, the the feature statistics, we employ again a symmetric kernel
figure confirms the intuition that when there is enough data, k(yb , yb0 ) to smooth the distribution of the feature mean and
and for continuous targets, the feature statistics are similar to covariance over the target bins B. This results in a smoothed
nearby bins. Interestingly, the figure also shows the problem version of the statistics:
with regions that have very few data samples, like the age X
range 0 to 6 years (shown in pink). Note that the mean and µ̃b = k(yb , yb0 )µb0 , (4)
variance in this range show unexpectedly high similarity to b0 ∈B
X
age 30. In fact, it is shocking that the feature statistics at age eb =
Σ k(yb , yb0 )Σb0 . (5)
30 are more similar to age 1 than age 17. This unjustified b0 ∈B
similarity is due to data imbalance. Specifically, since there
are not enough images for ages 0 to 6, this range thus inherits e b }, we then follow the stan-
With both {µb , Σb } and {µ̃b , Σ
its priors from the range with the maximum amount of data, dard whitening and re-coloring procedure (Sun et al., 2016)
which is the range around age 30. to calibrate the feature representation for each input sample:
FDS Algorithm. Inspired by these observations, we pro- 1 1
e 2 Σ− 2 (z − µb ) + µ̃b .
z̃ = Σ b b (6)
pose feature distribution smoothing (FDS), which performs
distribution smoothing on the feature space, i.e., transfers We integrate FDS into deep networks by inserting a fea-
the feature statistics between nearby target bins. This pro- ture calibration layer after the final feature map. To train
cedure aims to calibrate the potentially biased estimates of the model, we employ a momentum update of the running
feature distribution, especially for underrepresented target statistics {µb , Σb } across each epoch. Correspondingly, the
values (e.g., medium- and few-shot groups) in training data. smoothed statistics {µ̃b , Σ e b } are updated across different
FDS is performed by first estimating the statistics of each epochs but fixed within each training epoch. The momen-
bin. Without loss of generality, we substitute variance with tum update, which performs an exponential moving average
covariance to reflect also the relationship between the vari- (EMA) of running statistics, results in more stable and accu-
rate estimations of the feature statistics during training. The
Delving into Deep Imbalanced Regression
Statistics
EMA across epoch • NYUD2-DIR (depth): We create NYUD2-DIR based on
the NYU Depth Dataset V2 (Nathan Silberman & Fer-
Calibration
gus, 2012), which provides images and depth maps for
different indoor scenes. The depth maps have an upper
bound of 10 meters and we set the bin length as 0.1 meter.
Following standard practices (Bhat et al., 2020; Hu et al.,
2019), we use 50K images for training and 654 images
Figure 5. Feature distribution smoothing (FDS) introduces a fea- for testing. We randomly select 9357 test pixels for each
ture calibration layer that uses kernel smoothing to smooth the
bin to make the test set balanced.
distributions of feature mean and covariance over the target space.
• SHHS-DIR (health condition score): We create SHHS-
calibrated features z̃ are then passed to the final regression DIR based on the SHHS dataset (Quan et al., 1997),
function and used to compute the loss. which contains full-night Polysomnography (PSG) from
We note that FDS can be integrated with any neural network 2651 subjects. Available PSG signals include Electroen-
model, as well as any past work on improving label imbal- cephalography (EEG), Electrocardiography (ECG), and
ance. In Sec. 4, we integrate FDS with a variety of prior breathing signals (airflow, abdomen, and thorax), which
techniques for addressing data imbalance, and demonstrate are used as inputs. The dataset also includes the 36-
that it consistently improves performance. Item Short Form Health Survey (SF-36) (Ware Jr & Sher-
bourne, 1992) for each subject, where a General Health
score is extracted. The score is used as the target value
4. Benchmarking DIR with a minimum score of 0 and maximum of 100.
Datasets. We curate five DIR benchmarks that span com-
puter vision, natural language processing, and healthcare. Network Architectures. We employ ResNet-50 (He et al.,
Fig. 6 shows the label density distribution of these datasets, 2016) as our backbone network for IMDB-WIKI-DIR and
and their level of imbalance. AgeDB-DIR. Following (Wang et al., 2018), we adopt the
same BiLSTM + GloVe word embeddings baseline for STS-
B-DIR. For NYUD2-DIR, we use ResNet-50-based encoder-
• IMDB-WIKI-DIR (age): We construct IMDB-WIKI-DIR decoder architecture introduced in (Hu et al., 2019). Finally,
using the IMDB-WIKI dataset (Rothe et al., 2018), which for SHHS-DIR, we use the same CNN-RNN architecture
contains 523.0K face images and the corresponding ages. with ResNet block for PSG signals as in (Wang et al., 2019).
We filter out unqualified images, and manually construct
balanced validation and test set over the supported ages. Baselines. Since the literature has only a few proposals
The length of each bin is 1 year, with a minimum age of 0 for DIR, in addition to past work on imbalanced regres-
and a maximum age of 186. The number of images per sion (Branco et al., 2017; Torgo et al., 2013), we adapt a
bin varies between 1 and 7149, exhibiting significant data few imbalanced classification methods for regression, and
imbalance. Overall, the curated dataset has 191.5K im- propose a strong set of baselines. Below, we describe the
ages for training, 11.0K images for validation and testing. baselines, and how we can combine LDS with each method.
For FDS, it can be directly integrated with any baseline as a
• AgeDB-DIR (age): AgeDB-DIR is constructed in a simi- calibration layer, as described in Sec. 3.2.
lar manner from the AgeDB dataset (Moschoglou et al.,
2017). It contains 12.2K images for training, with a mini- • Vanilla model: We use term VANILLA to denote a model
mum age of 0 and a maximum age of 101, and maximum that does not include any technique for dealing with im-
bin density of 353 images and minimum bin density of 1. balanced data. To combine the vanilla model with LDS,
The validation and test set are balanced with 2.1K images. we re-weight the loss function by multiplying it by the
inverse of the LDS estimated density for each target bin.
• STS-B-DIR (text similarity score): We construct STS-B-
DIR from the Semantic Textual Similarity Benchmark • Synthetic samples: We choose existing methods for im-
(Cer et al., 2017; Wang et al., 2018), which is a collection balanced regression, including SMOTER (Torgo et al.,
of sentence pairs drawn from news headlines, video and 2013) and SMOGN (Branco et al., 2017). SMOTER first
image captions, and natural language inference data. Each defines frequent and rare regions using the original label
pair is annotated by multiple annotators with an averaged density, and creates synthetic samples for pre-defined rare
Delving into Deep Imbalanced Regression
1e3 IMDB-WIKI-DIR 1e2 AgeDB-DIR 1e2 STS-B-DIR 1e8 NYUD2-DIR 1e2 SHHS-DIR
1.5
4
4
3
# of samples
6
3 3 1.0
2
4
2 2
0.5
2 1
1 1
0 0 0 0.0 0
0 20 40 60 80 100 120 0 20 40 60 80 100 0 1 2 3 4 5 2 4 6 8 10 0 20 40 60 80 100
Age Age Similarity score Depth (m) SF-36 score
Figure 6. Overview of training set label distribution for five DIR datasets. They range from single-value prediction such as age, textual
similarity score, and health condition score, to dense-value prediction such as depth estimation. More details are provided in Appendix B.
Table 3. Benchmarking results on STS-B-DIR. I NV 14.39 11.84 13.12 16.02 9.34 7.73 8.49 11.20
I NV + LDS 14.14 11.66 12.77 16.05 9.26 7.64 8.18 11.32
Metrics MSE ↓ Pearson correlation (%) ↑ I NV + FDS 13.91 11.12 12.29 15.53 8.94 6.91 7.79 10.65
Shot All Many Med. Few All Many Med. Few I NV + LDS + FDS 13.76 11.12 12.18 15.07 8.70 6.94 7.60 10.18
VANILLA 0.974 0.851 1.520 0.984 74.2 72.0 62.7 75.2 O URS ( BEST ) VS . VANILLA +1.60 +1.41 +1.80 +1.87 +1.93 +1.13 +1.99 +2.02
S MOTE R (Torgo et al., 2013) 1.046 0.924 1.542 1.154 72.6 69.3 65.3 70.6
1e3
SMOGN (Branco et al., 2017) 0.990 0.896 1.327 1.175 73.2 70.4 65.5 69.2
SMOGN + LDS 0.962 0.880 1.242 1.155 74.0 71.5 65.2 69.8
3 Extrapolate Interpolate Extrapolate
# of samples
SMOGN + FDS 0.987 0.945 1.101 1.153 73.0 69.6 68.5 69.9
SMOGN + LDS + FDS 0.950 0.851 1.327 1.095 74.6 72.1 65.9 71.7
2
F OCAL -R + FDS 0.920 0.855 1.169 1.008 75.1 72.6 66.4 74.7
F OCAL -R + LDS + FDS 0.940 0.849 1.358 0.916 74.9 72.2 66.3 77.3 0
Label distribution Medium-shot region
Absolute MAE Gains
O URS ( BEST ) VS . VANILLA +.071 +.049 +.419 +.068 +1.8 +2.0 +5.8 +2.1 Figure 7. The absolute MAE gains of LDS + FDS over the vanilla
model, on a curated subset of IMDB-WIKI-DIR with certain target
values having no training data. We establish notable performance
gions. The advantage is even more profound under Pearson gains w.r.t. all regions, especially for extrapolation & interpolation.
correlation, which is commonly used for this NLP task.
Inferring Depth: NYUD2-DIR. For NYUD2-DIR, which this dataset. The results again confirm the effectiveness of
is a dense regression task, we verify from Table 4 that adding both FDS and LDS when applied for real-world imbalanced
LDS and FDS significantly improves the results. We note regression tasks, where by combining FDS and LDS we
that the vanilla model can inevitably overfit to the many- often get the highest gains over all tested regions.
shot regions during training. FDS and LDS help alleviate
this effect, and generalize better to all regions, with minor 4.2. Further Analysis
degradation in the many-shot region but significant boosts
Extrapolation & Interpolation. In real-world DIR tasks,
for other regions.
certain target values can have no data at all (e.g., see SHHS-
Inferring Health Score: SHHS-DIR. Table 5 reports the DIR and STS-B-DIR in Fig. 6). This motivates the need for
results on SHHS-DIR. Since SMOTER and SMOGN are target extrapolation and interpolation. We curate a subset
not directly applicable to this medical data, we skip them for from the training set of IMDB-WIKI-DIR, which has no
Delving into Deep Imbalanced Regression
1.0
1.0
E [kµb − µ̃bk1]
Mean cosine similarity
Average L1 Distance
0.9 0.14
0.12
0.9 0.8
0.10
Anchor target (0)
0.7 0.08
Other targets
0.8 Many-shot region 0.06
0.6
Medium-shot region 0.04
Few-shot region
0.5 0.02
1.0 1.0 0.25
Variance cosine similarity
Average L1 Distance
0.8 0.20
0.8
1,1
0.7 0.15
0.6
0.6
0.10
0.4
0.5
0.4 0.05
0.2
0.3
0.00
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80
Target value (Age) Target value (Age) Epoch
(a) Feature statistics similarity for age 0, without FDS (b) Feature statistics similarity for age 0, with FDS (c) Statistics change
Figure 8. Analysis on how FDS works. (a) & (b) Feature statistics similarity for anchor age 0, using model trained without and with FDS.
(c) L1 distance between the running statistics {µb , Σb } and the smoothed statistics {µ̃b , Σ
e b } during training.
label distributions, LDS and FDS consistently boost the per- Cao, K., Wei, C., Gaidon, A., Arechiga, N., and Ma,
formance across all regions compared to the vanilla model, T. Learning imbalanced datasets with label-distribution-
with relative MAE gains ranging from 8.8% to 12.4%. aware margin loss. In NeurIPS, 2019.
Comparisons to imbalanced classification methods (Ap- Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., and Spe-
pendix E.6). Finally, to gain more insights on the intrinsic cia, L. Semeval-2017 task 1: Semantic textual similarity
difference between imbalanced classification & imbalanced multilingual and crosslingual focused evaluation. In Pro-
regression problems, we directly apply existing imbalanced ceedings of the 11th International Workshop on Semantic
classification schemes on several appropriate DIR datasets, Evaluation, pp. 1–14, 2017.
and show empirical comparisons with imbalanced regres-
sion approaches. We demonstrate in Appendix E.6 that LDS Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,
and FDS outperform imbalanced classification schemes by W. P. Smote: synthetic minority over-sampling technique.
a large margin, where the errors for few-shot regions can be Journal of artificial intelligence research, 16:321–357,
reduced by up to 50% to 60%. Interestingly, the results also 2002.
show that imbalanced classification schemes often perform
worse than even the vanilla regression model, which con- Cui, Y., Jia, M., Lin, T.-Y., Song, Y., and Belongie, S. Class-
firms that regression requires different approaches for data balanced loss based on effective number of samples. In
imbalance than simply applying classification methods. We CVPR, 2019.
note that imbalanced classification methods could fail on Dong, Q., Gong, S., and Zhu, X. Imbalanced deep learning
regression problems for several reasons. First, they ignore by minority class incremental rectification. IEEE Trans-
the similarity between data samples that are close w.r.t. the actions on Pattern Analysis and Machine Intelligence, 41
continuous target. Moreover, classification cannot extrapo- (6):1367–1381, Jun 2019.
late or interpolate in the continuous label space, therefore
unable to deal with missing data in certain target regions. Eigen, D., Puhrsch, C., and Fergus, R. Depth map prediction
from a single image using a multi-scale deep network.
5. Conclusion NeurIPS, 2014.
We introduce the DIR task that learns from natural imbal- Garcı́a, S. and Herrera, F. Evolutionary undersampling for
anced data with continuous targets, and generalizes to the classification with imbalanced datasets: Proposals and
entire target range. We propose two simple and effective al- taxonomy. Evolutionary computation, 17(3):275–306,
gorithms for DIR that exploit the similarity between nearby 2009.
targets in both label and feature spaces. Extensive results on Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P.,
five curated large-scale real-world DIR benchmarks confirm Liu, N. F., Peters, M., Schmitz, M., and Zettlemoyer, L. S.
the superior performance of our methods. Our work fills the Allennlp: A deep semantic natural language processing
gap in benchmarks and techniques for practical DIR tasks. platform. 2017.
References He, H., Bai, Y., Garcia, E. A., and Li, S. Adasyn: Adaptive
synthetic sampling approach for imbalanced learning. In
Bhat, S. F., Alhashim, I., and Wonka, P. Adabins: IEEE international joint conference on neural networks,
Depth estimation using adaptive bins. arXiv preprint pp. 1322–1328, 2008.
arXiv:2011.14141, 2020.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual
Branco, P., Torgo, L., and Ribeiro, R. P. Smogn: a pre- learning for image recognition. In CVPR, 2016.
processing approach for imbalanced regression. In First
international workshop on learning with imbalanced do- Hu, J., Ozay, M., Zhang, Y., and Okatani, T. Revisiting
mains: Theory and applications, pp. 36–50. PMLR, 2017. single image depth estimation: Toward higher resolution
maps with accurate object boundaries. In WACV, 2019.
Branco, P., Torgo, L., and Ribeiro, R. P. Rebagg: Resampled Huang, C., Li, Y., Change Loy, C., and Tang, X. Learn-
bagging for imbalanced regression. In Second Interna- ing deep representation for imbalanced classification. In
tional Workshop on Learning with Imbalanced Domains: CVPR, 2016.
Theory and Applications, pp. 67–81. PMLR, 2018.
Huang, C., Li, Y., Chen, C. L., and Tang, X. Deep imbal-
Buda, M., Maki, A., and Mazurowski, M. A. A systematic anced learning for face recognition and attribute predic-
study of the class imbalance problem in convolutional tion. IEEE transactions on pattern analysis and machine
neural networks. Neural Networks, 106:249–259, 2018. intelligence, 2019.
Delving into Deep Imbalanced Regression
Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, Sun, B., Feng, J., and Saenko, K. Return of frustratingly
J., and Kalantidis, Y. Decoupling representation and easy domain adaptation. In Proceedings of the AAAI
classifier for long-tailed recognition. ICLR, 2020. Conference on Artificial Intelligence, volume 30, 2016.
Kingma, D. P. and Ba, J. Adam: A method for stochastic Tang, K., Huang, J., and Zhang, H. Long-tailed classifica-
optimization. arXiv preprint arXiv:1412.6980, 2014. tion by keeping the good and removing the bad momen-
tum causal effect. In NeurIPS, 2020.
Krizhevsky, A., Hinton, G., et al. Learning multiple layers
of features from tiny images. 2009. Torgo, L., Ribeiro, R. P., Pfahringer, B., and Branco, P.
Smote for regression. In Portuguese conference on artifi-
Lei, T., Zhang, Y., Wang, S. I., Dai, H., and Artzi, Y. Simple cial intelligence, pp. 378–389. Springer, 2013.
recurrent units for highly parallelizable recurrence. In
EMNLP, pp. 4470–4481, 2018. Verma, V., Lamb, A., Beckham, C., Najafi, A., Mitliagkas,
I., Lopez-Paz, D., and Bengio, Y. Manifold mixup: Better
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollár, P. representations by interpolating hidden states. In Interna-
Focal loss for dense object detection. In ICCV, pp. 2980– tional Conference on Machine Learning, 2019.
2988, 2017.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and
Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., and Yu, S. X. Bowman, S. R. Glue: A multi-task benchmark and analy-
Large-scale long-tailed recognition in an open world. In sis platform for natural language understanding. EMNLP
CVPR, 2019. 2018, pp. 353, 2018.
Loper, E. and Bird, S. Nltk: The natural language toolkit. Wang, H., Mao, C., He, H., Zhao, M., Jaakkola, T. S., and
arXiv preprint cs/0205028, 2002. Katabi, D. Bidirectional inference networks: A class of
deep bayesian networks for health profiling. In Proceed-
Moschoglou, S., Papaioannou, A., Sagonas, C., Deng, J., ings of the AAAI Conference on Artificial Intelligence, pp.
Kotsia, I., and Zafeiriou, S. Agedb: The first manually 766–773, 2019.
collected, in-the-wild age database. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Ware Jr, J. E. and Sherbourne, C. D. The mos 36-item short-
Recognition Workshop, volume 2, pp. 5, 2017. form health survey (sf-36): I. conceptual framework and
item selection. Medical care, pp. 473–483, 1992.
Nathan Silberman, Derek Hoiem, P. K. and Fergus, R. In-
door segmentation and support inference from rgbd im- Yang, Y. and Xu, Z. Rethinking the value of labels for
ages. In ECCV, 2012. improving class-imbalanced learning. In NeurIPS, 2020.
Parzen, E. On estimation of a probability density function Yin, X., Yu, X., Sohn, K., Liu, X., and Chandraker, M.
and mode. The annals of mathematical statistics, 33(3): Feature transfer learning for face recognition with under-
1065–1076, 1962. represented data. In In Proceeding of IEEE Computer
Vision and Pattern Recognition, Long Beach, CA, June
Pennington, J., Socher, R., and Manning, C. D. Glove: 2019.
Global vectors for word representation. In Proceedings
of the 2014 conference on empirical methods in natural Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D.
language processing (EMNLP), pp. 1532–1543, 2014. mixup: Beyond empirical risk minimization. In ICLR,
2018.
Quan, S. F., Howard, B. V., Iber, C., Kiley, J. P., Nieto, F. J.,
O’Connor, G. T., Rapoport, D. M., Redline, S., Robbins, Zhang, X., Fang, Z., Wen, Y., Li, Z., and Qiao, Y. Range
J., Samet, J. M., et al. The sleep heart health study: design, loss for deep face recognition with long-tailed training
rationale, and methods. Sleep, 20(12):1077–1085, 1997. data. In ICCV, 2017.
Shu, J., Xie, Q., Yi, L., Zhao, Q., Zhou, S., Xu, Z., and
Meng, D. Meta-weight-net: Learning an explicit mapping
for sample weighting. arXiv preprint arXiv:1902.07379,
2019.