0% found this document useful (0 votes)

46 views10 pages

Delving Into Deep Imbalanced Regression

The document introduces Deep Imbalanced Regression (DIR), a framework for learning from imbalanced data with continuous targets, addressing challenges such as missing data and the need for generalization across the entire target range. It proposes two methods, Label Distribution Smoothing (LDS) and Feature Distribution Smoothing (FDS), to effectively calibrate label and feature distributions by leveraging similarities between nearby targets. The authors benchmark DIR datasets across various domains and demonstrate the superior performance of their methods through extensive experiments.

Uploaded by

António Pedro Pinheiro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views10 pages

Delving Into Deep Imbalanced Regression

Uploaded by

António Pedro Pinheiro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Delving into Deep Imbalanced Regression

Yuzhe Yang 1 Kaiwen Zha 1 Ying-Cong Chen 1 Hao Wang 2 Dina Katabi 1

Abstract Deep Imbalanced Regression

Real-world data often exhibit imbalanced distri- Imbalanced distribution

Number of samples
butions, where certain target values have signif-
icantly fewer observations. Existing techniques
for dealing with imbalanced data focus on tar-
gets with categorical indices, i.e., different classes.
However, many tasks involve continuous targets,
where hard boundaries between classes do not Missing data
exist. We define Deep Imbalanced Regression
(DIR) as learning from such imbalanced data with Continuous target value
continuous targets, dealing with potential missing
data for certain target values, and generalizing to Figure 1. Deep Imbalanced Regression (DIR) aims to learn from
the entire target range. Motivated by the intrinsic imbalanced data with continuous targets, tackle potential missing
data for certain regions, and generalize to the entire target range.
difference between categorical and continuous la-
bel space, we propose distribution smoothing for
both labels and features, which explicitly acknowl- Existing solutions for learning from imbalanced data, how-
edges the effects of nearby targets, and calibrates ever, focus on targets with categorical indices, i.e., the tar-
both label and learned feature distributions. We gets are different classes. However, many real-world tasks
curate and benchmark large-scale DIR datasets involve continuous and even infinite target values. For ex-
from common real-world tasks in computer vi- ample, in vision applications, one needs to infer the age of
sion, natural language processing, and healthcare different people based on their visual appearances, where
domains. Extensive experiments verify the supe- age is a continuous target and can be highly imbalanced.
rior performance of our strategies. Our work fills Treating different ages as distinct classes is unlikely to yield
the gap in benchmarks and techniques for prac- the best results because it does not take advantage of the
tical imbalanced regression problems. Code and similarity between people with nearby ages. Similar issues
data are available at: https://github.com/ happen in medical applications since many health metrics
YyzHarry/imbalanced-regression. including heart rate, blood pressure, and oxygen saturation,
are continuous and often have skewed distributions across
patient populations.
1. Introduction
In this work, we systematically investigate Deep Imbalanced
Data imbalance is ubiquitous and inherent in the real world. Regression (DIR) arising in real-world settings (see Fig. 1).
Rather than preserving an ideal uniform distribution over We define DIR as learning continuous targets from natural
each category, the data often exhibit skewed distributions imbalanced data, dealing with potentially missing data for
with a long tail (Buda et al., 2018; Liu et al., 2019), where certain target values, and generalizing to a test set that is
certain target values have significantly fewer observations. balanced over the entire range of continuous target values.
This phenomenon poses great challenges for deep recogni- This definition is analogous to the class imbalance problem
tion models, and has motivated many prior techniques for (Liu et al., 2019), but focuses on the continuous setting.
addressing data imbalance (Cao et al., 2019; Cui et al., 2019;
Huang et al., 2019; Liu et al., 2019; Tang et al., 2020). DIR brings new challenges distinct from its classification
counterpart. First, given continuous (potentially infinite)
1
MIT Computer Science & Artificial Intelligence Laboratory target values, the hard boundaries between classes no longer
2
Department of Computer Science, Rutgers University. Correspon- exist, causing ambiguity when directly applying traditional
dence to: Yuzhe Yang <[email protected]>.
imbalanced classification methods such as re-sampling and
Proceedings of the 38 th International Conference on Machine re-weighting. Moreover, continuous labels inherently pos-
Learning, PMLR 139, 2021. Copyright 2021 by the author(s). sess a meaningful distance between targets, which has im-
Delving into Deep Imbalanced Regression

plication for how we should interpret data imbalance. For interpolating samples in the same class (Chawla et al., 2002).
example, say two target labels t1 and t2 have a small number Model-based solutions include re-weighting or adjusting the
of observations in training data. However, t1 is in a highly loss function to compensate for class imbalance (Cao et al.,
represented neighborhood (i.e., there are many samples in 2019; Cui et al., 2019; Dong et al., 2019; Huang et al.,
the range [t1 − ∆, t1 + ∆]), while t2 is in a weakly repre- 2016; 2019), and leveraging relevant learning paradigms,
sented neighborhood. In this case, t1 does not suffer from including transfer learning (Yin et al., 2019), metric learn-
the same level of imbalance as t2 . Finally, unlike classifi- ing (Zhang et al., 2017), meta-learning (Shu et al., 2019),
cation, certain target values may have no data at all, which and two-stage training (Kang et al., 2020). Recent studies
motivates the need for target extrapolation & interpolation. have also discovered that semi-supervised learning and self-
supervised learning lead to better imbalanced classification
In this paper, we propose two simple yet effective methods
results (Yang & Xu, 2020). In contrast to these past work,
for addressing DIR: label distribution smoothing (LDS) and
we identify the limitations of applying class imbalance meth-
feature distribution smoothing (FDS). A key idea underlying
ods to regression problems, and introduce new techniques
both approaches is to leverage the similarity between nearby
particularly suitable for learning continuous target values.
targets by employing a kernel distribution to perform ex-
plicit distribution smoothing in the label and feature spaces. Imbalanced Regression. Regression over imbalanced data
Both techniques can be easily embedded into existing deep is not as well explored. Most of the work on this topic is
networks and allow optimization in an end-to-end fashion. a direct adaptation of the SMOTE algorithm to regression
We verify that our techniques not only successfully calibrate scenarios (Branco et al., 2017; 2018; Torgo et al., 2013).
for the intrinsic underlying imbalance, but also provide large Synthetic samples are created for pre-defined rare target re-
and consistent gains when combined with other methods. gions by either directly interpolating both inputs and targets
(Torgo et al., 2013), or using Gaussian noise augmentation
To support practical evaluation of imbalanced regression,
(Branco et al., 2017). A bagging-based ensemble method
we curate and benchmark large-scale DIR datasets for com-
that incorporates multiple data pre-processing steps has also
mon real-world tasks in computer vision, natural language
been introduced (Branco et al., 2018). However, there exist
processing, and healthcare. They range from single-value
several intrinsic drawbacks for these methods. First, they
prediction such as age, text similarity score, health condition
fail to take the distance between targets into account, and
score, to dense-value prediction such as depth. We further
rather heuristically divide the dataset into rare and frequent
set up benchmarks for proper DIR performance evaluation.
sets, then plug in classification-based methods. Moreover,
Our contributions are as follows: modern data is of extremely high dimension (e.g., images
and physiological signals); linear interpolation of two sam-
• We formally define the DIR task as learning from imbal- ples of such data does not lead to meaningful new synthetic
anced data with continuous targets, and generalizing to the samples. Our methods are intrinsically different from past
entire target range. DIR provides thorough and unbiased work in their approach. They can be combined with existing
evaluation of learning algorithms in practical settings. methods to improve their performance, as we show in Sec. 4.
• We develop two simple, effective, and interpretable algo- Further, our approaches are tested on large-scale real-world
rithms for DIR, LDS and FDS, which exploit the similarity datasets in computer vision, NLP, and healthcare.
between nearby targets in both label and feature space.
• We curate benchmark DIR datasets in different domains: 3. Methods
computer vision, natural language processing, and health-
care. We set up strong baselines as well as benchmarks Problem Setting. Let {(xi , yi )}N i=1 be a training set, where
for proper DIR performance evaluation. xi ∈ Rd denotes the input and yi ∈ R is the label, which is
• Extensive experiments on large-scale DIR datasets verify a continuous target. We introduce an additional structure for
the consistent and superior performance of our strategies. the label space Y, where we divide Y into B groups (bins)
with equal intervals, i.e., [y0 , y1 ), [y1 , y2 ), . . . , [yB−1 , yB ).
2. Related Work Throughout the paper, we use b ∈ B to denote the group
index of the target value, where B = {1, . . . , B} ⊂ Z+
Imbalanced Classification. Much prior work has focused is the index space. In practice, the defined bins reflect a
on the imbalanced classification problem (also referred to minimum resolution we care for grouping data in a regres-
as long-tailed recognition (Liu et al., 2019)). Past solutions sion task. For instance, in age estimation, we could define
can be divided into data-based and model-based solutions: δy , yb+1 − yb = 1, showing a minimum age difference of
Data-based solutions either over-sample the minority class 1 is of interest. Finally, we denote z = f (x; θ) as the fea-
or under-sample the majority (Chawla et al., 2002; Garcı́a ture for x, where f (x; θ) is parameterized by a deep neural
& Herrera, 2009; He et al., 2008). For example, SMOTE network model with parameter θ. The final prediction ŷ is
generates synthetic samples for minority classes by linearly given by a regression function g(·) that operates over z.
Delving into Deep Imbalanced Regression

1e2
1e2 Pearson
Pearsoncorrelation:
correlation:-0.76
-0.76 Pearson
Pearsoncorrelation:
correlation:-0.47
-0.47
44
# of samples

00
20
20
Test error
Test error

15
15

10
10

Test error
00
00 20
20 40
40 60
60 80
80 100
100 00 20
20 40
40 60
60 80
80 100
100
Categorical
Categoricallabel
labelspace
space(class
(classindex)
index) Continuous
Continuouslabel
labelspace
space(age)
(age)

(a) CIFAR-100 (subsampled) (b) IMDB-WIKI (subsampled)

Continuous label space (age)
Figure 2. Comparison on the test error distribution (bottom) using
same training label distribution (top) on two different datasets: (a) Figure 3. Label distribution smoothing (LDS) convolves a symmet-
CIFAR-100, a classification task with categorical label space. (b) ric kernel with the empirical label density to estimate the effective
IMDB-WIKI, a regression task with continuous label space. label density distribution that accounts for the continuity of labels.

3.1. Label Distribution Smoothing LDS for Imbalanced Data Density Estimation. The
above example shows that, in the continuous case, the empir-
We start by showing an example to demonstrate the differ- ical label distribution does not reflect the real label density
ence between classification and regression when imbalance distribution. This is because of the dependence between
comes into the picture. data samples at nearby labels (e.g., images of close ages).
Motivating Example. We employ two datasets: (1) CIFAR- In fact, there is a significant literature in statistics on how to
100 (Krizhevsky et al., 2009), which is a 100-class classi- estimate the expected density in such cases (Parzen, 1962).
fication dataset, and (2) the IMDB-WIKI dataset (Rothe Thus, Label Distribution Smoothing (LDS) advocates the
et al., 2018), which is a large-scale image dataset for age use of kernel density estimation to learn the effective imbal-
estimation from visual appearance. The two datasets have ance in datasets that corresponds to continuous targets.
intrinsically different label space: CIFAR-100 exhibits cat- LDS convolves a symmetric kernel with the empirical den-
egorical label space where the target is class index, while sity distribution to extract a kernel-smoothed version that
IMDB-WIKI has a continuous label space where the target accounts for the overlap in information of data samples of
is age. We limit the age range to 0 ∼ 99 so that the two nearby labels. A symmetric kernel is any kernel that satis-
datasets have the same label range, and subsample them to fies: k(y, y 0 ) = k(y 0 , y) and ∇y k(y, y 0 ) + ∇y0 k(y 0 , y) = 0,
simulate data imbalance, while ensuring they have exactly ∀y, y 0 ∈ Y. Note that a Gaussian or a Laplacian kernel is a
the same label density distribution (Fig. 2). We make both symmetric kernel, while k(y, y 0 ) = yy 0 is not. The symmet-
test sets balanced. We then train a plain ResNet-50 model ric kernel characterizes the similarity between target values
on the two datasets, and plot their test error distributions. y 0 and any y w.r.t. their distance in the target space. Thus,
We observe from Fig. 2(a) that the error distribution corre- LDS computes the effective label density distribution as:
lates with label density distribution. Specifically, the test Z
error as a function of class index has a high negative Pearson p̃(y 0 ) , k(y, y 0 )p(y)dy, (1)
correlation with the label density distribution (i.e., −0.76) in Y
the categorical label space. The phenomenon is expected, as
where p(y) is the number of appearances of label of y in the
majority classes with more samples are better learned than
training data, and p̃(y 0 ) is the effective density of label y 0 .
minority classes. Interestingly however, as Fig. 2(b) shows,
the error distribution is very different for IMDB-WIKI with Fig. 3 illustrates LDS and how it smooths the label density
continuous label space, even when the label density distri- distribution. Further, it shows that the resulting label density
bution is the same as CIFAR-100. In particular, the error computed by LDS correlates well with the error distribution
distribution is much smoother and no longer correlates well (−0.83). This demonstrates that LDS captures the real imba-
with the label density distribution (−0.47). lance that affects regression problems.
The reason why this example is interesting is that all im- Now that the effective label density is available, techniques
balanced learning methods, directly or indirectly, operate for addressing class imbalance problems can be directly
by compensating for the imbalance in the empirical label adapted to the DIR context. For example, a straightforward
density distribution. This works well for class imbalance, adaptation can be the cost-sensitive re-weighting method,
but for continuous labels the empirical density does not ac- where we re-weight the loss function by multiplying it by the
curately reflect the imbalance as seen by the neural network. inverse of the LDS estimated label density for each target.
Hence, compensating for data imbalance based on empirical We show in Sec. 4 that LDS can be seamlessly incorporated
label density is inaccurate for the continuous label space. with a wide range of techniques to boost DIR performance.
Delving into Deep Imbalanced Regression

3.2. Feature Distribution Smoothing 1.0 Anchor target (30)

Mean cosine similarity

Other targets
Many-shot region
We are motivated by the intuition that continuity in the target 0.9
Medium-shot region
Few-shot region
space should create a corresponding continuity in the feature
space. That is, if the model works properly and the data is
0.8
balanced, one expects the feature statistics corresponding to
nearby targets to be close to each other. 1.0

Variance cosine similarity

0.9

Motivating Example. We use an illustrative example to 0.8

highlight the impact of data imbalance on feature statistics 0.7

in DIR. Again, we use a plain model trained on the images 0.6

in the IMDB-WIKI dataset to infer a person’s age from 0.5

visual appearance. We focus on the learned feature space, 0.4

0 20 40 60 80 100
i.e., z. We use a minimum bin size of 1, i.e., yb+1 − yb = 1, Target value (Age)

and group features with the same target value in the same Figure 4. Feature statistics similarity for age 30. Top: Cosine simi-
bin. We then compute the feature statistics (i.e., mean and larity of the feature mean at a particular age w.r.t. its value at the
variance) with respect to the data in each bin, which we anchor age. Bottom: Cosine similarity of the feature variance at a
denote as {µb , σb }B particular age w.r.t. its value at the anchor age. The color of the
b=1 . To visualize the similarity between
background refers to the data density in a particular target range.
feature statistics, we select an anchor bin b0 , and calculate
The figure shows that nearby ages have close similarities; However,
the cosine similarity of the feature statistics between b0 and it also shows that there is unjustified similarity between images at
all other bins. The results are summarized in Fig. 4 for ages 0 to 6 and age 30, due to data imbalance.
b0 = 30. The figure also shows the regions with different
data densities using the colors purple, yellow, and pink. ous feature elements within z:
Nb
Fig. 4 shows that the feature statistics around b0 = 30 are 1 X
highly similar to their values at b0 = 30. Specifically, the µb = zi , (2)
Nb i=1
cosine similarity of the feature mean and feature variance
X N
for all bins between age 25 and 35 are within a few percent 1 b

from their values at age 30 (the anchor age). Further, the Σb = (zi − µb )(zi − µb )> , (3)
Nb − 1 i=1
similarity gets higher for tighter ranges around the anchor.
Note that bin 30 falls in the high shot region. In fact, it is where Nb is the total number of samples in b-th bin. Given
among the few bins that have the most samples. So, the the feature statistics, we employ again a symmetric kernel
figure confirms the intuition that when there is enough data, k(yb , yb0 ) to smooth the distribution of the feature mean and
and for continuous targets, the feature statistics are similar to covariance over the target bins B. This results in a smoothed
nearby bins. Interestingly, the figure also shows the problem version of the statistics:
with regions that have very few data samples, like the age X
range 0 to 6 years (shown in pink). Note that the mean and µ̃b = k(yb , yb0 )µb0 , (4)
variance in this range show unexpectedly high similarity to b0 ∈B
X
age 30. In fact, it is shocking that the feature statistics at age eb =
Σ k(yb , yb0 )Σb0 . (5)
30 are more similar to age 1 than age 17. This unjustified b0 ∈B
similarity is due to data imbalance. Specifically, since there
are not enough images for ages 0 to 6, this range thus inherits e b }, we then follow the stan-
With both {µb , Σb } and {µ̃b , Σ
its priors from the range with the maximum amount of data, dard whitening and re-coloring procedure (Sun et al., 2016)
which is the range around age 30. to calibrate the feature representation for each input sample:
FDS Algorithm. Inspired by these observations, we pro- 1 1
e 2 Σ− 2 (z − µb ) + µ̃b .
z̃ = Σ b b (6)
pose feature distribution smoothing (FDS), which performs
distribution smoothing on the feature space, i.e., transfers We integrate FDS into deep networks by inserting a fea-
the feature statistics between nearby target bins. This pro- ture calibration layer after the final feature map. To train
cedure aims to calibrate the potentially biased estimates of the model, we employ a momentum update of the running
feature distribution, especially for underrepresented target statistics {µb , Σb } across each epoch. Correspondingly, the
values (e.g., medium- and few-shot groups) in training data. smoothed statistics {µ̃b , Σ e b } are updated across different
FDS is performed by first estimating the statistics of each epochs but fixed within each training epoch. The momen-
bin. Without loss of generality, we substitute variance with tum update, which performs an exponential moving average
covariance to reflect also the relationship between the vari- (EMA) of running statistics, results in more stable and accu-
rate estimations of the feature statistics during training. The
Delving into Deep Imbalanced Regression

continuous similarity score from 0 to 5. From the original

EMA across epoch
training set of 7.2K pairs, we create a training set with
5.2K pairs, and balanced validation set and test set of 1K
pairs each. The length of each bin is 0.1.

Statistics
EMA across epoch • NYUD2-DIR (depth): We create NYUD2-DIR based on
the NYU Depth Dataset V2 (Nathan Silberman & Fer-
Calibration
gus, 2012), which provides images and depth maps for
different indoor scenes. The depth maps have an upper
bound of 10 meters and we set the bin length as 0.1 meter.
Following standard practices (Bhat et al., 2020; Hu et al.,
2019), we use 50K images for training and 654 images
Figure 5. Feature distribution smoothing (FDS) introduces a fea- for testing. We randomly select 9357 test pixels for each
ture calibration layer that uses kernel smoothing to smooth the
bin to make the test set balanced.
distributions of feature mean and covariance over the target space.
• SHHS-DIR (health condition score): We create SHHS-
calibrated features z̃ are then passed to the final regression DIR based on the SHHS dataset (Quan et al., 1997),
function and used to compute the loss. which contains full-night Polysomnography (PSG) from
We note that FDS can be integrated with any neural network 2651 subjects. Available PSG signals include Electroen-
model, as well as any past work on improving label imbal- cephalography (EEG), Electrocardiography (ECG), and
ance. In Sec. 4, we integrate FDS with a variety of prior breathing signals (airflow, abdomen, and thorax), which
techniques for addressing data imbalance, and demonstrate are used as inputs. The dataset also includes the 36-
that it consistently improves performance. Item Short Form Health Survey (SF-36) (Ware Jr & Sher-
bourne, 1992) for each subject, where a General Health
score is extracted. The score is used as the target value
4. Benchmarking DIR with a minimum score of 0 and maximum of 100.
Datasets. We curate five DIR benchmarks that span com-
puter vision, natural language processing, and healthcare. Network Architectures. We employ ResNet-50 (He et al.,
Fig. 6 shows the label density distribution of these datasets, 2016) as our backbone network for IMDB-WIKI-DIR and
and their level of imbalance. AgeDB-DIR. Following (Wang et al., 2018), we adopt the
same BiLSTM + GloVe word embeddings baseline for STS-
B-DIR. For NYUD2-DIR, we use ResNet-50-based encoder-
• IMDB-WIKI-DIR (age): We construct IMDB-WIKI-DIR decoder architecture introduced in (Hu et al., 2019). Finally,
using the IMDB-WIKI dataset (Rothe et al., 2018), which for SHHS-DIR, we use the same CNN-RNN architecture
contains 523.0K face images and the corresponding ages. with ResNet block for PSG signals as in (Wang et al., 2019).
We filter out unqualified images, and manually construct
balanced validation and test set over the supported ages. Baselines. Since the literature has only a few proposals
The length of each bin is 1 year, with a minimum age of 0 for DIR, in addition to past work on imbalanced regres-
and a maximum age of 186. The number of images per sion (Branco et al., 2017; Torgo et al., 2013), we adapt a
bin varies between 1 and 7149, exhibiting significant data few imbalanced classification methods for regression, and
imbalance. Overall, the curated dataset has 191.5K im- propose a strong set of baselines. Below, we describe the
ages for training, 11.0K images for validation and testing. baselines, and how we can combine LDS with each method.
For FDS, it can be directly integrated with any baseline as a
• AgeDB-DIR (age): AgeDB-DIR is constructed in a simi- calibration layer, as described in Sec. 3.2.
lar manner from the AgeDB dataset (Moschoglou et al.,
2017). It contains 12.2K images for training, with a mini- • Vanilla model: We use term VANILLA to denote a model
mum age of 0 and a maximum age of 101, and maximum that does not include any technique for dealing with im-
bin density of 353 images and minimum bin density of 1. balanced data. To combine the vanilla model with LDS,
The validation and test set are balanced with 2.1K images. we re-weight the loss function by multiplying it by the
inverse of the LDS estimated density for each target bin.
• STS-B-DIR (text similarity score): We construct STS-B-
DIR from the Semantic Textual Similarity Benchmark • Synthetic samples: We choose existing methods for im-
(Cer et al., 2017; Wang et al., 2018), which is a collection balanced regression, including SMOTER (Torgo et al.,
of sentence pairs drawn from news headlines, video and 2013) and SMOGN (Branco et al., 2017). SMOTER first
image captions, and natural language inference data. Each defines frequent and rare regions using the original label
pair is annotated by multiple annotators with an averaged density, and creates synthetic samples for pre-defined rare
Delving into Deep Imbalanced Regression

1e3 IMDB-WIKI-DIR 1e2 AgeDB-DIR 1e2 STS-B-DIR 1e8 NYUD2-DIR 1e2 SHHS-DIR
1.5
4
4
3
# of samples

6
3 3 1.0
2
4
2 2
0.5
2 1
1 1

0 0 0 0.0 0
0 20 40 60 80 100 120 0 20 40 60 80 100 0 1 2 3 4 5 2 4 6 8 10 0 20 40 60 80 100
Age Age Similarity score Depth (m) SF-36 score

Figure 6. Overview of training set label distribution for five DIR datasets. They range from single-value prediction such as age, textual
similarity score, and health condition score, to dense-value prediction such as depth estimation. More details are provided in Appendix B.

regions by linearly interpolating both inputs and targets.

Table 1. Benchmarking results on IMDB-WIKI-DIR.
SMOGN further adds Gaussian noise to SMOTER. We
note that LDS can be directly used for a better estimation Metrics MAE ↓ GM ↓
Shot All Many Med. Few All Many Med. Few
of label density when dividing the target space.
VANILLA 8.06 7.23 15.12 26.33 4.57 4.17 10.59 20.46
• Error-aware loss: Inspired by the Focal loss (Lin et al., S MOTE R (Torgo et al., 2013) 8.14 7.42 14.15 25.28 4.64 4.30 9.05 19.46
2017) for classification, we propose a regression version SMOGN (Branco et al., 2017) 8.03 7.30 14.02 25.93 4.63 4.30 8.74 20.12
SMOGN + LDS 8.02 7.39 13.71 23.22 4.63 4.39 8.71 15.80
called Focal-R, where the scaling factor is replaced by SMOGN + FDS 8.03 7.35 14.06 23.44 4.65 4.33 8.87 16.00
a continuous function that maps the absolute error into SMOGN + LDS + FDS 7.97 7.38 13.22 22.95 4.59 4.39 7.84 14.94
[0, 1]. Precisely,
PnFocal-R loss based on L1 distance can be F OCAL -R 7.97 7.12 15.14 26.96 4.49 4.10 10.37 21.20
written as n1 i=1 σ(|βei |)γ ei , where ei is the L1 error F OCAL -R + LDS
F OCAL -R + FDS
7.90
7.96
7.10
7.14
14.72
14.71
25.84
26.06
4.47
4.51
4.09 10.11 19.14
4.12 10.16 19.56
for i-th sample, σ(·) is the Sigmoid function, and β, γ F OCAL -R + LDS + FDS 7.88 7.10 14.08 25.75 4.47 4.11 9.32 18.67
are hyper-parameters. To combine Focal-R with LDS, RRT 7.81 7.07 14.06 25.13 4.35 4.03 8.91 16.96
we multiply the loss with the inverse frequency of the RRT + LDS 7.79 7.08 13.76 24.64 4.34 4.02 8.72 16.92
RRT + FDS 7.65 7.02 12.68 23.85 4.31 4.03 7.58 16.28
estimated label density. RRT + LDS + FDS 7.65 7.06 12.41 23.51 4.31 4.07 7.17 15.44
SQI NV 7.87 7.24 12.44 22.76 4.47 4.22 7.25 15.10
• Two-stage training: Following (Kang et al., 2020) where SQI NV + LDS 7.83 7.31 12.43 22.51 4.42 4.19 7.00 13.94
feature and classifier are decoupled and trained in two SQI NV + FDS 7.83 7.23 12.60 22.37 4.42 4.20 6.93 13.48
SQI NV + LDS + FDS 7.78 7.20 12.61 22.19 4.37 4.12 7.39 12.61
stages, we propose a regression version called regressor
re-training (RRT), where in the first stage we train the en- O URS ( BEST ) VS . VANILLA +0.41 +0.21 +2.71 +4.14 +0.26 +0.15 +3.66 +7.85

coder normally, and in the second stage freeze the encoder

and re-train the regressor g(·) with inverse re-weighting. 4.1. Main Results
When adding LDS, the re-weighting in the second stage
is based on the label density estimated through LDS. We report the main results in this section for all DIR datasets.
All training details, hyper-parameter settings, and additional
• Cost-sensitive re-weighting: Since we divide the target results are provided in Appendix C and D.
space into finite bins, classic re-weighting methods can be
Inferring Age from Images: IMDB-WIKI-DIR &
directly plugged in. We adopt two re-weighting schemes
AgeDB-DIR. We report the performance of different meth-
based on the label distribution: inverse-frequency weight-
ods in Table 1 and 2, respectively. For each dataset, we
ing (INV) and its square-root weighting variant (SQINV).
group the baselines into four sections to reflect their differ-
When combining with LDS, instead of using the original
ent strategies. First, as both tables indicate, when applied
label density, we use the LDS estimated target density.
to modern high-dimensional data like images, SMOTER
and SMOGN can actually degrade the performance in com-
Evaluation Process and Metrics. Following (Liu et al.,
parison to the vanilla model. Moreover, within each group,
2019), we divide the target space into three disjoint subsets:
adding either LDS, FDS, or both leads to performance gains,
many-shot region (bins with over 100 training samples),
while LDS + FDS often achieves the best results. Finally,
medium-shot region (bins with 20∼100 training samples),
when compared to the vanilla model, using our LDS and
and few-shot region (bins with under 20 training samples),
FDS maintains or slightly improves the performance overall
and report results on these subsets, as well as overall perfor-
and on the many-shot regions, while substantially boosting
mance. We also refer to regions with no training samples
the performance for the medium-shot and few-shot regions.
as zero-shot, and investigate the ability of our techniques
to generalize to zero-shot regions in Sec. 4.2. For metrics, Inferring Text Similarity Score: STS-B-DIR. Table 3
we use common metrics for regression, such as the mean- shows the results, where similar observations can be made
average-error (MAE), mean-squared-error (MSE), and Pear- on STS-B-DIR. Again, both SMOTER and SMOGN per-
son correlation. We further propose another metric,
Qn called form worse than the vanilla model. In contrast, both LDS
1
error Geometric Mean (GM), and is defined as ( i=1 ei ) n and FDS consistently and substantially improve the results
for better prediction fairness. for various methods, especially in medium- and few-shot re-
Delving into Deep Imbalanced Regression

Table 2. Benchmarking results on AgeDB-DIR. Table 4. Benchmarking results on NYUD2-DIR.

Metrics MAE ↓ GM ↓ Metrics RMSE ↓ δ1 ↑
Shot All Many Med. Few All Many Med. Few Shot All Many Med. Few All Many Med. Few
VANILLA 7.77 6.62 9.55 13.67 5.05 4.23 7.01 10.75 VANILLA 1.477 0.591 0.952 2.123 0.677 0.777 0.693 0.570
S MOTE R (Torgo et al., 2013) 8.16 7.39 8.65 12.28 5.21 4.65 5.69 8.49 VANILLA + LDS 1.387 0.671 0.913 1.954 0.672 0.701 0.706 0.630
SMOGN (Branco et al., 2017) 8.26 7.64 9.01 12.09 5.36 4.90 6.19 8.44 VANILLA + FDS 1.442 0.615 0.940 2.059 0.681 0.760 0.695 0.596
SMOGN + LDS 7.96 7.44 8.64 11.77 5.03 4.68 5.69 7.98 VANILLA + LDS + FDS 1.338 0.670 0.851 1.880 0.705 0.730 0.764 0.655
SMOGN + FDS 8.06 7.52 8.75 11.89 5.02 4.66 5.63 8.02
SMOGN + LDS + FDS 7.90 7.32 8.51 11.19 4.98 4.64 5.41 7.35 O URS ( BEST ) VS . VANILLA +.139 -.024 +.101 +.243 +.028 -.017 +.071 +.085

F OCAL -R 7.64 6.68 9.22 13.00 4.90 4.26 6.39 9.52

F OCAL -R + LDS 7.56 6.67 8.82 12.40 4.82 4.27 5.87 8.83 Table 5. Benchmarking results on SHHS-DIR.
F OCAL -R + FDS 7.65 6.89 8.70 11.92 4.83 4.32 5.89 8.04
F OCAL -R + LDS + FDS 7.47 6.69 8.30 12.55 4.71 4.25 5.36 8.59 Metrics MAE ↓ GM ↓
RRT 7.74 6.98 8.79 11.99 5.00 4.50 5.88 8.63 Shot All Many Med. Few All Many Med. Few
RRT + LDS 7.72 7.00 8.75 11.62 4.98 4.54 5.71 8.27 VANILLA 15.36 12.47 13.98 16.94 10.63 8.04 9.59 12.20
RRT + FDS 7.70 6.95 8.76 11.86 4.82 4.32 5.83 8.08
RRT + LDS + FDS 7.66 6.99 8.60 11.32 4.80 4.42 5.53 6.99 F OCAL -R 14.67 11.70 13.69 17.06 9.98 7.93 8.85 11.95
F OCAL -R + LDS 14.49 12.01 12.43 16.57 9.98 7.89 8.59 11.40
SQI NV 7.81 7.16 8.80 11.20 4.99 4.57 5.73 7.77
F OCAL -R + FDS 14.18 11.06 13.56 15.99 9.45 6.95 8.81 11.13
SQI NV + LDS 7.67 6.98 8.86 10.89 4.85 4.39 5.80 7.45
F OCAL -R + LDS + FDS 14.02 11.08 12.24 15.49 9.32 7.18 8.10 10.39
SQI NV + FDS 7.69 7.10 8.86 9.98 4.83 4.41 5.97 6.29
SQI NV + LDS + FDS 7.55 7.01 8.24 10.79 4.72 4.36 5.45 6.79 RRT 14.78 12.43 14.01 16.48 10.12 8.05 9.71 11.96
RRT + LDS 14.56 12.08 13.44 16.45 9.89 7.85 9.18 11.82
O URS ( BEST ) VS . VANILLA +0.30 -0.05 +1.31 +3.69 +0.34 -0.02 +1.65 +4.46
RRT + FDS 14.36 11.97 13.33 16.08 9.74 7.54 9.20 11.31
RRT + LDS + FDS 14.33 11.96 12.47 15.92 9.63 7.35 8.74 11.17

Table 3. Benchmarking results on STS-B-DIR. I NV 14.39 11.84 13.12 16.02 9.34 7.73 8.49 11.20
I NV + LDS 14.14 11.66 12.77 16.05 9.26 7.64 8.18 11.32
Metrics MSE ↓ Pearson correlation (%) ↑ I NV + FDS 13.91 11.12 12.29 15.53 8.94 6.91 7.79 10.65
Shot All Many Med. Few All Many Med. Few I NV + LDS + FDS 13.76 11.12 12.18 15.07 8.70 6.94 7.60 10.18

VANILLA 0.974 0.851 1.520 0.984 74.2 72.0 62.7 75.2 O URS ( BEST ) VS . VANILLA +1.60 +1.41 +1.80 +1.87 +1.93 +1.13 +1.99 +2.02

S MOTE R (Torgo et al., 2013) 1.046 0.924 1.542 1.154 72.6 69.3 65.3 70.6
1e3
SMOGN (Branco et al., 2017) 0.990 0.896 1.327 1.175 73.2 70.4 65.5 69.2
SMOGN + LDS 0.962 0.880 1.242 1.155 74.0 71.5 65.2 69.8
3 Extrapolate Interpolate Extrapolate
# of samples

SMOGN + FDS 0.987 0.945 1.101 1.153 73.0 69.6 68.5 69.9
SMOGN + LDS + FDS 0.950 0.851 1.327 1.095 74.6 72.1 65.9 71.7
2

F OCAL -R 0.951 0.843 1.425 0.957 74.6 72.3 61.8 76.4

F OCAL -R + LDS 0.930 0.807 1.449 0.993 75.7 73.9 62.4 75.4 1

F OCAL -R + FDS 0.920 0.855 1.169 1.008 75.1 72.6 66.4 74.7
F OCAL -R + LDS + FDS 0.940 0.849 1.358 0.916 74.9 72.2 66.3 77.3 0
Label distribution Medium-shot region
Absolute MAE Gains

RRT 0.964 0.842 1.503 0.978 74.5 72.4 62.3 75.4 15

Absolute MAE gains Few-shot region
RRT + LDS 0.916 0.817 1.344 0.945 75.7 73.5 64.1 76.6 Many-shot region Zero-shot region
10
RRT + FDS 0.929 0.857 1.209 1.025 74.9 72.1 67.2 74.0
RRT + LDS + FDS 0.903 0.806 1.323 0.936 76.0 73.8 65.2 76.7 5

I NV 1.005 0.894 1.482 1.046 72.8 70.3 62.5 73.2 0

I NV + LDS 0.914 0.819 1.319 0.955 75.6 73.4 63.8 76.2 −5

I NV + FDS 0.927 0.851 1.225 1.012 75.0 72.4 66.6 74.2 0 20 40 60 80 100
I NV + LDS + FDS 0.907 0.802 1.363 0.942 76.0 74.0 65.2 76.6 Target value (Age)

O URS ( BEST ) VS . VANILLA +.071 +.049 +.419 +.068 +1.8 +2.0 +5.8 +2.1 Figure 7. The absolute MAE gains of LDS + FDS over the vanilla
model, on a curated subset of IMDB-WIKI-DIR with certain target
values having no training data. We establish notable performance
gions. The advantage is even more profound under Pearson gains w.r.t. all regions, especially for extrapolation & interpolation.
correlation, which is commonly used for this NLP task.
Inferring Depth: NYUD2-DIR. For NYUD2-DIR, which this dataset. The results again confirm the effectiveness of
is a dense regression task, we verify from Table 4 that adding both FDS and LDS when applied for real-world imbalanced
LDS and FDS significantly improves the results. We note regression tasks, where by combining FDS and LDS we
that the vanilla model can inevitably overfit to the many- often get the highest gains over all tested regions.
shot regions during training. FDS and LDS help alleviate
this effect, and generalize better to all regions, with minor 4.2. Further Analysis
degradation in the many-shot region but significant boosts
Extrapolation & Interpolation. In real-world DIR tasks,
for other regions.
certain target values can have no data at all (e.g., see SHHS-
Inferring Health Score: SHHS-DIR. Table 5 reports the DIR and STS-B-DIR in Fig. 6). This motivates the need for
results on SHHS-DIR. Since SMOTER and SMOGN are target extrapolation and interpolation. We curate a subset
not directly applicable to this medical data, we skip them for from the training set of IMDB-WIKI-DIR, which has no
Delving into Deep Imbalanced Regression

1.0
1.0
E [kµb − µ̃bk1]
Mean cosine similarity

Mean cosine similarity

0.16

Average L1 Distance
0.9 0.14

0.12
0.9 0.8
0.10
Anchor target (0)
0.7 0.08
Other targets
0.8 Many-shot region 0.06
0.6
Medium-shot region 0.04
Few-shot region
0.5 0.02
1.0 1.0 0.25
Variance cosine similarity

Variance cosine similarity

0.9
E eb
Σb − Σ

Average L1 Distance
0.8 0.20
0.8
1,1

0.7 0.15
0.6

0.6
0.10
0.4
0.5

0.4 0.05
0.2
0.3
0.00
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80
Target value (Age) Target value (Age) Epoch

(a) Feature statistics similarity for age 0, without FDS (b) Feature statistics similarity for age 0, with FDS (c) Statistics change
Figure 8. Analysis on how FDS works. (a) & (b) Feature statistics similarity for anchor age 0, using model trained without and with FDS.
(c) L1 distance between the running statistics {µb , Σb } and the smoothed statistics {µ̃b , Σ
e b } during training.

training evolves, indicating that the model learns to generate

Table 6. Interpolation & extrapolation results on the curated subset
features that are more accurate even without smoothing,
of IMDB-WIKI-DIR. Using LDS and FDS, the generalization
results on zero-shot regions can be consistently improved. and finally the smoothing module can be removed during
inference. We provide more results for different anchor ages
Metrics MAE ↓ GM ↓
in Appendix E.7, where similar effects can be observed.
Shot All w/ data Interp. Extrap. All w/ data Interp. Extrap.
VANILLA 11.72 9.32 16.13 18.19 7.44 5.33 14.41 16.74 Ablation: Kernel type for LDS & FDS (Appendix E.1).
VANILLA + LDS 10.54 8.31 14.14 17.38 6.50 4.67 12.13 15.36 We study the effects of different kernel types for LDS and
VANILLA + FDS 11.40 8.97 15.83 18.01 7.18 5.12 14.02 16.48 FDS when applying distribution smoothing. We select three
VANILLA + LDS + FDS 10.27 8.11 13.71 17.02 6.33 4.55 11.71 15.13
different kernel types, i.e., Gaussian, Laplacian, and Trian-
O URS ( BEST ) VS . VANILLA +1.45 +1.21 +2.42 +1.17 +1.11 +0.78 +2.70 +1.61
gular kernel, and evaluate their influences on both LDS and
FDS. In general, all kernel types lead to notable gains (e.g.,
training data in certain regions (Fig. 7), but evaluate on the 3.7% ∼ 6.2% relative MSE gains on STS-B-DIR), with the
original testset for zero-shot generalization analysis. Gaussian kernel often delivering the best results.
As Table 6 shows, compared to the vanilla model, LDS and Ablation: Different regression loss functions (Appendix
FDS can both improve the results not only on regions that E.2). We investigate the influence of different training loss
have data, but also achieve larger gains on those without functions on LDS and FDS. We select three common losses
data. Specifically, substantial improvements are established used for regression tasks, i.e., L1 loss, MSE loss, and the
for both target interpolation and extrapolation, where inter- Huber loss (also referred to as smoothed L1 loss). We find
polation enjoys larger boosts. that similar results are obtained for all losses, indicating that
both LDS and FDS are robust to different loss functions.
We further visualize the absolute MAE gains of our method
over vanilla model in Fig. 7. Our method provides a com- Ablation: Hyper-parameter for LDS & FDS (Appendix
prehensive treatment to the many, medium, few, as well as E.3). We investigate the effects of hyper-parameters on both
zero-shot regions, achieving remarkable performance gains. LDS and FDS. As we mainly employ the Gaussian kernel
for distribution smoothing, we extensively study different
Understanding FDS. We investigate how FDS influences
choices of the kernel size l and standard deviation σ. Inter-
the feature statistics. In Fig. 8(a) and 8(b) we plot the
estingly, we find LDS and FDS are surprisingly robust to
similarity of the feature statistics for anchor age 0, using
different hyper-parameters in a given range, and obtain simi-
model trained without and with FDS. As the figure indicates,
lar gains. For example, on STS-B-DIR with l ∈ {5,9,15}
since age 0 lies in the few-shot region, the feature statistics
and σ ∈ {1,2,3}, overall MSE gains range from 3.3% to
can have a large bias, i.e., age 0 shares large similarity with
6.2%, with l = 5 and σ = 2 exhibiting the best results.
region 40 ∼ 80 as in Fig. 8(a). In contrast, when FDS
is added, the statistics are better calibrated, resulting in a Ablation: Robustness to diverse skewed label densities
high similarity only in its neighborhood, and a gradually (Appendix E.4). We curate different imbalanced distribu-
decreasing similarity score as target value becomes larger. tions for IMDB-WIKI-DIR by combining different number
We further visualize the L1 distance between the running of disjoint skewed Gaussian distributions over the target
statistics {µb , Σb } and the smoothed statistics {µ̃b , Σ e b} space, with potential missing data in certain target regions,
during training in Fig. 8(c). Interestingly, the average L1 and evaluate the robustness of FDS and LDS to the distribu-
distance becomes smaller and gradually diminishes as the tion change. We verify that even under different imbalanced
Delving into Deep Imbalanced Regression

label distributions, LDS and FDS consistently boost the per- Cao, K., Wei, C., Gaidon, A., Arechiga, N., and Ma,
formance across all regions compared to the vanilla model, T. Learning imbalanced datasets with label-distribution-
with relative MAE gains ranging from 8.8% to 12.4%. aware margin loss. In NeurIPS, 2019.
Comparisons to imbalanced classification methods (Ap- Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., and Spe-
pendix E.6). Finally, to gain more insights on the intrinsic cia, L. Semeval-2017 task 1: Semantic textual similarity
difference between imbalanced classification & imbalanced multilingual and crosslingual focused evaluation. In Pro-
regression problems, we directly apply existing imbalanced ceedings of the 11th International Workshop on Semantic
classification schemes on several appropriate DIR datasets, Evaluation, pp. 1–14, 2017.
and show empirical comparisons with imbalanced regres-
sion approaches. We demonstrate in Appendix E.6 that LDS Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,
and FDS outperform imbalanced classification schemes by W. P. Smote: synthetic minority over-sampling technique.
a large margin, where the errors for few-shot regions can be Journal of artificial intelligence research, 16:321–357,
reduced by up to 50% to 60%. Interestingly, the results also 2002.
show that imbalanced classification schemes often perform
worse than even the vanilla regression model, which con- Cui, Y., Jia, M., Lin, T.-Y., Song, Y., and Belongie, S. Class-
firms that regression requires different approaches for data balanced loss based on effective number of samples. In
imbalance than simply applying classification methods. We CVPR, 2019.
note that imbalanced classification methods could fail on Dong, Q., Gong, S., and Zhu, X. Imbalanced deep learning
regression problems for several reasons. First, they ignore by minority class incremental rectification. IEEE Trans-
the similarity between data samples that are close w.r.t. the actions on Pattern Analysis and Machine Intelligence, 41
continuous target. Moreover, classification cannot extrapo- (6):1367–1381, Jun 2019.
late or interpolate in the continuous label space, therefore
unable to deal with missing data in certain target regions. Eigen, D., Puhrsch, C., and Fergus, R. Depth map prediction
from a single image using a multi-scale deep network.
5. Conclusion NeurIPS, 2014.

We introduce the DIR task that learns from natural imbal- Garcı́a, S. and Herrera, F. Evolutionary undersampling for
anced data with continuous targets, and generalizes to the classification with imbalanced datasets: Proposals and
entire target range. We propose two simple and effective al- taxonomy. Evolutionary computation, 17(3):275–306,
gorithms for DIR that exploit the similarity between nearby 2009.
targets in both label and feature spaces. Extensive results on Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P.,
five curated large-scale real-world DIR benchmarks confirm Liu, N. F., Peters, M., Schmitz, M., and Zettlemoyer, L. S.
the superior performance of our methods. Our work fills the Allennlp: A deep semantic natural language processing
gap in benchmarks and techniques for practical DIR tasks. platform. 2017.

References He, H., Bai, Y., Garcia, E. A., and Li, S. Adasyn: Adaptive
synthetic sampling approach for imbalanced learning. In
Bhat, S. F., Alhashim, I., and Wonka, P. Adabins: IEEE international joint conference on neural networks,
Depth estimation using adaptive bins. arXiv preprint pp. 1322–1328, 2008.
arXiv:2011.14141, 2020.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual
Branco, P., Torgo, L., and Ribeiro, R. P. Smogn: a pre- learning for image recognition. In CVPR, 2016.
processing approach for imbalanced regression. In First
international workshop on learning with imbalanced do- Hu, J., Ozay, M., Zhang, Y., and Okatani, T. Revisiting
mains: Theory and applications, pp. 36–50. PMLR, 2017. single image depth estimation: Toward higher resolution
maps with accurate object boundaries. In WACV, 2019.
Branco, P., Torgo, L., and Ribeiro, R. P. Rebagg: Resampled Huang, C., Li, Y., Change Loy, C., and Tang, X. Learn-
bagging for imbalanced regression. In Second Interna- ing deep representation for imbalanced classification. In
tional Workshop on Learning with Imbalanced Domains: CVPR, 2016.
Theory and Applications, pp. 67–81. PMLR, 2018.
Huang, C., Li, Y., Chen, C. L., and Tang, X. Deep imbal-
Buda, M., Maki, A., and Mazurowski, M. A. A systematic anced learning for face recognition and attribute predic-
study of the class imbalance problem in convolutional tion. IEEE transactions on pattern analysis and machine
neural networks. Neural Networks, 106:249–259, 2018. intelligence, 2019.
Delving into Deep Imbalanced Regression

Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, Sun, B., Feng, J., and Saenko, K. Return of frustratingly
J., and Kalantidis, Y. Decoupling representation and easy domain adaptation. In Proceedings of the AAAI
classifier for long-tailed recognition. ICLR, 2020. Conference on Artificial Intelligence, volume 30, 2016.

Kingma, D. P. and Ba, J. Adam: A method for stochastic Tang, K., Huang, J., and Zhang, H. Long-tailed classifica-
optimization. arXiv preprint arXiv:1412.6980, 2014. tion by keeping the good and removing the bad momen-
tum causal effect. In NeurIPS, 2020.
Krizhevsky, A., Hinton, G., et al. Learning multiple layers
of features from tiny images. 2009. Torgo, L., Ribeiro, R. P., Pfahringer, B., and Branco, P.
Smote for regression. In Portuguese conference on artifi-
Lei, T., Zhang, Y., Wang, S. I., Dai, H., and Artzi, Y. Simple cial intelligence, pp. 378–389. Springer, 2013.
recurrent units for highly parallelizable recurrence. In
EMNLP, pp. 4470–4481, 2018. Verma, V., Lamb, A., Beckham, C., Najafi, A., Mitliagkas,
I., Lopez-Paz, D., and Bengio, Y. Manifold mixup: Better
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollár, P. representations by interpolating hidden states. In Interna-
Focal loss for dense object detection. In ICCV, pp. 2980– tional Conference on Machine Learning, 2019.
2988, 2017.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and
Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., and Yu, S. X. Bowman, S. R. Glue: A multi-task benchmark and analy-
Large-scale long-tailed recognition in an open world. In sis platform for natural language understanding. EMNLP
CVPR, 2019. 2018, pp. 353, 2018.
Loper, E. and Bird, S. Nltk: The natural language toolkit. Wang, H., Mao, C., He, H., Zhao, M., Jaakkola, T. S., and
arXiv preprint cs/0205028, 2002. Katabi, D. Bidirectional inference networks: A class of
deep bayesian networks for health profiling. In Proceed-
Moschoglou, S., Papaioannou, A., Sagonas, C., Deng, J., ings of the AAAI Conference on Artificial Intelligence, pp.
Kotsia, I., and Zafeiriou, S. Agedb: The first manually 766–773, 2019.
collected, in-the-wild age database. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Ware Jr, J. E. and Sherbourne, C. D. The mos 36-item short-
Recognition Workshop, volume 2, pp. 5, 2017. form health survey (sf-36): I. conceptual framework and
item selection. Medical care, pp. 473–483, 1992.
Nathan Silberman, Derek Hoiem, P. K. and Fergus, R. In-
door segmentation and support inference from rgbd im- Yang, Y. and Xu, Z. Rethinking the value of labels for
ages. In ECCV, 2012. improving class-imbalanced learning. In NeurIPS, 2020.

Parzen, E. On estimation of a probability density function Yin, X., Yu, X., Sohn, K., Liu, X., and Chandraker, M.
and mode. The annals of mathematical statistics, 33(3): Feature transfer learning for face recognition with under-
1065–1076, 1962. represented data. In In Proceeding of IEEE Computer
Vision and Pattern Recognition, Long Beach, CA, June
Pennington, J., Socher, R., and Manning, C. D. Glove: 2019.
Global vectors for word representation. In Proceedings
of the 2014 conference on empirical methods in natural Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D.
language processing (EMNLP), pp. 1532–1543, 2014. mixup: Beyond empirical risk minimization. In ICLR,
2018.
Quan, S. F., Howard, B. V., Iber, C., Kiley, J. P., Nieto, F. J.,
O’Connor, G. T., Rapoport, D. M., Redline, S., Robbins, Zhang, X., Fang, Z., Wen, Y., Li, Z., and Qiao, Y. Range
J., Samet, J. M., et al. The sleep heart health study: design, loss for deep face recognition with long-tailed training
rationale, and methods. Sleep, 20(12):1077–1085, 1997. data. In ICCV, 2017.

Rothe, R., Timofte, R., and Gool, L. V. Deep expectation of

real and apparent age from a single image without facial
landmarks. International Journal of Computer Vision,
126(2-4):144–157, 2018.

Shu, J., Xie, Q., Yi, L., Zhao, Q., Zhou, S., Xu, Z., and
Meng, D. Meta-weight-net: Learning an explicit mapping
for sample weighting. arXiv preprint arXiv:1902.07379,
2019.

ImbalancedData Regression 2102.09554
No ratings yet
ImbalancedData Regression 2102.09554
23 pages
Modeling Imbalance Class
No ratings yet
Modeling Imbalance Class
24 pages
A Survey On Imbalanced Learning - Latest Research, Applications and Future Directions
No ratings yet
A Survey On Imbalanced Learning - Latest Research, Applications and Future Directions
51 pages
A Review On Handling Imbalanced Data
No ratings yet
A Review On Handling Imbalanced Data
12 pages
CANAVAN' and VESCOVI - 2004 - CMJ X SJ Evaluation of Power Prediction Equations Peak Vertical Jumping Power in Women
No ratings yet
CANAVAN' and VESCOVI - 2004 - CMJ X SJ Evaluation of Power Prediction Equations Peak Vertical Jumping Power in Women
6 pages
133 - Sampling Approaches For Imbalanced Data Classificatin Problem in Machine Learning
No ratings yet
133 - Sampling Approaches For Imbalanced Data Classificatin Problem in Machine Learning
14 pages
Training and Assessing Classification Rules With U
No ratings yet
Training and Assessing Classification Rules With U
29 pages
1 s2.0 S0950705121000411 Main
No ratings yet
1 s2.0 S0950705121000411 Main
15 pages
The Trumpet of The Swan: Classroom Novel Study
100% (2)
The Trumpet of The Swan: Classroom Novel Study
26 pages
Netact Tutorial
No ratings yet
Netact Tutorial
9 pages
Imbalance Factor: A Simple New Scale For Measuring Inter-Class Imbalance Extent in Classification Problems
No ratings yet
Imbalance Factor: A Simple New Scale For Measuring Inter-Class Imbalance Extent in Classification Problems
27 pages
The Use of Generative Adversarial Networks To Alleviate Class Imbalance in Tabular Data: A Survey
No ratings yet
The Use of Generative Adversarial Networks To Alleviate Class Imbalance in Tabular Data: A Survey
37 pages
Imbalanced Data
No ratings yet
Imbalanced Data
54 pages
Deep Learning for Rare Image Classes
No ratings yet
Deep Learning for Rare Image Classes
31 pages
Data Augmentation With Variational Autoencoder
No ratings yet
Data Augmentation With Variational Autoencoder
12 pages
Stop Oversampling For Class Imbalance Learning - A Review (OJO) - AHMAD S. TARAWNEH, AHMAD B. HASSANAT, GHADA AWAD ALTARAWNEH, ABDULLAH ALMUHAIMEED
No ratings yet
Stop Oversampling For Class Imbalance Learning - A Review (OJO) - AHMAD S. TARAWNEH, AHMAD B. HASSANAT, GHADA AWAD ALTARAWNEH, ABDULLAH ALMUHAIMEED
18 pages
Deep Models for Class Imbalance
No ratings yet
Deep Models for Class Imbalance
19 pages
SMOTE For Regression
No ratings yet
SMOTE For Regression
12 pages
DeepSMOTE Fusing Deep Learning and SMOTE For Imbalanced Data
No ratings yet
DeepSMOTE Fusing Deep Learning and SMOTE For Imbalanced Data
15 pages
Neural Networks: Sree Rama Vamsidhar S., Arun Kumar Sivapuram, Vaishnavi Ravi, Gowtham Senthil, Rama Krishna Gorthi
No ratings yet
Neural Networks: Sree Rama Vamsidhar S., Arun Kumar Sivapuram, Vaishnavi Ravi, Gowtham Senthil, Rama Krishna Gorthi
7 pages
Classification of Imbalanced Data A Review
No ratings yet
Classification of Imbalanced Data A Review
34 pages
Assignment DL
No ratings yet
Assignment DL
20 pages
Philippine Shell Crafts Industry
No ratings yet
Philippine Shell Crafts Industry
8 pages
Evaluation and Enhancement of Standard Classifier
No ratings yet
Evaluation and Enhancement of Standard Classifier
31 pages
Deep Attention for Imbalanced Image Classification
No ratings yet
Deep Attention for Imbalanced Image Classification
11 pages
Alan H. Andrews Original Materials
No ratings yet
Alan H. Andrews Original Materials
68 pages
DISC 12 Challenges
No ratings yet
DISC 12 Challenges
70 pages
Choi Et Al. - 2022 - Imbalanced Data Classification Via Cooperative Int
No ratings yet
Choi Et Al. - 2022 - Imbalanced Data Classification Via Cooperative Int
14 pages
Identifying and Compensating For Feature Deviation in Imbalanced Deep Learning
No ratings yet
Identifying and Compensating For Feature Deviation in Imbalanced Deep Learning
15 pages
Oversampling Techniques For Imbalanced Data in Regression
No ratings yet
Oversampling Techniques For Imbalanced Data in Regression
19 pages
Faculty ML Trends & Challenges
No ratings yet
Faculty ML Trends & Challenges
23 pages
A Cluster Based Under-Sampling Solution For Handling Imbalanced Data
No ratings yet
A Cluster Based Under-Sampling Solution For Handling Imbalanced Data
12 pages
Context-Aware Drift Detection
No ratings yet
Context-Aware Drift Detection
25 pages
1177-Article Text-3491-1-10-20240225top
No ratings yet
1177-Article Text-3491-1-10-20240225top
9 pages
AV51DOT0
No ratings yet
AV51DOT0
40 pages
Magdiff:: Covariate Data Set Shift Detection Via Activation Graphs of Deep Neural Networks
No ratings yet
Magdiff:: Covariate Data Set Shift Detection Via Activation Graphs of Deep Neural Networks
19 pages
DeepSMOTE Fusing Deep Learning and SMOTE For Imbalanced Data
No ratings yet
DeepSMOTE Fusing Deep Learning and SMOTE For Imbalanced Data
15 pages
Learning English Through Research Methods Textbook
No ratings yet
Learning English Through Research Methods Textbook
179 pages
On Stock Price Prediction - A Deep Learning Approach Using Bidirectional Long-Short Term Memory (Bilstm) - 20230227 - 202813
No ratings yet
On Stock Price Prediction - A Deep Learning Approach Using Bidirectional Long-Short Term Memory (Bilstm) - 20230227 - 202813
59 pages
Applied Model Predictive Control - A Brief Guide Do MATLAB/Simulink MPC Toolbox
100% (2)
Applied Model Predictive Control - A Brief Guide Do MATLAB/Simulink MPC Toolbox
66 pages
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
No ratings yet
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
9 pages
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
100% (1)
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
36 pages
ML in Mental Health Data Analysis
No ratings yet
ML in Mental Health Data Analysis
10 pages
Aipptoriginal 191215023212
No ratings yet
Aipptoriginal 191215023212
16 pages
Soil CBR Prediction for Engineers
No ratings yet
Soil CBR Prediction for Engineers
4 pages
Class 10 Artificial Intelligence Sample Paper Set 9
No ratings yet
Class 10 Artificial Intelligence Sample Paper Set 9
9 pages
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
No ratings yet
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
9 pages
Guidelines On Stability Testing of Cosmetics CE-CTFA - 2004
No ratings yet
Guidelines On Stability Testing of Cosmetics CE-CTFA - 2004
10 pages
Paper IJRITCC
No ratings yet
Paper IJRITCC
5 pages
Survey On Deep Learning With Class Imbalance: Open Access Survey Paper
No ratings yet
Survey On Deep Learning With Class Imbalance: Open Access Survey Paper
54 pages
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
No ratings yet
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
10 pages
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
No ratings yet
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
8 pages
Salary Prediction
No ratings yet
Salary Prediction
4 pages
An Overview of Classification Algorithms For Imbalanced Datasets
No ratings yet
An Overview of Classification Algorithms For Imbalanced Datasets
7 pages
SMOTE for Class Imbalance Handling
No ratings yet
SMOTE for Class Imbalance Handling
12 pages
Deep Learning and Thresholding With Class-Imbalanced Big Data
No ratings yet
Deep Learning and Thresholding With Class-Imbalanced Big Data
8 pages
NICE Actimize - DS - Rarity Problem in Supervised Fraud Detection Insights Article - 3JUNE20
No ratings yet
NICE Actimize - DS - Rarity Problem in Supervised Fraud Detection Insights Article - 3JUNE20
11 pages
Analysis of Imbalanced Classification Algorithms A Perspective View
No ratings yet
Analysis of Imbalanced Classification Algorithms A Perspective View
5 pages
Class Notes
No ratings yet
Class Notes
24 pages
How To Pass Salesforce Certified AI Associate Certification Exam - Automation Champion
No ratings yet
How To Pass Salesforce Certified AI Associate Certification Exam - Automation Champion
33 pages
Enhancing Supply Chain Resilience A Comparative Study of Predictive Analytics and Advanced Technologies in Healthcare and Retail Sectors
No ratings yet
Enhancing Supply Chain Resilience A Comparative Study of Predictive Analytics and Advanced Technologies in Healthcare and Retail Sectors
13 pages
Engineer Resume
No ratings yet
Engineer Resume
2 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
SSRN Id3150525 PDF
No ratings yet
SSRN Id3150525 PDF
66 pages
A Study For The Discovery of Web Usage Patterns Using Soft Computing Based Data Clustering Techniques
No ratings yet
A Study For The Discovery of Web Usage Patterns Using Soft Computing Based Data Clustering Techniques
14 pages
INTRODUCTION To Construction Rates
No ratings yet
INTRODUCTION To Construction Rates
91 pages
2018 12state of ArtofImbalancedDataClassificationMethods
No ratings yet
2018 12state of ArtofImbalancedDataClassificationMethods
7 pages
03 - Data & Learning
No ratings yet
03 - Data & Learning
53 pages
Machine Learning and Web Scraping Lesson02
No ratings yet
Machine Learning and Web Scraping Lesson02
29 pages
Template For Mini Project Report
No ratings yet
Template For Mini Project Report
15 pages
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
No ratings yet
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
11 pages
Eng2 12298 PDF
No ratings yet
Eng2 12298 PDF
24 pages
11-A-SMOTE A New Preprocessing Approach For Highly Im
No ratings yet
11-A-SMOTE A New Preprocessing Approach For Highly Im
11 pages
An Insight Into Classification With Imbalanced Data
No ratings yet
An Insight Into Classification With Imbalanced Data
29 pages
Class Imbalance Problem in Data Mining: Review
No ratings yet
Class Imbalance Problem in Data Mining: Review
5 pages
Admin, 1277
No ratings yet
Admin, 1277
21 pages
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
No ratings yet
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
5 pages
The Use of NARX Neural Networks To Predict Chaotic
No ratings yet
The Use of NARX Neural Networks To Predict Chaotic
11 pages
Worksheet Catchtheball
No ratings yet
Worksheet Catchtheball
13 pages
The Trend in Current and Near Future Energy Consumption From A Statistical Perspective
No ratings yet
The Trend in Current and Near Future Energy Consumption From A Statistical Perspective
11 pages
Imbalanced Data Problem in Machine Learning A Review
No ratings yet
Imbalanced Data Problem in Machine Learning A Review
14 pages
Imbalanced Dataset Classification and Solutions: A Review
No ratings yet
Imbalanced Dataset Classification and Solutions: A Review
29 pages
Prediction of Hazardous Asteroids Using Machine Learning
No ratings yet
Prediction of Hazardous Asteroids Using Machine Learning
6 pages
Building LSTM-Based Model For Solar Energy Forecasting - by Dr. Saptarsi Goswami - Towards Data Science
No ratings yet
Building LSTM-Based Model For Solar Energy Forecasting - by Dr. Saptarsi Goswami - Towards Data Science
7 pages
8.1A+Exponential+Smoothing+Forecasting+Model Stu
No ratings yet
8.1A+Exponential+Smoothing+Forecasting+Model Stu
7 pages
Handout 2
No ratings yet
Handout 2
10 pages

Delving Into Deep Imbalanced Regression

Uploaded by

Delving Into Deep Imbalanced Regression

Uploaded by

Delving into Deep Imbalanced Regression

Abstract Deep Imbalanced Regression

Real-world data often exhibit imbalanced distri- Imbalanced distribution

(a) CIFAR-100 (subsampled) (b) IMDB-WIKI (subsampled)

3.2. Feature Distribution Smoothing 1.0 Anchor target (30)

Mean cosine similarity

Variance cosine similarity

Motivating Example. We use an illustrative example to 0.8

highlight the impact of data imbalance on feature statistics 0.7

in DIR. Again, we use a plain model trained on the images 0.6

in the IMDB-WIKI dataset to infer a person’s age from 0.5

visual appearance. We focus on the learned feature space, 0.4

continuous similarity score from 0 to 5. From the original

regions by linearly interpolating both inputs and targets.

coder normally, and in the second stage freeze the encoder

Table 2. Benchmarking results on AgeDB-DIR. Table 4. Benchmarking results on NYUD2-DIR.

F OCAL -R 7.64 6.68 9.22 13.00 4.90 4.26 6.39 9.52

F OCAL -R 0.951 0.843 1.425 0.957 74.6 72.3 61.8 76.4

RRT 0.964 0.842 1.503 0.978 74.5 72.4 62.3 75.4 15

I NV 1.005 0.894 1.482 1.046 72.8 70.3 62.5 73.2 0

I NV + LDS 0.914 0.819 1.319 0.955 75.6 73.4 63.8 76.2 −5

Mean cosine similarity

Variance cosine similarity

training evolves, indicating that the model learns to generate

Rothe, R., Timofte, R., and Gool, L. V. Deep expectation of

You might also like