DL Tabular
DL Tabular
1 Introduction
Tabular data, which consists of rows and columns representing structured information [1, 2],
is the most commonly used data format in many industries, including healthcare, finance, and
transportation. Unlike unstructured data such as images and text, tabular data directly represents
Authors’ Contact Information: Shriyank Somvanshi, Texas State University, San Marcos, TX, [email protected]; Subasish
Das, Ph.D., Texas State University, San Marcos, TX, [email protected]; Syed Aaqib Javed, Texas State University, San
Marcos, TX, [email protected]; Gian Antariksa, Ph.D., Texas State University, San Marcos, TX, [email protected];
Ahmed Hossain, Ph.D., Texas State University, San Marcos, TX, [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the
full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM XXXX-XXXX/2024/10-ART
https://doi.org/XXXXXXX.XXXXXXX
2022
2024
Pre-trained language model to
model Tabular Data (Ptab) Tabular Transformer with Scaled
TabLLM Exponential Linear Units (TabTranSELU)
2020 Generation of Realistic
Tabular Prior-Data Fitted MambaTab
Tabular data (GReaT) Tabular Prediction adapted
Net-DNF Network (TabPFN) BERT approach (TP-BERTa)
Network-on-network (NON) Tabular Denoising Diffusion
Gated Adaptive Network for SwitchTab
Probabilistic Model (TabDDPM) Context Aware Representation
Value Imputation and Mask TabTransformer Deep Automated Learning of
Scalable Polynomial Additive of Table Entries (CARTE) Bi-Directional Sparse
Estimation (VIME) Features (GANDALF)
TAbular Convolution Models (SPAM) Hopfield
Neural Additive (TAC) Hopfield networks from Latent Factorizer Network (BiSHop)
Models (NAMs) Neural Basis Model (NBM) tabular data (Hopular) Transformer (LF-Transformer)
2021
* Denotes no paper found but has won several Kaggle competitions (https://www.kaggle.com/competitions/tabular-playground-series-apr-2021/discussion/230013%E2%80%A6)
Fig. 1. Progression of Tabular Deep Learning Models
within tables, providing rich context but complicating feature representation and model training.
Understanding how to handle these varied data types is critical to improving the performance of
deep learning models on tabular data.
Tabular data can also be represented in two different formats: 1D tabular data, and 2D tabular
data as shown in Figure 2 below. In 1D tabular data, each row represents a sample and columns
represent specific features, making it easy to process and analyze. This format is ideal for traditional
machine learning tasks, as each column follows a specific data type and the structure is fixed. For
example, in transportation safety datasets, each row could represent an individual crash event,
and the columns might include features like vehicle speed, crash time, or road conditions. The
simplicity of this structure makes it highly useful in various fields.
Sample
In contrast, 2D tabular data provides a more complex format where each sample can be repre-
sented by a table, with multiple rows and columns within each table. This format is often used for
tasks that require deeper relational analysis, such as tracking patient health over time or analyzing
transportation data across different regions and times. 2D tabular data is also more flexible, incor-
porating diverse data types, including timestamps or unstructured data, like text or images, within
each table. This additional complexity makes it suitable for applications in areas such as healthcare
and transportation, where temporal and multi-dimensional data are critical.
Understanding how to handle these varied data types is critical to improving the performance of
deep learning models on tabular data. Some of these data types are explained below:
• Binary Data: Binary data, a type of categorical data with two possible values (such as,
"Yes/No"), is often represented as 0 or 1 in deep learning models [12].
• Numerical Data: Numerical data, representing continuous or discrete variables (e.g., age,
vehicle speed), is common in predictive modeling, especially in transportation safety [13].
Deep learning models handle it directly, but preprocessing, like scaling or standardization, is
critical for performance. Advanced techniques, such as numerical embeddings, help capture
non-linear relationships and interactions in the data.
• Timestamps: Timestamps provide essential temporal information in systems like traffic
management. Preprocessing involves extracting features such as day, month, or hour to
capture temporal patterns for deep learning models [14].
• Text Data: Text data in tabular formats, such as crash descriptions, presents challenges
for deep learning models. Methods like TF-IDF [15] and word embeddings (e.g., word to
vector, global vectors for word representation) convert text into numerical vectors [16, 17].
Advanced models like transformers (e.g., BERT) capture context-aware embeddings [18].
• Image Data: In multi-modal datasets, image data is sometimes embedded in tables, such as in
autonomous driving, where road images are paired with tabular data. Convolutional Neural
Networks (CNNs) process images, but integrating image features with tabular data requires
feature fusion techniques. Hybrid models like TabTransformer use attention mechanisms to
merge image and tabular data, enhancing predictive performance [19].
• Hyperlinks: Hyperlinks, though uncommon in traditional tabular datasets, are increasingly
used in web data applications or web documents [20]. When tables include URLs, advanced
preprocessing is required to extract metadata or context from the linked pages, often using
NLP models to incorporate this information into the feature set.
• Video Data: Video data in tabular formats provides valuable temporal information for do-
mains like autonomous driving and traffic management. Keyframes from videos are processed
using 3D-CNNs or Recurrent Neural Networks (RNNs) to capture spatial and temporal fea-
tures, which are then integrated with tabular data to improve model predictions, such as in
crash prediction models where video features enhance the understanding of road conditions
and driver behavior [21, 22].
• Emoji: Emojis common in social media and messaging platforms, enhance communication
by visually conveying emotions or objects [23] and pose challenges for encoding sentiment.
Deep learning models use character-level or emoji embeddings to map them to sentiment
vectors, enabling effective interpretation alongside other data types.
Tabular data, composed of rows and columns, lacks the spatial or sequential structure found in
images and text, making it difficult to apply traditional deep learning models like CNNs, which rely
on spatial coherence. Unlike structured data, reordering columns or rows in tabular data doesn’t
change feature relationships, and deep learning models struggle without the inductive biases that
machine learning models like XGBoost and Random Forests possess. Machine learning models
excel in handling heterogeneous feature types, non-local interactions, and small, high-dimensional
datasets, where deep learning models often overfit and fail to generalize.
To address the limitations of traditional deep learning models when applied to tabular data,
recent advancements have led to the development of specialized architectures such as TabNet,
TabTransformer, and SAINT. These models introduce mechanisms like attention layers, feature
embeddings, and hybrid architectures to dynamically focus on the most relevant features, improving
their ability to handle the complexity of heterogeneous tabular data. For instance, TabNet [24]
employs sequential attention mechanisms for instance-wise feature selection, while TabTransformer
[19] uses self-attention layers to capture feature dependencies more effectively than CNNs. SAINT
[25] enhances this approach by incorporating intersample attention, enabling the model to capture
relationships between data rows. Moreover, models such as TabTranSELU [26] and GNN4TDL [27]
are designed to efficiently manage both categorical and numerical features by employing hybrid
structures and regularization techniques, which help mitigate overfitting and improve generalization.
These innovations have enabled deep learning models to rival or surpass traditional machine
learning methods in tasks involving tabular data, including fraud detection, and predictive analytics.
Additionally, novel techniques such as transforming tabular data into image-like structures [2,
28], employing multi-view representation learning, and extracting schemas from tabular data
[29] further contribute to overcoming the challenges posed by the absence of inherent spatial
relationships in tabular datasets.
By leveraging these advancements, recent tabular deep learning models not only address the
unique challenges of tabular data but also offer significant improvements in performance, in-
terpretability, and scalability compared to both traditional deep learning and machine learning
approaches. These innovations demonstrate the growing potential for deep learning in handling
complex, non-spatial data across a wide range of real-world applications.
Similarly, Hellerstein, [37] addresses the inherent challenges of working with tabular data,
particularly when it lacks a grid-like structure typically seen in other data types such as images or
text. The study focuses on automating the transformation of unstructured tables into tidy, relational
forms suitable for analysis. It also introduces the idea that clean data tables can be considered as
grids of cells, somewhat analogous to pixels in an image, where adjacent rows and columns may
exhibit patterns. While deep learning models excel in pattern recognition in image grids, detecting
such patterns in tabular data is far more difficult due to the diversity in how tables are structured
and the absence of explicit spatial relationships. Ucar et al. [1] discussed the challenges posed
by the lack of inherent spatial structure in tabular data. While image data benefits from spatial
coherence (e.g., neighboring pixels are spatially correlated), and text or audio from semantic and
temporal structures, tabular data lacks such clear patterns. This makes it difficult to apply common
augmentation techniques like cropping or rotation, which are highly effective in domains such
as image processing. To overcome these limitations, the authors propose the SubTab framework,
which divides input features of tabular data into subsets, analogous to feature bagging or image
cropping, to generate different views of the data.
By reconstructing full data from these subsets, the framework forces the model to learn better
representations of tabular data in a self-supervised setting, despite the absence of grid-like structure.
This approach enables the model to discover patterns and relationships within tabular data that are
not immediately apparent, and the results demonstrate that SubTab can achieve state-of-the-art
performance on various datasets.
In an effort to refine this approach, Wang and Sun [38] introduced TransTab, as shown in Figure
3 and Figure 4 below, a model that uses transformers to encode tabular data by treating rows
(samples) and columns (features) as sequences. Figure 3 illustrates TransTab’s ability to handle
tasks like transfer learning, feature incremental learning, and zero-shot inference, demonstrating its
adaptability across different tabular data tasks. Figure 4 details the framework, where categorical,
binary, and numerical features are tokenized and processed through a gated transformer with
multi-head attention, enabling efficient learning of feature interactions. This structured approach
allows TransTab to handle variable-column tables and facilitates knowledge transfer, even across
tables with different structures, enabling more effective learning and generalization across tasks
and domains. The model focuses on learning generalizable representations, which can be applied
to different datasets, overcoming the limitations imposed by the nonspatial nature of tabular data.
The contextualization of columns and cells in TransTab introduces a structured way to interpret
relationships within tabular data, enabling more effective learning and generalization across tasks
and domains.
In a similar effort Ghorbani et al. [39] introduce the Feature Vectors method, which generates
feature embeddings that capture both the importance and semantic relationships between features.
Drawing inspiration from word embeddings in NLP, where words that frequently co-occur in the
same context share similar embeddings, the authors apply a similar approach to features in tabular
data. However, given that tabular data lacks natural co-occurrence structures, the authors propose
using decision trees to extract co-occurrence relationships between features. By analyzing decision
paths in tree-based models, they are able to create feature embeddings that preserve semantic
relationships despite the nonspatial nature of tabular data. Also, Geisler and Binnig [40] discussed
the challenges of applying existing model explanation methods, such as Local Interpretable Model-
agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP), to tabular data. These
methods, originally designed for data with spatial or temporal relationships, often fall short when
applied to tabular data due to the absence of clear spatial or sequential patterns. To overcome this,
they propose the Quest framework, which generates explanations in the form of relational queries
specifically tailored for tabular data. Quest uses surrogate models and query-driven explanations
to address the unique structure of tabular data, offering a more semantically rich and intuitive
understanding of model behavior. By focusing on relational query predicates, Quest is able to
explain not only why a model produced a particular output but also why it did not produce an
alternative output.
Existing works
Pretrain
Finetune Train
Columns y Columns y
Train Columns y
Test
Columns y T1 T1 T1
Columns y
T1 Columns Columns y
T1 Columns y Columns y
T2 T2 T1 T2
… …
Pretrain 1. Transfer learning across tables 2. Feature incremental learning
Finetune Train
Columns Pretrain
Columns y Columns y Finetune Columns y Test
Columns y T1 Columns
T1
T1 T1
T3 T3
Columns Columns
T2 T2
Fixed-table pretrain + finetune
… …
3. Pretrain + finetune 4. Zero-shot inference
{𝑐𝑐𝑐𝑐𝑐𝑐}
𝑍𝑍̂1 → 𝑦𝑦1 Contrastive Self-supervised
Projector {𝑐𝑐𝑐𝑐𝑐𝑐}
𝑍𝑍̂2 → 𝑦𝑦2 learning Supervised
Encoded ZL …
Moving in the same direction Zhu et al. [2] explain that CNNs excel when applied to data with
spatial or temporal relationships, such as the arrangement of pixels in images or the sequential
nature of text, allowing them to capture local patterns through convolutions. However, the absence
of these structures in tabular data presents a significant challenge for CNN-based modeling. To
address this challenge, the authors propose a novel algorithm called the Image Generator for
Tabular Data (IGTD), which transforms tabular data into image-like structures by assigning tabular
features to pixel positions while preserving feature relationships. This transformation introduces a
form of spatial relationship in the data, allowing CNNs to process tabular data more effectively.
The study demonstrates that these image representations help CNNs capture feature relationships
and improve predictive performance compared to models trained on raw tabular data. The IGTD
method addresses the lack of spatial or sequential dependencies in tabular data by creating artificial
spatial relationships making it more compatible with deep learning models designed for structured
data.
Source Full
C1-Cn SoftMax Output
Input Connection
Migrate
Global
Target C1-Cn Average Soft Max Output
Input Pooling
Data
Augmentation
learning, which has been successful in mitigating overfitting in small datasets for computer vision
and NLP. For example, Zhao [43] proposes using transfer learning combined with data augmentation
to tackle overfitting in small datasets as shown in Figure 5. By pre-training a CNN on large datasets
like ImageNet, and then fine-tuning it on smaller datasets, the model can leverage previously
learned representations to improve performance on the target task. Jain et al., [41] extend this
concept to tabular data by converting tabular datasets into image representations using techniques
like IGTD and SuperTML. These methods allow deep learning models, particularly CNNs, to be
applied to tabular data by transforming it into an image-like format that can take advantage of
pre-trained models, thus reducing overfitting. Horenko [45] introduces the entropy-optimal scalable
probabilistic approximations algorithm algorithm to breach the overfitting barrier at a fraction of
the computational cost. Badger [46] demonstrates the effectiveness of small language models in
processing tabular data without extensive preprocessing, achieving record classification accuracy.
Another promising solution is the HyperTab method proposed by Wydmański et al. [47], which
uses a hypernetwork-based approach to build an ensemble of neural networks specialized for
small tabular datasets. A general HyperTab structure is shown in Figure 6. By employing feature
subsetting as a form of data augmentation, HyperTab virtually increases the number of training
samples without altering the number of parameters. This approach allows the model to generalize
better by preventing overfitting, particularly on small datasets.
X1 X2 X3 X4 X5 X6
1 0 0 1 0 1
X1 X2 X3
Hypernetwork
Target
Target network
Weights
deep learning, these traditional models are still preferred in certain cases today. Additionally, their
faster training and deployment times make them ideal for applications where real-time decisions are
necessary. These classical approaches were well-suited for small-scale tabular datasets but limited
to classification and regression tasks. However, these traditional models are not without their
limitations. For instance, Clark et al. [49] point out that logistic regression models can encounter
problems such as complete and quasi-complete separation, where the model either perfectly or
nearly perfectly separates the data. This can lead to extremely large or infinite coefficient estimates,
making statistical inference unreliable. Moreover, logistic regression is particularly sensitive to small
sample sizes, especially when dealing with low-frequency categorical variables, which can worsen
separation issues. To counter this, covariates are often removed, or categories merged, but such
actions can lead to oversimplification and a reduction in the model’s predictive power. Similarly,
Carreras et al. [50] highlight several drawbacks of SVMs, particularly in the soft margin variant.
These include the risk of overfitting, increased computational complexity when feature selection
is involved, and the non-convex nature of the optimization problem, which complicates finding
optimal solutions. Additionally, achieving a complete Pareto front in multi-objective optimization
is difficult due to new parameters leading to similar solutions despite weight variations.
Expanding on the strengths and limitations of traditional models like SVMs and logistic regression,
models like Decision Trees, Naive Bayes, and early Neural Networks also relied on manual feature
engineering, requiring domain expertise to select relevant features [51]. This process, while labor-
intensive, allowed these models to perform effectively on smaller datasets. Abrar and Samad [52]
emphasize that while fully connected deep neural networks have become popular in recent years,
traditional machine learning models like gradient boosting trees still outperform deep learning
models in many cases, particularly when dealing with tabular data containing uncorrelated variables.
This study highlights that traditional models like gradient boosting trees are superior when deep
models fail, especially in datasets that lack the strong correlations often found in real-world data.
Model Training
(Year) Architecture Efficiency Main Features
Source
VIME (2020) [58] Neural network + Self and semi- Self-supervised learning and contex-
masked and feature- supervised learning; tual embedding
vector estimation Moderate
NON (2020) [59] Field-wise + across field Supervised learning; Network-on-network model
+ operation fusion net- Moderate
work
Net-DNF (2020) Affine literals + conjunc- Supervised learning; Structure-based on disjunctive nor-
[60] tion + output layer Moderate mal form
12 Somvanshi et al.
Model Training
(Year) Architecture Efficiency Main Features
Source
TabTransformer Transformer-based ar- Supervised learning, Transformer network for categorical
(2020) [61] chitecture + contextual semi-supervised learn- data
embeddings ing; Moderate
TabNet (2019) Sequential Attention + Supervised learning; Sequential attention structure
[24] Sparse Feature Selection Moderate
+ Feature Transformer
NODE (2019) Differentiable oblivious Supervised learning; Differentiable decisions are made
[62] decision trees (ODT) Moderate with classic decision trees via the ent-
max transformation
DeepGBM (2019) Hybrid approach inte- Supervised learning; Two DNNs distill knowledge from de-
[63] grating GBDT with NN Moderate to high cision tree
SuperTML (2019) CNN-based Supervised learning; Transform tabular data into images
[64] Moderate for CNNs
xDeepFM (2018) Hybrid neural network Supervised learning; Embedding Layer + compressed in-
[65] Moderate to high teraction network + DNN
TabNN (2018) Automatic feature group- Supervised learning; DNNs based on feature groups dis-
[66] ing + recursive encoder Moderate to high tilled from GBDT
with shared embedding
RLN (2018) [67] Regularization mecha- Supervised learning; Hyperparameters regularization
nism + sparse network Moderate scheme
DeepFM (2017) Factorization machines + Supervised learning; Combining low- and high-order fea-
[68] deep neural networks Moderate to high ture interactions + shared embedding
layer
Wide&Deep Memorization (wide Supervised learning; Cross-product feature transforma-
(2016) [69] component) and gen- High tions for memorization + Embedding
eralization (deep layer for categorical features
component)
focus compared to more advanced methods. While MLPs are effective, they are often surpassed by
specialized or more sophisticated architectures in handling the complexities of tabular data.
Early applications of shallow neural networks, particularly FCNs, to tabular data often under-
performed compared to specialized models like GBDTs. However, recent studies have shown that
with proper tuning and architectural enhancements, neural networks can rival or surpass GBDTs.
Chen et al. [74] emphasized the efficiency of shallow networks in handling unordered tabular data,
while Erichson et al. [75] demonstrated their competitiveness in tasks like fluid dynamics, where
fast training and regularization were key advantages. Rubachev et al. [76] further noted that with
optimized tuning and techniques like unsupervised pre-training, shallow networks can close the
performance gap with GBDTs, although this gain is context-dependent. Fiedler [77] introduced
structural innovations such as Leaky Gates and Ghost Batch Norm, which significantly enhanced
MLPs for tabular data, enabling them to outperform GBDTs in several cases. Figure 7 shows the
original and the modified MLP+ model. These advancements demonstrate that shallow networks
when effectively optimized, can meet the unique challenges of tabular data and compete with
traditional models.
Loss
Linear Linear
Weighted Average
Loss
…
Dropout Dropout
Activation Multi-Layer Activation Multi-Layer
Batch Norm Perception Ghost Batch Norm Perception Linear
Linear Linear
Categorical Non-cat.
Categorical Non-cat.
Embedding Embedding
Embedding Embedding
These findings align with the broader consensus that the standard architecture of FCNs often
lacks the inductive biases necessary for effectively modeling the complexities of tabular data, such
as categorical variables, missing data, and imbalanced datasets. Specialized neural networks are
frequently required to address these challenges. However, Grinsztajn et al. [9] offer a more optimistic
view, demonstrating that shallow FCNs, like MLPs, can remain competitive when combined with
regularization techniques to mitigate overfitting and generalization issues. They further suggest
that even simple architectures, such as ResNet, can match the performance of more advanced
models, indicating that, with proper modifications, shallow networks can still play a valuable role
in handling tabular data.
Encoder Decoder
Step 1 Step 2
FC
Agg. Agg.
GBN
Encoded representation
Feature Input Feature attributes
feature importance and quantify overall contributions [79]. The newer InterpreTabNet builds on
this by improving feature attribution methods, further enhancing interpretability, and making the
model’s decisions more transparent at both local and global levels [78].
TabNet excels in working with raw tabular data without the need for extensive preprocessing or
manual feature engineering, which is often required by traditional models like GBDTs. Its end-to-
end learning capabilities allow TabNet to directly process raw data, simplifying workflows while
maintaining high performance. Additionally, TabNet introduces self-supervised learning, a novel
feature for tabular data, where it can pre-train on unlabeled data using masked feature prediction
to improve performance on supervised tasks, particularly when labeled data is scarce. Evaluated
on various datasets, TabNet has been shown to outperform or match state-of-the-art models,
including GBDTs, in both classification and regression tasks. For example, in facies classification,
it achieves superior accuracy compared to traditional tree-based models and other deep learning
architectures like 1D-CNNs and MLPs [79]. Its flexible architecture, which incorporates sequential
feature transformers and attention mechanisms, enhances generalization across different domains,
while the use of sparse attention ensures interpretability, addressing a key limitation of traditional
deep learning models.
TabNet, despite its innovations like interpretability and sparse attention, is often outperformed
by XGBoost across various datasets, requiring more hyperparameter tuning and showing less
consistent results [80]. Additionally, TabNet’s training time is significantly longer, making it less
practical for quick iterations or real-time applications [81]. It is also prone to overfitting on smaller
datasets due to its complex architecture, especially when not tuned correctly.
3.3.2 Neural Oblivious Decision Ensembles (NODE). NODE have been proposed to address
the specific challenges of applying deep learning to tabular data, which has traditionally been
dominated by tree-based models like GBDT. Popov et al. (2019) identified the key limitations of
deep learning models in handling tabular data, primarily due to their inability to outperform
GBDTs consistently. To bridge this gap, NODE was introduced as a deep learning architecture that
generalizes ensembles of oblivious decision trees, offering end-to-end gradient-based optimization
and multi-layer hierarchical representation learning. This design allows NODE to capture complex
feature interactions within tabular data, a task where traditional deep learning models often fall
short. One of NODE’s key innovations is the use of differentiable oblivious decision trees, where
splitting decisions are made through the entmax transformation, allowing for soft, gradient-based
feature selection. This approach makes the decision-making process more flexible and differentiable,
unlike conventional decision trees that rely on hard splits.
Additionally, NODE’s multi-layer architecture is designed to capture both shallow and deep
interactions within tabular data, effectively functioning as a deep, fully differentiable GBDT model
trained end-to-end via backpropagation [62]. The architecture of NODE stacks multiple layers of dif-
ferentiable oblivious decision trees, which enables NODE to outperform existing tree-based models
in many tasks. Furthermore, NODE enhances computational efficiency by allowing pre-computation
of feature selectors, significantly speeding up inference without sacrificing accuracy. Joseph [82]
explored NODE within the PyTorch Tabular framework, which simplifies deep learning for tabular
data by offering a unified API that integrates both NODE and TabNet. This framework addresses the
complexity of training deep learning models compared to traditional machine learning libraries like
Scikit-learn, making advanced models more accessible for practitioners and researchers. Fayaz et al.
[80] compared NODE, TabNet, and XGBoost, noting that while NODE introduces key innovations
such as handling mixed data types and data imbalance, it often requires more hyperparameter
tuning than XGBoost. However, combining NODE with XGBoost enhances performance, showing
NODE’s strength in complementing traditional models for tabular data.
Semek 2019
Arik 2019
Ruff 2021
Borisov 2021
Raschka 2020 Grinsztajn 2021
Achtibat 2022
Cai 2021
Karim 2022
Jun 2024
Liu 2022
Zhu 2023
Fuhl 2023
Shirkavand 2023
Building on these developments, Table 2 presents a timeline of major models introduced during
this period, detailing their architectures and key performance traits. These models highlight the
significant breakthroughs in tabular deep learning, from hybrid architectures to advanced attention
mechanisms, which have propelled performance and scalability forward.
A Survey on Deep Tabular Learning 17
Model Training
(Year) Architecture Efficiency Main Features
Source
TabLLM (2022) Large language models Few-shot supervised Serializes tabular data into natural
[84] learning; Moderate language strings
TabDDPM Multinomial diffusion + Supervised learning; Multinomial and gaussian diffusion
(2022) [85] gaussian diffusion Moderate to high to handle categorical and numerical
features
Ptab (2022) Pre-trained language Supervised and self- Uses three-stage training strategy
[86] model architecture supervised learning; (modality transformation, masked-
N/A language fine-tuning, and classifica-
tion fine-tuning)
GANDALF Gated feature learning Supervised learning; Uses GFLUs with learnable feature
(2022) [87] unit (GFLUs) High masks and hierarchical gating mech-
anisms
ARM-Net Exponential neurons + Supervised learning; Adaptive relational modeling with
(2021) [88] gated attention mecha- High multi-head gated attention network
nism + sparse softmax
NPT (2021) Attention-based NN + Self-supervised learn- Process the entire dataset at once, use
[89] datapoints + attributes ing; Moderate to Low attention between data points
SAINT (2021) Hybrid architecture with Self-supervised con- Attention over both rows and
[25] both self-attention and trastive learning + columns
intersample attention supervised learning;
mechanisms Moderate
Regularized Plain Multilayer percep- Supervised learning; A “cocktail” of regularization tech-
DNNs (2021) tron Moderate to high niques
[90]
Boost-GNN GBDT + GNN Semi-supervised GNN on top decision trees from the
(2021) [91] learning; Moderate to GBDT algorithm
high
DNN2LR DNN + LR Supervised learning; Calculate cross-feature fields with
(2021) [92] High DNNs for LR
IGTD (2021) CNN-based neural net- Supervised learning; Transform tabular data into images
[2] work High for CNNs
SCARF (2021) Encoder + pre-train head Self-supervised Self-supervised contrastive learning,
[93] network contrastive, semi- random feature corruption
supervised, fully
supervised learning;
N/A
4.1 TabTransformer
The TabTransformer model introduces significant advancements in tabular deep learning by lever-
aging attention mechanisms and hybrid architectures to address the unique challenges posed by
tabular data [19]. At its core, TabTransformer employs multi-head self-attention layers adapted from
the Transformer architecture, traditionally used in NLP, to capture complex feature interactions and
dependencies across the dataset as seen in Figure 10. This attention mechanism enables the model
to effectively capture relationships between features, making it particularly useful for datasets with
numerous categorical variables.
The TabTransformer architecture combines transformer layers with MLP components, forming
a hybrid structure optimized for tabular data. Categorical features are embedded using a column
embedding layer, which transforms each category into a dense, learnable representation. These
embeddings are passed through Transformer layers, which aggregate contextual information from
Loss
Multi-Layer Perceptron
Concatenation
Feed Forward
Transformer × N
Add & Norm
Multi-Head Attention
other features to capture interdependencies. The contextualized categorical features are then
concatenated with continuous features and processed through the MLP for final prediction. This
design leverages the strengths of both contextual learning for categorical data and traditional
MLP benefits for continuous data. Additionally, TabTransformer incorporates masked language
modeling and replaced token detection, enabling it to pre-train on large amounts of unlabeled data,
thus improving performance in low-labeled data scenarios and making it effective for real-world
applications.
Recent advancements in TabTransformer models, such as the self-supervised TabTransformer
introduced by Vyas [94], further refine this architecture by leveraging MLM in a pre-training phase to
learn from unlabeled data. This self-supervised approach enhances the model’s ability to generalize
by capturing intricate feature dependencies through self-attention mechanisms. By combining
Transformer layers with MLP for final prediction, the model handles mixed data types and smaller
dataset sizes effectively. However, trade-offs exist while the model demonstrates strong performance
gains, particularly in semi-supervised settings, the reliance on masked language modeling pre-
training increases computational overhead, potentially limiting scalability. Interpretability remains
moderate, with attention scores providing insights into feature importance, though the model is
less interpretable than traditional models like GBDT.
Another significant advancement is the GatedTabTransformer, introduced by Cholakov and Kolev
[95], which enhances the original TabTransformer by incorporating a gated multi-layer perceptron.
This modification improves the model’s ability to capture cross-token interactions using spatial
gating units. The GatedTabTransformer boosts performance by approximately 1 percent in AUROC
compared to the standard TabTransformer, especially in binary classification tasks. However, this
comes at the cost of increased computational complexity due to the additional processing required
for the spatial gating units. While the model shows improved performance, its scalability, and
interpretability remain limited compared to simpler models like MLPs or GBDTs.
Therefore, while TabTransformer models offer notable improvements in handling tabular data
through attention mechanisms and hybrid architectures, they present trade-offs in terms of perfor-
mance, scalability, and interpretability. Recent variations like the self-supervised TabTransformer
and GatedTabTransformer demonstrate the potential of these models to outperform traditional
approaches, though at the cost of higher computational demands.
4.2 FT-Transformer
The FT-Transformer model, as presented by Gorishniy et al. [96], introduces a novel approach to
addressing the challenges inherent in tabular data by leveraging attention mechanisms, hybrid
architectures, and transformer-based methodologies. The model adapts the attention mechanism,
originally designed for tasks like NLP, to process tabular data. In this context, the attention mecha-
nism allows the model to capture complex relationships between heterogeneous features, including
both numerical and categorical data as shown in Figure 11. By using attention to dynamically
prioritize certain features, the model effectively models interactions that are often difficult to detect
in traditional tabular data approaches.
0.1
T
1.2
𝑊𝑊1 (𝑐𝑐𝑐𝑐𝑐𝑐)
𝑥𝑥1 (𝑐𝑐𝑐𝑐𝑐𝑐) 𝑏𝑏1 (𝑐𝑐𝑐𝑐𝑐𝑐)
A k
B
B
𝑊𝑊2 (𝑐𝑐𝑐𝑐𝑐𝑐)
𝑥𝑥2 (𝑐𝑐𝑐𝑐𝑐𝑐) 𝑏𝑏2 (𝑐𝑐𝑐𝑐𝑐𝑐)
d
In addition to attention, the FT-Transformer employs a hybrid architecture that integrates feature
tokenization. This process transforms both numerical and categorical features into embeddings,
which are then processed through layers of the Transformer architecture. The result is a model that
is highly flexible and capable of handling diverse types of tabular data, a crucial advantage for tasks
where tabular data can vary widely in feature types and distributions. This hybrid design bridges
traditional feature encoding methods with the robust learning capabilities of Transformer-based
approaches, enabling better generalization across different datasets.
Recent studies have demonstrated the effectiveness of the FT-Transformer across various applica-
tions. In the domain of heart failure prognosis, the FT-Transformer outperformed traditional models
like Random Forest and Logistic Regression by capturing the non-linear interactions between
medical features, such as demographic and clinical data [97]. The use of attention mechanisms
allowed the model to dynamically prioritize important health indicators, leading to more accurate
predictions. Similarly, in intrusion detection systems, the FT-Transformer showed superior accuracy
in identifying network anomalies by processing the highly structured nature of network traffic data
[98]. The hybrid architecture seamlessly integrated categorical and numerical features, improving
the model’s ability to detect both known and unknown threats. Additionally, advancements like
stacking multiple transformer layers have been employed to further enhance the model’s capacity
to capture long-range dependencies within the data, making it even more effective in complex tasks
[99]. While the FT-Transformer model demonstrates improved performance over other models,
such as ResNet and MLP, particularly on various tabular tasks, it comes with certain trade-offs. In
terms of interpretability, the model’s complexity poses challenges. Traditional models like GBDT
offer clearer interpretability, as their decision-making processes are more transparent. In contrast,
the FT-Transformer’s reliance on attention mechanisms and deep layers makes it harder to explain,
although the attention scores do provide some insight into feature importance. Furthermore, the
model’s scalability is another consideration; the computational demands of Transformer-based
models, especially the quadratic scaling of the attention mechanism with the number of features,
can become a limitation when applied to extensive datasets. Despite these limitations, the FT-
Transformer’s ability to generalize across diverse datasets makes it a promising model for tabular
data analysis, offering significant advancements in predictive performance.
Building on these advancements, we present a performance and log-loss comparison between
TabNet and FT-Transformer. As shown in Figure 12, the FT-Transformer consistently demonstrates
superior performance as the number of random search iterations increases, while the log-loss for
both models decreases at different rates. This comparison highlights FT-Transformer’s enhanced
generalization capabilities over TabNet, particularly in larger search spaces. While this figure
provides an illustrative example of performance differences, unlike the previous survey on tabular
deep learning [7], we have not offered a comparison of all tabular deep learning models, as a
comprehensive evaluation across multiple models and diverse datasets is beyond the scope of this
current survey. Future research should aim to conduct more extensive performance evaluations to
thoroughly examine the strengths and limitations of these models.
4.3 DeepGBM
The DeepGBM model represents an innovative approach to addressing the challenges of tabular data
in deep learning, leveraging a combination of advanced techniques such as attention mechanisms,
hybrid architectures, and knowledge distillation [63]. While the model does not explicitly employ
traditional attention mechanisms, it incorporates feature importance from GBDT, a method that
allows the model to prioritize certain features over others. This process mimics attention by
directing the model’s focus to the most informative features rather than treating all inputs equally.
By emphasizing the most relevant features, DeepGBM enhances its ability to handle both sparse
categorical and dense numerical data, a crucial requirement in tabular data tasks.
Recent advancements in tabular deep learning further underscore DeepGBM’s role in combining
neural networks with GBDT to achieve improved performance. In particular, the model’s hybrid
architecture utilizes CatNN to handle sparse categorical features through embeddings and factor-
ization machines, and GBDT2NN to convert the outputs of GBDT into a neural network format
optimized for dense numerical features [100]. Figure 13 shows the structure of DeepGBM. This
integration allows DeepGBM to leverage the strengths of both model types, overcoming limitations
in traditional approaches that struggle to process mixed feature types in a unified framework.
DeepGBM
CatNN GBDT2NN
Though DeepGBM does not directly implement transformer models, it adopts ideas from
transformer-based architectures, particularly in the form of knowledge distillation. By distilling
the knowledge gained from GBDT trees into a neural network, including not just the predictions
but also the tree structure and feature importance, DeepGBM retains the benefits of GBDT while
enhancing its learning capacity [101]. This mirrors how transformers use distillation to simplify
complex models while preserving performance.
The trade-offs in DeepGBM between performance, interpretability, and scalability reflect broader
challenges in tabular deep learning. DeepGBM achieves higher accuracy by combining GBDT and
neural networks but sacrifices some interpretability, as the added complexity of the neural network
component reduces the transparency typically associated with tree-based models. Scalability is also
a challenge, as the neural network elements require greater computational resources. However,
models like WindTunnel have shown that this approach can boost accuracy while maintaining some
of the structural benefits of the original GBDT [101]. These trade-offs must be carefully balanced
depending on the application, as DeepGBM excels in performance and efficiency, particularly for
large-scale and real-time applications.
Addition fi
Addition
feature abstraction
Abstlay
Learnable Masks K=Ko , d=do Dropout
(a) (b)
Fig. 14. (a)DANets Abstract layer (b) An i’th Basic Block [102]
DANets also incorporate hybrid architectures that blend feature grouping and hierarchical
abstraction processes, similar to CNNs, but adapted for the unique structure of tabular data. The
introduction of the Abstract Layer (ABSTLAY) as seen in Figure 14 enables the model to group
correlated features and abstract higher-level representations through successive layers. Additionally,
shortcut paths are employed, allowing raw features to be reintroduced at higher levels of the
network, ensuring that critical information is retained and enhancing the model’s robustness,
particularly in deeper architectures. This design is similar to ResNet-style connections, where
residual pathways prevent information loss and degradation in deeper networks, thus boosting
performance.
DANets incorporate transformer-inspired ideas through the use of dynamic weighting and
attention-like mechanisms, allowing the model to selectively focus on important features during
the feature selection and abstraction processes. Although not a direct application of transformer
models, these methods improve the handling of tabular data and boost performance, making DANets
superior to traditional models like XGBoost and neural networks such as TabNet. However, this
performance comes at the cost of reduced interpretability. While attention-based feature selection
offers insights into the significance of specific features, the complexity of hierarchical abstraction
obscures the decision-making process, making it less transparent than simpler models like de-
cision trees. To address scalability, DANets utilize structure re-parameterization, which reduces
computational complexity during inference, allowing deeper networks without overwhelming
computational costs. Despite the performance boost from deeper architectures, the study notes
that additional depth yields diminishing returns due to the limited feature space in tabular data.
However, the complexity introduced by intersample attention makes it less intuitive to interpret
than simpler models. Lastly, while SAINT and SAINTENS scale well across large datasets, the
computational demands of the attention mechanisms, especially intersample attention, can make
these models more resource-intensive, particularly in larger datasets.
In which city did Piotr’s last 1st place finish occur? Utterance Token Representations Column Representations
In Which City did … Year Venue Position …
Year Venue Position Event
(A) Content Snapshot from Input Table (C) Vertical Self-Attention over Aligned Row Encodings
(B) Per-row Encoding (for each row in content snapshot, using 𝑹𝑹𝟐𝟐 as an example)
2005 Erfurt
…
Utterance Token Vectors 1st Cell Vectors
[CLS] In Which City did …
Cell-wise Pooling Cell-wise Pooling Cell-wise Pooling
Transformer (BERT)
𝑹𝑹𝟐𝟐 [CLS] In which city Piotr’s … [SEP] Year | real | 2005 [SEP] Venue | text | Erfurt [SEP] Position | text |1st [SEP] …
Fig. 15. Overview of TaBERT’s Method for Jointly Learning Representations of Natural Language Utterances
and Table Schemas, Using an Example From WikiTableQuestions [105]
In terms of architecture, TaBERT uses a hybrid approach known as content snapshots to reduce
computational overhead. Instead of encoding all rows in a table, which would be costly, TaBERT
selects a subset of rows that are most relevant to the natural language query. This allows the model
to retain key information necessary for effective joint reasoning between text and tables while
reducing the burden of processing unnecessary data. However, this comes with a trade-off: while
content snapshots help scale the model to larger tables, there is a risk of losing critical information
if the selected rows do not adequately represent the table’s full structure and content.
SELU SELU
Categorical column name Cat emb
FFNN
Positional wide addition Attention Self Attention Layer Final Dense Layer
Loss
Noise Gaussian Noise with 0.2 stf SELU SELU activation function
The model also employs a hybrid architecture, adapting the traditional Transformer design for
tabular data by simplifying its structure. Instead of utilizing the full stack of encoder and decoder
layers, as seen in NLP tasks, TabTranSELU uses just a single encoder and decoder layer. This
reduction in complexity helps tailor the architecture to the specific needs of tabular data without
sacrificing performance. Moreover, the model integrates elements of both neural networks and
Model Training
(Year) Architecture Efficiency Main Features
Source
MambaTab Mamba block + final pre- Supervised and self- Structured state-space models + fea-
(2024) [3] diction layer supervised learning; ture incremental learning
High
TP-BERTa Transformer-based Supervised learning; Transforms scalar numerical values
(2024) [107] using relative magni- Moderate into discrete tokens, integrates fea-
tude tokenization + ture name-value pairs
intra-feature attention
SwitchTab Asymmetric encoder- Self-supervised learn- Employs asymmetric encoder-
(2024) [6] decoder ing; Moderate to high decoder structure to decouple mutual
and salient features leveraging
feature corruption
CARTE (2024) Graph NN architecture Self-supervised learn- Transforms each row of tabular data
[108] using graph-attention ing; Moderate into a graph representation
layers
BiSHop (2024) Hopfield-based frame- Supervised learning; Bi-directional sparse Hopfield mod-
[109] work Moderate to high ules to process tabular data column-
wise and row-wise utilizing tabular
embeddings for categorical and nu-
merical features
LF- Column-wise trans- Supervised learning; Uses column-wise and row-wise at-
Transformer former + row-wise Moderate tention, latent factor embeddings, ma-
(2024) [106] transformer + latent trix factorization to capture feature
factor embedding interactions
TabTranSELU Self-attention + SELU ac- Supervised learning; Applies positional encoding to nu-
(2024) [26] tivation + masked layer N/A merical data and replaces ReLU with
SELU activation
TabR (2023) Feed-forward NN aug- Supervised learning; Integrates a retrieval-augmented
[110] mented with a retrieval- Moderate mechanism using L2-based nearest
based mechanism neighbors with a feed-forward NN
HYTREL Hypergraph structure- Self-supervised learn- Transforms tabular data into hyper-
(2023) [111] aware transformer ing; High graphs
(HyperTrans)
ReConTab Asymmetric autoen- Self-supervised; semi- Uses transformer-based asymmetric
(2023) [5] coder supervised; N/A autoencoder and feature corruption
GNN4TDL Graph neural network Supervised learning; Transforms tabular data into graph
(2023) [27] N/A structures using feature embeddings
Trompt (2023) Prompt learning Supervised learning; Uses prompt-inspired learning to de-
[112] Moderate rive sample-specific feature impor-
tances by combining column and
prompt embeddings
XTab (2023) Transformer based Self-supervised learn- Uses cross-table pretraining, data-
[113] ing; Moderate to high specific featurizers, and embedding
layers for categorical and numerical
features
Expanding the scope of transformer models, MambaTab integrates structured state-space mod-
els with feature incremental learning, capturing long-range dependencies in tabular data more
efficiently than standard self-attention mechanisms [3]. MambaTab’s ability to adapt to evolving
feature sets enhances its scalability, but it sacrifices interpretability, lacking the attention mecha-
nisms that explain feature importance in models like TabNet. SwitchTab employs an asymmetric
encoder-decoder architecture that decouples mutual and salient features through separate pro-
jectors, improving feature representation in tabular data [6]. By using feature corruption-based
methods, SwitchTab enhances performance and interpretability, but its complexity affects scalabil-
ity, making it less efficient for very large datasets. Context Aware Representation of Table Entries
(CARTE) also utilizes advanced architectures, combining a Graph Neural Network (GNN) with
graph-attention layers to represent each table row as a graphlet, enabling the model to capture com-
plex contextual relationships across tables [108]. CARTE excels in transfer learning and performs
well on heterogeneous datasets, although its graph-attention mechanisms reduce interpretability
and scalability with large datasets.
In the realm of tokenization and prompt-based models, TP-BERTa stands out by applying Relative
Magnitude Tokenization (RMT) to transform scalar numerical values into discrete tokens, effectively
treating numerical data as words in a language model framework [107]. Additionally, its Intra-
Feature Attention (IFA) module unifies feature names and values into a coherent representation,
reducing feature interference and enhancing prediction accuracy. However, this deep integration
impacts interpretability compared to more straightforward models like gradient-boosted decision
trees. Trompt employs prompt-inspired learning to derive sample-specific feature importance
through the use of column and prompt embeddings, which tailor the relevance of features for
each data instance [112]. While Trompt boosts performance, especially for highly variable tabular
datasets, the abstract nature of its embeddings compromises interpretability and adds complexity.
Several other models combine innovative mechanisms with existing architectures to address
tabular data challenges. TabR integrates a retrieval-augmented mechanism that utilizes L2-based
nearest neighbors along with a feed-forward neural network, enhancing local learning by retrieving
relevant context from the training data [110]. While this method significantly improves predictive
accuracy, it introduces computational overhead during training, affecting scalability. BiSHop lever-
ages Bi-directional Sparse Hopfield Modules to process tabular data both column-wise and row-wise,
capturing intra-feature and inter-feature interactions [109]. Its specialized tabular embeddings and
learnable sparsity provide strong performance but at the cost of reduced interpretability and higher
computational requirements, limiting its application to larger datasets.
Finally, Hypergraph-enhanced Tabular Data Representation Learning (HYTREL) addresses struc-
tural challenges in tabular data using a Hypergraph structure-aware transformer, representing
tables as hypergraphs to capture complex cell, row, and column relationships [111]. This enables
HYTREL to preserve critical structural properties and perform exceptionally well on tasks like
column annotation and table similarity prediction, though the complexity of hypergraphs reduces
interpretability. TabLLM, a novel approach, serializes tabular data into natural language strings
to allow large language models (LLMs) to process it as they would with text [84]. While effective
in zero-shot and few-shot learning scenarios, TabLLM faces scalability issues and interpretability
challenges due to the high computational demands of LLMs and their abstract representation of
tabular data.
selection, where the most relevant features are dynamically emphasized based on their interactions
with others, resulting in improved performance across datasets [61]. Figure 17 further exemplifies
this by demonstrating how Multi-Headed Self-Attention (MHSA) is applied between both features
and samples in a tabular deep-learning model. By focusing attention first on the relationships
between features and then between different samples, the model improves its ability to generalize
and capture complex feature interactions, enhancing accuracy in tabular data processing.
(CLS) 0.1 0.4 0.7 0.1 0.4 0.7 0.1 0.4 0.7
MLP 0.1
Age:2 Feature 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5 0.1
MHSA Classifier 0.6
Salary: Embedding 0.1 0.2 0.2 0.1 0.2 0.2 0.1 0.2 0.2
on CLS 0.8
Credit: 700 0.8 0.5 0.1 0.8 0.5 0.1 0.8 0.5 0.1
X S S’ S’’ 𝑦𝑦�
01 0.4 0.7 0.2 0.6 0.3 0.25 0.2 0.4 0.7 0.8 0.25
0. 0.4 0.6 0.3 0.5 0.2 0.25 0.1 0.5 0.7 0.6 0.25
Reshape 2
0. 0.3 0.5 0.2 0.6 0.3 0.25 0.5 0.2 0.3 0.8 0.5
1
0.3 0.2 0.5 0.1 0.56 0.4 0.3 0.25 0.2 0.9 0.74 0.25
MHSA 0.2 0.3 0.5 0.2 0.3 0.1 0.4 0.2 0.3 0.7 0.5 0.4 Reshape
0.2 0.2 0.1 0.25 0.21 0.4 0.8 0.35 0.1 0.4 0.48 0.5
Fig. 17. Feature and Sample Attention Using MHSA to Optimize Tabular Data Classification and Generaliza-
tion [114]
Building on this, SAINT introduces both self-attention and intersample attention mechanisms to
further refine feature selection. The self-attention mechanism in SAINT focuses on interactions
between features within a single data point, dynamically selecting important features based on
their relationships [25]. This is similar to TabNet’s emphasis on instance-specific feature selection
but extends beyond by capturing deeper interdependencies between features, improving model
adaptability and performance on heterogeneous datasets. SAINT’s novel intersample attention
adds another layer of sophistication by enabling data points to attend to other samples within
the dataset. This allows SAINT to better handle noisy or missing features by borrowing relevant
information from similar samples, a capability that is particularly useful in real-world datasets
where data quality may vary. This cross-sample attention mechanism significantly enhances feature
selection, making the model more robust to incomplete or corrupt data compared to traditional
models like GBDTs and MLPs.
Both TabNet [24] and TabTransformer [19] offer significant advances in interpretability. TabNet
operates at both local and global levels, enabling users to understand which features contribute
to individual predictions while also providing a broader view of the overall model behavior. This
transparency makes TabNet particularly useful in understanding model decisions for specific
samples. Similarly, SAINT improves interpretability through its attention-based structure. In SAINT,
attention maps highlight which features and samples are being prioritized during prediction,
making it easier to trace the model’s decision-making process and visualize feature importance.
TabTransformer also enhances interpretability by generating contextual embeddings that cluster
semantically similar features together in the embedding space. This clustering facilitates easier
visualization and interpretation of feature relationships, making the model more transparent.
In terms of feature selection, TabNet integrates attention directly into the learning process,
optimizing feature selection and model training simultaneously. Unlike traditional methods like
forward selection or Lasso regularization, which apply uniform selection across the dataset, TabNet’s
instance-wise selection adapts to the specific needs of each sample, resulting in more compact
feature representations and a reduced risk of overfitting. InterpreTabNet, an improvement over
TabNet, further boosts these capabilities with the MLP-Attentive Transformer and the Entmax
activation function, leading to more precise feature selection [78]. Similarly, TabTransformer’s
multi-head self-attention mechanism enables the model to dynamically capture feature interactions
across the dataset. By attending to all other features, it efficiently selects the most critical ones
while disregarding irrelevant data, which enhances the model’s robustness against noisy or missing
data. SAINT extends this concept by leveraging intersample attention, which allows features
to interact across different samples. This mechanism not only improves feature selection but
also provides a way for the model to learn from multiple data points simultaneously, enhancing
its resilience to missing or noisy data. SAINT’s feature encoding method, which projects both
categorical and continuous features into a shared embedding space, also outperforms traditional
encoding techniques by allowing the model to learn from all feature types in a unified manner.
Both TabNet and TabTransformer, along with SAINT, showcase notable advancements in han-
dling tabular data through their attention mechanisms, offering robustness, adaptability, and
transparency. TabNet’s attention-driven approach enhances gradient propagation and generaliza-
tion, while TabTransformer excels in handling noisy and missing data, making both models suitable
for real-world applications where data imperfections are common. SAINT builds on these strengths
by introducing intersample attention, which allows the model to learn from relationships between
samples, further enhancing its ability to handle complex data distributions. Additionally, TabTrans-
former and SAINT’s pre-training on unlabeled data in semi-supervised learning scenarios allows
them to refine feature representations, contributing to improved performance when compared to
models relying exclusively on labeled data.
performance on tasks traditionally dominated by simpler models like GBDT and boosting the
effectiveness of tabular data predictions.
However, both architectures introduce challenges related to increased complexity. NODE’s dif-
ferentiable decision trees and multi-layer structures add computational overhead, making training
more resource-intensive compared to simpler GBDT models. Similarly, DeepGBM’s distillation pro-
cess, which involves learning leaf embeddings and managing multiple trees, introduces additional
computational cost. Both models require careful hyperparameter tuning to optimize performance,
which can make them more difficult to use in practice. Parameters like the number of layers, tree
depth, tree groups, and output dimensions must be adjusted meticulously to avoid overfitting and
ensure optimal learning. These complexities can increase the training time and resource demands
for both NODE and DeepGBM, making them less efficient in terms of inference speed compared
to their GBDT counterparts. Despite this, both models achieve comparable inference efficiency to
GBDTs when implemented effectively, but the training process for NODE and DeepGBM tends to
be longer due to the additional layers of differentiable optimization and knowledge distillation.
6 Training Strategies
6.1 Data Augmentation
Data augmentation techniques, such as SMOTE, GAN-based methods, and variational autoencoders
(VAEs), have demonstrated varying degrees of effectiveness in improving the performance of deep
learning models on tabular data, particularly in addressing class imbalance and small dataset issues.
SMOTE, one of the classic techniques, has been widely used to oversample the minority class
by generating synthetic samples [117]. It does this by interpolating between existing data points
in the feature space, which helps mitigate the class imbalance problem and can enhance model
performance in imbalanced datasets. However, SMOTE performs well with categorical features,
it struggles with continuous variables, as noted in experiments using datasets like Breast Cancer
and Credit Card Fraud [117]. The technique may struggle to maintain feature distributions when
dealing with categorical data, leading to less realistic synthetic samples that may not fully capture
the complexity of the original dataset. Wang and Pai [118] similarly note that SMOTE, although
effective for initial data expansion, does not generate sufficiently diverse and realistic data, limiting
its utility for more complex datasets.
GAN-based methods, particularly Conditional Tabular GAN (CTGAN) and Wasserstein GAN
with Gradient Penalty (WCGAN-GP) have emerged as more advanced techniques for tabular data
augmentation. These methods have demonstrated better performance than traditional techniques
like SMOTE, especially when working with mixed-type tabular data containing both continuous
and categorical features. Camino et al. [119] highlight the advantages of using GANs over SMOTE
for minority class oversampling, emphasizing that GANs can generate more realistic and diverse
samples. However, they also point out challenges specific to tabular data, such as difficulty in
handling discrete outputs and mode collapse, where the GAN fails to generate a sufficiently varied
dataset. Jeong et al. [120] introduce BAMTGAN, a variation of GANs, which incorporates a similarity
loss to ensure the generated data maintains the original distribution and avoids mode collapse.
Despite improvements, the challenge of balancing sample diversity and realism persists.
CTGAN addresses several challenges inherent to tabular data, such as handling non-Gaussian
and multimodal distributions, by introducing mode-specific normalization and using a conditional
generator to manage class imbalance Xu et al. [121]. Sauber-Cole and Khoshgoftaar [122] offer a
broad survey on the use of GANs to address class imbalance in tabular data. GANs are praised
for generating realistic minority class samples and improving model performance on imbalanced
datasets. However, challenges like mode collapse—where GANs fail to capture the diversity of the
minority class and maintain realistic feature distributions, especially for categorical data, remain
significant. Despite these issues, Sauber-Cole and Khoshgoftaar [122] highlight Wasserstein and
Conditional GANs as promising solutions for overcoming these limitations. This allows CTGAN to
generate more realistic and varied synthetic data while preserving the underlying data distributions.
WCGAN-GP further improves the stability of GAN training by mitigating issues like vanishing
gradients and mode collapse, which are common problems in standard GAN architectures. Com-
pared to SMOTE, WCGAN-GP has been shown to produce synthetic data that better preserves
data patterns and relationships, ultimately leading to better model performance and higher privacy
protection [123]. Hybrid approaches that combine SMOTE with GAN-based methods address chal-
lenges faced by standalone models. Wang and Pai [118] introduced a hybrid model using SMOTE
to augment small datasets, followed by WCGAN-GP to generate diverse and realistic synthetic
data. This combination leverages SMOTE’s statistical consistency and WCGAN-GP’s ability to
prevent overfitting, producing high-quality data while maintaining feature distribution, making it
an effective solution for tabular data augmentation.
VAEs are another promising approach for data augmentation, particularly for continuous data.
VAEs regularize the latent space to generate smooth and realistic data distributions, and they have
been effective in augmenting tabular datasets [117]. However, they tend to struggle with mixed-type
data and categorical features, where maintaining the original feature distribution becomes more
challenging. Additionally, VAEs are prone to a phenomenon known as posterior collapse, where
the latent space collapses into a narrow range, reducing the variability of the generated samples
and leading to unrealistic outputs, particularly for minority classes.
One of the main challenges across these techniques is the difficulty in maintaining the original
feature distributions, especially for continuous features and imbalanced categorical columns. While
SMOTE works well for continuous data, it often falls short in handling categorical data. GAN-
based approaches, such as CTGAN, use specific normalization techniques to address this issue,
but even these advanced methods are not immune to challenges like mode collapse, where the
model generates synthetic data lacking variability. GANs also require significant computational
resources and careful tuning of hyperparameters to avoid these issues during training. Despite
these challenges, GAN-based techniques, particularly WCGAN-GP, have demonstrated superior
performance in generating high-quality, realistic synthetic data compared to traditional methods
like SMOTE, making them a valuable tool for augmenting tabular datasets.
6.2 Cross-validation
Cross-validation is a crucial technique to ensure the generalization of deep learning models,
especially for tabular data, where model overfitting and data imbalances can significantly affect
performance. Richetti et al. [124] emphasize the importance of cross-validation, particularly for
smaller datasets where its role in preventing overfitting is more pronounced. In this context, k-fold
cross-validation emerges as a popular method, with the authors applying an 8-fold cross-validation
approach to achieve robust error measurements across different data partitions. Similarly, Zhu et al.
[2] applied tenfold cross-validation in their study on CNNs for tabular data transformed into image
representations. The study highlights how k-fold cross-validation ensures generalization even after
the transformation process, preventing overfitting, especially in limited or imbalanced datasets.
Both studies underscore that while k-fold cross-validation offers robust performance evaluation,
increasing the number of folds, such as from 5-fold to 10-fold, introduces higher computational
costs without proportionally improving performance accuracy.
Wilimitis and Walsh [125] provide a comparative analysis of cross-validation methods, focusing
on the trade-offs between computational efficiency and model performance. They examine the
commonly used 5-fold cross-validation alongside other variants like repeated k-fold cross-validation,
finding that while more folds can slightly improve model evaluation, it also escalates computational
demands. The study also explores nested cross-validation, a more unbiased method for performance
estimation, particularly useful in healthcare models. However, the significant computational costs
of nested cross-validation are highlighted due to its repeated training cycles during hyperparameter
tuning. This mirrors the findings of Richetti et al. [124], who noted that methods like leave-one-
out cross-validation (LOOCV) can be computationally impractical for larger datasets due to their
repeated iterations.
Ullah et al. [126] expand on these ideas by discussing the use of stratified k-fold cross-validation,
particularly for handling class imbalances in deep learning models for tabular data. By maintaining
consistent class proportions in each fold, stratified cross-validation improves generalization, espe-
cially when working with imbalanced datasets, a key concern echoed in the previous two studies.
Ullah et al. [126] also discussed the challenges of using LOOCV, which, despite providing unbiased
performance estimates, comes with a high computational cost, especially for larger datasets. Nested
cross-validation is similarly praised for its precision in mitigating data leakage during hyperparam-
eter tuning but is noted for its quadratic time complexity, making it a computationally intensive
choice.
These insights collectively suggest that while cross-validation techniques such as stratified
k-fold and nested cross-validation are vital for improving the robustness of deep learning models
on tabular data, they must be chosen carefully. The choice depends on balancing accuracy and
computational efficiency, where simpler methods like k-fold cross-validation are more scalable,
while more complex methods like nested cross-validation, although more precise, come with
significant computational trade-offs.
as contrastive learning, has also proven effective, particularly in scenarios where labeled data is
scarce. These methods enable models to learn useful features without extensive labeling, making
them highly suitable for domains with limited labeled datasets. Moreover, El-Melegy et al. [130]
introduced a novel approach that transforms tabular data into image-like formats, allowing the
use of CNNs traditionally designed for image tasks. Coupled with GAN-based sampling, which
generates synthetic data to balance datasets, this approach enables effective learning from small,
sparse datasets. In summary, while transfer learning shows substantial promise in tabular data, its
effectiveness is still hindered by challenges such as feature heterogeneity, catastrophic forgetting,
and dataset imbalance. However, advancements like transformer-based architectures, progressive
learning, and GAN-based data augmentation offer solutions to these challenges. As these methods
continue to evolve, transfer learning will likely become a more robust and widely applicable tool
for tasks involving tabular data.
7 Future Directions
As deep learning models continue to evolve for tabular data, two key areas stand out for future
exploration: explainability and self-supervised learning. While current models offer impressive
predictive capabilities, their lack of transparency remains a significant challenge in high-stakes fields
like transportation engineering and healthcare. Enhancing model explainability and interpretability
through advanced techniques like SHAP, LIME, and integrated gradients is essential for building
trust and understanding in these models. Additionally, the growing field of self-supervised learning
(SSL) offers significant potential to leverage vast amounts of unlabeled tabular data, improving
model performance without relying on extensive labeled datasets. This section examines these
promising directions and their potential impact on the future of tabular deep learning.
notes and tabular data. The combined use of these techniques allows for greater transparency by
identifying the contributions of both text-based and structured data features. While this enhances
trust among clinicians, the complexity of these explanations poses challenges for broader adoption.
Simplifying these techniques to make them more accessible to non-technical users is necessary for
their wider application in high-stakes environments such as transportation safety and healthcare.
Several studies emphasize the need for further refinement of these explainability techniques
to improve their practical application. Dastile and Celik [133] applied SHAP in cancer prediction
models and found that while SHAP enhanced the interpretability of the model, its computational
demands made real-time application challenging. The authors suggest optimizing SHAP or devel-
oping more efficient explainability methods that retain interpretability while reducing resource
consumption, especially in scenarios where real-time decision-making is critical. Similarly, Tran and
Byeon [134] used SHAP in a hybrid LightGBM–TabPFN model to predict dementia in Parkinson’s
disease patients. SHAP provided valuable insights into the feature contributions, improving the
model’s interpretability in a clinical setting. However, the study also highlights the need for further
development of causality-driven explanations, integrating domain expertise to increase trust and
applicability in medical environments. In summary, while SHAP, LIME, and Integrated Gradients
have significantly improved the interpretability of deep learning models for tabular data, further
development is needed to enhance their computational efficiency, stability, and accessibility for
real-world applications where trust and transparency are crucial.
- - - - - - - - -
- - - - - - - - - ii) Reconstruction Loss
D 𝑥𝑥̅ 1 𝑥𝑥 𝑜𝑜𝑜𝑜 𝑥𝑥̅1 𝑥𝑥1
𝑥𝑥̅ 2 𝑥𝑥 𝑜𝑜𝑜𝑜 𝑥𝑥̅2 𝑥𝑥2
overlap overlap
𝑥𝑥̅ 3 𝑥𝑥 𝑜𝑜𝑜𝑜 𝑥𝑥̅3 𝑥𝑥3
data, this model learns effective representations without needing heavy reliance on labeled data,
improving generalization across tasks.
Another approach to improving SSL’s efficacy on tabular data is through the careful design
of reconstruction-based tasks that leverage feature subsets. Zheng et al. [138] apply the SubTab
framework, where different subsets of features are used to reconstruct the full input, helping
address the issue of heterogeneity in tabular data, where not all features contribute equally to the
prediction task. This approach is echoed in VIME [58], which introduces two pretext tasks—feature
vector estimation and mask vector estimation—that focus on reconstructing original data from
masked and corrupted versions. These pretext tasks help the model learn robust representations by
encouraging it to handle missing or noisy data effectively. Similarly, masked encoding for tabular
data [139] builds on this by incorporating masked encoding inspired by transformer models and
using adversarial training during the reconstruction process. This adversarial component forces the
model to recover features even in the presence of perturbations, making the learned representations
more robust.
All these studies highlight the importance of leveraging unlabeled data in SSL, particularly in sce-
narios where labeled data is scarce. Leveraging large amounts of unlabeled data through techniques
such as consistency regularization, as seen in VIME and MET, can significantly enhance model
generalization even with limited labeled data. Wang et al. [135] emphasized that SSL techniques
must make effective use of unlabeled data to ensure transferability across tasks, especially since
tabular data often originates from diverse sources with varying feature distributions. This aligns
with the findings of Chitlangia et al. [137], where Manifold Mixup’s use of latent space perturba-
tions helps generate meaningful augmentations without relying on input-level transformations.
In SubTab, multi-view learning enables the model to capture different perspectives of the data,
extracting more robust representations from unlabeled data.
8 Conclusion
This survey reviewed the progress in deep learning models designed for tabular data, traditionally
a challenging domain for deep learning. While classical models like GBDTs have long dominated
tabular data tasks, new architectures such as TabNet, SAINT, and TabTransformer have introduced
attention mechanisms and feature embeddings to better handle the complexities of heterogeneous
features, high dimensionality, and non-local interactions. These models have made significant
strides in enhancing interpretability and performance, with innovations like TabNet’s sequential
attention and SAINT’s intersample attention, which dynamically capture relationships between
features and rows of data.
However, challenges remain, particularly regarding computational efficiency and the risk of
overfitting on smaller datasets. While models like TabTransformer and SAINT are computationally
intensive, techniques like Mixup, CutMix, and regularization methods have been developed to ad-
dress overfitting. Recent advancements, including hybrid models like TabTranSELU and GNN4TDL,
have expanded the range of applications in many research domains. IGTD has further enhanced
how deep learning models can transform tabular data into more structured formats for better
performance.
One limitation of this survey is the lack of a detailed performance comparison across different
models and datasets. Future research should focus on conducting more rigorous evaluations of
tabular deep learning models on diverse datasets to gain a deeper understanding of their relative
strengths and weaknesses. Alongside performance comparisons, further studies should aim to
enhance the scalability and adaptability of these models, particularly in handling smaller or noisier
datasets. Techniques such as transfer learning and self-supervised learning hold promise, as they
allow models to benefit from large amounts of unlabeled data. Additionally, improving model
interpretability and reducing computational costs will be crucial to broadening the applicability of
deep tabular learning across industries like healthcare, finance, transportation, and infrastructure.
Acknowledgments
We extend our sincere gratitude to Khaled Aly Abousabaa for his assistance in preparing the images
for this study. We also thank our colleagues at Texas State University for their valuable guidance
throughout this work.
References
[1] Talip Ucar, Ehsan Hajiramezanali, and Lindsay Edwards. Subtab: Subsetting features of tabular data for self-supervised
representation learning. Advances in Neural Information Processing Systems, 34:18853–18865, 2021.
[2] Yitan Zhu, Thomas Brettin, Fangfang Xia, Alexander Partin, Maulik Shukla, Hyunseung Yoo, Yvonne A Evrard,
James H Doroshow, and Rick L Stevens. Converting tabular data into images for deep learning with convolutional
neural networks. Scientific reports, 11(1):11325, 2021.
[3] Md Atik Ahamed and Qiang Cheng. Mambatab: A simple yet effective approach for handling tabular data. arXiv
preprint arXiv:2401.08867, 2024.
[4] Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. Language models are realistic
tabular data generators. arXiv preprint arXiv:2210.06280, 2022.
[5] Suiyao Chen, Jing Wu, Naira Hovakimyan, and Handong Yao. Recontab: Regularized contrastive representation
learning for tabular data. arXiv preprint arXiv:2310.18541, 2023.
[6] Jing Wu, Suiyao Chen, Qi Zhao, Renat Sergazinov, Chen Li, Shengjie Liu, Chongchao Zhao, Tianpei Xie, Hanqing
Guo, Cheng Ji, et al. Switchtab: Switched autoencoders are effective tabular learners. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 38, pages 15924–15933, 2024.
[7] Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug, Martin Pawelczyk, and Gjergji Kasneci. Deep neural
networks and tabular data: A survey. IEEE transactions on neural networks and learning systems, 2022.
[8] Ravid Shwartz-Ziv and Amitai Armon. Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90,
2022.
[9] Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why do tree-based models still outperform deep learning on
typical tabular data? Advances in neural information processing systems, 35:507–520, 2022.
[10] Yuqian Wu, Hengyi Luo, and Raymond ST Lee. Deep feature embedding for tabular data. arXiv preprint
arXiv:2408.17162, 2024.
[11] Yury Gorishniy, Ivan Rubachev, and Artem Babenko. On embeddings for numerical features in tabular deep learning.
Advances in Neural Information Processing Systems, 35:24991–25004, 2022.
[12] David Collett. Modelling binary data. CRC press, 2002.
[13] Scott L Zeger and Kung-Yee Liang. Longitudinal data analysis for discrete and continuous outcomes. Biometrics,
pages 121–130, 1986.
[14] Curtis E Dyreson and Richard T Snodgrass. Timestamp semantics and representation. Information Systems, 18(3):143–
166, 1993.
[15] Ari Aulia Hakim, Alva Erwin, Kho I Eng, Maulahikmah Galinium, and Wahyu Muliady. Automated document
classification for news article in bahasa indonesia based on term frequency inverse document frequency (tf-idf)
approach. In 2014 6th international conference on information technology and electrical engineering (ICITEE), pages 1–4.
IEEE, 2014.
[16] Tomas Mikolov. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[17] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543,
2014.
[18] Xinlong Li, Xingyu Fu, Guangluan Xu, Yang Yang, Jiuniu Wang, Li Jin, Qing Liu, and Tianyuan Xiang. Enhancing bert
representation with context-aware embedding for aspect-based sentiment analysis. IEEE Access, 8:46868–46876, 2020.
[19] Xin Huang, Ashish Khetan, Milan Cvitkovic, and Zohar Karnin. Tabtransformer: Tabular data modeling using
contextual embeddings, 2020.
[20] Zhengyi Ma, Zhicheng Dou, Wei Xu, Xinyu Zhang, Hao Jiang, Zhao Cao, and Ji-Rong Wen. Pre-training for ad-hoc
retrieval: hyperlink is also you need. In Proceedings of the 30th ACM International Conference on Information &
Knowledge Management, pages 1212–1221, 2021.
[21] Zhenbo Lu, Wei Zhou, Shixiang Zhang, and Chen Wang. A new video-based crash detection method: Balancing speed
and accuracy using a feature fusion deep learning framework. Journal of advanced transportation, 2020(1):8848874,
2020.
[22] Boutheina Maaloul, Abdelmalik Taleb-Ahmed, Smail Niar, Naim Harb, and Carlos Valderrama. Adaptive video-based
algorithm for accident detection on highways. In 2017 12th IEEE International Symposium on Industrial Embedded
Systems (SIES), pages 1–6. IEEE, 2017.
[23] VN Durga Pavithra Kollipara, VN Hemanth Kollipara, and M Durga Prakash. Emoji prediction from twitter data using
deep learning approach. In 2021 Asian conference on innovation in technology (ASIANCON), pages 1–6. IEEE, 2021.
[24] Sercan Ö Arik and Tomas Pfister. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI
conference on artificial intelligence, volume 35, pages 6679–6687, 2021.
[25] Gowthami Somepalli, Micah Goldblum, Avi Schwarzschild, C Bayan Bruss, and Tom Goldstein. Saint: Improved
neural networks for tabular data via row attention and contrastive pre-training. arXiv preprint arXiv:2106.01342, 2021.
[26] Yuchen Mao. Tabtranselu: A transformer adaptation for solving tabular data. Applied and Computational Engineering,
51:81–88, 2024.
[27] Cheng-Te Li, Yu-Che Tsai, and Jay Chiehen Liao. Graph neural networks for tabular data learning. In 2023 IEEE 39th
International Conference on Data Engineering (ICDE), pages 3589–3592. IEEE, 2023.
[28] Shriyank Somvanshi, Gian Antariksa, and Subasish Das. Enhanced balanced-generative adversarial networks to
predict pedestrian injury types. Available at SSRN 4847615, 2024.
[29] Marco D Adelfio and Hanan Samet. Schema extraction for tabular data on the web. Proceedings of the VLDB Endowment,
6(6):421–432, 2013.
[30] Laith Alzubaidi, Jinglan Zhang, Amjad J Humaidi, Ayad Al-Dujaili, Ye Duan, Omran Al-Shamma, José Santamaría,
Mohammed A Fadhel, Muthana Al-Amidie, and Laith Farhan. Review of deep learning: concepts, cnn architectures,
challenges, applications, future directions. Journal of big Data, 8:1–74, 2021.
[31] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[32] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
[33] Qinghua Zheng, Zhen Peng, Zhuohang Dang, Linchao Zhu, Ziqi Liu, Zhiqiang Zhang, and Jun Zhou. Deep tabular data
modeling with dual-route structure-adaptive graph networks. IEEE Transactions on Knowledge and Data Engineering,
35(9):9715–9727, 2023.
[34] Antonio Briola, Yuanrong Wang, Silvia Bartolucci, and Tomaso Aste. Homological convolutional neural networks.
arXiv preprint arXiv:2308.13816, 2023.
[35] Tennison Liu, Zhaozhi Qian, Jeroen Berrevoets, and Mihaela van der Schaar. Goggle: Generative modelling for tabular
data by learning relational structure. In The Eleventh International Conference on Learning Representations, 2023.
[36] Lun Du, Fei Gao, Xu Chen, Ran Jia, Junshan Wang, Jiang Zhang, Shi Han, and Dongmei Zhang. Tabularnet: A neural
network architecture for understanding semantic structures of tabular data. In Proceedings of the 27th ACM SIGKDD
Conference on Knowledge Discovery & Data Mining, pages 322–331, 2021.
[37] Joseph M Hellerstein. Learning to restructure tables automatically. ACM SIGMOD Record, 53(1):75–75, 2024.
[38] Zifeng Wang and Jimeng Sun. Transtab: Learning transferable tabular transformers across tables. Advances in Neural
Information Processing Systems, 35:2902–2915, 2022.
[39] Amirata Ghorbani, Dina Berenbaum, Maor Ivgi, Yuval Dafna, and James Y Zou. Beyond importance scores: Interpreting
tabular ml by visualizing feature semantics. Information, 13(1):15, 2021.
[40] Nadja Geisler and Carsten Binnig. Introducing quest: a query-driven framework to explain classification models on
tabular data. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, pages 1–4, 2022.
[41] Vanshika Jain, Meghansh Goel, and Kshitiz Shah. Deep learning on small tabular dataset: Using transfer learning
and image classification. In International Conference on Artificial Intelligence and Speech Technology, pages 555–568.
Springer, 2021.
[42] Georgia Koppe, Andreas Meyer-Lindenberg, and Daniel Durstewitz. Deep learning for small and big data in psychiatry.
Neuropsychopharmacology, 46(1):176–190, 2021.
[43] Wei Zhao. Research on the deep learning of the small sample data based on transfer learning. In AIP conference
proceedings, volume 1864. AIP Publishing, 2017.
[44] Lorenzo Brigato and Luca Iocchi. A close look at deep learning with small data. In 2020 25th international conference
on pattern recognition (ICPR), pages 2490–2497. IEEE, 2021.
[45] Illia Horenko. On a scalable entropic breaching of the overfitting barrier for small data problems in machine learning.
Neural Computation, 32(8):1563–1579, 2020.
[46] Benjamin L. Badger. Small language models for tabular data, 2022.
[47] Witold Wydmański, Oleksii Bulenok, and Marek Śmieja. Hypertab: Hypernetwork approach for deep learning on
small tabular datasets. In 2023 IEEE 10th International Conference on Data Science and Advanced Analytics (DSAA),
pages 1–9. IEEE, 2023.
[48] Rajat Singh and Srikanta Bedathur. Embeddings for tabular data: A survey, 2023.
[49] Robert G Clark, Wade Blanchard, Francis KC Hui, Ran Tian, and Haruka Woods. Dealing with complete separation
and quasi-complete separation in logistic regression for linguistic data. Research Methods in Applied Linguistics,
2(1):100044, 2023.
[50] Daniel Valero-Carreras, Javier Alcaraz, and Mercedes Landete. Comparing two svm models through different metrics
based on the confusion matrix. Computers & Operations Research, 152:106131, 2023.
[51] Nitin Chauhan and Krishna Singh. A review on conventional machine learning vs deep learning. pages 347–352, 09
2018.
[52] Sakib Abrar. Deep model intervention for representation learning of tabular data. Master’s thesis, Tennessee State
University, 2023.
[53] Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why do tree-based models still outperform deep learning on
tabular data?, 2022.
[54] Sheikh Amir Fayaz, Majid Zaman, Sameer Kaul, and Muheet Ahmed Butt. Is deep learning on tabular data enough?
an assessment. International Journal of Advanced Computer Science and Applications, 13(4), 2022.
[55] André Schidler and Stefan Szeider. Sat-based decision tree learning for large data sets. Journal of Artificial Intelligence
Research, 80:875–918, 2024.
[56] Vinícius G Costa and Carlos E Pedreira. Recent advances in decision trees: An updated survey. Artificial Intelligence
Review, 56(5):4765–4800, 2023.
[57] Matan Marudi, Irad Ben-Gal, and Gonen Singer. A decision tree-based method for ordinal classification problems.
IISE Transactions, 56(9):960–974, 2024.
[58] Jinsung Yoon, Yao Zhang, James Jordon, and Mihaela Van der Schaar. Vime: Extending the success of self-and
semi-supervised learning to tabular domain. Advances in Neural Information Processing Systems, 33:11033–11043,
2020.
[59] Yuanfei Luo, Hao Zhou, Wei-Wei Tu, Yuqiang Chen, Wenyuan Dai, and Qiang Yang. Network on network for tabular
data classification in real-world applications. In Proceedings of the 43rd International ACM SIGIR Conference on Research
and Development in Information Retrieval, pages 2317–2326, 2020.
[60] Liran Katzir, Gal Elidan, and Ran El-Yaniv. Net-dnf: Effective deep modeling of tabular data. In International conference
on learning representations, 2020.
[61] Xin Huang, Ashish Khetan, Milan Cvitkovic, and Zohar Karnin. Tabtransformer: Tabular data modeling using
contextual embeddings. arXiv preprint arXiv:2012.06678, 2020.
[62] Sergei Popov, Stanislav Morozov, and Artem Babenko. Neural oblivious decision ensembles for deep learning on
tabular data. arXiv preprint arXiv:1909.06312, 2019.
[63] Guolin Ke, Zhenhui Xu, Jia Zhang, Jiang Bian, and Tie-Yan Liu. Deepgbm: A deep learning framework distilled by
gbdt for online prediction tasks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining, pages 384–394, 2019.
[64] Baohua Sun, Lin Yang, Wenhan Zhang, Michael Lin, Patrick Dong, Charles Young, and Jason Dong. Supertml:
Two-dimensional word embedding for the precognition on structured tabular data. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition workshops, pages 0–0, 2019.
[65] Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. xdeepfm: Com-
bining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD
international conference on knowledge discovery & data mining, pages 1754–1763, 2018.
[66] Guolin Ke, Jia Zhang, Zhenhui Xu, Jiang Bian, and Tie-Yan Liu. Tabnn: A universal neural network solution for
tabular data. 2018.
[67] Ira Shavitt and Eran Segal. Regularization learning networks: deep learning for tabular datasets. Advances in Neural
Information Processing Systems, 31, 2018.
[68] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a factorization-machine based
neural network for ctr prediction. arXiv preprint arXiv:1703.04247, 2017.
[69] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg
Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for recommender systems. In Proceedings of the 1st
workshop on deep learning for recommender systems, pages 7–10, 2016.
[70] Ravid Shwartz-Ziv and Amitai Armon. Tabular data: Deep learning is not all you need, 2021.
[71] Ami Abutbul, Gal Elidan, Liran Katzir, and Ran El-Yaniv. Dnf-net: A neural architecture for tabular data. arXiv
preprint arXiv:2006.06465, 2020.
[72] Nitin Kumar Chauhan and Krishna Singh. A review on conventional machine learning vs deep learning. In 2018
International conference on computing, power and communication technologies (GUCON), pages 347–352. IEEE, 2018.
[73] Sakib Abrar and Manar D Samad. Perturbation of deep autoencoder weights for model compression and classification
of tabular data. Neural Networks, 156:160–169, 2022.
[74] Jintai Chen, Jiahuan Yan, Qiyuan Chen, Danny Chen, Jian Wu, and Jimeng Sun. Excelformer: Can a dnn be a sure bet
for tabular prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, 2024.
[75] N Benjamin Erichson, Lionel Mathelin, Zhewei Yao, Steven L Brunton, Michael W Mahoney, and J Nathan Kutz.
Shallow neural networks for fluid flow reconstruction with limited sensors. Proceedings of the Royal Society A,
476(2238):20200097, 2020.
[76] Ivan Rubachev, Artem Alekberov, Yury Gorishniy, and Artem Babenko. Revisiting pretraining objectives for tabular
deep learning. arXiv preprint arXiv:2207.03208, 2022.
[77] James Fiedler. Simple modifications to improve tabular neural networks. arXiv preprint arXiv:2108.03214, 2021.
[78] Shiyun Wa, Xinai Lu, and Minjuan Wang. Stable and interpretable deep learning for tabular data: Introducing
interpretabnet with the novel interprestability metric. arXiv preprint arXiv:2310.02870, 2023.
[79] Viet-Cuong Ta, Thi-Linh Hoang, Ngoc-San Doan, Van-Thang Nguyen, Nuong Nguyen Dieu, Thi Thanh Thuy Pham,
and Nam Nguyen Dang. Tabnet efficiency for facies classification and learning feature embedding from well log data.
Petroleum Science and Technology, pages 1–16, 2023.
[80] Sheikh Amir Fayaz, Majid Zaman, Sameer Kaul, and Muheet Ahmed Butt. Is deep learning on tabular data enough?
an assessment. International Journal of Advanced Computer Science and Applications, 13(4), 2022.
[81] Aleksandra Lewandowska. Xgboost meets tabnet in predicting the costs of forwarding contracts. In 2022 17th
Conference on Computer Science and Intelligence Systems (FedCSIS), pages 417–420. IEEE, 2022.
[82] Manu Joseph. Pytorch tabular: A framework for deep learning with tabular data, 2021.
[83] Wojciech Samek, Grégoire Montavon, Andrea Vedaldi, Lars Kai Hansen, and Klaus-Robert Müller. Explainable AI:
interpreting, explaining and visualizing deep learning, volume 11700. Springer Nature, 2019.
[84] Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, and David Sontag. Tabllm:
Few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence
and Statistics, pages 5549–5581. PMLR, 2023.
[85] Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. Tabddpm: Modelling tabular data with
diffusion models. In International Conference on Machine Learning, pages 17564–17579. PMLR, 2023.
[86] Guang Liu, Jie Yang, and Ledell Wu. Ptab: Using the pre-trained language model for modeling tabular data. arXiv
preprint arXiv:2209.08060, 2022.
[87] Manu Joseph and Harsh Raj. Gate: Gated additive tree ensemble for tabular classification and regression. arXiv
preprint arXiv:2207.08548, 2, 2022.
[88] Shaofeng Cai, Kaiping Zheng, Gang Chen, HV Jagadish, Beng Chin Ooi, and Meihui Zhang. Arm-net: Adaptive
relation modeling network for structured data. In Proceedings of the 2021 International Conference on Management of
Data, pages 207–220, 2021.
[89] Jannik Kossen, Neil Band, Clare Lyle, Aidan N Gomez, Thomas Rainforth, and Yarin Gal. Self-attention between
datapoints: Going beyond individual input-output pairs in deep learning. Advances in Neural Information Processing
[114] Shourav B Rabbani, Ivan V Medri, and Manar D Samad. Attention versus contrastive learning of tabular data–a
data-centric benchmarking. arXiv preprint arXiv:2401.04266, 2024.
[115] Sajad Darabi, Shayan Fazeli, Ali Pazoki, Sriram Sankararaman, and Majid Sarrafzadeh. Contrastive mixup: Self-and
semi-supervised learning for tabular domain. arXiv preprint arXiv:2108.12296, 2021.
[116] Karim Lounici, Katia Meziani, and Benjamin Riu. Muddling label regularization: Deep learning for tabular datasets.
arXiv preprint arXiv:2106.04462, 2021.
[117] Pedro Machado, Bruno Fernandes, and Paulo Novais. Benchmarking data augmentation techniques for tabular data.
In International Conference on Intelligent Data Engineering and Automated Learning, pages 104–112. Springer, 2022.
[118] Winston Wang and Tun-Wen Pai. enhancing small tabular clinical trial dataset through hybrid data augmentation:
combining smote and wcgan-gp. Data, 8(9):135, 2023.
[119] Ramiro Camino, Christian Hammerschmidt, and Radu State. Minority class oversampling for tabular data with deep
generative models. arXiv preprint arXiv:2005.03773, 2020.
[120] Jueun Jeong, Hanseok Jeong, and Han-Joon Kim. Bamtgan: A balanced augmentation technique for tabular data. In
2023 9th International Conference on Applied System Innovation (ICASI), pages 205–207. IEEE, 2023.
[121] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using
conditional gan. Advances in neural information processing systems, 32, 2019.
[122] Rick Sauber-Cole and Taghi M Khoshgoftaar. The use of generative adversarial networks to alleviate class imbalance
in tabular data: a survey. Journal of Big Data, 9(1):98, 2022.
[123] Susan McKeever and Manhar Singh Walia. Synthesising tabular datasets using wasserstein conditional gans with
gradient penalty (wcgan-gp). Technological University Dublin, 2020.
[124] Jonathan Richetti, Foivos I Diakogianis, Asher Bender, André F Colaço, and Roger A Lawes. A methods guideline for
deep learning for tabular data in agriculture with a case study to forecast cereal yield. Computers and Electronics in
Agriculture, 205:107642, 2023.
[125] Drew Wilimitis and Colin G Walsh. Practical considerations and applied examples of cross-validation for model
development and evaluation in health care: tutorial. JMIR AI, 2:e49023, 2023.
[126] Ihsan Ullah, Andre Rios, Vaibhav Gala, and Susan Mckeever. Explaining deep learning models for tabular data using
layer-wise relevance propagation. Applied Sciences, 12(1):136, 2021.
[127] Roman Levin, Valeriia Cherepanova, Avi Schwarzschild, Arpit Bansal, C Bayan Bruss, Tom Goldstein, Andrew Gordon
Wilson, and Micah Goldblum. Transfer learning with deep tabular models. arXiv preprint arXiv:2206.15306, 2022.
[128] Mohammadreza Iman, Hamid Reza Arabnia, and Khaled Rasheed. A review of deep transfer learning and recent
advancements. Technologies, 11(2):40, 2023.
[129] Qixuan Jin and Talip Ucar. Benchmarking tabular representation models in transfer learning settings. In NeurIPS
2023 Second Table Representation Learning Workshop, 2023.
[130] Moumen El-Melegy, Ahmed Mamdouh, Samia Ali, Mohamed Badawy, Mohamed Abou El-Ghar, Norah Saleh Alghamdi,
and Ayman El-Baz. Prostate cancer diagnosis via visual representation of tabular data and deep transfer learning.
Bioengineering, 11(7):635, 2024.
[131] Junkang An, Yiwan Zhang, and Inwhee Joe. Specific-input lime explanations for tabular data based on deep learning
models. Applied Sciences, 13(15):8782, 2023.
[132] Zhenyue Gao, Xiaoli Liu, Yu Kang, Pan Hu, Xiu Zhang, Wei Yan, Muyang Yan, Pengming Yu, Qing Zhang, Wendong
Xiao, et al. Improving the prognostic evaluation precision of hospital outcomes for heart failure using admission
notes and clinical tabular data: Multimodal deep learning model. Journal of Medical Internet Research, 26:e54363, 2024.
[133] Xolani Dastile and Turgay Celik. Making deep learning-based predictions for credit scoring explainable. IEEE Access,
9:50426–50440, 2021.
[134] Vinh Quang Tran and Haewon Byeon. Predicting dementia in parkinson’s disease on a small tabular dataset using
hybrid lightgbm–tabpfn and shap. Digital Health, 10:20552076241272585, 2024.
[135] Wei-Yao Wang, Wei-Wei Du, Derek Xu, Wei Wang, and Wen-Chih Peng. A survey on self-supervised learning for
non-sequential tabular data. arXiv preprint arXiv:2402.01204, 2024.
[136] Ehsan Hajiramezanali, Nathaniel Lee Diamant, Gabriele Scalia, and Max W Shen. Stab: Self-supervised learning for
tabular data. In NeurIPS 2022 First Table Representation Workshop, 2022.
[137] Sharad Chitlangia, Anand Muralidhar, and Rajat Agarwal. Self supervised pre-training for large scale tabular data.
2022.
[138] Xuan Zheng, Xiuli Ma, Yanliang Jin, Dongsheng Gu, and Rui Wang. Tabular-based self-supervised learning approach
for encrypted traffic classification. Journal of Electronic Imaging, 32(4):043032–043032, 2023.
[139] Kushal Majmundar, Sachin Goyal, Praneeth Netrapalli, and Prateek Jain. Met: Masked encoding for tabular data.
arXiv preprint arXiv:2206.08564, 2022.