0% found this document useful (0 votes)

21 views43 pages

DL Tabular

Uploaded by

tridao.edu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views43 pages

DL Tabular

Uploaded by

tridao.edu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

A Survey on Deep Tabular Learning

SHRIYANK SOMVANSHI, Texas State University, TX

SUBASISH DAS, PH.D., Texas State University, TX
SYED AAQIB JAVED, Texas State University, TX
arXiv:2410.12034v1 [cs.LG] 15 Oct 2024

GIAN ANTARIKSA, PH.D., Texas State University, TX

AHMED HOSSAIN, PH.D., Texas State University, TX
Tabular data, widely used in industries like healthcare, finance, and transportation, presents unique challenges
for deep learning due to its heterogeneous nature and lack of spatial structure. This survey reviews the evolution
of deep learning models for tabular data, from early fully connected networks (FCNs) to advanced architectures
like TabNet, SAINT, TabTranSELU, and MambaNet. These models incorporate attention mechanisms, feature
embeddings, and hybrid architectures to address tabular data complexities. TabNet uses sequential attention
for instance-wise feature selection, improving interpretability, while SAINT combines self-attention and
intersample attention to capture complex interactions across features and data points, both advancing scalability
and reducing computational overhead. Hybrid architectures such as TabTransformer and FT-Transformer
integrate attention mechanisms with multi-layer perceptrons (MLPs) to handle categorical and numerical data,
with FT-Transformer adapting transformers for tabular datasets. Research continues to balance performance
and efficiency for large datasets. Graph-based models like GNN4TDL and GANDALF combine neural networks
with decision trees or graph structures, enhancing feature representation and mitigating overfitting in small
datasets through advanced regularization techniques. Diffusion-based models like the Tabular Denoising
Diffusion Probabilistic Model (TabDDPM) generate synthetic data to address data scarcity, improving model
robustness. Similarly, models like TabPFN and Ptab leverage pre-trained language models, incorporating
transfer learning and self-supervised techniques into tabular tasks. This survey highlights key advancements
and outlines future research directions on scalability, generalization, and interpretability in diverse tabular
data applications.
CCS Concepts: • Computing methodologies → Machine learning; Deep learning; Neural networks; •
Applied computing → Predictive analytics.
Additional Key Words and Phrases: Tabular Deep Learning, TabNet
ACM Reference Format:
Shriyank Somvanshi, Subasish Das, Ph.D., Syed Aaqib Javed, Gian Antariksa, Ph.D., and Ahmed Hossain, Ph.D..
2024. A Survey on Deep Tabular Learning. 1, 1 (October 2024), 43 pages. https://doi.org/XXXXXXX.XXXXXXX

1 Introduction
Tabular data, which consists of rows and columns representing structured information [1, 2],
is the most commonly used data format in many industries, including healthcare, finance, and
transportation. Unlike unstructured data such as images and text, tabular data directly represents
Authors’ Contact Information: Shriyank Somvanshi, Texas State University, San Marcos, TX, [email protected]; Subasish
Das, Ph.D., Texas State University, San Marcos, TX, [email protected]; Syed Aaqib Javed, Texas State University, San
Marcos, TX, [email protected]; Gian Antariksa, Ph.D., Texas State University, San Marcos, TX, [email protected];
Ahmed Hossain, Ph.D., Texas State University, San Marcos, TX, [email protected].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the
full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM XXXX-XXXX/2024/10-ART
https://doi.org/XXXXXXX.XXXXXXX

, Vol. 1, No. 1, Article . Publication date: October 2024.

2 Somvanshi et al.

real-world phenomena in a structured form, making it crucial for decision-making processes in

areas like risk assessment, predictive analytics, and safety monitoring. For example, in the field
of transportation engineering, tabular data plays a key role in recording crash incidents, vehicle
attributes, environmental factors, and human behavior, enabling researchers to predict crash
severity and improve safety measures using data-driven insights.
Despite the success of deep learning in domains like computer vision and natural language
processing (NLP), its application to tabular data has been less straightforward. Deep learning
models often struggle with tabular data due to several challenges:
(1) Small sample sizes: Many tabular datasets are relatively small, especially when compared
to large image or text datasets, leading to overfitting in complex deep learning models.
(2) High dimensionality: Tabular data often involves many features, which can be sparse or
irrelevant, making it difficult for models to identify meaningful patterns.
(3) Complex feature interactions: Unlike images or text, where local structures are prominent,
the interactions between features in tabular data are non-local and complex, requiring more
specialized architectures to capture these relationships effectively.
These factors make tree-based models like XGBoost and Random Forests more effective for
many tabular data tasks, as they are better suited to handle sparse features and complex interactions.
Over recent years, significant strides have been made in the development of deep learning models
specifically for tabular data, addressing the unique challenges posed by this data type. While early
models like fully connected networks (FCNs) showed promise, new architectures have emerged
that have significantly advanced the field [3–6]. One of the leading models in this space is the FT-
Transformer, which adapts transformer models, initially developed for sequential data, to effectively
handle tabular data by encoding features through attention mechanisms [7, 8]. This model has
shown impressive performance due to its ability to learn complex interactions between features,
making it well-suited for high-dimensional data.
Another recent innovation is the Self-Attention and Intersample Transformer (SAINT), which
improves upon the original transformer by introducing intersample attention mechanisms, allowing
the model to better capture relationships between rows of tabular data [9]. SAINT has demonstrated
superior performance in various benchmarks compared to traditional models like XGBoost and
deep learning models such as Neural Oblivious Decision Ensembles (NODE). Additionally, models
like TabTransformer leverage transformers specifically for categorical feature encoding, providing a
more scalable solution for handling mixed data types in tabular datasets. This approach enables the
model to capture meaningful representations from categorical variables, which are often challenging
for traditional deep-learning architectures to handle effectively. These new models have introduced
significant innovations in terms of feature encoding, complex interaction learning, and model
interpretability, which are crucial for advancing the application of deep learning to tabular data in
many research areas. The objective of this survey paper is to review these advancements in detail,
exploring their historical evolution as seen in Figure 1, key techniques, datasets, and potential
applications.

2 Challenges in Modeling Tabular Data

2.1 Heterogeneous Feature Types
Tabular data, a foundational structure in fields like healthcare, finance, and transportation, com-
monly contains heterogeneous data types such as numerical, categorical, ordinal, text, and even
multimedia elements like images and emojis. Numerical features often represent continuous or
discrete values (e.g., age, income), while categorical features classify entities into discrete groups
(e.g., gender, city) [10, 11]. In more complex cases, text data, images, or emojis may be embedded

, Vol. 1, No. 1, Article . Publication date: October 2024.

A Survey on Deep Tabular Learning 3

2022
2024
Pre-trained language model to
model Tabular Data (Ptab) Tabular Transformer with Scaled
TabLLM Exponential Linear Units (TabTranSELU)
2020 Generation of Realistic
Tabular Prior-Data Fitted MambaTab
Tabular data (GReaT) Tabular Prediction adapted
Net-DNF Network (TabPFN) BERT approach (TP-BERTa)
Network-on-network (NON) Tabular Denoising Diffusion
Gated Adaptive Network for SwitchTab
Probabilistic Model (TabDDPM) Context Aware Representation
Value Imputation and Mask TabTransformer Deep Automated Learning of
Scalable Polynomial Additive of Table Entries (CARTE) Bi-Directional Sparse
Estimation (VIME) Features (GANDALF)
TAbular Convolution Models (SPAM) Hopfield
Neural Additive (TAC) Hopfield networks from Latent Factorizer Network (BiSHop)
Models (NAMs) Neural Basis Model (NBM) tabular data (Hopular) Transformer (LF-Transformer)

Image Generator for Cross-Table Pretraining of Tabular

TabNet Denoising Autoencoders Tabular Data (IGTD) Transformers (Xtab)
SuperTML (DAEs)* Tabular Prompt (Trompt)
Neural Oblivious Self-Attention and Intersample Graph Neural Networks for
Decision Ensembles Non-Parametric Attention Transformer (SAINT) Regularized Contrastive for Tabular DL (GNN4TDL)
DeepGBM (NODE) Transformer (NPT) Tabular Data (ReConTab)
Extremely Boosted Neural Hypergraph-Enhanced
FT-Transformer Network (XBNet) Tabular Data Representation Learning
2019 Self-Supervised Contrastive Tabular Retrieval (TabR) (HYTREL)
Learning using Random Feature
ARM-Net 2023
(SCARF)
Deep Abstract
DNN2LR Networks (DANET)
Boost-GNN
SDTR

2021

* Denotes no paper found but has won several Kaggle competitions (https://www.kaggle.com/competitions/tabular-playground-series-apr-2021/discussion/230013%E2%80%A6)
Fig. 1. Progression of Tabular Deep Learning Models

within tables, providing rich context but complicating feature representation and model training.
Understanding how to handle these varied data types is critical to improving the performance of
deep learning models on tabular data.
Tabular data can also be represented in two different formats: 1D tabular data, and 2D tabular
data as shown in Figure 2 below. In 1D tabular data, each row represents a sample and columns
represent specific features, making it easy to process and analyze. This format is ideal for traditional
machine learning tasks, as each column follows a specific data type and the structure is fixed. For
example, in transportation safety datasets, each row could represent an individual crash event,
and the columns might include features like vehicle speed, crash time, or road conditions. The
simplicity of this structure makes it highly useful in various fields.

1D Tabular Data 2D Tabular Data

ID Severity Gender Income Age Time Speed Temp Vehicle Road

Time Speed Temp Vehicle Road
20 O Male 5435 55 Time Speed Temp Vehicle Road
21 BC Female 8541 42 Time Speed Temp Vehicle Road

22 KA Female 6582 23 45890 65 69.7 Car Dry

14565 78 37.1 SUV Wet
Sample 23 KA Male 4582 30
24 O Female 8342 28 45811 71 89.8 Truck Dry

54946 102 39.9 Car Dry

25 O Male 8642 56
95156 98 36.2 Car Dry
26 BC Male 6575 37
19645 84 47.3 Car Wet

39519 99 86.9 SUV Slippery

Sample

Fig. 2. An Illustration of 1D (left) and 2D (right) Tabular Dataset

, Vol. 1, No. 1, Article . Publication date: October 2024.

4 Somvanshi et al.

In contrast, 2D tabular data provides a more complex format where each sample can be repre-
sented by a table, with multiple rows and columns within each table. This format is often used for
tasks that require deeper relational analysis, such as tracking patient health over time or analyzing
transportation data across different regions and times. 2D tabular data is also more flexible, incor-
porating diverse data types, including timestamps or unstructured data, like text or images, within
each table. This additional complexity makes it suitable for applications in areas such as healthcare
and transportation, where temporal and multi-dimensional data are critical.
Understanding how to handle these varied data types is critical to improving the performance of
deep learning models on tabular data. Some of these data types are explained below:
• Binary Data: Binary data, a type of categorical data with two possible values (such as,
"Yes/No"), is often represented as 0 or 1 in deep learning models [12].
• Numerical Data: Numerical data, representing continuous or discrete variables (e.g., age,
vehicle speed), is common in predictive modeling, especially in transportation safety [13].
Deep learning models handle it directly, but preprocessing, like scaling or standardization, is
critical for performance. Advanced techniques, such as numerical embeddings, help capture
non-linear relationships and interactions in the data.
• Timestamps: Timestamps provide essential temporal information in systems like traffic
management. Preprocessing involves extracting features such as day, month, or hour to
capture temporal patterns for deep learning models [14].
• Text Data: Text data in tabular formats, such as crash descriptions, presents challenges
for deep learning models. Methods like TF-IDF [15] and word embeddings (e.g., word to
vector, global vectors for word representation) convert text into numerical vectors [16, 17].
Advanced models like transformers (e.g., BERT) capture context-aware embeddings [18].
• Image Data: In multi-modal datasets, image data is sometimes embedded in tables, such as in
autonomous driving, where road images are paired with tabular data. Convolutional Neural
Networks (CNNs) process images, but integrating image features with tabular data requires
feature fusion techniques. Hybrid models like TabTransformer use attention mechanisms to
merge image and tabular data, enhancing predictive performance [19].
• Hyperlinks: Hyperlinks, though uncommon in traditional tabular datasets, are increasingly
used in web data applications or web documents [20]. When tables include URLs, advanced
preprocessing is required to extract metadata or context from the linked pages, often using
NLP models to incorporate this information into the feature set.
• Video Data: Video data in tabular formats provides valuable temporal information for do-
mains like autonomous driving and traffic management. Keyframes from videos are processed
using 3D-CNNs or Recurrent Neural Networks (RNNs) to capture spatial and temporal fea-
tures, which are then integrated with tabular data to improve model predictions, such as in
crash prediction models where video features enhance the understanding of road conditions
and driver behavior [21, 22].
• Emoji: Emojis common in social media and messaging platforms, enhance communication
by visually conveying emotions or objects [23] and pose challenges for encoding sentiment.
Deep learning models use character-level or emoji embeddings to map them to sentiment
vectors, enabling effective interpretation alongside other data types.
Tabular data, composed of rows and columns, lacks the spatial or sequential structure found in
images and text, making it difficult to apply traditional deep learning models like CNNs, which rely
on spatial coherence. Unlike structured data, reordering columns or rows in tabular data doesn’t
change feature relationships, and deep learning models struggle without the inductive biases that
machine learning models like XGBoost and Random Forests possess. Machine learning models

, Vol. 1, No. 1, Article . Publication date: October 2024.

A Survey on Deep Tabular Learning 5

excel in handling heterogeneous feature types, non-local interactions, and small, high-dimensional
datasets, where deep learning models often overfit and fail to generalize.
To address the limitations of traditional deep learning models when applied to tabular data,
recent advancements have led to the development of specialized architectures such as TabNet,
TabTransformer, and SAINT. These models introduce mechanisms like attention layers, feature
embeddings, and hybrid architectures to dynamically focus on the most relevant features, improving
their ability to handle the complexity of heterogeneous tabular data. For instance, TabNet [24]
employs sequential attention mechanisms for instance-wise feature selection, while TabTransformer
[19] uses self-attention layers to capture feature dependencies more effectively than CNNs. SAINT
[25] enhances this approach by incorporating intersample attention, enabling the model to capture
relationships between data rows. Moreover, models such as TabTranSELU [26] and GNN4TDL [27]
are designed to efficiently manage both categorical and numerical features by employing hybrid
structures and regularization techniques, which help mitigate overfitting and improve generalization.
These innovations have enabled deep learning models to rival or surpass traditional machine
learning methods in tasks involving tabular data, including fraud detection, and predictive analytics.
Additionally, novel techniques such as transforming tabular data into image-like structures [2,
28], employing multi-view representation learning, and extracting schemas from tabular data
[29] further contribute to overcoming the challenges posed by the absence of inherent spatial
relationships in tabular datasets.
By leveraging these advancements, recent tabular deep learning models not only address the
unique challenges of tabular data but also offer significant improvements in performance, in-
terpretability, and scalability compared to both traditional deep learning and machine learning
approaches. These innovations demonstrate the growing potential for deep learning in handling
complex, non-spatial data across a wide range of real-world applications.

2.2 Non-spatial relationships

Traditional deep learning models, such as CNNs and RNNs, excel in capturing spatial and sequential
relationships in structured data types like images and text, where spatial coherence or temporal
dependencies play a crucial role. CNN, for instance, detects local patterns by processing spatially
adjacent pixels, allowing them to capture meaningful features through convolutions [30]. Similarly,
RNNs excel at learning from sequential data, where past information influences future predictions,
making them well-suited for text and time-series data. However, tabular data lacks such inherent
spatial or temporal structures. In tabular formats, features do not follow any specific spatial or
temporal order, and their relative positions carry no meaningful information. Reordering columns
or rows does not alter the relationships between features, making models like CNNs and RNNs
unsuitable without significant adaptation [31, 32]. This absence of local correlations and temporal
dependencies in tabular data makes it challenging for traditional deep-learning models to perform
effectively, particularly when non-spatial relationships are critical.
Recent research has sought to address these challenges by introducing novel architectures
specifically designed to capture relational structures within tabular data. For example, models
such as Dual-Route Structure-Adaptive Graph Networks (DRSA-Net) [33] and Homological CNNs
[34] employ graph-based and topologically constrained approaches to model the dependencies
between features. Other approaches, like GOGGLE [35], focus on learning generative models by
exploiting underlying relational structures, while TabularNet [36] combines spatial and relational
information using advanced techniques such as pooling and Graph Convolutional Networks (GCNs).
These innovations represent significant progress in adapting deep learning architectures to the
unique challenges posed by tabular data, paving the way for more effective modeling of complex,
non-spatial relationships.

, Vol. 1, No. 1, Article . Publication date: October 2024.

6 Somvanshi et al.

Similarly, Hellerstein, [37] addresses the inherent challenges of working with tabular data,
particularly when it lacks a grid-like structure typically seen in other data types such as images or
text. The study focuses on automating the transformation of unstructured tables into tidy, relational
forms suitable for analysis. It also introduces the idea that clean data tables can be considered as
grids of cells, somewhat analogous to pixels in an image, where adjacent rows and columns may
exhibit patterns. While deep learning models excel in pattern recognition in image grids, detecting
such patterns in tabular data is far more difficult due to the diversity in how tables are structured
and the absence of explicit spatial relationships. Ucar et al. [1] discussed the challenges posed
by the lack of inherent spatial structure in tabular data. While image data benefits from spatial
coherence (e.g., neighboring pixels are spatially correlated), and text or audio from semantic and
temporal structures, tabular data lacks such clear patterns. This makes it difficult to apply common
augmentation techniques like cropping or rotation, which are highly effective in domains such
as image processing. To overcome these limitations, the authors propose the SubTab framework,
which divides input features of tabular data into subsets, analogous to feature bagging or image
cropping, to generate different views of the data.
By reconstructing full data from these subsets, the framework forces the model to learn better
representations of tabular data in a self-supervised setting, despite the absence of grid-like structure.
This approach enables the model to discover patterns and relationships within tabular data that are
not immediately apparent, and the results demonstrate that SubTab can achieve state-of-the-art
performance on various datasets.
In an effort to refine this approach, Wang and Sun [38] introduced TransTab, as shown in Figure
3 and Figure 4 below, a model that uses transformers to encode tabular data by treating rows
(samples) and columns (features) as sequences. Figure 3 illustrates TransTab’s ability to handle
tasks like transfer learning, feature incremental learning, and zero-shot inference, demonstrating its
adaptability across different tabular data tasks. Figure 4 details the framework, where categorical,
binary, and numerical features are tokenized and processed through a gated transformer with
multi-head attention, enabling efficient learning of feature interactions. This structured approach
allows TransTab to handle variable-column tables and facilitates knowledge transfer, even across
tables with different structures, enabling more effective learning and generalization across tasks
and domains. The model focuses on learning generalizable representations, which can be applied
to different datasets, overcoming the limitations imposed by the nonspatial nature of tabular data.
The contextualization of columns and cells in TransTab introduces a structured way to interpret
relationships within tabular data, enabling more effective learning and generalization across tasks
and domains.
In a similar effort Ghorbani et al. [39] introduce the Feature Vectors method, which generates
feature embeddings that capture both the importance and semantic relationships between features.
Drawing inspiration from word embeddings in NLP, where words that frequently co-occur in the
same context share similar embeddings, the authors apply a similar approach to features in tabular
data. However, given that tabular data lacks natural co-occurrence structures, the authors propose
using decision trees to extract co-occurrence relationships between features. By analyzing decision
paths in tree-based models, they are able to create feature embeddings that preserve semantic
relationships despite the nonspatial nature of tabular data. Also, Geisler and Binnig [40] discussed
the challenges of applying existing model explanation methods, such as Local Interpretable Model-
agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP), to tabular data. These
methods, originally designed for data with spatial or temporal relationships, often fall short when
applied to tabular data due to the absence of clear spatial or sequential patterns. To overcome this,
they propose the Quest framework, which generates explanations in the form of relational queries
specifically tailored for tabular data. Quest uses surrogate models and query-driven explanations

, Vol. 1, No. 1, Article . Publication date: October 2024.

A Survey on Deep Tabular Learning 7

to address the unique structure of tabular data, offering a more semantically rich and intuitive
understanding of model behavior. By focusing on relational query predicates, Quest is able to
explain not only why a model produced a particular output but also why it did not produce an
alternative output.

Existing works
Pretrain
Finetune Train
Columns y Columns y
Train Columns y
Test
Columns y T1 T1 T1
Columns y

T1 Columns Columns y
T1 Columns y Columns y
T2 T2 T1 T2

… …
Pretrain 1. Transfer learning across tables 2. Feature incremental learning
Finetune Train
Columns Pretrain
Columns y Columns y Finetune Columns y Test
Columns y T1 Columns
T1
T1 T1
T3 T3
Columns Columns

T2 T2
Fixed-table pretrain + finetune
… …
3. Pretrain + finetune 4. Zero-shot inference

Fig. 3. Demonstration of TransTab Tasks [38]

Input processor Gated transformer

Input table categorical tokenize & Nom & X L blocks
cat xc
embedding Allign
cat bin num Multi-head attention
Cat, token emb Ec
xc xb xu binary tokenize & Nom &
bin
embedding Allign
xb bin, token emb Eb Gate layer Linear layer
Element-wise
Multiply tokenize & Nom &
num Allign
Element-wise embedding
Add num, token emb Eb
xu Linear layer
Concatenate numerical
Nam
Learning
d d {𝑐𝑐𝑐𝑐𝑐𝑐}
𝑍𝑍1 → 𝑦𝑦1 Supervise
n Classifier {𝑐𝑐𝑐𝑐𝑐𝑐}
𝑍𝑍2 → 𝑦𝑦2 d learning
[CLS] emb. Z{clr} …

{𝑐𝑐𝑐𝑐𝑐𝑐}
𝑍𝑍̂1 → 𝑦𝑦1 Contrastive Self-supervised
Projector {𝑐𝑐𝑐𝑐𝑐𝑐}
𝑍𝑍̂2 → 𝑦𝑦2 learning Supervised
Encoded ZL …

Fig. 4. TransTab Framework [38]

Moving in the same direction Zhu et al. [2] explain that CNNs excel when applied to data with
spatial or temporal relationships, such as the arrangement of pixels in images or the sequential
nature of text, allowing them to capture local patterns through convolutions. However, the absence
of these structures in tabular data presents a significant challenge for CNN-based modeling. To
address this challenge, the authors propose a novel algorithm called the Image Generator for

, Vol. 1, No. 1, Article . Publication date: October 2024.

8 Somvanshi et al.

Tabular Data (IGTD), which transforms tabular data into image-like structures by assigning tabular
features to pixel positions while preserving feature relationships. This transformation introduces a
form of spatial relationship in the data, allowing CNNs to process tabular data more effectively.
The study demonstrates that these image representations help CNNs capture feature relationships
and improve predictive performance compared to models trained on raw tabular data. The IGTD
method addresses the lack of spatial or sequential dependencies in tabular data by creating artificial
spatial relationships making it more compatible with deep learning models designed for structured
data.

2.3 Overfitting in Small Tabular Datasets

Overfitting is a process where a model is unable to generalize and fits too closely to the training
dataset. It can occur for a number of causes, including insufficient data samples and an inadequate
training set to adequately represent all potential input data values. Overfitting is a significant
challenge in deep learning, particularly when working with small datasets, as the model may end
up memorizing the training data instead of learning generalizable patterns. To address this issue,
various strategies have been proposed in the literature. One notable approach is transfer learning,
where a model pre-trained on a large, related dataset is fine-tuned on a smaller target dataset. This
helps to mitigate overfitting by leveraging the prior knowledge embedded in the pre-trained model,
thus reducing the need for extensive data in the target domain [32, 41]. Additionally, the use of
techniques like image generation from tabular data and the application of CNNs are explored to
handle small datasets more effectively. In their work, Koppe et al. [42] highlight the importance
of balancing the bias-variance trade-off in deep learning models trained on small datasets. They
argue that overfitting occurs when models capture noise and peculiarities specific to the training
data rather than generalizable patterns. To counter this, they recommend regularization techniques
such as dropout and the incorporation of domain knowledge during model training. These methods
help to constrain the model’s flexibility and reduce the likelihood of overfitting. LeCun et al. [32]
pointed out that while deep networks can learn complex representations, their flexibility can lead
to memorization of the training data. To mitigate this, they suggest using unsupervised pre-training
and data augmentation, which have proven effective in improving the generalization capabilities of
deep learning models.
While deep learning has made significant strides in domains like computer vision and NLP,
its application to tabular data has proven more difficult. This disparity can be attributed to the
fundamental differences in data volume and structure between these domains. In computer vision
and NLP, large-scale datasets such as ImageNet (with millions of labeled images) and GPT-3’s
massive corpora help models learn complex representations without overfitting. By contrast, tabular
datasets commonly found in fields like engineering, healthcare, and finance are often significantly
smaller, comprising only hundreds or thousands of samples. This size limitation makes it challenging
to train deep learning models effectively and increases the risk of overfitting due to insufficient
data diversity. As explained in few studies [43, 44] outlined, deep learning models excel in image
classification due to their ability to learn from vast amounts of data, but struggle when applied to
small datasets, leading to overfitting. The high number of parameters in deep models, such as those
found in CNNs, further exacerbates the problem when there are limited training examples. Similarly,
Jain et al., [41] emphasize that deep learning models tend to perform poorly on tabular data due to
the dataset’s heterogeneous nature. Deep models, which rely on large-scale homogeneous data (as
seen in computer vision and NLP), often fail to generalize well in domains where tabular data is
used, causing overfitting after only a few epochs.
To overcome this, researchers have proposed several methods to address the challenges of
training deep learning models on small tabular datasets. One prominent approach is transfer

, Vol. 1, No. 1, Article . Publication date: October 2024.

A Survey on Deep Tabular Learning 9

Source Full
C1-Cn SoftMax Output
Input Connection

Migrate

Global
Target C1-Cn Average Soft Max Output
Input Pooling

Data
Augmentation

Fig. 5. Transfer Learning in Pre-Trained-TLCNN [43]

learning, which has been successful in mitigating overfitting in small datasets for computer vision
and NLP. For example, Zhao [43] proposes using transfer learning combined with data augmentation
to tackle overfitting in small datasets as shown in Figure 5. By pre-training a CNN on large datasets
like ImageNet, and then fine-tuning it on smaller datasets, the model can leverage previously
learned representations to improve performance on the target task. Jain et al., [41] extend this
concept to tabular data by converting tabular datasets into image representations using techniques
like IGTD and SuperTML. These methods allow deep learning models, particularly CNNs, to be
applied to tabular data by transforming it into an image-like format that can take advantage of
pre-trained models, thus reducing overfitting. Horenko [45] introduces the entropy-optimal scalable
probabilistic approximations algorithm algorithm to breach the overfitting barrier at a fraction of
the computational cost. Badger [46] demonstrates the effectiveness of small language models in
processing tabular data without extensive preprocessing, achieving record classification accuracy.
Another promising solution is the HyperTab method proposed by Wydmański et al. [47], which
uses a hypernetwork-based approach to build an ensemble of neural networks specialized for
small tabular datasets. A general HyperTab structure is shown in Figure 6. By employing feature
subsetting as a form of data augmentation, HyperTab virtually increases the number of training
samples without altering the number of parameters. This approach allows the model to generalize
better by preventing overfitting, particularly on small datasets.

3 Historical Evolution of Tabular Deep Learning

3.1 Classical Approaches
Before the emergence of deep learning, traditional machine learning models like Support Vector
Machines (SVMs), linear and logistic regression, and tree-based methods have long been the
preferred choice for tabular data analysis [48]. These classical approaches were well-suited for
small-scale tabular datasets but limited to classification and regression tasks. These models were
not only highly interpretable, allowing users to understand and explain predictions, but they also
handled both numerical and categorical data well. They were ideally suited for small to medium-
sized datasets, as they required less computational power and were quick to train. Despite the rise of

, Vol. 1, No. 1, Article . Publication date: October 2024.

10 Somvanshi et al.

X1 X2 X3 X4 X5 X6

1 0 0 1 0 1

X1 X2 X3
Hypernetwork

Target
Target network
Weights

Fig. 6. General HyperTab Structure [47]

deep learning, these traditional models are still preferred in certain cases today. Additionally, their
faster training and deployment times make them ideal for applications where real-time decisions are
necessary. These classical approaches were well-suited for small-scale tabular datasets but limited
to classification and regression tasks. However, these traditional models are not without their
limitations. For instance, Clark et al. [49] point out that logistic regression models can encounter
problems such as complete and quasi-complete separation, where the model either perfectly or
nearly perfectly separates the data. This can lead to extremely large or infinite coefficient estimates,
making statistical inference unreliable. Moreover, logistic regression is particularly sensitive to small
sample sizes, especially when dealing with low-frequency categorical variables, which can worsen
separation issues. To counter this, covariates are often removed, or categories merged, but such
actions can lead to oversimplification and a reduction in the model’s predictive power. Similarly,
Carreras et al. [50] highlight several drawbacks of SVMs, particularly in the soft margin variant.
These include the risk of overfitting, increased computational complexity when feature selection
is involved, and the non-convex nature of the optimization problem, which complicates finding
optimal solutions. Additionally, achieving a complete Pareto front in multi-objective optimization
is difficult due to new parameters leading to similar solutions despite weight variations.
Expanding on the strengths and limitations of traditional models like SVMs and logistic regression,
models like Decision Trees, Naive Bayes, and early Neural Networks also relied on manual feature
engineering, requiring domain expertise to select relevant features [51]. This process, while labor-
intensive, allowed these models to perform effectively on smaller datasets. Abrar and Samad [52]
emphasize that while fully connected deep neural networks have become popular in recent years,
traditional machine learning models like gradient boosting trees still outperform deep learning
models in many cases, particularly when dealing with tabular data containing uncorrelated variables.
This study highlights that traditional models like gradient boosting trees are superior when deep
models fail, especially in datasets that lack the strong correlations often found in real-world data.

, Vol. 1, No. 1, Article . Publication date: October 2024.

Additionally, these classical models do not face the overfitting challenges or high computational
costs associated with deep learning. Unlike deep learning models, which tend to overly smooth the
relationships in data, tree-based methods can accurately partition the feature space and learn locally
constant functions, making them ideal for datasets with irregular target functions. These models
are also more robust to uninformative features, which are common in tabular datasets, while neural
networks, especially multi-layer perceptrons (MLPs), struggle with irrelevant or redundant features,
negatively impacting their performance [53]. Additionally, tree-based models preserve the original
orientation of the data, which is important in tabular datasets where features often carry individual
meanings, such as age or income. Tree-based models excel at handling the complexities of tabular
data by partitioning the feature space in a way that captures non-linear relationships, which deep
learning models often struggle to do without overfitting. A study by Fayaz et al. [54] found that
even when applied to large datasets, traditional models like XGBoost consistently outperformed
state-of-the-art deep learning models, especially when the data lacked strong correlations that deep
learning models rely on.
While tree-based models offer numerous advantages for tabular data, they are not without their
challenges. Tree-based models, such as decision trees, face several challenges when dealing with
tabular data. One key issue is scalability, particularly with large datasets [55]. As the complexity
of the dataset increases, decision trees tend to grow deep, which significantly raises runtime and
computational costs. This scalability problem is especially pronounced in models designed to handle
large datasets, as they struggle to balance depth and size without sacrificing accuracy. Another
major drawback is overfitting, where deep trees tend to memorize the training data, including noise
and irrelevant features, leading to poor generalization on unseen data [56]. Although techniques
like pruning can help mitigate this, they may reduce model accuracy. Traditional decision trees
also rely on univariate splits, which can oversimplify complex relationships between features in
tabular data, often leading to unnecessarily large trees. While multivariate trees can capture more
intricate patterns, they come with added complexity and reduced interpretability. Additionally,
decision trees often struggle with imbalanced data, as they tend to be biased towards the majority
class [57]. Techniques like SMOTE or cost-sensitive learning are required to address this, but these
methods increase computational overhead. Furthermore, decision trees that use models in their
leaves, known as model trees, face significant increases in training time and complexity, especially
when evaluating a large number of candidate splits across a wide range of features. As tabular deep
learning transitioned from classical methods, foundational models emerged that addressed many
of the limitations of tree-based approaches. Table 1 outlines these key models, showcasing their
core architectures and training methodologies, which laid the groundwork for the more advanced
techniques seen in modern tabular deep learning.
Table 1. Timeline of DL Models for Tabular Data (2016-20)

Model Training
(Year) Architecture Efficiency Main Features
Source
VIME (2020) [58] Neural network + Self and semi- Self-supervised learning and contex-
masked and feature- supervised learning; tual embedding
vector estimation Moderate
NON (2020) [59] Field-wise + across field Supervised learning; Network-on-network model
+ operation fusion net- Moderate
work
Net-DNF (2020) Affine literals + conjunc- Supervised learning; Structure-based on disjunctive nor-
[60] tion + output layer Moderate mal form
12 Somvanshi et al.

Model Training
(Year) Architecture Efficiency Main Features
Source
TabTransformer Transformer-based ar- Supervised learning, Transformer network for categorical
(2020) [61] chitecture + contextual semi-supervised learn- data
embeddings ing; Moderate
TabNet (2019) Sequential Attention + Supervised learning; Sequential attention structure
[24] Sparse Feature Selection Moderate
+ Feature Transformer
NODE (2019) Differentiable oblivious Supervised learning; Differentiable decisions are made
[62] decision trees (ODT) Moderate with classic decision trees via the ent-
max transformation
DeepGBM (2019) Hybrid approach inte- Supervised learning; Two DNNs distill knowledge from de-
[63] grating GBDT with NN Moderate to high cision tree
SuperTML (2019) CNN-based Supervised learning; Transform tabular data into images
[64] Moderate for CNNs
xDeepFM (2018) Hybrid neural network Supervised learning; Embedding Layer + compressed in-
[65] Moderate to high teraction network + DNN
TabNN (2018) Automatic feature group- Supervised learning; DNNs based on feature groups dis-
[66] ing + recursive encoder Moderate to high tilled from GBDT
with shared embedding
RLN (2018) [67] Regularization mecha- Supervised learning; Hyperparameters regularization
nism + sparse network Moderate scheme
DeepFM (2017) Factorization machines + Supervised learning; Combining low- and high-order fea-
[68] deep neural networks Moderate to high ture interactions + shared embedding
layer
Wide&Deep Memorization (wide Supervised learning; Cross-product feature transforma-
(2016) [69] component) and gen- High tions for memorization + Embedding
eralization (deep layer for categorical features
component)

3.2 Shallow Neural Networks

Recent research highlights the ongoing challenges of applying deep learning to tabular data.
Despite deep learning’s success in image and text domains, tree-based models like XGBoost and
Random Forests continue to outperform neural networks on medium-sized tabular datasets [9].
This performance gap persists even after extensive hyperparameter tuning. Researchers have
identified key challenges for developing tabular-specific neural networks, including robustness to
uninformative features, preserving data orientation, and learning irregular functions [7, 70].
Katzir et al. [60] introduced the Net-DNF architecture, embedding inductive biases similar to
gradient-boosting decision trees (GBDTs), to address the shortcomings of FCNs in tabular data tasks.
Their experiments demonstrated that Net-DNF outperformed traditional FCNs, particularly on
large-scale tabular datasets, underscoring the limitations of conventional neural architectures for
this domain. Similarly, Borisov et al. [7] critiqued deep neural networks for tabular data, noting that
early attempts with shallow and FCN often failed to match the performance of tree-based models
like decision trees and GBDTs. They emphasized that FCNs struggle with the unique challenges of
tabular data, such as handling categorical variables, missing entries, and imbalanced datasets, and
that feature engineering alone rarely closes the performance gap. In line with this, Abutbul et al.
[71] proposed DNF-Net, a neural architecture incorporating Boolean logic and feature selection,
which consistently outperformed FCNs on large-scale tabular classification tasks. Chauhan and
Singh [72] and Abrar and Samad [73] similarly recognized the use of shallow networks, such as
MLPs, for tabular data but highlighted their limitations, including overfitting and limited research

, Vol. 1, No. 1, Article . Publication date: October 2024.

A Survey on Deep Tabular Learning 13

focus compared to more advanced methods. While MLPs are effective, they are often surpassed by
specialized or more sophisticated architectures in handling the complexities of tabular data.
Early applications of shallow neural networks, particularly FCNs, to tabular data often under-
performed compared to specialized models like GBDTs. However, recent studies have shown that
with proper tuning and architectural enhancements, neural networks can rival or surpass GBDTs.
Chen et al. [74] emphasized the efficiency of shallow networks in handling unordered tabular data,
while Erichson et al. [75] demonstrated their competitiveness in tasks like fluid dynamics, where
fast training and regularization were key advantages. Rubachev et al. [76] further noted that with
optimized tuning and techniques like unsupervised pre-training, shallow networks can close the
performance gap with GBDTs, although this gain is context-dependent. Fiedler [77] introduced
structural innovations such as Leaky Gates and Ghost Batch Norm, which significantly enhanced
MLPs for tabular data, enabling them to outperform GBDTs in several cases. Figure 7 shows the
original and the modified MLP+ model. These advancements demonstrate that shallow networks
when effectively optimized, can meet the unique challenges of tabular data and compete with
traditional models.

Loss
Linear Linear
Weighted Average
Loss
…

Dropout Dropout
Activation Multi-Layer Activation Multi-Layer
Batch Norm Perception Ghost Batch Norm Perception Linear
Linear Linear

Dropout Dropout Leaky Gate Leaky Gate

Activation Activation
Batch Norm Concatenation Ghost Batch Norm
Linear Linear
Concatenation

Categorical Non-cat.
Categorical Non-cat.
Embedding Embedding
Embedding Embedding

(a) Original MLP model (b) Modified MLP model, “MLP+”

Fig. 7. Original and Modified MLP Models [77]

These findings align with the broader consensus that the standard architecture of FCNs often
lacks the inductive biases necessary for effectively modeling the complexities of tabular data, such
as categorical variables, missing data, and imbalanced datasets. Specialized neural networks are
frequently required to address these challenges. However, Grinsztajn et al. [9] offer a more optimistic
view, demonstrating that shallow FCNs, like MLPs, can remain competitive when combined with
regularization techniques to mitigate overfitting and generalization issues. They further suggest
that even simple architectures, such as ResNet, can match the performance of more advanced
models, indicating that, with proper modifications, shallow networks can still play a valuable role
in handling tabular data.

, Vol. 1, No. 1, Article . Publication date: October 2024.

14 Somvanshi et al.

3.3 Initial Breakthroughs

TabNet and NODE represent significant advancements in the application of deep learning to
tabular data, addressing longstanding challenges in performance, interpretability, and efficiency.
This research explores how these models tackle issues inherent to tabular data, such as handling
heterogeneous features and preventing overfitting while introducing innovations that set them
apart from classical machine learning methods and earlier neural network approaches.
3.3.1 TabNet. TabNet is a deep learning architecture specifically designed to address the chal-
lenges associated with applying neural networks to tabular data. Unlike image or text data, tabular
data often consists of heterogeneous features, making it difficult for traditional deep learning
models, such as MLPs, to efficiently capture the relationships between the features. Classical ma-
chine learning models have traditionally excelled in this domain due to their ability to handle the
complex decision boundaries of tabular data. However, deep learning offers potential advantages,
such as end-to-end learning and the ability to integrate with other data types, which TabNet
leverages through its novel architecture. TabNet introduces several key innovations to overcome
these challenges. One of the central features of TabNet is its sequential attention mechanism, which
allows the model to select the most important features for each decision step dynamically [24]. This
instance-wise feature selection sets TabNet apart from other models, as it can tailor the features
used for each individual input rather than relying on a fixed set of features for all instances. This
dynamic feature selection leads to more efficient learning by focusing the model’s capacity on the
most relevant features, which is particularly beneficial for tabular data that may have a mix of
irrelevant or redundant information. Figure 8 below shows the architecture of the encoder and
decoder of a TabNet model.

Encoder Decoder

Step 1 Step 2
FC

ReLU ReLU Reconstructed features

Output

Split Split Split

Step 1 Step 2
Feature Feature Feature FC FC
Transformer Transformer Transformer

Attentive Feature Feature

Attentive Mask
Mask Transformer Transformer Transformer
Transformer

Agg. Agg.

GBN

Encoded representation
Feature Input Feature attributes

Fig. 8. TabNet Encoder and Decoder Architecture [78]

TabNet offers significant advancements in interpretability compared to both classical machine

learning models and traditional neural networks. By integrating sparse attention and feature masks,
TabNet enhances performance while providing insights into which features influence predictions.
This results in both local and global interpretability, making it possible to visualize individual

, Vol. 1, No. 1, Article . Publication date: October 2024.

A Survey on Deep Tabular Learning 15

feature importance and quantify overall contributions [79]. The newer InterpreTabNet builds on
this by improving feature attribution methods, further enhancing interpretability, and making the
model’s decisions more transparent at both local and global levels [78].
TabNet excels in working with raw tabular data without the need for extensive preprocessing or
manual feature engineering, which is often required by traditional models like GBDTs. Its end-to-
end learning capabilities allow TabNet to directly process raw data, simplifying workflows while
maintaining high performance. Additionally, TabNet introduces self-supervised learning, a novel
feature for tabular data, where it can pre-train on unlabeled data using masked feature prediction
to improve performance on supervised tasks, particularly when labeled data is scarce. Evaluated
on various datasets, TabNet has been shown to outperform or match state-of-the-art models,
including GBDTs, in both classification and regression tasks. For example, in facies classification,
it achieves superior accuracy compared to traditional tree-based models and other deep learning
architectures like 1D-CNNs and MLPs [79]. Its flexible architecture, which incorporates sequential
feature transformers and attention mechanisms, enhances generalization across different domains,
while the use of sparse attention ensures interpretability, addressing a key limitation of traditional
deep learning models.
TabNet, despite its innovations like interpretability and sparse attention, is often outperformed
by XGBoost across various datasets, requiring more hyperparameter tuning and showing less
consistent results [80]. Additionally, TabNet’s training time is significantly longer, making it less
practical for quick iterations or real-time applications [81]. It is also prone to overfitting on smaller
datasets due to its complex architecture, especially when not tuned correctly.

3.3.2 Neural Oblivious Decision Ensembles (NODE). NODE have been proposed to address
the specific challenges of applying deep learning to tabular data, which has traditionally been
dominated by tree-based models like GBDT. Popov et al. (2019) identified the key limitations of
deep learning models in handling tabular data, primarily due to their inability to outperform
GBDTs consistently. To bridge this gap, NODE was introduced as a deep learning architecture that
generalizes ensembles of oblivious decision trees, offering end-to-end gradient-based optimization
and multi-layer hierarchical representation learning. This design allows NODE to capture complex
feature interactions within tabular data, a task where traditional deep learning models often fall
short. One of NODE’s key innovations is the use of differentiable oblivious decision trees, where
splitting decisions are made through the entmax transformation, allowing for soft, gradient-based
feature selection. This approach makes the decision-making process more flexible and differentiable,
unlike conventional decision trees that rely on hard splits.
Additionally, NODE’s multi-layer architecture is designed to capture both shallow and deep
interactions within tabular data, effectively functioning as a deep, fully differentiable GBDT model
trained end-to-end via backpropagation [62]. The architecture of NODE stacks multiple layers of dif-
ferentiable oblivious decision trees, which enables NODE to outperform existing tree-based models
in many tasks. Furthermore, NODE enhances computational efficiency by allowing pre-computation
of feature selectors, significantly speeding up inference without sacrificing accuracy. Joseph [82]
explored NODE within the PyTorch Tabular framework, which simplifies deep learning for tabular
data by offering a unified API that integrates both NODE and TabNet. This framework addresses the
complexity of training deep learning models compared to traditional machine learning libraries like
Scikit-learn, making advanced models more accessible for practitioners and researchers. Fayaz et al.
[80] compared NODE, TabNet, and XGBoost, noting that while NODE introduces key innovations
such as handling mixed data types and data imbalance, it often requires more hyperparameter
tuning than XGBoost. However, combining NODE with XGBoost enhances performance, showing
NODE’s strength in complementing traditional models for tabular data.

, Vol. 1, No. 1, Article . Publication date: October 2024.

4 Recent Advances in Tabular Deep Learning
While a previous study [7] provides a structured overview of deep learning for tabular data, focusing
on challenges like handling categorical variables, data transformation, and model comparison,
this survey takes a different approach by emphasizing the historical evolution and algorithmic
advancements in the field. We highlight the development of more recent models such as Mam-
baNet, SwitchTab, and TP-BERTa, showing how these architectures have evolved to address the
unique complexities of tabular data. By exploring advancements in attention mechanisms, hybrid
architectures, and other recent breakthroughs, this survey underscores the transformation of deep
learning models into more efficient, scalable, and interpretable solutions. Unlike previous work, this
study does not focus on model comparison, as a comprehensive evaluation across models requires
a separate analysis tailored to various types of tabular data.
In the rapidly evolving field of tabular deep learning, significant improvements have been
made with each year bringing new architectures designed to address the increasing complexity of
tabular data. Recent models, such as HyperTab and GANDALF, push the boundaries of scalability
and interpretability, offering enhanced methods for handling heterogeneous features and high-
dimensional data. These newer architectures build upon foundational work, leading to marked
performance improvements over traditional approaches. As shown in Figure 9, the evolution of
tabular deep learning highlights key contributions, ranging from Semek et al. [83] and Arik et al.
[24] in 2019 to the most recent developments, arranged by citation count to showcase the growing
impact of this research.

Semek 2019

Arik 2019

Ruff 2021
Borisov 2021
Raschka 2020 Grinsztajn 2021

Achtibat 2022
Cai 2021
Karim 2022

Loddo 2022 Kalyakulina 2023

Jun 2024
Liu 2022
Zhu 2023

Fuhl 2023

Zheng 2022 Kim 2023

Shirkavand 2023

Fig. 9. A Timeline of Tabular Deep Learning Papers

Building on these developments, Table 2 presents a timeline of major models introduced during
this period, detailing their architectures and key performance traits. These models highlight the
significant breakthroughs in tabular deep learning, from hybrid architectures to advanced attention
mechanisms, which have propelled performance and scalability forward.
A Survey on Deep Tabular Learning 17

Table 2. Timeline of DL Models for Tabular Data (2021-22)

Model Training
(Year) Architecture Efficiency Main Features
Source
TabLLM (2022) Large language models Few-shot supervised Serializes tabular data into natural
[84] learning; Moderate language strings
TabDDPM Multinomial diffusion + Supervised learning; Multinomial and gaussian diffusion
(2022) [85] gaussian diffusion Moderate to high to handle categorical and numerical
features
Ptab (2022) Pre-trained language Supervised and self- Uses three-stage training strategy
[86] model architecture supervised learning; (modality transformation, masked-
N/A language fine-tuning, and classifica-
tion fine-tuning)
GANDALF Gated feature learning Supervised learning; Uses GFLUs with learnable feature
(2022) [87] unit (GFLUs) High masks and hierarchical gating mech-
anisms
ARM-Net Exponential neurons + Supervised learning; Adaptive relational modeling with
(2021) [88] gated attention mecha- High multi-head gated attention network
nism + sparse softmax
NPT (2021) Attention-based NN + Self-supervised learn- Process the entire dataset at once, use
[89] datapoints + attributes ing; Moderate to Low attention between data points
SAINT (2021) Hybrid architecture with Self-supervised con- Attention over both rows and
[25] both self-attention and trastive learning + columns
intersample attention supervised learning;
mechanisms Moderate
Regularized Plain Multilayer percep- Supervised learning; A “cocktail” of regularization tech-
DNNs (2021) tron Moderate to high niques
[90]
Boost-GNN GBDT + GNN Semi-supervised GNN on top decision trees from the
(2021) [91] learning; Moderate to GBDT algorithm
high
DNN2LR DNN + LR Supervised learning; Calculate cross-feature fields with
(2021) [92] High DNNs for LR
IGTD (2021) CNN-based neural net- Supervised learning; Transform tabular data into images
[2] work High for CNNs
SCARF (2021) Encoder + pre-train head Self-supervised Self-supervised contrastive learning,
[93] network contrastive, semi- random feature corruption
supervised, fully
supervised learning;
N/A

4.1 TabTransformer
The TabTransformer model introduces significant advancements in tabular deep learning by lever-
aging attention mechanisms and hybrid architectures to address the unique challenges posed by
tabular data [19]. At its core, TabTransformer employs multi-head self-attention layers adapted from
the Transformer architecture, traditionally used in NLP, to capture complex feature interactions and
dependencies across the dataset as seen in Figure 10. This attention mechanism enables the model
to effectively capture relationships between features, making it particularly useful for datasets with
numerous categorical variables.
The TabTransformer architecture combines transformer layers with MLP components, forming
a hybrid structure optimized for tabular data. Categorical features are embedded using a column
embedding layer, which transforms each category into a dense, learnable representation. These
embeddings are passed through Transformer layers, which aggregate contextual information from

, Vol. 1, No. 1, Article . Publication date: October 2024.

18 Somvanshi et al.

Loss

Multi-Layer Perceptron

Concatenation

Add & Norm

Feed Forward

Transformer × N
Add & Norm

Multi-Head Attention

Column Embedding Layer Normalization

Categorical Features x1, x2, … xm, Continuous Features xcont ∈ℝc

Fig. 10. TabTransformer Architecture [19]

other features to capture interdependencies. The contextualized categorical features are then
concatenated with continuous features and processed through the MLP for final prediction. This
design leverages the strengths of both contextual learning for categorical data and traditional
MLP benefits for continuous data. Additionally, TabTransformer incorporates masked language
modeling and replaced token detection, enabling it to pre-train on large amounts of unlabeled data,
thus improving performance in low-labeled data scenarios and making it effective for real-world
applications.
Recent advancements in TabTransformer models, such as the self-supervised TabTransformer
introduced by Vyas [94], further refine this architecture by leveraging MLM in a pre-training phase to
learn from unlabeled data. This self-supervised approach enhances the model’s ability to generalize
by capturing intricate feature dependencies through self-attention mechanisms. By combining
Transformer layers with MLP for final prediction, the model handles mixed data types and smaller
dataset sizes effectively. However, trade-offs exist while the model demonstrates strong performance
gains, particularly in semi-supervised settings, the reliance on masked language modeling pre-
training increases computational overhead, potentially limiting scalability. Interpretability remains
moderate, with attention scores providing insights into feature importance, though the model is
less interpretable than traditional models like GBDT.
Another significant advancement is the GatedTabTransformer, introduced by Cholakov and Kolev
[95], which enhances the original TabTransformer by incorporating a gated multi-layer perceptron.
This modification improves the model’s ability to capture cross-token interactions using spatial
gating units. The GatedTabTransformer boosts performance by approximately 1 percent in AUROC
compared to the standard TabTransformer, especially in binary classification tasks. However, this
comes at the cost of increased computational complexity due to the additional processing required

, Vol. 1, No. 1, Article . Publication date: October 2024.

A Survey on Deep Tabular Learning 19

for the spatial gating units. While the model shows improved performance, its scalability, and
interpretability remain limited compared to simpler models like MLPs or GBDTs.
Therefore, while TabTransformer models offer notable improvements in handling tabular data
through attention mechanisms and hybrid architectures, they present trade-offs in terms of perfor-
mance, scalability, and interpretability. Recent variations like the self-supervised TabTransformer
and GatedTabTransformer demonstrate the potential of these models to outperform traditional
approaches, though at the cost of higher computational demands.

4.2 FT-Transformer
The FT-Transformer model, as presented by Gorishniy et al. [96], introduces a novel approach to
addressing the challenges inherent in tabular data by leveraging attention mechanisms, hybrid
architectures, and transformer-based methodologies. The model adapts the attention mechanism,
originally designed for tasks like NLP, to process tabular data. In this context, the attention mecha-
nism allows the model to capture complex relationships between heterogeneous features, including
both numerical and categorical data as shown in Figure 11. By using attention to dynamically
prioritize certain features, the model effectively models interactions that are often difficult to detect
in traditional tabular data approaches.

𝑥𝑥 (𝑛𝑛𝑛𝑛𝑛𝑛) 𝑊𝑊 (𝑛𝑛𝑛𝑛𝑛𝑛) 𝑏𝑏 (𝑛𝑛𝑛𝑛𝑛𝑛)

0.8

0.1
T
1.2

𝑊𝑊1 (𝑐𝑐𝑐𝑐𝑐𝑐)
𝑥𝑥1 (𝑐𝑐𝑐𝑐𝑐𝑐) 𝑏𝑏1 (𝑐𝑐𝑐𝑐𝑐𝑐)
A k
B
B

𝑊𝑊2 (𝑐𝑐𝑐𝑐𝑐𝑐)
𝑥𝑥2 (𝑐𝑐𝑐𝑐𝑐𝑐) 𝑏𝑏2 (𝑐𝑐𝑐𝑐𝑐𝑐)
d

Fig. 11. FT-Transformer Architecture [96]

In addition to attention, the FT-Transformer employs a hybrid architecture that integrates feature
tokenization. This process transforms both numerical and categorical features into embeddings,
which are then processed through layers of the Transformer architecture. The result is a model that
is highly flexible and capable of handling diverse types of tabular data, a crucial advantage for tasks
where tabular data can vary widely in feature types and distributions. This hybrid design bridges

, Vol. 1, No. 1, Article . Publication date: October 2024.

20 Somvanshi et al.

traditional feature encoding methods with the robust learning capabilities of Transformer-based
approaches, enabling better generalization across different datasets.
Recent studies have demonstrated the effectiveness of the FT-Transformer across various applica-
tions. In the domain of heart failure prognosis, the FT-Transformer outperformed traditional models
like Random Forest and Logistic Regression by capturing the non-linear interactions between
medical features, such as demographic and clinical data [97]. The use of attention mechanisms
allowed the model to dynamically prioritize important health indicators, leading to more accurate
predictions. Similarly, in intrusion detection systems, the FT-Transformer showed superior accuracy
in identifying network anomalies by processing the highly structured nature of network traffic data
[98]. The hybrid architecture seamlessly integrated categorical and numerical features, improving
the model’s ability to detect both known and unknown threats. Additionally, advancements like
stacking multiple transformer layers have been employed to further enhance the model’s capacity
to capture long-range dependencies within the data, making it even more effective in complex tasks
[99]. While the FT-Transformer model demonstrates improved performance over other models,
such as ResNet and MLP, particularly on various tabular tasks, it comes with certain trade-offs. In
terms of interpretability, the model’s complexity poses challenges. Traditional models like GBDT
offer clearer interpretability, as their decision-making processes are more transparent. In contrast,
the FT-Transformer’s reliance on attention mechanisms and deep layers makes it harder to explain,
although the attention scores do provide some insight into feature importance. Furthermore, the
model’s scalability is another consideration; the computational demands of Transformer-based
models, especially the quadratic scaling of the attention mechanism with the number of features,
can become a limitation when applied to extensive datasets. Despite these limitations, the FT-
Transformer’s ability to generalize across diverse datasets makes it a promising model for tabular
data analysis, offering significant advancements in predictive performance.
Building on these advancements, we present a performance and log-loss comparison between
TabNet and FT-Transformer. As shown in Figure 12, the FT-Transformer consistently demonstrates
superior performance as the number of random search iterations increases, while the log-loss for
both models decreases at different rates. This comparison highlights FT-Transformer’s enhanced
generalization capabilities over TabNet, particularly in larger search spaces. While this figure
provides an illustrative example of performance differences, unlike the previous survey on tabular
deep learning [7], we have not offered a comparison of all tabular deep learning models, as a
comprehensive evaluation across multiple models and diverse datasets is beyond the scope of this
current survey. Future research should aim to conduct more extensive performance evaluations to
thoroughly examine the strengths and limitations of these models.

4.3 DeepGBM
The DeepGBM model represents an innovative approach to addressing the challenges of tabular data
in deep learning, leveraging a combination of advanced techniques such as attention mechanisms,
hybrid architectures, and knowledge distillation [63]. While the model does not explicitly employ
traditional attention mechanisms, it incorporates feature importance from GBDT, a method that
allows the model to prioritize certain features over others. This process mimics attention by
directing the model’s focus to the most informative features rather than treating all inputs equally.
By emphasizing the most relevant features, DeepGBM enhances its ability to handle both sparse
categorical and dense numerical data, a crucial requirement in tabular data tasks.
Recent advancements in tabular deep learning further underscore DeepGBM’s role in combining
neural networks with GBDT to achieve improved performance. In particular, the model’s hybrid
architecture utilizes CatNN to handle sparse categorical features through embeddings and factor-
ization machines, and GBDT2NN to convert the outputs of GBDT into a neural network format

, Vol. 1, No. 1, Article . Publication date: October 2024.

A Survey on Deep Tabular Learning 21

(a) Performance Plot (b) Log-Loss Plot

Fig. 12. Performance and Log-Loss of TabNet and FT-Transformer Models

optimized for dense numerical features [100]. Figure 13 shows the structure of DeepGBM. This
integration allows DeepGBM to leverage the strengths of both model types, overcoming limitations
in traditional approaches that struggle to process mixed feature types in a unified framework.

DeepGBM

CatNN GBDT2NN

Sparse Categorical Features Dense Numerical Features

Fig. 13. A DeepGBM Framework [63]

Though DeepGBM does not directly implement transformer models, it adopts ideas from
transformer-based architectures, particularly in the form of knowledge distillation. By distilling
the knowledge gained from GBDT trees into a neural network, including not just the predictions
but also the tree structure and feature importance, DeepGBM retains the benefits of GBDT while
enhancing its learning capacity [101]. This mirrors how transformers use distillation to simplify
complex models while preserving performance.

, Vol. 1, No. 1, Article . Publication date: October 2024.

22 Somvanshi et al.

The trade-offs in DeepGBM between performance, interpretability, and scalability reflect broader
challenges in tabular deep learning. DeepGBM achieves higher accuracy by combining GBDT and
neural networks but sacrifices some interpretability, as the added complexity of the neural network
component reduces the transparency typically associated with tree-based models. Scalability is also
a challenge, as the neural network elements require greater computational resources. However,
models like WindTunnel have shown that this approach can boost accuracy while maintaining some
of the structural benefits of the original GBDT [101]. These trade-offs must be carefully balanced
depending on the application, as DeepGBM excels in performance and efficiency, particularly for
large-scale and real-time applications.

4.4 Deep Attention Networks for Tabular Data (DANets)

In recent advancements in tabular deep learning, the DANets model leverages attention mechanisms,
hybrid architectures, and transformer-based approaches to tackle the challenges specific to tabular
data processing. One of the key innovations in DANets is the use of a dynamic feature selection
process, where relevant features are identified and emphasized through learnable sparse masks
[102]. This approach, based on the Entmax sparsity mapping, allows the model to selectively focus
on the most important features at each stage of the network, enhancing its ability to abstract
meaningful representations from the data. This mechanism is akin to attention mechanisms used
in transformer models, though specifically tailored for the irregular and heterogeneous nature of
tabular data.
output fusion

Addition fi

Addition
feature abstraction

Linear & Sigmoid Linear Abstlay

K=Ko , d=d1
Abstlay
K=Ko , d=d1
feature selection

Abstlay
Learnable Masks K=Ko , d=do Dropout

Abstract layer fi-1 x

(a) (b)

Fig. 14. (a)DANets Abstract layer (b) An i’th Basic Block [102]

DANets also incorporate hybrid architectures that blend feature grouping and hierarchical
abstraction processes, similar to CNNs, but adapted for the unique structure of tabular data. The
introduction of the Abstract Layer (ABSTLAY) as seen in Figure 14 enables the model to group
correlated features and abstract higher-level representations through successive layers. Additionally,
shortcut paths are employed, allowing raw features to be reintroduced at higher levels of the
network, ensuring that critical information is retained and enhancing the model’s robustness,
particularly in deeper architectures. This design is similar to ResNet-style connections, where

, Vol. 1, No. 1, Article . Publication date: October 2024.

A Survey on Deep Tabular Learning 23

residual pathways prevent information loss and degradation in deeper networks, thus boosting
performance.
DANets incorporate transformer-inspired ideas through the use of dynamic weighting and
attention-like mechanisms, allowing the model to selectively focus on important features during
the feature selection and abstraction processes. Although not a direct application of transformer
models, these methods improve the handling of tabular data and boost performance, making DANets
superior to traditional models like XGBoost and neural networks such as TabNet. However, this
performance comes at the cost of reduced interpretability. While attention-based feature selection
offers insights into the significance of specific features, the complexity of hierarchical abstraction
obscures the decision-making process, making it less transparent than simpler models like de-
cision trees. To address scalability, DANets utilize structure re-parameterization, which reduces
computational complexity during inference, allowing deeper networks without overwhelming
computational costs. Despite the performance boost from deeper architectures, the study notes
that additional depth yields diminishing returns due to the limited feature space in tabular data.

4.5 Self-Attention and Intersample Attention (SAINT)

Recent advances in tabular deep learning have leveraged attention mechanisms and transformer-
based approaches to address challenges in tabular data processing. The SAINT model leverages
recent advances in tabular deep learning by integrating attention mechanisms, hybrid architectures,
and transformer-based approaches to overcome the unique challenges of tabular data. SAINT uses
two types of attention mechanisms: self-attention and intersample attention [25]. Self-attention
allows the model to capture complex correlations between features within a single data sample,
enabling it to model relationships that simpler models may miss. Intersample attention, a novel
addition, enables the model to compare a row (data point) with other rows, allowing for a more
dynamic learning process that adjusts based on patterns across the entire dataset. This mechanism
proves useful in situations where some features may be noisy or incomplete, as the model can learn
from other similar data points.
SAINT’s hybrid architecture combines both self-attention and intersample attention to create a
comprehensive learning system. SAINT’s advanced architecture has also shown strong results in
software defect prediction tasks [103]. By leveraging attention mechanisms and transformer-based
approaches, SAINT effectively handles complex interactions between data points, improving defect
prediction performance. It consistently outperforms traditional models like XGBoost and Random
Forest, particularly when dealing with mixed data types. However, while SAINT offers improved
accuracy, its complexity impacts interpretability due to the inclusion of intersample attention,
making it less intuitive than simpler models. Additionally, the computational demands associated
with SAINT’s attention mechanisms can pose scalability challenges, especially when working with
larger datasets.
In addition to these innovations, SAINTENS, a modified version of SAINT, further enhances the
model’s ability to handle tabular data by addressing some of SAINT’s limitations [104]. SAINTENS
employs the same attention mechanisms but includes a MLP ensemble to improve robustness
when dealing with missing or noisy data. This approach, alongside contrastive pre-training and
augmentation techniques such as Mixup and Cutmix, allows SAINTENS to generate stronger
data representations, particularly in healthcare datasets where missing values are common. The
trade-offs between these enhancements manifest in three key areas: performance, interpretability,
and scalability. In terms of performance, SAINT and SAINTENS consistently outperform traditional
machine learning models like GBDT and deep learning models like TabNet, especially when working
with mixed feature types and datasets with limited labeled data. SAINT’s attention mechanisms
offer some degree of interpretability, allowing users to visualize important features and data points.

, Vol. 1, No. 1, Article . Publication date: October 2024.

24 Somvanshi et al.

However, the complexity introduced by intersample attention makes it less intuitive to interpret
than simpler models. Lastly, while SAINT and SAINTENS scale well across large datasets, the
computational demands of the attention mechanisms, especially intersample attention, can make
these models more resource-intensive, particularly in larger datasets.

4.6 Tabular BERT (TaBERT)

The TaBERT model addresses the challenges of tabular data by incorporating attention mechanisms,
hybrid architectures, and transformer-based approaches. A key innovation in TaBERT is its use
of attention mechanisms, particularly the vertical self-attention mechanism, which operates over
vertically aligned table cell representations across rows [105]. This enables the model to capture
dependencies between different rows and allows for a better representation of tabular data by
focusing on relevant columns and rows in relation to a given natural language query. While this
mechanism improves performance in handling tabular structures, it also introduces additional
computational complexity, making it less scalable when dealing with very large datasets or tables
containing numerous rows. Figure 15 illustrates the TaBERT architecture, which jointly processes
natural language utterances and table schemas. It highlights how the model captures both text
and tabular structures using multi-head attention and pooling mechanisms, enabling it to generate
unified representations for downstream tasks like semantic parsing.

In which city did Piotr’s last 1st place finish occur? Utterance Token Representations Column Representations
In Which City did … Year Venue Position …
Year Venue Position Event

𝑹𝑹𝟏𝟏 2003 Tampere 3rd EU Junior Championship Vertical Pooling

𝑹𝑹𝟐𝟐 2005 Erfurt 1st EU U23 Championship
Vertical Self-Attention Layer (x V)
2005 Izmir 1st Universiade
𝑹𝑹𝟑𝟑 𝑹𝑹𝟐𝟐 [CLS] In Which City … ? 2005 Erfurt Position …
2006 Moscow 2nd World Indoor Championship
𝑹𝑹𝟒𝟒 𝑹𝑹𝟑𝟑 [CLS] In Which City … ? 2005 Izmir Position …
2007 Bangkok 1st Universiade
𝑹𝑹𝟓𝟓 𝑹𝑹𝟓𝟓 [CLS] In Which City … ? 2007 Bangkok …
Selected Rows as Content Snapshot: {𝑹𝑹𝟐𝟐 , 𝑹𝑹𝟑𝟑 𝑹𝑹𝟓𝟓 } Position

(A) Content Snapshot from Input Table (C) Vertical Self-Attention over Aligned Row Encodings

(B) Per-row Encoding (for each row in content snapshot, using 𝑹𝑹𝟐𝟐 as an example)

2005 Erfurt
…
Utterance Token Vectors 1st Cell Vectors
[CLS] In Which City did …
Cell-wise Pooling Cell-wise Pooling Cell-wise Pooling

Transformer (BERT)

Fig. 15. Overview of TaBERT’s Method for Jointly Learning Representations of Natural Language Utterances
and Table Schemas, Using an Example From WikiTableQuestions [105]

In terms of architecture, TaBERT uses a hybrid approach known as content snapshots to reduce
computational overhead. Instead of encoding all rows in a table, which would be costly, TaBERT
selects a subset of rows that are most relevant to the natural language query. This allows the model
to retain key information necessary for effective joint reasoning between text and tables while
reducing the burden of processing unnecessary data. However, this comes with a trade-off: while
content snapshots help scale the model to larger tables, there is a risk of losing critical information
if the selected rows do not adequately represent the table’s full structure and content.

, Vol. 1, No. 1, Article . Publication date: October 2024.

A Survey on Deep Tabular Learning 25

Built on a transformer-based pretraining framework, TaBERT benefits from learning repre-

sentations of both natural language and structured data (tables). The model is pre-trained on a
large corpus of 26 million tables and their corresponding text, using a BERT-like masked language
modeling objective combined with table-specific objectives such as masked column prediction and
cell value recovery. This pretraining improves the model’s ability to align textual and tabular data
for downstream tasks like semantic parsing.
When evaluating performance versus interpretability, TaBERT excels in tasks like semantic
parsing, where it outperforms models like BERT in benchmarks such as WikiTableQuestions as
shown in Figure 15. However, the complexity introduced by the use of transformers and attention
mechanisms makes TaBERT less interpretable than simpler machine learning models, such as
decision trees, which offer more straightforward explanations for their decisions. In terms of
scalability, the content snapshot mechanism helps the model handle larger tables more efficiently,
but this comes with the risk of not fully capturing the table’s information.

4.7 Tabular Transformer with Scaled Exponential Linear Units (TabTranSELU)

The TabTranSELU model incorporates several recent advances in tabular deep learning, leveraging
attention mechanisms, hybrid architectures, and transformer-based approaches to address the
unique challenges of tabular data. One key innovation is the use of self-attention mechanisms,
which allow the model to capture dependencies between different features in tabular datasets
[26]. This self-attention approach is crucial for identifying relationships between input features, a
task that can be particularly challenging with tabular data due to its lack of inherent structure, as
seen in images or text. The attention mechanism computes scores by transforming the input into
query, key, and value matrices, which enables the model to determine the weighted importance of
different features. This helps the model learn inter-feature relationships more efficiently, ultimately
improving its predictive performance. Figure 16 shows the input, transformer, and dense layer used
in the TabTranSELU model.
Input Processing Transformer Like Layers

Apply Noise &

Attention Attention
Numerical Element Num vac
Positional Encoding
SELU SELU
Numerical Column name

Num name emb

Categorical element Embedding FFNN FFNN

SELU SELU
Categorical column name Cat emb

Cat name emb

FFNN
Positional wide addition Attention Self Attention Layer Final Dense Layer
Loss
Noise Gaussian Noise with 0.2 stf SELU SELU activation function

Fig. 16. A TabTranSELU Framework [26]

The model also employs a hybrid architecture, adapting the traditional Transformer design for
tabular data by simplifying its structure. Instead of utilizing the full stack of encoder and decoder
layers, as seen in NLP tasks, TabTranSELU uses just a single encoder and decoder layer. This
reduction in complexity helps tailor the architecture to the specific needs of tabular data without
sacrificing performance. Moreover, the model integrates elements of both neural networks and

, Vol. 1, No. 1, Article . Publication date: October 2024.

transformers, allowing it to handle categorical and continuous features with equal efficiency. These
features are processed separately through embedding layers, where categorical features are treated
similarly to tokens in NLP, and numerical features undergo positional encoding to preserve their
importance across different data instances.
One of the most significant adaptations of the TabTranSELU model is the replacement of rectified
linear unit (ReLU) activations with scaled exponential linear units (SELU), addressing the "dying
ReLU" problem, which is worsened by the presence of negative values in tabular data. SELU
retains both positive and negative values, preventing the loss of latent information during training
and making it more suitable for tabular datasets. Additionally, the use of positional encoding
for numerical features preserves their order and significance, enhancing the model’s ability to
handle continuous data. In terms of performance, TabTranSELU demonstrates competitive accuracy
compared to traditional algorithms like gradient boosting decision trees (e.g., XGBoost), with only
a slight 0.2 percent gap on larger datasets. It also performs well against similar transformer-based
models, including TabTransformer, making it highly effective for predictive tasks, despite sacrificing
a small amount of performance in exchange for broader functionality.
Interpretability is a key strength of the TabTranSELU model, with its embedding layers providing a
clear understanding of relationships between features. Techniques like principal component analysis
applied to the embeddings allow users to visualize how features and categories interact, offering
valuable insights, especially when working with anonymized or unfamiliar datasets—insights that
are often harder to achieve with traditional deep learning methods. In addition to interpretability, the
model excels in scalability. By reducing the number of layers and incorporating SELU activations, it
becomes more streamlined and less computationally intensive compared to traditional transformer
models, making it well-suited for larger datasets and more efficient to train with minimal resource
demand. Overall, TabTranSELU strikes an effective balance between performance, interpretability,
and scalability, making it a strong choice for various tabular data applications. While we have
already discussed several models from 2022 to 2024, it is important to note that a previous survey
paper [7] from 2022 did not include these more recent studies. The following section will explore
the latest architectural innovations and models that push the boundaries even further, marking a
new phase in the evolution of tabular deep learning.

4.8 New Architectures and Innovations

In recent years, the development of deep learning models for tabular data has accelerated, with new
architectures emerging to tackle the growing complexity of this domain. Table 3 below highlights
the key models introduced between 2023 and 2024, including innovative approaches such as LF-
Transformer and ReConTab, which leverage advanced transformer-based and hybrid techniques to
address challenges like feature interaction and noise. The table also outlines their architectures,
training efficiencies, and notable features, offering a snapshot of the latest advancements in the
field. LF-Transformer, for instance, employs both row-wise and column-wise attention mechanisms
to capture complex feature interactions, using matrix factorization and latent factor embeddings
to enhance prediction accuracy, particularly in noisy or incomplete datasets [106]. This model
excels in regression and classification tasks, though its complexity reduces interpretability and
increases computational demands for larger datasets. Similarly, ReConTab utilizes a transformer-
based asymmetric autoencoder to extract essential information from raw data, incorporating
feature corruption techniques to enhance model robustness, though the added complexity results
in higher computational costs and reduced transparency [5]. GNN4TDL also builds on transformer-
based autoencoder structures, leveraging feature corruption to improve robustness to noise and
generalization, though it faces challenges in scalability and interpretability [27].
A Survey on Deep Tabular Learning 27

Table 3. Timeline of DL Models for Tabular Data (2023-24)

Model Training
(Year) Architecture Efficiency Main Features
Source
MambaTab Mamba block + final pre- Supervised and self- Structured state-space models + fea-
(2024) [3] diction layer supervised learning; ture incremental learning
High
TP-BERTa Transformer-based Supervised learning; Transforms scalar numerical values
(2024) [107] using relative magni- Moderate into discrete tokens, integrates fea-
tude tokenization + ture name-value pairs
intra-feature attention
SwitchTab Asymmetric encoder- Self-supervised learn- Employs asymmetric encoder-
(2024) [6] decoder ing; Moderate to high decoder structure to decouple mutual
and salient features leveraging
feature corruption
CARTE (2024) Graph NN architecture Self-supervised learn- Transforms each row of tabular data
[108] using graph-attention ing; Moderate into a graph representation
layers
BiSHop (2024) Hopfield-based frame- Supervised learning; Bi-directional sparse Hopfield mod-
[109] work Moderate to high ules to process tabular data column-
wise and row-wise utilizing tabular
embeddings for categorical and nu-
merical features
LF- Column-wise trans- Supervised learning; Uses column-wise and row-wise at-
Transformer former + row-wise Moderate tention, latent factor embeddings, ma-
(2024) [106] transformer + latent trix factorization to capture feature
factor embedding interactions
TabTranSELU Self-attention + SELU ac- Supervised learning; Applies positional encoding to nu-
(2024) [26] tivation + masked layer N/A merical data and replaces ReLU with
SELU activation
TabR (2023) Feed-forward NN aug- Supervised learning; Integrates a retrieval-augmented
[110] mented with a retrieval- Moderate mechanism using L2-based nearest
based mechanism neighbors with a feed-forward NN
HYTREL Hypergraph structure- Self-supervised learn- Transforms tabular data into hyper-
(2023) [111] aware transformer ing; High graphs
(HyperTrans)
ReConTab Asymmetric autoen- Self-supervised; semi- Uses transformer-based asymmetric
(2023) [5] coder supervised; N/A autoencoder and feature corruption
GNN4TDL Graph neural network Supervised learning; Transforms tabular data into graph
(2023) [27] N/A structures using feature embeddings
Trompt (2023) Prompt learning Supervised learning; Uses prompt-inspired learning to de-
[112] Moderate rive sample-specific feature impor-
tances by combining column and
prompt embeddings
XTab (2023) Transformer based Self-supervised learn- Uses cross-table pretraining, data-
[113] ing; Moderate to high specific featurizers, and embedding
layers for categorical and numerical
features

Expanding the scope of transformer models, MambaTab integrates structured state-space mod-
els with feature incremental learning, capturing long-range dependencies in tabular data more
efficiently than standard self-attention mechanisms [3]. MambaTab’s ability to adapt to evolving
feature sets enhances its scalability, but it sacrifices interpretability, lacking the attention mecha-
nisms that explain feature importance in models like TabNet. SwitchTab employs an asymmetric
encoder-decoder architecture that decouples mutual and salient features through separate pro-
jectors, improving feature representation in tabular data [6]. By using feature corruption-based

, Vol. 1, No. 1, Article . Publication date: October 2024.

28 Somvanshi et al.

methods, SwitchTab enhances performance and interpretability, but its complexity affects scalabil-
ity, making it less efficient for very large datasets. Context Aware Representation of Table Entries
(CARTE) also utilizes advanced architectures, combining a Graph Neural Network (GNN) with
graph-attention layers to represent each table row as a graphlet, enabling the model to capture com-
plex contextual relationships across tables [108]. CARTE excels in transfer learning and performs
well on heterogeneous datasets, although its graph-attention mechanisms reduce interpretability
and scalability with large datasets.
In the realm of tokenization and prompt-based models, TP-BERTa stands out by applying Relative
Magnitude Tokenization (RMT) to transform scalar numerical values into discrete tokens, effectively
treating numerical data as words in a language model framework [107]. Additionally, its Intra-
Feature Attention (IFA) module unifies feature names and values into a coherent representation,
reducing feature interference and enhancing prediction accuracy. However, this deep integration
impacts interpretability compared to more straightforward models like gradient-boosted decision
trees. Trompt employs prompt-inspired learning to derive sample-specific feature importance
through the use of column and prompt embeddings, which tailor the relevance of features for
each data instance [112]. While Trompt boosts performance, especially for highly variable tabular
datasets, the abstract nature of its embeddings compromises interpretability and adds complexity.
Several other models combine innovative mechanisms with existing architectures to address
tabular data challenges. TabR integrates a retrieval-augmented mechanism that utilizes L2-based
nearest neighbors along with a feed-forward neural network, enhancing local learning by retrieving
relevant context from the training data [110]. While this method significantly improves predictive
accuracy, it introduces computational overhead during training, affecting scalability. BiSHop lever-
ages Bi-directional Sparse Hopfield Modules to process tabular data both column-wise and row-wise,
capturing intra-feature and inter-feature interactions [109]. Its specialized tabular embeddings and
learnable sparsity provide strong performance but at the cost of reduced interpretability and higher
computational requirements, limiting its application to larger datasets.
Finally, Hypergraph-enhanced Tabular Data Representation Learning (HYTREL) addresses struc-
tural challenges in tabular data using a Hypergraph structure-aware transformer, representing
tables as hypergraphs to capture complex cell, row, and column relationships [111]. This enables
HYTREL to preserve critical structural properties and perform exceptionally well on tasks like
column annotation and table similarity prediction, though the complexity of hypergraphs reduces
interpretability. TabLLM, a novel approach, serializes tabular data into natural language strings
to allow large language models (LLMs) to process it as they would with text [84]. While effective
in zero-shot and few-shot learning scenarios, TabLLM faces scalability issues and interpretability
challenges due to the high computational demands of LLMs and their abstract representation of
tabular data.

5 Architectures and Techniques

5.1 Attention Mechanisms
Attention mechanisms have become pivotal in enhancing feature selection, interpretability, and
performance in various deep-learning models designed for tabular data. In models like TabNet,
attention mechanisms focus on the most relevant features at each decision step, tailoring feature
selection for each individual sample. This instance-wise feature selection improves efficiency and
generalization, allowing the model to concentrate on the most critical features while minimizing
distractions from less important ones [24]. Similarly, TabTransformer leverages self-attention layers
to transform parametric embeddings into contextual embeddings, enabling the model to capture
the dependencies between categorical features. This transformation allows for more refined feature

, Vol. 1, No. 1, Article . Publication date: October 2024.

A Survey on Deep Tabular Learning 29

selection, where the most relevant features are dynamically emphasized based on their interactions
with others, resulting in improved performance across datasets [61]. Figure 17 further exemplifies
this by demonstrating how Multi-Headed Self-Attention (MHSA) is applied between both features
and samples in a tabular deep-learning model. By focusing attention first on the relationships
between features and then between different samples, the model improves its ability to generalize
and capture complex feature interactions, enhancing accuracy in tabular data processing.

(CLS) 0.1 0.4 0.7 0.1 0.4 0.7 0.1 0.4 0.7
MLP 0.1
Age:2 Feature 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5 0.1
MHSA Classifier 0.6
Salary: Embedding 0.1 0.2 0.2 0.1 0.2 0.2 0.1 0.2 0.2
on CLS 0.8
Credit: 700 0.8 0.5 0.1 0.8 0.5 0.1 0.8 0.5 0.1

X S S’ S’’ 𝑦𝑦�
01 0.4 0.7 0.2 0.6 0.3 0.25 0.2 0.4 0.7 0.8 0.25
0. 0.4 0.6 0.3 0.5 0.2 0.25 0.1 0.5 0.7 0.6 0.25
Reshape 2
0. 0.3 0.5 0.2 0.6 0.3 0.25 0.5 0.2 0.3 0.8 0.5
1

0.3 0.2 0.5 0.1 0.56 0.4 0.3 0.25 0.2 0.9 0.74 0.25

MHSA 0.2 0.3 0.5 0.2 0.3 0.1 0.4 0.2 0.3 0.7 0.5 0.4 Reshape
0.2 0.2 0.1 0.25 0.21 0.4 0.8 0.35 0.1 0.4 0.48 0.5

MHSA between Samples

Fig. 17. Feature and Sample Attention Using MHSA to Optimize Tabular Data Classification and Generaliza-
tion [114]

Building on this, SAINT introduces both self-attention and intersample attention mechanisms to
further refine feature selection. The self-attention mechanism in SAINT focuses on interactions
between features within a single data point, dynamically selecting important features based on
their relationships [25]. This is similar to TabNet’s emphasis on instance-specific feature selection
but extends beyond by capturing deeper interdependencies between features, improving model
adaptability and performance on heterogeneous datasets. SAINT’s novel intersample attention
adds another layer of sophistication by enabling data points to attend to other samples within
the dataset. This allows SAINT to better handle noisy or missing features by borrowing relevant
information from similar samples, a capability that is particularly useful in real-world datasets
where data quality may vary. This cross-sample attention mechanism significantly enhances feature
selection, making the model more robust to incomplete or corrupt data compared to traditional
models like GBDTs and MLPs.
Both TabNet [24] and TabTransformer [19] offer significant advances in interpretability. TabNet
operates at both local and global levels, enabling users to understand which features contribute
to individual predictions while also providing a broader view of the overall model behavior. This
transparency makes TabNet particularly useful in understanding model decisions for specific
samples. Similarly, SAINT improves interpretability through its attention-based structure. In SAINT,
attention maps highlight which features and samples are being prioritized during prediction,
making it easier to trace the model’s decision-making process and visualize feature importance.
TabTransformer also enhances interpretability by generating contextual embeddings that cluster
semantically similar features together in the embedding space. This clustering facilitates easier
visualization and interpretation of feature relationships, making the model more transparent.

, Vol. 1, No. 1, Article . Publication date: October 2024.

30 Somvanshi et al.

In terms of feature selection, TabNet integrates attention directly into the learning process,
optimizing feature selection and model training simultaneously. Unlike traditional methods like
forward selection or Lasso regularization, which apply uniform selection across the dataset, TabNet’s
instance-wise selection adapts to the specific needs of each sample, resulting in more compact
feature representations and a reduced risk of overfitting. InterpreTabNet, an improvement over
TabNet, further boosts these capabilities with the MLP-Attentive Transformer and the Entmax
activation function, leading to more precise feature selection [78]. Similarly, TabTransformer’s
multi-head self-attention mechanism enables the model to dynamically capture feature interactions
across the dataset. By attending to all other features, it efficiently selects the most critical ones
while disregarding irrelevant data, which enhances the model’s robustness against noisy or missing
data. SAINT extends this concept by leveraging intersample attention, which allows features
to interact across different samples. This mechanism not only improves feature selection but
also provides a way for the model to learn from multiple data points simultaneously, enhancing
its resilience to missing or noisy data. SAINT’s feature encoding method, which projects both
categorical and continuous features into a shared embedding space, also outperforms traditional
encoding techniques by allowing the model to learn from all feature types in a unified manner.
Both TabNet and TabTransformer, along with SAINT, showcase notable advancements in han-
dling tabular data through their attention mechanisms, offering robustness, adaptability, and
transparency. TabNet’s attention-driven approach enhances gradient propagation and generaliza-
tion, while TabTransformer excels in handling noisy and missing data, making both models suitable
for real-world applications where data imperfections are common. SAINT builds on these strengths
by introducing intersample attention, which allows the model to learn from relationships between
samples, further enhancing its ability to handle complex data distributions. Additionally, TabTrans-
former and SAINT’s pre-training on unlabeled data in semi-supervised learning scenarios allows
them to refine feature representations, contributing to improved performance when compared to
models relying exclusively on labeled data.

5.2 Hybrid Architectures

Hybrid architectures, such as NODE and DeepGBM, leverage the strengths of both decision trees
and neural networks to enhance generalization, capture complex feature interactions, and improve
performance on tabular data. Both models capitalize on the interpretability and efficient feature splits
provided by decision trees, while simultaneously benefiting from the gradient-based optimization
and hierarchical representation learning typical of deep neural networks. This synergy between
decision trees and neural networks allows these hybrid architectures to overcome the limitations
faced by traditional models when dealing with tabular data, where deep learning models often
underperform compared to shallower models like decision trees.
NODE employs differentiable Oblivious Decision Trees, a variant where all internal nodes of
the same depth use the same splitting feature and threshold, which enables NODE to combine
the inherent interpretability of decision trees with the backpropagation capabilities of neural net-
works [62]. This structure facilitates the learning of higher-order feature interactions by organizing
decision trees into multi-layer architectures that mirror deep neural networks, thus enhancing
generalization. Similarly, DeepGBM builds on this concept by incorporating two primary compo-
nents: CatNN, which focuses on handling sparse categorical features, and GBDT2NN, which distills
knowledge from GBDT into a neural network model to handle dense numerical features effectively
[63]. The GBDT2NN component takes advantage of GBDT’s ability to efficiently process numerical
features while leveraging neural networks’ flexibility in capturing complex feature interactions. In
this way, both NODE and DeepGBM are able to represent intricate patterns in the data, improving

, Vol. 1, No. 1, Article . Publication date: October 2024.

A Survey on Deep Tabular Learning 31

performance on tasks traditionally dominated by simpler models like GBDT and boosting the
effectiveness of tabular data predictions.
However, both architectures introduce challenges related to increased complexity. NODE’s dif-
ferentiable decision trees and multi-layer structures add computational overhead, making training
more resource-intensive compared to simpler GBDT models. Similarly, DeepGBM’s distillation pro-
cess, which involves learning leaf embeddings and managing multiple trees, introduces additional
computational cost. Both models require careful hyperparameter tuning to optimize performance,
which can make them more difficult to use in practice. Parameters like the number of layers, tree
depth, tree groups, and output dimensions must be adjusted meticulously to avoid overfitting and
ensure optimal learning. These complexities can increase the training time and resource demands
for both NODE and DeepGBM, making them less efficient in terms of inference speed compared
to their GBDT counterparts. Despite this, both models achieve comparable inference efficiency to
GBDTs when implemented effectively, but the training process for NODE and DeepGBM tends to
be longer due to the additional layers of differentiable optimization and knowledge distillation.

5.3 Regularization and Optimization Techniques

Kadra et al. [90] explored the effectiveness of regularization techniques such as Mixup, Dropout,
and Weight Decay in mitigating overfitting in deep learning models for small tabular datasets.
These methods enhance generalization by limiting weight magnitudes (Weight Decay), preventing
neuron co-adaptation through random deactivation (Dropout), and generating synthetic data by
interpolating between training samples (Mixup). While these techniques improve performance
over traditional methods like GBDT, trade-offs include reduced interpretability and the need for
careful hyperparameter tuning to maintain stability. The concept of a "regularization cocktail"
combining multiple methods further demonstrates that well-regularized models can outperform
both traditional and deep learning approaches on tabular data. Abrar and Samad [73] also emphasize
the role of Dropout and Weight Decay in combating overfitting in tabular datasets, particularly
highlighting Dropout’s ability to improve generalization by forcing the network to learn diverse
feature representations. They propose a novel periodic weight perturbation method, which prunes
and regrows weights during training, achieving a balance between model compression and accuracy.
This method outperforms traditional weight pruning by improving model generalization without
the typical loss in accuracy, though it introduces challenges related to model interpretability due to
the sparsity of the resulting models. Darabi et al. [115] further demonstrate the effectiveness of
Mixup in enhancing generalization, particularly through Contrastive Mixup, which interpolates
samples in the latent space to avoid generating unrealistic data points. This improves the model’s
stability and smoothes decision boundaries, but at the cost of reduced interpretability due to
transformations in the latent space.
Shavitt and Segal [67] take a different approach with Regularization Learning Networks, which
assign different regularization coefficients to each weight based on feature importance. This allows
for fine-tuned control over sparsity, reducing overfitting while maintaining interpretability. Regu-
larization Learning Networks strike a balance between model complexity and stability, though they
require sophisticated hyperparameter tuning. Similarly, Lounici et al. [116] introduce Muddling
Labels for Regularization (MLR), a technique that uses label permutations and structured dithering
to penalize memorization and improve generalization. MLR, tailored specifically for small tabu-
lar datasets, offers an effective alternative to Dropout and Weight Decay by maintaining model
flexibility while reducing overfitting, though it may increase model complexity.

, Vol. 1, No. 1, Article . Publication date: October 2024.

32 Somvanshi et al.

6 Training Strategies
6.1 Data Augmentation
Data augmentation techniques, such as SMOTE, GAN-based methods, and variational autoencoders
(VAEs), have demonstrated varying degrees of effectiveness in improving the performance of deep
learning models on tabular data, particularly in addressing class imbalance and small dataset issues.
SMOTE, one of the classic techniques, has been widely used to oversample the minority class
by generating synthetic samples [117]. It does this by interpolating between existing data points
in the feature space, which helps mitigate the class imbalance problem and can enhance model
performance in imbalanced datasets. However, SMOTE performs well with categorical features,
it struggles with continuous variables, as noted in experiments using datasets like Breast Cancer
and Credit Card Fraud [117]. The technique may struggle to maintain feature distributions when
dealing with categorical data, leading to less realistic synthetic samples that may not fully capture
the complexity of the original dataset. Wang and Pai [118] similarly note that SMOTE, although
effective for initial data expansion, does not generate sufficiently diverse and realistic data, limiting
its utility for more complex datasets.
GAN-based methods, particularly Conditional Tabular GAN (CTGAN) and Wasserstein GAN
with Gradient Penalty (WCGAN-GP) have emerged as more advanced techniques for tabular data
augmentation. These methods have demonstrated better performance than traditional techniques
like SMOTE, especially when working with mixed-type tabular data containing both continuous
and categorical features. Camino et al. [119] highlight the advantages of using GANs over SMOTE
for minority class oversampling, emphasizing that GANs can generate more realistic and diverse
samples. However, they also point out challenges specific to tabular data, such as difficulty in
handling discrete outputs and mode collapse, where the GAN fails to generate a sufficiently varied
dataset. Jeong et al. [120] introduce BAMTGAN, a variation of GANs, which incorporates a similarity
loss to ensure the generated data maintains the original distribution and avoids mode collapse.
Despite improvements, the challenge of balancing sample diversity and realism persists.
CTGAN addresses several challenges inherent to tabular data, such as handling non-Gaussian
and multimodal distributions, by introducing mode-specific normalization and using a conditional
generator to manage class imbalance Xu et al. [121]. Sauber-Cole and Khoshgoftaar [122] offer a
broad survey on the use of GANs to address class imbalance in tabular data. GANs are praised
for generating realistic minority class samples and improving model performance on imbalanced
datasets. However, challenges like mode collapse—where GANs fail to capture the diversity of the
minority class and maintain realistic feature distributions, especially for categorical data, remain
significant. Despite these issues, Sauber-Cole and Khoshgoftaar [122] highlight Wasserstein and
Conditional GANs as promising solutions for overcoming these limitations. This allows CTGAN to
generate more realistic and varied synthetic data while preserving the underlying data distributions.
WCGAN-GP further improves the stability of GAN training by mitigating issues like vanishing
gradients and mode collapse, which are common problems in standard GAN architectures. Com-
pared to SMOTE, WCGAN-GP has been shown to produce synthetic data that better preserves
data patterns and relationships, ultimately leading to better model performance and higher privacy
protection [123]. Hybrid approaches that combine SMOTE with GAN-based methods address chal-
lenges faced by standalone models. Wang and Pai [118] introduced a hybrid model using SMOTE
to augment small datasets, followed by WCGAN-GP to generate diverse and realistic synthetic
data. This combination leverages SMOTE’s statistical consistency and WCGAN-GP’s ability to
prevent overfitting, producing high-quality data while maintaining feature distribution, making it
an effective solution for tabular data augmentation.

, Vol. 1, No. 1, Article . Publication date: October 2024.

A Survey on Deep Tabular Learning 33

VAEs are another promising approach for data augmentation, particularly for continuous data.
VAEs regularize the latent space to generate smooth and realistic data distributions, and they have
been effective in augmenting tabular datasets [117]. However, they tend to struggle with mixed-type
data and categorical features, where maintaining the original feature distribution becomes more
challenging. Additionally, VAEs are prone to a phenomenon known as posterior collapse, where
the latent space collapses into a narrow range, reducing the variability of the generated samples
and leading to unrealistic outputs, particularly for minority classes.
One of the main challenges across these techniques is the difficulty in maintaining the original
feature distributions, especially for continuous features and imbalanced categorical columns. While
SMOTE works well for continuous data, it often falls short in handling categorical data. GAN-
based approaches, such as CTGAN, use specific normalization techniques to address this issue,
but even these advanced methods are not immune to challenges like mode collapse, where the
model generates synthetic data lacking variability. GANs also require significant computational
resources and careful tuning of hyperparameters to avoid these issues during training. Despite
these challenges, GAN-based techniques, particularly WCGAN-GP, have demonstrated superior
performance in generating high-quality, realistic synthetic data compared to traditional methods
like SMOTE, making them a valuable tool for augmenting tabular datasets.

6.2 Cross-validation
Cross-validation is a crucial technique to ensure the generalization of deep learning models,
especially for tabular data, where model overfitting and data imbalances can significantly affect
performance. Richetti et al. [124] emphasize the importance of cross-validation, particularly for
smaller datasets where its role in preventing overfitting is more pronounced. In this context, k-fold
cross-validation emerges as a popular method, with the authors applying an 8-fold cross-validation
approach to achieve robust error measurements across different data partitions. Similarly, Zhu et al.
[2] applied tenfold cross-validation in their study on CNNs for tabular data transformed into image
representations. The study highlights how k-fold cross-validation ensures generalization even after
the transformation process, preventing overfitting, especially in limited or imbalanced datasets.
Both studies underscore that while k-fold cross-validation offers robust performance evaluation,
increasing the number of folds, such as from 5-fold to 10-fold, introduces higher computational
costs without proportionally improving performance accuracy.
Wilimitis and Walsh [125] provide a comparative analysis of cross-validation methods, focusing
on the trade-offs between computational efficiency and model performance. They examine the
commonly used 5-fold cross-validation alongside other variants like repeated k-fold cross-validation,
finding that while more folds can slightly improve model evaluation, it also escalates computational
demands. The study also explores nested cross-validation, a more unbiased method for performance
estimation, particularly useful in healthcare models. However, the significant computational costs
of nested cross-validation are highlighted due to its repeated training cycles during hyperparameter
tuning. This mirrors the findings of Richetti et al. [124], who noted that methods like leave-one-
out cross-validation (LOOCV) can be computationally impractical for larger datasets due to their
repeated iterations.
Ullah et al. [126] expand on these ideas by discussing the use of stratified k-fold cross-validation,
particularly for handling class imbalances in deep learning models for tabular data. By maintaining
consistent class proportions in each fold, stratified cross-validation improves generalization, espe-
cially when working with imbalanced datasets, a key concern echoed in the previous two studies.
Ullah et al. [126] also discussed the challenges of using LOOCV, which, despite providing unbiased
performance estimates, comes with a high computational cost, especially for larger datasets. Nested

, Vol. 1, No. 1, Article . Publication date: October 2024.

34 Somvanshi et al.

cross-validation is similarly praised for its precision in mitigating data leakage during hyperparam-
eter tuning but is noted for its quadratic time complexity, making it a computationally intensive
choice.
These insights collectively suggest that while cross-validation techniques such as stratified
k-fold and nested cross-validation are vital for improving the robustness of deep learning models
on tabular data, they must be chosen carefully. The choice depends on balancing accuracy and
computational efficiency, where simpler methods like k-fold cross-validation are more scalable,
while more complex methods like nested cross-validation, although more precise, come with
significant computational trade-offs.

6.3 Transfer Learning for Tabular Data

Transfer learning has demonstrated effectiveness in tabular data, particularly in addressing the
limitations of small datasets. Pre-trained models, as emphasized by Levin et al. [127], enhance
performance significantly when labeled data is limited, as these models transfer complex represen-
tations that outperform traditional models like GBDT. This effectiveness is most pronounced when
there is alignment between the feature spaces of the upstream and downstream tasks, allowing
models to generalize effectively across tasks. However, one of the key challenges, highlighted by
Wang and Sun [38], is the inherent heterogeneity of feature spaces in tabular data. Tasks often
involve different columns or feature types, making it difficult for pre-trained models to generalize
without adaptation.
To address this challenge, several innovative methods have been developed. Levin et al. [127]
propose the use of pseudo-features, which allow models to manage mismatched or missing features
during the transfer of knowledge across tasks with different feature sets. Similarly, Wang and
Sun [38] introduce the TransTab model, a transformer-based architecture that treats cells and
columns as independent elements. This flexibility enables the model to handle tables with different
formats, significantly improving generalization across tasks with varying feature types. Despite
these advancements, issues such as catastrophic forgetting, where fine-tuning a pre-trained model
on new data results in the loss of previously acquired knowledge, persist. Iman et al. [128] address
this by proposing progressive learning, a technique that adds new layers to pre-trained models
during fine-tuning, preserving previously learned information while allowing the model to adapt
to new tasks. Additionally, adversarial-based methods that use networks to extract transferable
features across tasks have emerged as an effective strategy for improving model generalization in
diverse tabular domains.
While these models represent significant progress, traditional methods like logistic regression
and XGBoost still perform competitively with deep learning models in many tabular settings, as
noted by Jin and Ucar [129]. This is particularly true in cases where datasets differ significantly
in feature types and distributions, further complicating the effectiveness of transfer learning. The
challenge of imbalanced datasets, where models may skew performance toward majority classes,
also remains a key issue.
Recent advancements focus on developing models tailored specifically for tabular data. Yan et
al. [107] introduced TP-BERTa, a pre-trained model that handles both categorical and numerical
features using techniques such as relative magnitude tokenization and intra-feature attention. This
model demonstrates improved performance over both deep learning and traditional methods by
addressing the structural complexities of tabular data. Additionally, Jin and Ucar [129] propose
innovative architectures that leverage representation learning to facilitate knowledge transfer
across tasks with similar feature types.
Pre-training and fine-tuning strategies, as noted by Levin et al. [127], remain central to improving
performance in domain-specific applications. The use of self-supervised learning techniques, such

, Vol. 1, No. 1, Article . Publication date: October 2024.

A Survey on Deep Tabular Learning 35

as contrastive learning, has also proven effective, particularly in scenarios where labeled data is
scarce. These methods enable models to learn useful features without extensive labeling, making
them highly suitable for domains with limited labeled datasets. Moreover, El-Melegy et al. [130]
introduced a novel approach that transforms tabular data into image-like formats, allowing the
use of CNNs traditionally designed for image tasks. Coupled with GAN-based sampling, which
generates synthetic data to balance datasets, this approach enables effective learning from small,
sparse datasets. In summary, while transfer learning shows substantial promise in tabular data, its
effectiveness is still hindered by challenges such as feature heterogeneity, catastrophic forgetting,
and dataset imbalance. However, advancements like transformer-based architectures, progressive
learning, and GAN-based data augmentation offer solutions to these challenges. As these methods
continue to evolve, transfer learning will likely become a more robust and widely applicable tool
for tasks involving tabular data.

7 Future Directions
As deep learning models continue to evolve for tabular data, two key areas stand out for future
exploration: explainability and self-supervised learning. While current models offer impressive
predictive capabilities, their lack of transparency remains a significant challenge in high-stakes fields
like transportation engineering and healthcare. Enhancing model explainability and interpretability
through advanced techniques like SHAP, LIME, and integrated gradients is essential for building
trust and understanding in these models. Additionally, the growing field of self-supervised learning
(SSL) offers significant potential to leverage vast amounts of unlabeled tabular data, improving
model performance without relying on extensive labeled datasets. This section examines these
promising directions and their potential impact on the future of tabular deep learning.

7.1 Explainability and Interpretability

Explainability techniques such as SHAP, LIME, and Integrated Gradients play a pivotal role in
enhancing the interpretability of deep learning models, particularly in the domain of tabular data.
However, their current implementations have limitations that necessitate further development,
especially for real-world applications where trust and transparency are essential.
LIME has gained recognition for its ability to provide local explanations by creating simplified
models around specific predictions. By perturbing the input data and observing the effects, LIME
generates a local surrogate model that approximates the complex decision boundaries of the
underlying deep learning model. Despite these strengths, LIME’s reliance on kernel selection and
its assumption of feature independence can lead to inconsistencies in high-dimensional datasets, as
discussed by An et al. [131].
SHAP, on the other hand, is grounded in game theory and provides a more globally consistent
approach to feature contribution explanations. Unlike LIME, which focuses on local approximations,
SHAP offers theoretically sound attributions of importance to each feature by calculating the
marginal contribution of each feature to the prediction. Studies such as Ullah et al. [126] demonstrate
that SHAP generally offers more accurate and consistent explanations than LIME. However, this
comes at the cost of higher computational demands, which limits SHAP’s practicality in real-time
applications. In domains such as healthcare and finance, where regulatory compliance and trust
are crucial, SHAP’s detailed and fair explanations have made it the preferred tool. Nevertheless,
improving SHAP’s computational efficiency without compromising its rigorous interpretability is
crucial to making it viable for real-time and large-scale deployments.
Integrated Gradients provide a complementary approach, especially effective for models that
involve multimodal data. Gao et al. [132] demonstrated the successful integration of SHAP and
Integrated Gradients in a deep-learning model for hospital outcome prediction, using both clinical

, Vol. 1, No. 1, Article . Publication date: October 2024.

36 Somvanshi et al.

notes and tabular data. The combined use of these techniques allows for greater transparency by
identifying the contributions of both text-based and structured data features. While this enhances
trust among clinicians, the complexity of these explanations poses challenges for broader adoption.
Simplifying these techniques to make them more accessible to non-technical users is necessary for
their wider application in high-stakes environments such as transportation safety and healthcare.
Several studies emphasize the need for further refinement of these explainability techniques
to improve their practical application. Dastile and Celik [133] applied SHAP in cancer prediction
models and found that while SHAP enhanced the interpretability of the model, its computational
demands made real-time application challenging. The authors suggest optimizing SHAP or devel-
oping more efficient explainability methods that retain interpretability while reducing resource
consumption, especially in scenarios where real-time decision-making is critical. Similarly, Tran and
Byeon [134] used SHAP in a hybrid LightGBM–TabPFN model to predict dementia in Parkinson’s
disease patients. SHAP provided valuable insights into the feature contributions, improving the
model’s interpretability in a clinical setting. However, the study also highlights the need for further
development of causality-driven explanations, integrating domain expertise to increase trust and
applicability in medical environments. In summary, while SHAP, LIME, and Integrated Gradients
have significantly improved the interpretability of deep learning models for tabular data, further
development is needed to enhance their computational efficiency, stability, and accessibility for
real-world applications where trust and transparency are crucial.

7.2 Self-supervised learning

SSL has been highly successful in domains like computer vision and NLP, where the inherent
structures such as spatial relationships in images or semantic patterns in text make it easier to
design effective pretext tasks. However, applying SSL to tabular data presents a unique set of
challenges due to the lack of such explicit structures. Several studies have focused on adapting SSL
techniques to tabular data to improve model performance and tackle the issues surrounding the
design of meaningful pretext tasks and the effective use of unlabeled data.
One of the primary challenges in applying SSL to tabular data is the difficulty in designing
effective augmentations and pretext tasks. Unlike structured data such as images, which exhibit
spatial consistency, or text, which benefits from semantic coherence, tabular data lacks these natural
structures. As a result, traditional augmentations like rotation in vision or token masking in NLP are
not directly applicable. Wang et al. [135] discussed how SSL for tabular data needs to move beyond
such tasks to design new methods that can capture the implicit relationships between features.
Pretext tasks such as predicting missing values or reconstructing corrupted features have been
proposed to overcome this limitation. Ucar et al. [1] introduce the SubTab framework as shown in
Figure 18, which divides tabular features into multiple subsets and trains models to reconstruct
features from these subsets, providing a novel multi-view approach for learning representations.
This multi-view method implicitly acts as data augmentation, helping the model generalize better
across different datasets. This is similar to Hajiramezanali et al. [136], who introduce the STab
model, which avoids input-level augmentations by applying stochastic regularization techniques at
various layers of the neural network, generating different views of the same data to improve the
robustness of the learned representations.
Chitlangia et al. [137] employed a different method, Manifold Mixup, which creates interpolations
between hidden states rather than directly manipulating the input data. This method generates
perturbed representations, allowing the model to recover original inputs and effectively handle high-
cardinality features, thereby enhancing model performance without relying on manually labeled
data. Similarly, Vyas [94] applies a TabTransformer model, which uses self-attention mechanisms
to capture dependencies between categorical and numerical features. By leveraging unlabeled

, Vol. 1, No. 1, Article . Publication date: October 2024.

A Survey on Deep Tabular Learning 37

- - - - - - - - -
- - - - - - - - - ii) Reconstruction Loss
D 𝑥𝑥̅ 1 𝑥𝑥 𝑜𝑜𝑜𝑜 𝑥𝑥̅1 𝑥𝑥1
𝑥𝑥̅ 2 𝑥𝑥 𝑜𝑜𝑜𝑜 𝑥𝑥̅2 𝑥𝑥2
overlap overlap
𝑥𝑥̅ 3 𝑥𝑥 𝑜𝑜𝑜𝑜 𝑥𝑥̅3 𝑥𝑥3

- - - - - - - - - E iii) Contrastive & Distance

Loss
- - - - - - - - -
h1 h2 h3 𝑧𝑧1 𝑧𝑧2 , 𝑧𝑧1 𝑧𝑧3 , 𝑧𝑧2 𝑧𝑧3
x1 x2 x3

i) Dividing features into subsets

(Analogous to cropping images) h2
x2 h G
X
h1 h3 z1 z2 z3
x1 x3

Fig. 18. SubTab framework [1]

data, this model learns effective representations without needing heavy reliance on labeled data,
improving generalization across tasks.
Another approach to improving SSL’s efficacy on tabular data is through the careful design
of reconstruction-based tasks that leverage feature subsets. Zheng et al. [138] apply the SubTab
framework, where different subsets of features are used to reconstruct the full input, helping
address the issue of heterogeneity in tabular data, where not all features contribute equally to the
prediction task. This approach is echoed in VIME [58], which introduces two pretext tasks—feature
vector estimation and mask vector estimation—that focus on reconstructing original data from
masked and corrupted versions. These pretext tasks help the model learn robust representations by
encouraging it to handle missing or noisy data effectively. Similarly, masked encoding for tabular
data [139] builds on this by incorporating masked encoding inspired by transformer models and
using adversarial training during the reconstruction process. This adversarial component forces the
model to recover features even in the presence of perturbations, making the learned representations
more robust.
All these studies highlight the importance of leveraging unlabeled data in SSL, particularly in sce-
narios where labeled data is scarce. Leveraging large amounts of unlabeled data through techniques
such as consistency regularization, as seen in VIME and MET, can significantly enhance model
generalization even with limited labeled data. Wang et al. [135] emphasized that SSL techniques
must make effective use of unlabeled data to ensure transferability across tasks, especially since
tabular data often originates from diverse sources with varying feature distributions. This aligns
with the findings of Chitlangia et al. [137], where Manifold Mixup’s use of latent space perturba-
tions helps generate meaningful augmentations without relying on input-level transformations.
In SubTab, multi-view learning enables the model to capture different perspectives of the data,
extracting more robust representations from unlabeled data.

8 Conclusion
This survey reviewed the progress in deep learning models designed for tabular data, traditionally
a challenging domain for deep learning. While classical models like GBDTs have long dominated
tabular data tasks, new architectures such as TabNet, SAINT, and TabTransformer have introduced
attention mechanisms and feature embeddings to better handle the complexities of heterogeneous

, Vol. 1, No. 1, Article . Publication date: October 2024.

38 Somvanshi et al.

features, high dimensionality, and non-local interactions. These models have made significant
strides in enhancing interpretability and performance, with innovations like TabNet’s sequential
attention and SAINT’s intersample attention, which dynamically capture relationships between
features and rows of data.
However, challenges remain, particularly regarding computational efficiency and the risk of
overfitting on smaller datasets. While models like TabTransformer and SAINT are computationally
intensive, techniques like Mixup, CutMix, and regularization methods have been developed to ad-
dress overfitting. Recent advancements, including hybrid models like TabTranSELU and GNN4TDL,
have expanded the range of applications in many research domains. IGTD has further enhanced
how deep learning models can transform tabular data into more structured formats for better
performance.
One limitation of this survey is the lack of a detailed performance comparison across different
models and datasets. Future research should focus on conducting more rigorous evaluations of
tabular deep learning models on diverse datasets to gain a deeper understanding of their relative
strengths and weaknesses. Alongside performance comparisons, further studies should aim to
enhance the scalability and adaptability of these models, particularly in handling smaller or noisier
datasets. Techniques such as transfer learning and self-supervised learning hold promise, as they
allow models to benefit from large amounts of unlabeled data. Additionally, improving model
interpretability and reducing computational costs will be crucial to broadening the applicability of
deep tabular learning across industries like healthcare, finance, transportation, and infrastructure.

Acknowledgments
We extend our sincere gratitude to Khaled Aly Abousabaa for his assistance in preparing the images
for this study. We also thank our colleagues at Texas State University for their valuable guidance
throughout this work.

References
[1] Talip Ucar, Ehsan Hajiramezanali, and Lindsay Edwards. Subtab: Subsetting features of tabular data for self-supervised
representation learning. Advances in Neural Information Processing Systems, 34:18853–18865, 2021.
[2] Yitan Zhu, Thomas Brettin, Fangfang Xia, Alexander Partin, Maulik Shukla, Hyunseung Yoo, Yvonne A Evrard,
James H Doroshow, and Rick L Stevens. Converting tabular data into images for deep learning with convolutional
neural networks. Scientific reports, 11(1):11325, 2021.
[3] Md Atik Ahamed and Qiang Cheng. Mambatab: A simple yet effective approach for handling tabular data. arXiv
preprint arXiv:2401.08867, 2024.
[4] Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. Language models are realistic
tabular data generators. arXiv preprint arXiv:2210.06280, 2022.
[5] Suiyao Chen, Jing Wu, Naira Hovakimyan, and Handong Yao. Recontab: Regularized contrastive representation
learning for tabular data. arXiv preprint arXiv:2310.18541, 2023.
[6] Jing Wu, Suiyao Chen, Qi Zhao, Renat Sergazinov, Chen Li, Shengjie Liu, Chongchao Zhao, Tianpei Xie, Hanqing
Guo, Cheng Ji, et al. Switchtab: Switched autoencoders are effective tabular learners. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 38, pages 15924–15933, 2024.
[7] Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug, Martin Pawelczyk, and Gjergji Kasneci. Deep neural
networks and tabular data: A survey. IEEE transactions on neural networks and learning systems, 2022.
[8] Ravid Shwartz-Ziv and Amitai Armon. Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90,
2022.
[9] Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why do tree-based models still outperform deep learning on
typical tabular data? Advances in neural information processing systems, 35:507–520, 2022.
[10] Yuqian Wu, Hengyi Luo, and Raymond ST Lee. Deep feature embedding for tabular data. arXiv preprint
arXiv:2408.17162, 2024.
[11] Yury Gorishniy, Ivan Rubachev, and Artem Babenko. On embeddings for numerical features in tabular deep learning.
Advances in Neural Information Processing Systems, 35:24991–25004, 2022.
[12] David Collett. Modelling binary data. CRC press, 2002.

, Vol. 1, No. 1, Article . Publication date: October 2024.

A Survey on Deep Tabular Learning 39

[13] Scott L Zeger and Kung-Yee Liang. Longitudinal data analysis for discrete and continuous outcomes. Biometrics,
pages 121–130, 1986.
[14] Curtis E Dyreson and Richard T Snodgrass. Timestamp semantics and representation. Information Systems, 18(3):143–
166, 1993.
[15] Ari Aulia Hakim, Alva Erwin, Kho I Eng, Maulahikmah Galinium, and Wahyu Muliady. Automated document
classification for news article in bahasa indonesia based on term frequency inverse document frequency (tf-idf)
approach. In 2014 6th international conference on information technology and electrical engineering (ICITEE), pages 1–4.
IEEE, 2014.
[16] Tomas Mikolov. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[17] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543,
2014.
[18] Xinlong Li, Xingyu Fu, Guangluan Xu, Yang Yang, Jiuniu Wang, Li Jin, Qing Liu, and Tianyuan Xiang. Enhancing bert
representation with context-aware embedding for aspect-based sentiment analysis. IEEE Access, 8:46868–46876, 2020.
[19] Xin Huang, Ashish Khetan, Milan Cvitkovic, and Zohar Karnin. Tabtransformer: Tabular data modeling using
contextual embeddings, 2020.
[20] Zhengyi Ma, Zhicheng Dou, Wei Xu, Xinyu Zhang, Hao Jiang, Zhao Cao, and Ji-Rong Wen. Pre-training for ad-hoc
retrieval: hyperlink is also you need. In Proceedings of the 30th ACM International Conference on Information &
Knowledge Management, pages 1212–1221, 2021.
[21] Zhenbo Lu, Wei Zhou, Shixiang Zhang, and Chen Wang. A new video-based crash detection method: Balancing speed
and accuracy using a feature fusion deep learning framework. Journal of advanced transportation, 2020(1):8848874,
2020.
[22] Boutheina Maaloul, Abdelmalik Taleb-Ahmed, Smail Niar, Naim Harb, and Carlos Valderrama. Adaptive video-based
algorithm for accident detection on highways. In 2017 12th IEEE International Symposium on Industrial Embedded
Systems (SIES), pages 1–6. IEEE, 2017.
[23] VN Durga Pavithra Kollipara, VN Hemanth Kollipara, and M Durga Prakash. Emoji prediction from twitter data using
deep learning approach. In 2021 Asian conference on innovation in technology (ASIANCON), pages 1–6. IEEE, 2021.
[24] Sercan Ö Arik and Tomas Pfister. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI
conference on artificial intelligence, volume 35, pages 6679–6687, 2021.
[25] Gowthami Somepalli, Micah Goldblum, Avi Schwarzschild, C Bayan Bruss, and Tom Goldstein. Saint: Improved
neural networks for tabular data via row attention and contrastive pre-training. arXiv preprint arXiv:2106.01342, 2021.
[26] Yuchen Mao. Tabtranselu: A transformer adaptation for solving tabular data. Applied and Computational Engineering,
51:81–88, 2024.
[27] Cheng-Te Li, Yu-Che Tsai, and Jay Chiehen Liao. Graph neural networks for tabular data learning. In 2023 IEEE 39th
International Conference on Data Engineering (ICDE), pages 3589–3592. IEEE, 2023.
[28] Shriyank Somvanshi, Gian Antariksa, and Subasish Das. Enhanced balanced-generative adversarial networks to
predict pedestrian injury types. Available at SSRN 4847615, 2024.
[29] Marco D Adelfio and Hanan Samet. Schema extraction for tabular data on the web. Proceedings of the VLDB Endowment,
6(6):421–432, 2013.
[30] Laith Alzubaidi, Jinglan Zhang, Amjad J Humaidi, Ayad Al-Dujaili, Ye Duan, Omran Al-Shamma, José Santamaría,
Mohammed A Fadhel, Muthana Al-Amidie, and Laith Farhan. Review of deep learning: concepts, cnn architectures,
challenges, applications, future directions. Journal of big Data, 8:1–74, 2021.
[31] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[32] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
[33] Qinghua Zheng, Zhen Peng, Zhuohang Dang, Linchao Zhu, Ziqi Liu, Zhiqiang Zhang, and Jun Zhou. Deep tabular data
modeling with dual-route structure-adaptive graph networks. IEEE Transactions on Knowledge and Data Engineering,
35(9):9715–9727, 2023.
[34] Antonio Briola, Yuanrong Wang, Silvia Bartolucci, and Tomaso Aste. Homological convolutional neural networks.
arXiv preprint arXiv:2308.13816, 2023.
[35] Tennison Liu, Zhaozhi Qian, Jeroen Berrevoets, and Mihaela van der Schaar. Goggle: Generative modelling for tabular
data by learning relational structure. In The Eleventh International Conference on Learning Representations, 2023.
[36] Lun Du, Fei Gao, Xu Chen, Ran Jia, Junshan Wang, Jiang Zhang, Shi Han, and Dongmei Zhang. Tabularnet: A neural
network architecture for understanding semantic structures of tabular data. In Proceedings of the 27th ACM SIGKDD
Conference on Knowledge Discovery & Data Mining, pages 322–331, 2021.
[37] Joseph M Hellerstein. Learning to restructure tables automatically. ACM SIGMOD Record, 53(1):75–75, 2024.

, Vol. 1, No. 1, Article . Publication date: October 2024.

40 Somvanshi et al.

[38] Zifeng Wang and Jimeng Sun. Transtab: Learning transferable tabular transformers across tables. Advances in Neural
Information Processing Systems, 35:2902–2915, 2022.
[39] Amirata Ghorbani, Dina Berenbaum, Maor Ivgi, Yuval Dafna, and James Y Zou. Beyond importance scores: Interpreting
tabular ml by visualizing feature semantics. Information, 13(1):15, 2021.
[40] Nadja Geisler and Carsten Binnig. Introducing quest: a query-driven framework to explain classification models on
tabular data. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, pages 1–4, 2022.
[41] Vanshika Jain, Meghansh Goel, and Kshitiz Shah. Deep learning on small tabular dataset: Using transfer learning
and image classification. In International Conference on Artificial Intelligence and Speech Technology, pages 555–568.
Springer, 2021.
[42] Georgia Koppe, Andreas Meyer-Lindenberg, and Daniel Durstewitz. Deep learning for small and big data in psychiatry.
Neuropsychopharmacology, 46(1):176–190, 2021.
[43] Wei Zhao. Research on the deep learning of the small sample data based on transfer learning. In AIP conference
proceedings, volume 1864. AIP Publishing, 2017.
[44] Lorenzo Brigato and Luca Iocchi. A close look at deep learning with small data. In 2020 25th international conference
on pattern recognition (ICPR), pages 2490–2497. IEEE, 2021.
[45] Illia Horenko. On a scalable entropic breaching of the overfitting barrier for small data problems in machine learning.
Neural Computation, 32(8):1563–1579, 2020.
[46] Benjamin L. Badger. Small language models for tabular data, 2022.
[47] Witold Wydmański, Oleksii Bulenok, and Marek Śmieja. Hypertab: Hypernetwork approach for deep learning on
small tabular datasets. In 2023 IEEE 10th International Conference on Data Science and Advanced Analytics (DSAA),
pages 1–9. IEEE, 2023.
[48] Rajat Singh and Srikanta Bedathur. Embeddings for tabular data: A survey, 2023.
[49] Robert G Clark, Wade Blanchard, Francis KC Hui, Ran Tian, and Haruka Woods. Dealing with complete separation
and quasi-complete separation in logistic regression for linguistic data. Research Methods in Applied Linguistics,
2(1):100044, 2023.
[50] Daniel Valero-Carreras, Javier Alcaraz, and Mercedes Landete. Comparing two svm models through different metrics
based on the confusion matrix. Computers & Operations Research, 152:106131, 2023.
[51] Nitin Chauhan and Krishna Singh. A review on conventional machine learning vs deep learning. pages 347–352, 09
2018.
[52] Sakib Abrar. Deep model intervention for representation learning of tabular data. Master’s thesis, Tennessee State
University, 2023.
[53] Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why do tree-based models still outperform deep learning on
tabular data?, 2022.
[54] Sheikh Amir Fayaz, Majid Zaman, Sameer Kaul, and Muheet Ahmed Butt. Is deep learning on tabular data enough?
an assessment. International Journal of Advanced Computer Science and Applications, 13(4), 2022.
[55] André Schidler and Stefan Szeider. Sat-based decision tree learning for large data sets. Journal of Artificial Intelligence
Research, 80:875–918, 2024.
[56] Vinícius G Costa and Carlos E Pedreira. Recent advances in decision trees: An updated survey. Artificial Intelligence
Review, 56(5):4765–4800, 2023.
[57] Matan Marudi, Irad Ben-Gal, and Gonen Singer. A decision tree-based method for ordinal classification problems.
IISE Transactions, 56(9):960–974, 2024.
[58] Jinsung Yoon, Yao Zhang, James Jordon, and Mihaela Van der Schaar. Vime: Extending the success of self-and
semi-supervised learning to tabular domain. Advances in Neural Information Processing Systems, 33:11033–11043,
2020.
[59] Yuanfei Luo, Hao Zhou, Wei-Wei Tu, Yuqiang Chen, Wenyuan Dai, and Qiang Yang. Network on network for tabular
data classification in real-world applications. In Proceedings of the 43rd International ACM SIGIR Conference on Research
and Development in Information Retrieval, pages 2317–2326, 2020.
[60] Liran Katzir, Gal Elidan, and Ran El-Yaniv. Net-dnf: Effective deep modeling of tabular data. In International conference
on learning representations, 2020.
[61] Xin Huang, Ashish Khetan, Milan Cvitkovic, and Zohar Karnin. Tabtransformer: Tabular data modeling using
contextual embeddings. arXiv preprint arXiv:2012.06678, 2020.
[62] Sergei Popov, Stanislav Morozov, and Artem Babenko. Neural oblivious decision ensembles for deep learning on
tabular data. arXiv preprint arXiv:1909.06312, 2019.
[63] Guolin Ke, Zhenhui Xu, Jia Zhang, Jiang Bian, and Tie-Yan Liu. Deepgbm: A deep learning framework distilled by
gbdt for online prediction tasks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining, pages 384–394, 2019.

, Vol. 1, No. 1, Article . Publication date: October 2024.

A Survey on Deep Tabular Learning 41

[64] Baohua Sun, Lin Yang, Wenhan Zhang, Michael Lin, Patrick Dong, Charles Young, and Jason Dong. Supertml:
Two-dimensional word embedding for the precognition on structured tabular data. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition workshops, pages 0–0, 2019.
[65] Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. xdeepfm: Com-
bining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD
international conference on knowledge discovery & data mining, pages 1754–1763, 2018.
[66] Guolin Ke, Jia Zhang, Zhenhui Xu, Jiang Bian, and Tie-Yan Liu. Tabnn: A universal neural network solution for
tabular data. 2018.
[67] Ira Shavitt and Eran Segal. Regularization learning networks: deep learning for tabular datasets. Advances in Neural
Information Processing Systems, 31, 2018.
[68] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a factorization-machine based
neural network for ctr prediction. arXiv preprint arXiv:1703.04247, 2017.
[69] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg
Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for recommender systems. In Proceedings of the 1st
workshop on deep learning for recommender systems, pages 7–10, 2016.
[70] Ravid Shwartz-Ziv and Amitai Armon. Tabular data: Deep learning is not all you need, 2021.
[71] Ami Abutbul, Gal Elidan, Liran Katzir, and Ran El-Yaniv. Dnf-net: A neural architecture for tabular data. arXiv
preprint arXiv:2006.06465, 2020.
[72] Nitin Kumar Chauhan and Krishna Singh. A review on conventional machine learning vs deep learning. In 2018
International conference on computing, power and communication technologies (GUCON), pages 347–352. IEEE, 2018.
[73] Sakib Abrar and Manar D Samad. Perturbation of deep autoencoder weights for model compression and classification
of tabular data. Neural Networks, 156:160–169, 2022.
[74] Jintai Chen, Jiahuan Yan, Qiyuan Chen, Danny Chen, Jian Wu, and Jimeng Sun. Excelformer: Can a dnn be a sure bet
for tabular prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, 2024.
[75] N Benjamin Erichson, Lionel Mathelin, Zhewei Yao, Steven L Brunton, Michael W Mahoney, and J Nathan Kutz.
Shallow neural networks for fluid flow reconstruction with limited sensors. Proceedings of the Royal Society A,
476(2238):20200097, 2020.
[76] Ivan Rubachev, Artem Alekberov, Yury Gorishniy, and Artem Babenko. Revisiting pretraining objectives for tabular
deep learning. arXiv preprint arXiv:2207.03208, 2022.
[77] James Fiedler. Simple modifications to improve tabular neural networks. arXiv preprint arXiv:2108.03214, 2021.
[78] Shiyun Wa, Xinai Lu, and Minjuan Wang. Stable and interpretable deep learning for tabular data: Introducing
interpretabnet with the novel interprestability metric. arXiv preprint arXiv:2310.02870, 2023.
[79] Viet-Cuong Ta, Thi-Linh Hoang, Ngoc-San Doan, Van-Thang Nguyen, Nuong Nguyen Dieu, Thi Thanh Thuy Pham,
and Nam Nguyen Dang. Tabnet efficiency for facies classification and learning feature embedding from well log data.
Petroleum Science and Technology, pages 1–16, 2023.
[80] Sheikh Amir Fayaz, Majid Zaman, Sameer Kaul, and Muheet Ahmed Butt. Is deep learning on tabular data enough?
an assessment. International Journal of Advanced Computer Science and Applications, 13(4), 2022.
[81] Aleksandra Lewandowska. Xgboost meets tabnet in predicting the costs of forwarding contracts. In 2022 17th
Conference on Computer Science and Intelligence Systems (FedCSIS), pages 417–420. IEEE, 2022.
[82] Manu Joseph. Pytorch tabular: A framework for deep learning with tabular data, 2021.
[83] Wojciech Samek, Grégoire Montavon, Andrea Vedaldi, Lars Kai Hansen, and Klaus-Robert Müller. Explainable AI:
interpreting, explaining and visualizing deep learning, volume 11700. Springer Nature, 2019.
[84] Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, and David Sontag. Tabllm:
Few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence
and Statistics, pages 5549–5581. PMLR, 2023.
[85] Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. Tabddpm: Modelling tabular data with
diffusion models. In International Conference on Machine Learning, pages 17564–17579. PMLR, 2023.
[86] Guang Liu, Jie Yang, and Ledell Wu. Ptab: Using the pre-trained language model for modeling tabular data. arXiv
preprint arXiv:2209.08060, 2022.
[87] Manu Joseph and Harsh Raj. Gate: Gated additive tree ensemble for tabular classification and regression. arXiv
preprint arXiv:2207.08548, 2, 2022.
[88] Shaofeng Cai, Kaiping Zheng, Gang Chen, HV Jagadish, Beng Chin Ooi, and Meihui Zhang. Arm-net: Adaptive
relation modeling network for structured data. In Proceedings of the 2021 International Conference on Management of
Data, pages 207–220, 2021.
[89] Jannik Kossen, Neil Band, Clare Lyle, Aidan N Gomez, Thomas Rainforth, and Yarin Gal. Self-attention between
datapoints: Going beyond individual input-output pairs in deep learning. Advances in Neural Information Processing

, Vol. 1, No. 1, Article . Publication date: October 2024.

42 Somvanshi et al.

Systems, 34:28742–28756, 2021.

[90] Arlind Kadra, Marius Lindauer, Frank Hutter, and Josif Grabocka. Well-tuned simple nets excel on tabular datasets.
Advances in neural information processing systems, 34:23928–23941, 2021.
[91] Sergei Ivanov and Liudmila Prokhorenkova. Boost then convolve: Gradient boosting meets graph neural networks.
arXiv preprint arXiv:2101.08543, 2021.
[92] Zhaocheng Liu, Qiang Liu, Haoli Zhang, and Yuntian Chen. Dnn2lr: Interpretation-inspired feature crossing for
real-world tabular data. arXiv preprint arXiv:2008.09775, 2020.
[93] Dara Bahri, Heinrich Jiang, Yi Tay, and Donald Metzler. Scarf: Self-supervised contrastive learning using random
feature corruption. arXiv preprint arXiv:2106.15147, 2021.
[94] Tirth Kiranbhai Vyas. Deep learning with tabular data: A self-supervised approach. arXiv preprint arXiv:2401.15238,
2024.
[95] Radostin Cholakov and Todor Kolev. The gatedtabtransformer. an enhanced deep learning architecture for tabular
modeling. arXiv preprint arXiv:2201.00199, 2022.
[96] Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Revisiting deep learning models for tabular
data. Advances in Neural Information Processing Systems, 34:18932–18943, 2021.
[97] Geunho Kim, Minuk Yang, Geun-Hyeong Kim, Seong-Hwan Eom, Tae-Soo Lee, and Seung Park. Predicting heart
failure prognosis using deep learning based on ft-transformer. In 2023 Fourteenth International Conference on Ubiquitous
and Future Networks (ICUFN), pages 932–936. IEEE, 2023.
[98] S Saraniya, MV Sowmiya, BN Kalpana, and M Karthi. Securing networks: Unleashing the power of the ft-transformer
for intrusion detection. In 2024 International Conference on Computer, Electrical & Communication Engineering
(ICCECE), pages 1–7. IEEE, 2024.
[99] S Phani Praveen, Thulasi Bikku, P Muthukumar, K Sandeep, Jampani Chandra Sekhar, V Krishna Pratap, et al.
Enhanced intrusion detection using stacked ft-transformer architecture. Full Length Article, 13(2):19–9, 2024.
[100] Xue Chen, Zhenlong Liu, Ming Zhong, Xin Liu, and Peng Song. A deep learning approach using deepgbm for credit
assessment. In Proceedings of the 2019 International Conference on Robotics, Intelligent Control and Artificial Intelligence,
pages 774–779, 2019.
[101] Gyeong-In Yu, Saeed Amizadeh, Sehoon Kim, Artidoro Pagnoni, Ce Zhang, Byung-Gon Chun, Markus Weimer, and
Matteo Interlandi. Windtunnel: towards differentiable ml pipelines beyond a single model. Proceedings of the VLDB
Endowment, 15(1):11–20, 2021.
[102] Jintai Chen, Kuanlun Liao, Yao Wan, Danny Z Chen, and Jian Wu. Danets: Deep abstract networks for tabular
data classification and regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages
3930–3938, 2022.
[103] Sriman Mohapatra, Eunjeong Ju, Jeonghwa Lee, and Duksan Ryu. Software defect prediction based on saint. The
Transactions of the Korea Information Processing Society, 13(5):236–242, 2024.
[104] Julian Gutheil and Klaus Donsa. Saintens: self-attention and intersample attention transformer for digital biomarker
development using tabular healthcare real world data. In dHealth 2022, pages 212–220. IOS Press, 2022.
[105] Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. Tabert: Pretraining for joint understanding of
textual and tabular data. arXiv preprint arXiv:2005.08314, 2020.
[106] Kwangtek Na, Ju-Hong Lee, and Eunchan Kim. Lf-transformer: Latent factorizer transformer for tabular learning.
IEEE Access, 2024.
[107] Jiahuan Yan, Bo Zheng, Hongxia Xu, Yiheng Zhu, Danny Chen, Jimeng Sun, Jian Wu, and Jintai Chen. Making
pre-trained language models great on tabular prediction. arXiv preprint arXiv:2403.01841, 2024.
[108] Myung Jun Kim, Léo Grinsztajn, and Gaël Varoquaux. Carte: pretraining and transfer for tabular learning. arXiv
preprint arXiv:2402.16785, 2024.
[109] Chenwei Xu, Yu-Chao Huang, Jerry Yao-Chieh Hu, Weijian Li, Ammar Gilani, Hsi-Sheng Goan, and Han Liu. Bishop:
Bi-directional cellular learning for tabular data with generalized sparse modern hopfield model. arXiv preprint
arXiv:2404.03830, 2024.
[110] Yury Gorishniy, Ivan Rubachev, Nikolay Kartashev, Daniil Shlenskii, Akim Kotelnikov, and Artem Babenko. Tabr:
Tabular deep learning meets nearest neighbors. In The Twelfth International Conference on Learning Representations,
2024.
[111] Pei Chen, Soumajyoti Sarkar, Leonard Lausen, Balasubramaniam Srinivasan, Sheng Zha, Ruihong Huang, and George
Karypis. Hytrel: Hypergraph-enhanced tabular data representation learning. Advances in Neural Information Processing
Systems, 36, 2024.
[112] Kuan-Yu Chen, Ping-Han Chiang, Hsin-Rung Chou, Ting-Wei Chen, and Tien-Hao Chang. Trompt: Towards a better
deep neural network for tabular data. arXiv preprint arXiv:2305.18446, 2023.
[113] Bingzhao Zhu, Xingjian Shi, Nick Erickson, Mu Li, George Karypis, and Mahsa Shoaran. Xtab: Cross-table pretraining
for tabular transformers. arXiv preprint arXiv:2305.06090, 2023.

, Vol. 1, No. 1, Article . Publication date: October 2024.

A Survey on Deep Tabular Learning 43

[114] Shourav B Rabbani, Ivan V Medri, and Manar D Samad. Attention versus contrastive learning of tabular data–a
data-centric benchmarking. arXiv preprint arXiv:2401.04266, 2024.
[115] Sajad Darabi, Shayan Fazeli, Ali Pazoki, Sriram Sankararaman, and Majid Sarrafzadeh. Contrastive mixup: Self-and
semi-supervised learning for tabular domain. arXiv preprint arXiv:2108.12296, 2021.
[116] Karim Lounici, Katia Meziani, and Benjamin Riu. Muddling label regularization: Deep learning for tabular datasets.
arXiv preprint arXiv:2106.04462, 2021.
[117] Pedro Machado, Bruno Fernandes, and Paulo Novais. Benchmarking data augmentation techniques for tabular data.
In International Conference on Intelligent Data Engineering and Automated Learning, pages 104–112. Springer, 2022.
[118] Winston Wang and Tun-Wen Pai. enhancing small tabular clinical trial dataset through hybrid data augmentation:
combining smote and wcgan-gp. Data, 8(9):135, 2023.
[119] Ramiro Camino, Christian Hammerschmidt, and Radu State. Minority class oversampling for tabular data with deep
generative models. arXiv preprint arXiv:2005.03773, 2020.
[120] Jueun Jeong, Hanseok Jeong, and Han-Joon Kim. Bamtgan: A balanced augmentation technique for tabular data. In
2023 9th International Conference on Applied System Innovation (ICASI), pages 205–207. IEEE, 2023.
[121] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using
conditional gan. Advances in neural information processing systems, 32, 2019.
[122] Rick Sauber-Cole and Taghi M Khoshgoftaar. The use of generative adversarial networks to alleviate class imbalance
in tabular data: a survey. Journal of Big Data, 9(1):98, 2022.
[123] Susan McKeever and Manhar Singh Walia. Synthesising tabular datasets using wasserstein conditional gans with
gradient penalty (wcgan-gp). Technological University Dublin, 2020.
[124] Jonathan Richetti, Foivos I Diakogianis, Asher Bender, André F Colaço, and Roger A Lawes. A methods guideline for
deep learning for tabular data in agriculture with a case study to forecast cereal yield. Computers and Electronics in
Agriculture, 205:107642, 2023.
[125] Drew Wilimitis and Colin G Walsh. Practical considerations and applied examples of cross-validation for model
development and evaluation in health care: tutorial. JMIR AI, 2:e49023, 2023.
[126] Ihsan Ullah, Andre Rios, Vaibhav Gala, and Susan Mckeever. Explaining deep learning models for tabular data using
layer-wise relevance propagation. Applied Sciences, 12(1):136, 2021.
[127] Roman Levin, Valeriia Cherepanova, Avi Schwarzschild, Arpit Bansal, C Bayan Bruss, Tom Goldstein, Andrew Gordon
Wilson, and Micah Goldblum. Transfer learning with deep tabular models. arXiv preprint arXiv:2206.15306, 2022.
[128] Mohammadreza Iman, Hamid Reza Arabnia, and Khaled Rasheed. A review of deep transfer learning and recent
advancements. Technologies, 11(2):40, 2023.
[129] Qixuan Jin and Talip Ucar. Benchmarking tabular representation models in transfer learning settings. In NeurIPS
2023 Second Table Representation Learning Workshop, 2023.
[130] Moumen El-Melegy, Ahmed Mamdouh, Samia Ali, Mohamed Badawy, Mohamed Abou El-Ghar, Norah Saleh Alghamdi,
and Ayman El-Baz. Prostate cancer diagnosis via visual representation of tabular data and deep transfer learning.
Bioengineering, 11(7):635, 2024.
[131] Junkang An, Yiwan Zhang, and Inwhee Joe. Specific-input lime explanations for tabular data based on deep learning
models. Applied Sciences, 13(15):8782, 2023.
[132] Zhenyue Gao, Xiaoli Liu, Yu Kang, Pan Hu, Xiu Zhang, Wei Yan, Muyang Yan, Pengming Yu, Qing Zhang, Wendong
Xiao, et al. Improving the prognostic evaluation precision of hospital outcomes for heart failure using admission
notes and clinical tabular data: Multimodal deep learning model. Journal of Medical Internet Research, 26:e54363, 2024.
[133] Xolani Dastile and Turgay Celik. Making deep learning-based predictions for credit scoring explainable. IEEE Access,
9:50426–50440, 2021.
[134] Vinh Quang Tran and Haewon Byeon. Predicting dementia in parkinson’s disease on a small tabular dataset using
hybrid lightgbm–tabpfn and shap. Digital Health, 10:20552076241272585, 2024.
[135] Wei-Yao Wang, Wei-Wei Du, Derek Xu, Wei Wang, and Wen-Chih Peng. A survey on self-supervised learning for
non-sequential tabular data. arXiv preprint arXiv:2402.01204, 2024.
[136] Ehsan Hajiramezanali, Nathaniel Lee Diamant, Gabriele Scalia, and Max W Shen. Stab: Self-supervised learning for
tabular data. In NeurIPS 2022 First Table Representation Workshop, 2022.
[137] Sharad Chitlangia, Anand Muralidhar, and Rajat Agarwal. Self supervised pre-training for large scale tabular data.
2022.
[138] Xuan Zheng, Xiuli Ma, Yanliang Jin, Dongsheng Gu, and Rui Wang. Tabular-based self-supervised learning approach
for encrypted traffic classification. Journal of Electronic Imaging, 32(4):043032–043032, 2023.
[139] Kushal Majmundar, Sachin Goyal, Praneeth Netrapalli, and Prateek Jain. Met: Masked encoding for tabular data.
arXiv preprint arXiv:2206.08564, 2022.

Received October 10, 2024

, Vol. 1, No. 1, Article . Publication date: October 2024.

AED3701 Exam
No ratings yet
AED3701 Exam
5 pages
Grade R Activity-Guide-Term-3-Cover - Sepedi - Print
No ratings yet
Grade R Activity-Guide-Term-3-Cover - Sepedi - Print
2 pages
Informed Parents Consent For Research Participants
No ratings yet
Informed Parents Consent For Research Participants
1 page
Midterm Exam
100% (1)
Midterm Exam
8 pages
Tabular Data - Deep Learning Is Not All You Need
No ratings yet
Tabular Data - Deep Learning Is Not All You Need
13 pages
Fast Access Testbank Modern Deep Learning For Tabular Data Novel Approaches To Common Modeling Problems 1st Edition
No ratings yet
Fast Access Testbank Modern Deep Learning For Tabular Data Novel Approaches To Common Modeling Problems 1st Edition
330 pages
A Detailed Lesson Plan Based On Listening Skills
100% (1)
A Detailed Lesson Plan Based On Listening Skills
7 pages
Deep Neural Networks and Tabular Data: A Survey
No ratings yet
Deep Neural Networks and Tabular Data: A Survey
20 pages
Test Bank For Testbank Modern Deep Learning For Tabular Data Novel Approaches To Common Modeling Problems 1st Edition Download
100% (1)
Test Bank For Testbank Modern Deep Learning For Tabular Data Novel Approaches To Common Modeling Problems 1st Edition Download
406 pages
2MS-seq-ONE FuLL - by Precious Rose
No ratings yet
2MS-seq-ONE FuLL - by Precious Rose
31 pages
Individuation Jung
100% (1)
Individuation Jung
18 pages
T M: A T D L P - E E: AB Dvancing Abular EEP Earning With Arameter Fficient Nsembling
No ratings yet
T M: A T D L P - E E: AB Dvancing Abular EEP Earning With Arameter Fficient Nsembling
39 pages
Resnet
No ratings yet
Resnet
25 pages
Revisiting Deep Learning Models For Tabular Data
No ratings yet
Revisiting Deep Learning Models For Tabular Data
12 pages
KNN - Problem Statement ANSWER
100% (1)
KNN - Problem Statement ANSWER
8 pages
Understanding The Limits of Deep Tabular Methods With Temporal Shift
No ratings yet
Understanding The Limits of Deep Tabular Methods With Temporal Shift
17 pages
A Data-Centric Perspective On Evaluating Machine Learning Models For Tabular Data
No ratings yet
A Data-Centric Perspective On Evaluating Machine Learning Models For Tabular Data
35 pages
Deep Neural Networks and Tabular Data A Survey
No ratings yet
Deep Neural Networks and Tabular Data A Survey
21 pages
Large Language Models (LLMS) On Tabular Data: Predic-Tion, Generation, and Understanding - A Survey
No ratings yet
Large Language Models (LLMS) On Tabular Data: Predic-Tion, Generation, and Understanding - A Survey
47 pages
Deep Neural Networks and Tabular Data: A Survey
No ratings yet
Deep Neural Networks and Tabular Data: A Survey
22 pages
Accurate Predictions On Small Data With A Tabular Foundation Model
No ratings yet
Accurate Predictions On Small Data With A Tabular Foundation Model
23 pages
XGBOOST
No ratings yet
XGBOOST
36 pages
Trompt Towards A Better Deep Neural Network For Tabular Data
No ratings yet
Trompt Towards A Better Deep Neural Network For Tabular Data
43 pages
Mambular: A Sequential Model For Tabular Deep Learning
No ratings yet
Mambular: A Sequential Model For Tabular Deep Learning
21 pages
MLP Tabular
No ratings yet
MLP Tabular
19 pages
NeurIPS 2022 On Embeddings For Numerical Features in Tabular Deep Learning Paper Conference
No ratings yet
NeurIPS 2022 On Embeddings For Numerical Features in Tabular Deep Learning Paper Conference
14 pages
Inclusive Education Essentials
No ratings yet
Inclusive Education Essentials
12 pages
T R: T D L M N N 2023: AB Abular EEP Earning Eets Earest Eighbors in
No ratings yet
T R: T D L M N N 2023: AB Abular EEP Earning Eets Earest Eighbors in
39 pages
Modern Deep Learning For Tabular Data: Novel Approaches To Common Modeling Problems 1st Edition Andre Ye
No ratings yet
Modern Deep Learning For Tabular Data: Novel Approaches To Common Modeling Problems 1st Edition Andre Ye
40 pages
(Current Issues in Linguistic Theory 223) Teresa Fanego (Ed.), Javier Pérez-Guerra (Ed.), María José López-Couso (Ed.) - English Historical Syntax and Morphology_ Selected Papers from 11 ICEHL, Santia
100% (1)
(Current Issues in Linguistic Theory 223) Teresa Fanego (Ed.), Javier Pérez-Guerra (Ed.), María José López-Couso (Ed.) - English Historical Syntax and Morphology_ Selected Papers from 11 ICEHL, Santia
317 pages
On Embeddings For Numerical Features in Tabular Deep Learning
No ratings yet
On Embeddings For Numerical Features in Tabular Deep Learning
21 pages
ExcelFormer A Neural Network Surpassing GBDTs On Tabular Data
No ratings yet
ExcelFormer A Neural Network Surpassing GBDTs On Tabular Data
13 pages
Tabular Data Classification and Regression XGBoost or Deep Learning With Retrieval-Augmented Generation
No ratings yet
Tabular Data Classification and Regression XGBoost or Deep Learning With Retrieval-Augmented Generation
14 pages
A Closer Look at Deep Learning On Tabular Data
No ratings yet
A Closer Look at Deep Learning On Tabular Data
43 pages
A Table Detection Method For PDF Documents Based On Convolutional Neural Networks (2016) - 136
No ratings yet
A Table Detection Method For PDF Documents Based On Convolutional Neural Networks (2016) - 136
7 pages
Integrating Image and Tabular Data For Deep Learning - by Yuan Tian - Towards Data Science
No ratings yet
Integrating Image and Tabular Data For Deep Learning - by Yuan Tian - Towards Data Science
14 pages
GANThesis
No ratings yet
GANThesis
81 pages
Tree-Hybrid MLPs
No ratings yet
Tree-Hybrid MLPs
14 pages
ML Mid-I
No ratings yet
ML Mid-I
25 pages
Orientation: Cultural and Heritage Tourism
No ratings yet
Orientation: Cultural and Heritage Tourism
11 pages
Few-Shot Classification of Tabular Data With Large Language Models
No ratings yet
Few-Shot Classification of Tabular Data With Large Language Models
33 pages
Fuzzy CNN
No ratings yet
Fuzzy CNN
10 pages
Publi-6721 2
No ratings yet
Publi-6721 2
17 pages
B18BB5312 - Advertising - Unit 1
No ratings yet
B18BB5312 - Advertising - Unit 1
31 pages
Why Tree Based Method
No ratings yet
Why Tree Based Method
14 pages
TabNet: Interpretable Deep Learning for Tabular Data
No ratings yet
TabNet: Interpretable Deep Learning for Tabular Data
12 pages
Tacl A 00544
No ratings yet
Tacl A 00544
23 pages
T PFN: A T T S S T C P S: AB Ransformer HAT Olves Mall Abular Lassification Roblems in A Econd
No ratings yet
T PFN: A T T S S T C P S: AB Ransformer HAT Olves Mall Abular Lassification Roblems in A Econd
33 pages
Federated Learning For Tabular Data Using Tabnet: A Vehicular Use-Case
No ratings yet
Federated Learning For Tabular Data Using Tabnet: A Vehicular Use-Case
7 pages
T DDPM: M T D D M: AB Odelling Abular Ata With Iffusion Odels
No ratings yet
T DDPM: M T D D M: AB Odelling Abular Ata With Iffusion Odels
15 pages
Image-Based Table Recognition Data Model and Metric
No ratings yet
Image-Based Table Recognition Data Model and Metric
11 pages
Ibm Tabformer
No ratings yet
Ibm Tabformer
5 pages
Gate
No ratings yet
Gate
25 pages
Concept Attainment Lesson Plan
No ratings yet
Concept Attainment Lesson Plan
7 pages
Surverypaper GNN
No ratings yet
Surverypaper GNN
24 pages
Training Design - PRIMALS 4-6 (Science)
No ratings yet
Training Design - PRIMALS 4-6 (Science)
1 page
Naan Mudhalvan Sigaram Thodu 020724
No ratings yet
Naan Mudhalvan Sigaram Thodu 020724
1 page
LocalGLMnet Interpretable Deep Learning For Tabular Data
No ratings yet
LocalGLMnet Interpretable Deep Learning For Tabular Data
26 pages
CPD-Structured Multivariate Polynomial Optimization
No ratings yet
CPD-Structured Multivariate Polynomial Optimization
24 pages
Automatic Building of A Large and Straightforward Dataset For Image-Based Table Structure Recognition
No ratings yet
Automatic Building of A Large and Straightforward Dataset For Image-Based Table Structure Recognition
13 pages
Transformer for Time Series Data
No ratings yet
Transformer for Time Series Data
22 pages
MATE: Multi-View Attention For Table Transformer Efficiency
No ratings yet
MATE: Multi-View Attention For Table Transformer Efficiency
14 pages
Well Tuned Simple Nets
No ratings yet
Well Tuned Simple Nets
23 pages
25 Tabular Representation Noisy o
No ratings yet
25 Tabular Representation Noisy o
14 pages
The Audio Lingual Method
100% (1)
The Audio Lingual Method
32 pages
TabTransformer: Contextual Embeddings for Tabular Data
No ratings yet
TabTransformer: Contextual Embeddings for Tabular Data
17 pages
Counseling
No ratings yet
Counseling
4 pages
2626 Pre Training Language Mod
No ratings yet
2626 Pre Training Language Mod
10 pages
AI Researchers: TabPFN Review
No ratings yet
AI Researchers: TabPFN Review
2 pages
Generative Diffusion Models for Structured Data
No ratings yet
Generative Diffusion Models for Structured Data
20 pages
!discussion 1
No ratings yet
!discussion 1
8 pages
Comparitive Analysis of Gradient Boosting and Transformer Based Models For Binary Classification in Tabular Data
No ratings yet
Comparitive Analysis of Gradient Boosting and Transformer Based Models For Binary Classification in Tabular Data
5 pages
Teks Lomba Pidata Bhs Inggris Hari Guru
No ratings yet
Teks Lomba Pidata Bhs Inggris Hari Guru
2 pages
HDW Elem Grammar 1.1
No ratings yet
HDW Elem Grammar 1.1
4 pages
TUTA: Tree-Based Transformers For Generally Structured Table Pre-Training
No ratings yet
TUTA: Tree-Based Transformers For Generally Structured Table Pre-Training
11 pages
Rain Forest Enrichment Unit
No ratings yet
Rain Forest Enrichment Unit
4 pages
2 Components of ETA
No ratings yet
2 Components of ETA
3 pages
Maritime Students' Attitude and Performance
No ratings yet
Maritime Students' Attitude and Performance
8 pages
Deep Learning Architectures Enabling Sophisticated Feature Extraction and Representation For Complex Data Analysis
No ratings yet
Deep Learning Architectures Enabling Sophisticated Feature Extraction and Representation For Complex Data Analysis
11 pages
Renxi Liu: International Commerce and Sports Experience
No ratings yet
Renxi Liu: International Commerce and Sports Experience
2 pages
Overview of Sternbergttrt
No ratings yet
Overview of Sternbergttrt
3 pages
Resume
No ratings yet
Resume
2 pages
Issues and Challenges Implementation of Basic Education and Vocational
No ratings yet
Issues and Challenges Implementation of Basic Education and Vocational
10 pages
PR1 Script
No ratings yet
PR1 Script
3 pages
Entrepreneurial Competencies Module
No ratings yet
Entrepreneurial Competencies Module
8 pages
Lesson Plan 6th Grade
No ratings yet
Lesson Plan 6th Grade
4 pages
Meg 101
No ratings yet
Meg 101
2 pages