0% found this document useful (0 votes)

5 views78 pages

How To Win A Data Science Competition

Uploaded by

ksafuture1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views78 pages

How To Win A Data Science Competition

Uploaded by

ksafuture1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 78

How to win a Data

Science Competition
Mohammed
Eltayeb
1. Introduction to Kaggle
2. Exploring AI Competitions
Table of 3. Mastering AI Competitions: Understand
your Toolkit
Contents 4. Choosing the Best Models
5. Practical Tips and Strategies
Introduction to
Kaggle
Welcome onboard kagglers
Kaggle.com
The Biggest AI community!

● Hosting AI competitions,
datasets, notebooks, models,
tutorials and a lot more.
● Best place to learn the
practical part of AI.
● Ranks and Tiers.
Why do you have to compete in Kaggle?
● Improve your AI practical skills.
● Learn how to apply the state of the art.
● See the implications of your decisions during building the model (i.e.
learn what overfitting really mean.) (btw, have you heard about shake
up? )
● Compare yourself against the tops in the world.
(You may be able to get over them).
● Get some prizes
● Fun :)
Why do you have to compete in Kaggle?
Exploring AI
Competitions
Have you heard about shakeup?
What are AI comps? You after the first submission:

● You will be given a real-world

task coming from any
discipline (e.g. medicine,
finance, sports, astronomy,..)
● You should gather some
domain knowledge about the
topic, read about it, then
choose the model that best
solves the problem.
What are AI comps?
● Dataset:
○ You will be given a two datasets for this task (or asked to gather a one lol).
○ Training and testing datasets.
○ You should train your model on the training data then make predictions for
the testing data.
○ These predictions are the submission file for the competition.

Train & Ready to

train validate Predict
model Model

Predict Results
test (y_pred)
test csv file usually
Example of Dataset samples
Index Weight Wingspan (cm) Price? Back color Species
(g)

1 100.1 125.5 10 Brown Buteo jamaicensis

2 3000.7 200.0 98.1 Gray Sagittarius

serpentarius

3 3300.0 220.3 110.2 Gray Sagittarius

serpentarius

4 4100.0 136.0 154 Black Gavia immer

5 3.0 11.0 2 Green Calothorax lucifer

Leaderboard
Price
● Leaderboard:

Public Leaderboard
○ Oh bro, have you heard about shakeup? 13495
○ There are 2 leaderboards. Public and Private.
○ Public LB: running while the competition is 16500
running.
16500
○ Private LB: When the competition finishes the
private lb will get revealed and the final ranking 13950

Private Leaderboard
will depend on the private.
○ But why do we need this split for test data? 17450

15250
Leaderboard You might overfit the leaderboard

● Leaderboard: Price-True Price_pred

Public Leaderboard
○ We need private LB to drop solutions that 13495 13490
overfitted to the public LB.
○ This is to assure your solution is 16500 16562
generalizable, working and reliable.
○ When many people overfits the public LB 16500 165315
and the private LB get all over the place,

Private Leaderboard
13950 35134
we call this shake up.
○ But how to avoid overfitting? 17450 20518
By having a
robust validation (more details next). 15250 29516
AI Competitions Goals

● Performance: The ● Efficiency: There ● Handle Big Data:

current state of the art is are many constraints Sometimes data sets can go
the baseline for the in the competitions up to 700GB. How could you
competition. So in many (e.g. limited GPUs, decrease this data size with
times you have to beat limited inference minimal loss in information so
the SOTA to win (or time,...) you can handle it in your
maybe not). device? How to load it to your
model fast enough to finish
within a reasonable time?
How to do inference later with
this amount of data?
AI Competitions Goals
● Ideal Dataset: Sometimes the
dataset for the competition is not
enough for winning. You have to
search for datasets in the internet
or check pretrained models.
● Avoid overfitting: Having robust
validation split would make your
model robust against shakeups (I
hope lol).
What are AI comps?
● Let’s explore it a bit:

Kaggle: Your Home for Data Science

Mastering AI
Competitions
Understand your Toolkit
Understand Your Toolkit

Let’s start by a quick revision for data types…

● Text
● Tabular Data
● Images
● Time Series Data
● Videos
● Waves (e.g. Audio)
Understand Your Toolkit: EDA
● Exploratory Data Analysis (EDA) is
to explore your data by looking into
raw samples, statistics or plots to
gain useful insights about your task.

● Why do we need it?

Understand Your Toolkit: EDA
● Exploratory Data Analysis (EDA) is to explore your data by looking into
raw samples, statistics or plots to gain useful insights about your task.
● Why do we need it?
○ Get comfortable with the data.
○ Determine how to approach the problem.
○ Determine what is the best cv.
○ Determine the most important features (How can you do that? ).
○ Determine any strange behaviour in features’ distributions or features’
correlation with each other.
○ Discover serious problems with the data (e.g. Leakage ).
○ Find magic features
Understand Your Toolkit: EDA
● Good EDA could be the key for winning a competition.
● Examples:
○ AMP®-Parkinson's Disease Progression Prediction | Kaggle
○ Google Research - Identify Contrails to Reduce Global Warming | Kaggle
Understand Your Toolkit: EDA
● Tools & Techniques:
○ Plots: Ordinary EDA.
○ T-SNE/UMAP: Visualize high dimensional data. Link
○ Trees Models: Detect important features. Link
○ LOFO/SHAP: Better features selection tools. Link
○ Adversarial Validation: Detect Distribution Shift. Link
○ …
t-SNE short-movie
showing how the
NN behave during
training.

Reference: American
Express - Default Prediction
| Kaggle
Understand Your Toolkit: Data Preprocessing
● Tabular/Time-series Data:
○ Encoding (Label, One-Hot,Frequency,...)
○ Feature Engineering (Aggregations, Lags, Differences,...). Link
○ Features Selection (Trees, LOFO, SHAP,...)
○ Normalization (maybe not?)
○ …
Understand Your Toolkit: Data Preprocessing
● Images/text data:
○ Images Normalization.
○ Augmentations
○ Text Cleaning
○ Text Tokenization
○ …
Understand Your Toolkit: Validation
● The most important step in the
beginning of any
project/competition is to
choose the correct
train/validation split.
● Choosing the incorrect split
would lead to wrong
evaluation making the all the
next steps useless.
Understand Your Toolkit: Validation
● Types of Data Split:
● Hold-out fold: Splits data into a single training set and a single validation
set.
● KFold: Divides data into k subsets and uses each subset as a validation set
while training on the remaining k-1 subsets.
● StratifiedKFold: Similar to KFold, but ensures that each fold maintains the
same proportion of class labels as the entire dataset.
● GroupKFold: Splits data into k folds while preserving the grouping of
samples within each fold.
● TimeSeriesSplit: Divides data into training and validation sets in a manner
that respects the temporal order, suitable for time series data.
● Their Combinations…Link
Understand Your Toolkit: Validation
● How to know if your score is
good or not?

Make a Baseline then compare your

subsequent scores against it.
Understand Your Toolkit: Optimizers

● Optimizers: Algorithms used to minimize

your loss and update your model
parameters.
● Types: SGD, SGD with Momentum,
RMSprop, Adam, AdamW…etc
● The best one is …?
Understand Your Toolkit: Schedulers

● Schedulers: Decay (decrease) the learning rate over time.

● Why? Increases the stability of the model a lot.
● Types: Linear (step decay), Exponential, Cosine, Polynomial, Reduce-on-
plateau…etc
● The best one is …?
Understand Your Toolkit: Training Stability

● To make the training stable we have these components work

together:
○ Optimizer
○ Scheduler
○ Batch size
○ Initial Learning Rate
○ Warm-up steps
○ Number of Epochs
Understand Your Toolkit: Efficiency
● Data Processing:
○ Cudf: Pandas on GPU
○ Polars: Pandas with multi-thread
● ML Algorithms:
○ Cuml: Sklearn on GPU
● DL Algorithms:
○ ONNX: Speed up for inference in CPU
○ Mixed Precision / Quantization: Lower size
○ Knowledge Distillation: Student model (small) try to
learn from (replicate) a teacher model (big).
○ …
Understand Your Toolkit: Ensembling

● Why using only one model?

Let’s use several ones then
use a weighted average!
● Types of Ensembling:
○ Averaging (or Blending)
○ Bagging
○ Stacking
Choosing The
Best Models
It’s all about the assumptions.
Revisit Data Types…

Do we have a more high level approach/view to see these types of data?

● Tabular Data ● Text

● Time Series Data ● Images

● Waves (e.g. Audio) ● Videos

Revisit Data Types…

● Classify the data types based on their properties (assumptions):

○ Tabular (non-sequential): Order of features doesn’t matter.
○ Sequence: Order does matter.
○ Neighborhood: Each feature has a relation with its neighbors.
○ Graph: Generalized Neighborhood.
○ Sequential Decision-making: Sequence + feedback influencing
future decisions.
Data Assumptions
● Classify the data types based on their properties (assumptions):
○ Tabular (non-sequential): Tabular Data.
○ Sequence: Time-series data, text data, waves data.
○ Neighborhood: Images.
○ Graph: Graph-based data (e.g. molecular structures,...)
○ Sequential Decision-making: Decision-making environments (e.g.
playing games, robotics, …)
Data Assumptions
● Classify the data types based on their properties (assumptions):
○ Tabular (non-sequential): Tabular Data.
○ Sequence: Time-series data, text data, waves data.
○ Neighborhood: Images.
○ Graph: Graph-based data (e.g. molecular structures,...)
○ Sequential Decision-making: Decision-making environments (e.g.
playing games, robotics, …)
○ Combination between them: Videos (Sequence+Neighborhood)
Models Assumptions

● Classify the data types based on their properties (assumptions):

○ Tabular (non-sequential): Machine Learning models.
○ Sequence: Sequence models (LSTM/Transformer/1D CNN)
○ Neighborhood: (1D CNN/2D CNN/3D CNN)
○ Graph: Graph NN
○ Sequential Decision-making: Reinforcement Learning
○ Combination between them: e.g. (for videos: CNN + Sequence)
Models & Data Assumptions

● So, to figure out the right model,

you should know the
properties/assumptions you have
in your dataset then choose the
appropriate models.
Practical Tips
and Strategies
It’s all about the assumptions.
ML Models:

● ML Models are the SOTA in

tabular data.
● Types of models:
○ Linear Models.
○ Kernel Models.
○ Trees Models.
ML Models:
● Linear Models: Linear Regression / Logistic Regression
○ Best when there is a linear relationship between the features
and the target variable.
○ Most stable model.
○ Used widely in time-series forecasting due to high
relationship between target and lags.
○ Best Versions: Ridge, ElasticNet, Lasso. (in sklearn).
○ Forecasting variants: Arima, Facebook Prophet,...
ML Models:
● Kernel Models: SVM
○ Best when there is a very big number of features.
○ Quite slow for samples > 5K (How to solve this? ).
○ Used sometimes as a head for embeddings extracted from
the models because it can handle high dimensionality. Link
○ Best Versions: SVM (in sklearn), and …?
ML Models:
● Kernel Models: SVM
○ Best when there is a very big number of features.
○ Quite slow for samples > 5K (How to solve this? ).
○ Used sometimes as a head for embeddings extracted from
the models because it can handle high dimensionality. Link
○ Best Versions: SVM (in sklearn), SVM (in cuml - very fast).
ML Models:
● Trees Models:
● Generally the best ML algorithms.
● Types:
○ Decision Trees
○ Random Forest
○ Gradient Boosting (GBDT):
■ XGBoost:
■ LightGBM: Fast in CPU
■ Catboost: Best baseline
ML Models:
● Examples:
○ American Express - Default
Prediction | Kaggle
○ ISIC 2024 - Skin Cancer Detection
with 3D-TBP | Kaggle
○ Enefit - Predict Energy Behavior of
Prosumers | Kaggle
DL Models:

● DL Models are the SOTA in

everything except tabular
(time series is in-
between but mostly ML is
better).
NLP Tasks

● Best Models: Transformers

● Needs finetuning: ( Deberta ,
Roberta,...)
● Zero-shot: (Recent LLMs)
● Best Library: “transformers” by
Huggingface .
NLP Tasks
● NLP models are quite powerful already. Therefore, the improvements in LBs
would be small.
● How to improve your results?
○ Clean your data properly.
○ Tune your parameters. They have high effects here.
○ Find models that are pre-trained on data similar to your task (you can use
Huggingface).
○ Ensembling (averaging, stacking)
○ Manipulate the architecture and the embeddings. Link Link
○ Data-based ideas.
○ Check previous solutions.
NLP Tasks
● Notes:
○ If the task is regression, disable dropout to get more stable training. Link
○ Transformers needs very low learning rates (e.g. 1e-5, 2e-5)
○ Schedulers and warmup are widely used in NLP.
○ NLP models are usually finetuned using 1-3 epochs only before overfitting.
(why? )
○ LLMs are trained mostly on <= 1 epoch.
○ Scheduler should work on the steps, not the epochs.
○ NLP models are huge (e.g. xsmall deberta has ~70M params)
○ Recenely, a very interesting pattern appeared where Deberta could get beaten by
LLMs in classification IF the data came from LLMs.
NLP Tasks

● Examples:
○ Feedback Prize - English Language Learning | Kaggle
○ Learning Agency Lab - Automated Essay Scoring 2.0 | Kaggle
○ LMSYS - Chatbot Arena Human Preference Predictions | Kaggle
○ AI Mathematical Olympiad - Progress Prize 1 | Kaggle
○ Kaggle - LLM Science Exam | Kaggle
● Some good baselines:
○ AES-2 | Multi-class Classification [Train] (kaggle.com)
○ [Training] Gemma-2 9b 4-bit QLoRA fine-tuning (kaggle.com)
Computer Vision

● Best Models:
● EfficientNet
● ResNet
● ViTs - Swin Transformers
● ConvNeXt
● Best Library:
● Models: Timm , Torchvision.
● Augmentations: Albumentations ,
Torchvision.
Computer Vision Tasks

● Classification: EfficientNet, ResNet, ViT,..

● Segmentation (2D/2.5D/3D): Unet-based archs , Mask RCNN,...
● Detection: DETR (DEtection TRansformer) , YOLO, EfficientDet, Faster
RCNN,...
● How to improve your results? Everything in NLP +..
○ Augmentations
○ TTA (Test-time augmentations).
○ Data-based ideas.
Computer Vision Tasks
● How to choose the appropriate augmentations? Error Analysis
○ Train a baseline.
○ Make predictions on your validation data.
○ Inspect the worst predicted images, these predictions should guide you to the
problems the model facing.
○ Examples:
■ Failure with very small objects: Scale augmentation.
■ Failure with different colors / environment: Color augmentations.
■ Failure with rotated images: Rotation augmentations.
■ Failure with Blurry images: Noise augmentations.
■ …etc
Computer Vision Tasks

● Notes:
○ Smaller models can be better than bigger models in many times.
○ Computer vision models are small. Training is fast compared to NLP.
○ You will need a bit bigger lr if you use CNN-based model (e.g. 1e-4,1e-3) and
small one if you use Transformer-based model (e.g. 1e-5,2e-5).
○ You may need big number of epochs (e.g. 5-200). It depends on the task.
○ Using wrong augmentations can decrease performance significantly.
○ Scheduler based on epochs not steps.
Computer Vision Tasks

● Examples:
○ Google Research - Identify Contrails to Reduce Global Warming | Kaggle
○ RSNA 2022 Cervical Spine Fracture Detection | Kaggle
○ Vesuvius Challenge - Ink Detection | Kaggle
○ RSNA Screening Mammography Breast Cancer Detection | Kaggle
● Resources:
○ Kaggle Days Paris 2022_Philipp Singer & Yauhen Babakhin_Practical Tips for Deep
Transfer Learning - YouTube
Waves

● Best Models:
● 1D-CNN (e.g. WaveNet,..)
● and…
Waves

● Best Models:
● 1D-CNN (e.g. WaveNet,..)
● 2D-CNN (e.g. EfficientNet,..) (What? )
Waves
● Best Models:
● 1D-CNN (e.g. WaveNet,..)
● 2D-CNN (e.g. EfficientNet,..) (What? )

Waves can be converted into an image representation called

Spectrograms. Then we fit CNN on it.
Waves
● Fitting on spectrograms is the SOTA .
Waves
● Examples:
○ BirdClef 2023: Pytorch Lightning-Training w/ cMAP (kaggle.com)
○ HMS - Harmful Brain Activity Classification | Kaggle
Proteins & DNA Sequences
● Per-Char Sequences.
● Can be approached using NLP
models.
● Sometimes, it can be converted to
3D images, then approached as 3D
CNN.
● Can be approached using Graph NN
as well.
Proteins & DNA Sequences
● Examples:
○ Novozymes Enzyme Stability Prediction | Kaggle
○ NeurIPS 2024 - Predict New Medicines with
BELKA | Kaggle
○ Stanford Ribonanza RNA Folding | Kaggle
AI & Security
● CTF but for AI .
● Learn about Adversarial attacks, LLM
jailbreaking and some random stuff .
● Examples:
○ AI Village Capture the Flag @ DEFCON31 | Kaggle
○ AI Village Capture the Flag @ DEFCON | Kaggle
Final Notes: How do you approach a new
competition?

● Start with reading discussions, understanding the domain

knowledge, then summarize everything important.
● Read the baselines, do eda and choose the best validation.
● if you have time, build your own baseline from scratch then
replicate all the ideas in the best public nbs into yours. Otherwise,
choose a public nb then build on it.
● Then add your ideas .
Final Notes: Difference between Kaggle
Competitions and real-world projects?

● Problem Definition.
● Data Availability.
● Model Deployment.
● Evaluation Metrics.
● Resources and time constraints.
Conclusion
● Ok you know the best models then what? Just applying
them to win?
● ⇒ The real work is in the data part.
● Have fun and keep learning!

(Don’t forget to eat and sleep btw )

Thanks for Attendance!

Designing Machine Learning Systems by Chip Huygen by Rick
100% (1)
Designing Machine Learning Systems by Chip Huygen by Rick
15 pages
Win Kaggle Competition Course
No ratings yet
Win Kaggle Competition Course
14 pages
Kaggle Competition Guide Part 1
No ratings yet
Kaggle Competition Guide Part 1
23 pages
Approaching (Almost) Any Machine Learning Problem - Abhishek Thakur - No Free Hunch
No ratings yet
Approaching (Almost) Any Machine Learning Problem - Abhishek Thakur - No Free Hunch
22 pages
AI ML Session Slides
No ratings yet
AI ML Session Slides
34 pages
OceanofPDF - Com Hands-On Machine Learning From Scratch - Venelin Valkov
No ratings yet
OceanofPDF - Com Hands-On Machine Learning From Scratch - Venelin Valkov
119 pages
Untitled Document
No ratings yet
Untitled Document
4 pages
Kaggle Tutorial 1
No ratings yet
Kaggle Tutorial 1
29 pages
Unit 4 - Question Bank and Answers
No ratings yet
Unit 4 - Question Bank and Answers
23 pages
05 Kaggle Competition
No ratings yet
05 Kaggle Competition
37 pages
How AI Models Learn: A Step-by-Step Guide
No ratings yet
How AI Models Learn: A Step-by-Step Guide
13 pages
ML Checklist PDF
No ratings yet
ML Checklist PDF
4 pages
CSC413 Lecture Note
No ratings yet
CSC413 Lecture Note
32 pages
Understanding Data and Models
No ratings yet
Understanding Data and Models
20 pages
Evaluating Model Performance: Evaluation Strategies: Train/Validation/Test
No ratings yet
Evaluating Model Performance: Evaluation Strategies: Train/Validation/Test
127 pages
Unit-1 Introduction To Machine Learning (5hrs)
No ratings yet
Unit-1 Introduction To Machine Learning (5hrs)
8 pages
Introduction To AIML
No ratings yet
Introduction To AIML
19 pages
Building A ML System
No ratings yet
Building A ML System
42 pages
ML Systems Interview Notes
No ratings yet
ML Systems Interview Notes
5 pages
INF442 DataScienceBooklet
No ratings yet
INF442 DataScienceBooklet
248 pages
Kaggle Competition Mastery Guide
100% (1)
Kaggle Competition Mastery Guide
74 pages
ML Hands-On Session 2024
No ratings yet
ML Hands-On Session 2024
53 pages
(V2) Kaggle's Community Competitions Setup Guide and FAQs
No ratings yet
(V2) Kaggle's Community Competitions Setup Guide and FAQs
24 pages
AWS Machine Learning Specialty Master Cheat Sheet
No ratings yet
AWS Machine Learning Specialty Master Cheat Sheet
24 pages
Deep Learning Project Workflow Guide
No ratings yet
Deep Learning Project Workflow Guide
11 pages
DATA 2024 - Dist
No ratings yet
DATA 2024 - Dist
72 pages
Final ML
No ratings yet
Final ML
2 pages
Codes and Concepts of ML-Developer
No ratings yet
Codes and Concepts of ML-Developer
125 pages
Segmentation Dataset
No ratings yet
Segmentation Dataset
41 pages
ML Unit 2
No ratings yet
ML Unit 2
86 pages
ML Revision
No ratings yet
ML Revision
207 pages
ML Resources CW 2025
No ratings yet
ML Resources CW 2025
5 pages
Lec 2
No ratings yet
Lec 2
13 pages
Week 3 A
No ratings yet
Week 3 A
18 pages
Introduction to Machine Learning Basics
No ratings yet
Introduction to Machine Learning Basics
5 pages
Machine Learning Path
No ratings yet
Machine Learning Path
21 pages
Chapter 4
No ratings yet
Chapter 4
34 pages
ML Viva Practice (Answers)
No ratings yet
ML Viva Practice (Answers)
4 pages
Module 5.pptx - 20250608 - 201231 - 0000
No ratings yet
Module 5.pptx - 20250608 - 201231 - 0000
43 pages
AI - ML Beginner-Friendly Resources For Cs
No ratings yet
AI - ML Beginner-Friendly Resources For Cs
9 pages
ML Interview Questions
No ratings yet
ML Interview Questions
146 pages
Jade Abbott - Mls Hidden Tasks
No ratings yet
Jade Abbott - Mls Hidden Tasks
78 pages
SWE 227 Slide 01
No ratings yet
SWE 227 Slide 01
21 pages
Machine Learning Lecture1 - 26-27 Aug
No ratings yet
Machine Learning Lecture1 - 26-27 Aug
30 pages
MLOps Getting From Good To Great
No ratings yet
MLOps Getting From Good To Great
41 pages
Fall2024 W4995 Lecture1
No ratings yet
Fall2024 W4995 Lecture1
110 pages
01 Phan Tich Dau Tu Nang Cao - CRISP Trong KHDL
No ratings yet
01 Phan Tich Dau Tu Nang Cao - CRISP Trong KHDL
37 pages
Lesson 2 - Introduction To ML
No ratings yet
Lesson 2 - Introduction To ML
36 pages
AI Engineer Interview Prep Guide
No ratings yet
AI Engineer Interview Prep Guide
16 pages
ML Ans
No ratings yet
ML Ans
4 pages
Assignment 3 DL
No ratings yet
Assignment 3 DL
6 pages
AI 501 - Lesson 4 - Supervised Learning
No ratings yet
AI 501 - Lesson 4 - Supervised Learning
41 pages
ML Exam Preparation Tips
No ratings yet
ML Exam Preparation Tips
41 pages
A Practical and Technical Introduction To Machine Learning
No ratings yet
A Practical and Technical Introduction To Machine Learning
23 pages
Data Science Roadmap For Beginners
No ratings yet
Data Science Roadmap For Beginners
4 pages
MLOps Data Lifecycle Course
No ratings yet
MLOps Data Lifecycle Course
133 pages
Azure AI Fundamentals AI 900
No ratings yet
Azure AI Fundamentals AI 900
84 pages
Python NumPy and Machine Learning A Comprehensive Guide
No ratings yet
Python NumPy and Machine Learning A Comprehensive Guide
10 pages
Alzheimer's Disease Detection
No ratings yet
Alzheimer's Disease Detection
49 pages
Network Anomaly Detection Using LSTMBased Autoencoder
No ratings yet
Network Anomaly Detection Using LSTMBased Autoencoder
10 pages
Upgrad
No ratings yet
Upgrad
9 pages
Ann 5TH
No ratings yet
Ann 5TH
98 pages
COVID-19 Prediction Using Regression
No ratings yet
COVID-19 Prediction Using Regression
5 pages
OpenMIMS Plugin for Mass Spectrometry
No ratings yet
OpenMIMS Plugin for Mass Spectrometry
18 pages
Fraud Detection in Immigration
No ratings yet
Fraud Detection in Immigration
47 pages
Intership
100% (1)
Intership
14 pages
Module3 DS PPT
No ratings yet
Module3 DS PPT
68 pages
Wafer Map Analysis for Engineers
No ratings yet
Wafer Map Analysis for Engineers
12 pages
4 - Cyberbullying Detection and Machine Learning A Systematic Literature Review - 2023
No ratings yet
4 - Cyberbullying Detection and Machine Learning A Systematic Literature Review - 2023
42 pages
Offline Signature Verification
No ratings yet
Offline Signature Verification
13 pages
Literature Survey Diabetes Prediction
No ratings yet
Literature Survey Diabetes Prediction
2 pages
Machine Learning in Cybersecurity
No ratings yet
Machine Learning in Cybersecurity
6 pages
BCSE352E EDA CAT 2 Mod 1,2,5
No ratings yet
BCSE352E EDA CAT 2 Mod 1,2,5
146 pages
Wang 2019
No ratings yet
Wang 2019
11 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
50 pages
Lung Cancer Detection Using Machine Learning
No ratings yet
Lung Cancer Detection Using Machine Learning
5 pages
Syllabus M.tech Computational Biology 2023 2024
No ratings yet
Syllabus M.tech Computational Biology 2023 2024
68 pages
HCIP-AI-EI Developer V2.0 Training Material
No ratings yet
HCIP-AI-EI Developer V2.0 Training Material
508 pages
Sentiment Analysis of Twitter Data Using TF-IDF and Machine Learning Techniques
No ratings yet
Sentiment Analysis of Twitter Data Using TF-IDF and Machine Learning Techniques
4 pages
Evaluating Explainable Machine Learning Models For Clinicians
No ratings yet
Evaluating Explainable Machine Learning Models For Clinicians
11 pages
Ai For IT Coders
No ratings yet
Ai For IT Coders
18 pages
CL-I Lab Manual
No ratings yet
CL-I Lab Manual
131 pages
Approaches To Anatomical and Functional Brain Connectivity Analysis With Applications To Adolescent Major Depressive Disorder - PHD
No ratings yet
Approaches To Anatomical and Functional Brain Connectivity Analysis With Applications To Adolescent Major Depressive Disorder - PHD
24 pages
Question Bank of Machine Learning
No ratings yet
Question Bank of Machine Learning
4 pages
Abdul Waheed Et Al - 2020 - An Optimized Dense Convolutional Neural Network Model For Disease Recognition
No ratings yet
Abdul Waheed Et Al - 2020 - An Optimized Dense Convolutional Neural Network Model For Disease Recognition
11 pages
Intro Project Explaination
No ratings yet
Intro Project Explaination
12 pages

How To Win A Data Science Competition

Uploaded by

How To Win A Data Science Competition

Uploaded by

How to win a Data

● You will be given a real-world

Train & Ready to

1 100.1 125.5 10 Brown Buteo jamaicensis

2 3000.7 200.0 98.1 Gray Sagittarius

3 3300.0 220.3 110.2 Gray Sagittarius

4 4100.0 136.0 154 Black Gavia immer

5 3.0 11.0 2 Green Calothorax lucifer

● Leaderboard: Price-True Price_pred

● Performance: The ● Efficiency: There ● Handle Big Data:

Kaggle: Your Home for Data Science

Let’s start by a quick revision for data types…

● Why do we need it?

Make a Baseline then compare your

● Optimizers: Algorithms used to minimize

● Schedulers: Decay (decrease) the learning rate over time.

● To make the training stable we have these components work

● Why using only one model?

Do we have a more high level approach/view to see these types of data?

● Tabular Data ● Text

● Time Series Data ● Images

● Waves (e.g. Audio) ● Videos

● Classify the data types based on their properties (assumptions):

● Classify the data types based on their properties (assumptions):

● So, to figure out the right model,

● ML Models are the SOTA in

● DL Models are the SOTA in

● Best Models: Transformers

● Classification: EfficientNet, ResNet, ViT,..

Waves can be converted into an image representation called

● Start with reading discussions, understanding the domain

(Don’t forget to eat and sleep btw )

Thanks for Attendance!

You might also like