How Is Bigdata Handled in Kaggle?: 17Cp006-Leenanci Parmar 17CP012-DHRUVI LAD

Kaggle handles big data by supporting BigQuery, Google's cloud-based data warehouse. BigQuery uses the Dremel query engine to perform interactive queries on billions of records in seconds using a columnar database and nested data storage. To load a dataset on Kaggle, users first generate a BigQuery dataset reference to point to the data. This allows for fast, ad-hoc analysis of large datasets on Kaggle using BigQuery.

Uploaded by

Darshan Tank

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

226 views18 pages

How Is Bigdata Handled in Kaggle?: 17Cp006-Leenanci Parmar 17CP012-DHRUVI LAD

Uploaded by

Darshan Tank

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 18

HOW IS BIGDATA HANDLED IN

KAGGLE?

17CP006-LEENANCI PARMAR
17CP012-DHRUVI LAD
INTRODUCTION TO KAGGLE

• Kaggle is a crowdsourced data analysis competition platform.

• Businesses bring their data problems and Kaggle hosts it.
• Scientists and programmers compete to come up with the best solution.
• In 2017, Google acquired Kaggle.
HOW KAGGLE WORKS ?

• Kaggle prepares the data and a description of the problem.

• Participants compete against each other to produce the best models. Work is
shared publicly through Kaggle Kernels.
• Submissions are made through Kaggle Kernels, through manual upload or
using the Kaggle API.
• Thus, the main function of Kaggle is to provide the data publically.
HOW KAGGLE WORKS ? (CONTINUED..)

• Kaggle supports a variety of dataset publication formats.

• CSV,JSON,SQLite
• In addition to these datasets, KAGGLE also supports BigQuery.
WHAT IS BIGQUERY?

• BigQuery is a cloud-based data warehouse from Google that lets users query
and analyze large amounts of read-only data. Using a SQL-like syntax,
BigQuery runs queries on billions of rows of data in a matter of seconds.
• This is iPaaS (integration platform-as-a-service) supports any combination of
on-premises, cloud data, and application integration scenarios.
FEATURES OF BIGQUERY
• The main component of BigQuery is Dremel query engine.
• There are huge amounts of unstructured data such as images, videos, log files,
and books present.
• All of this data needed to be queried. For this, MapReduce was designed.
• However, its batch-processing approach made it less than ideal for instant
querying.
• Dremel, on the other hand, was able to perform interactive querying on
billions of records in seconds.
ARCHITECTURE OF
BIGQUERY
DREMEL’S FEATURES AND
CHARACTERISTICS:

• Tree architecture:
• It uses tree architecture, which means that it treats a query as an
execution tree.
• Execution trees break an SQL query into pieces and then reassemble the
results for faster performance. Slots (or leaves) read billions of rows of
data and perform computations on them while the mixers (or branches)
aggregate the results.
• Columnar databases:
• Another reason for it’s incredibly fast performance is its use of a columnar data
storage format instead of the traditional row-based storage.
• Columnar databases allow for better compression due to the homogenous nature
of data stored within columns. In this design, only the required columns are pulled
out, making it an ideal choice for huge databases with billions of rows.
• Data sorting and aggregation operations are also easier with columnar databases
when compared to relational databases. This makes columnar databases more
suitable for intensive data analysis and the parallel processing approach employed
in Dremel’s tree architecture.
• Nested data storage:
• Join-based queries can be time-consuming in normalized databases,
and this challenge only gets worse in large databases.
• So Dremel opts for a different approach and permits the storage
of nested or repeated data using the data type — RECORD.
• This feature gives Dremel the capability to maintain relationships
between data inside a table. Nested data can be loaded from JSON files
or other source formats into tables.
• Columnar and nested data storage are ideal for querying semi-
structured and unstructured data, which constitute an important part
of the big data universe.
• Repetition level: the level of the nesting in the field path at which the repetition is happening.
• Definition level: how many optional/repeated fields in the field path have been defined.
IMPLEMENTATION IN KAGGLE:

• To load a dataset you first need to generate a dataset reference to point BQ to it.
• Any time, for working with BQ from Kaggle the project name is bigquery-public-data.
The method "client.dataset" is named
as if it returns a dataset, but it actually
gives us a dataset reference.
COMPARISON BETWEEN THE TWO
BigQuery MapReduce
• Query service for large datasets • Programming model for processing
large datasets

• Ad hoc and trial-and- error

interactive query of large dataset for • Batch processing of large dataset for
quick analysis and troubleshooting time-consuming data conversion or
aggregation

• Very fast response

• Not very fast (takes minutes - days)
WHY BIGQUERY?
• Analyzing data is becomes faster process using this , even on really large
datasets. Also, BigQuery has several tiers to support scalable massive data
storage and query processing.
• This turns the user’s workflow into a more seamless process instead of the
previous fragmented practice, where data storage, querying, cleaning, and
analysis would take place across several tools and platforms.
• BigQuery ML is a set of extensions to the SQL language that allows the users
to easily (in minutes) create, train, and evaluate machine learning models and
their predictive performance.
REFERENCES:
• https://www.kaggle.com/dansbecker/getting-started-with-sql-and-bigquery
• https://towardsdatascience.com/want-to-use-bigquery-read-this-fab36822830
• https://www.kaggle.com/docs/datasets

17CP006-LEENANCI PARMAR
17CP012-DHRUVI LAD

The Hacking Bible - Kevin James
89% (36)
The Hacking Bible - Kevin James
95 pages
3D Computer Vision - Foundations and Advanced Methodologies-Springer (2024)
No ratings yet
3D Computer Vision - Foundations and Advanced Methodologies-Springer (2024)
478 pages
Databricks 101
No ratings yet
Databricks 101
16 pages
How To Train An Object Detection Model With Mmdetection - DLology
No ratings yet
How To Train An Object Detection Model With Mmdetection - DLology
7 pages
AI Driven Companies in Egypt
No ratings yet
AI Driven Companies in Egypt
16 pages
Accelerate Computing Vision and Image Processing Using VPI 1.1 by Rodolfo Lima
No ratings yet
Accelerate Computing Vision and Image Processing Using VPI 1.1 by Rodolfo Lima
23 pages
Early Forest Fire Detection Paper
No ratings yet
Early Forest Fire Detection Paper
7 pages
Beginner vs. Advanced Data Science with AI
No ratings yet
Beginner vs. Advanced Data Science with AI
3 pages
Career Plans For Next 2 Years
No ratings yet
Career Plans For Next 2 Years
11 pages
Geospatial Python Libraries Guide
No ratings yet
Geospatial Python Libraries Guide
10 pages
Machine Learning Journey Logs
No ratings yet
Machine Learning Journey Logs
15 pages
Gpu-Applications-Catalog 2021
No ratings yet
Gpu-Applications-Catalog 2021
76 pages
Pradip Python-PPT-Geoinformatics (Pradip)
100% (1)
Pradip Python-PPT-Geoinformatics (Pradip)
8 pages
ROS and ArduPilot Integration Guide
No ratings yet
ROS and ArduPilot Integration Guide
17 pages
Deep Learning and Computer Vision For Video Analytics
No ratings yet
Deep Learning and Computer Vision For Video Analytics
37 pages
Real-Time 4K Face Detection with OpenCV
No ratings yet
Real-Time 4K Face Detection with OpenCV
19 pages
Nvidia Learning Training Course Catalog
No ratings yet
Nvidia Learning Training Course Catalog
33 pages
Mining Presentation Wingtra CPG Tunisia March 2022
No ratings yet
Mining Presentation Wingtra CPG Tunisia March 2022
42 pages
Slides Deep Learning On AWS With NVIDIA From Training To Deployment
No ratings yet
Slides Deep Learning On AWS With NVIDIA From Training To Deployment
48 pages
Data Science - A Kaggle Walkthrough - Understanding The Data - 2 PDF
No ratings yet
Data Science - A Kaggle Walkthrough - Understanding The Data - 2 PDF
9 pages
Data Science For Business
No ratings yet
Data Science For Business
18 pages
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
No ratings yet
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
15 pages
Data Mining With Bigdata
No ratings yet
Data Mining With Bigdata
30 pages
Learning Deep Learning Theory and Practice of Neural Networks Computer Vision NLP and Transformers Using TensorFlow 1st Edition Ekman Magnus Instant Download
100% (1)
Learning Deep Learning Theory and Practice of Neural Networks Computer Vision NLP and Transformers Using TensorFlow 1st Edition Ekman Magnus Instant Download
82 pages
05 Logistic - Regression
No ratings yet
05 Logistic - Regression
7 pages
Real Estate ML Project Guide
No ratings yet
Real Estate ML Project Guide
20 pages
Data Warehousing for IT Professionals
No ratings yet
Data Warehousing for IT Professionals
10 pages
Assignment - Machine Learning
No ratings yet
Assignment - Machine Learning
3 pages
Lecture 1
100% (1)
Lecture 1
21 pages
Big Data in Mobile Networks
No ratings yet
Big Data in Mobile Networks
15 pages
BC0041 Fundamentals of Database Management Paper 1
No ratings yet
BC0041 Fundamentals of Database Management Paper 1
11 pages
Machine Learning Notes
100% (1)
Machine Learning Notes
36 pages
Study Notes - Lesson 1 - 7 PDF
No ratings yet
Study Notes - Lesson 1 - 7 PDF
25 pages
Data Science M-1 Notes
No ratings yet
Data Science M-1 Notes
34 pages
Power BI Field List Icon Updates
No ratings yet
Power BI Field List Icon Updates
3 pages
Machine Learning Part 8
No ratings yet
Machine Learning Part 8
42 pages
Big Data Analytics
No ratings yet
Big Data Analytics
134 pages
NITHYA S - 412520403004 - Project Report
No ratings yet
NITHYA S - 412520403004 - Project Report
39 pages
Pytorch Tutorial 1
No ratings yet
Pytorch Tutorial 1
48 pages
Data Science for Non-Programmers
No ratings yet
Data Science for Non-Programmers
5 pages
Understanding Recommendation Systems
No ratings yet
Understanding Recommendation Systems
45 pages
Data Science Course Content
No ratings yet
Data Science Course Content
4 pages
Cs490 Advanced Topics in Computing (Deep Learning) : Lecture 16: Convolutional Neural Networks (CNNS)
No ratings yet
Cs490 Advanced Topics in Computing (Deep Learning) : Lecture 16: Convolutional Neural Networks (CNNS)
63 pages
Machine Learning Midterm
No ratings yet
Machine Learning Midterm
18 pages
MTech DATA SCIENCE & ENGINEERING HCL - 0
No ratings yet
MTech DATA SCIENCE & ENGINEERING HCL - 0
11 pages
BA ZG523 Introduction To Data Science
50% (2)
BA ZG523 Introduction To Data Science
12 pages
Mastering Python 100 Exercises With Solutions
No ratings yet
Mastering Python 100 Exercises With Solutions
32 pages
بنك الاسئلة د محمود ابوالفتوح PDF
No ratings yet
بنك الاسئلة د محمود ابوالفتوح PDF
4 pages
OpenCV CUDA Functions
100% (1)
OpenCV CUDA Functions
30 pages
Chapter 01 Introduction To ML
No ratings yet
Chapter 01 Introduction To ML
31 pages
Keras Cheat Sheet Python
No ratings yet
Keras Cheat Sheet Python
1 page
From Data To Insights Course Summary
No ratings yet
From Data To Insights Course Summary
67 pages
Formatted BigQuery CheatSheet
No ratings yet
Formatted BigQuery CheatSheet
1 page
Bigquery
No ratings yet
Bigquery
25 pages
CMSC476676 TermPaperPatnaikPratiksha
No ratings yet
CMSC476676 TermPaperPatnaikPratiksha
8 pages
Introduction To Google Cloud Big Data Platform: Lecturer: Phd. Tran Minh Quang Data Engineering - Group 12
No ratings yet
Introduction To Google Cloud Big Data Platform: Lecturer: Phd. Tran Minh Quang Data Engineering - Group 12
21 pages
Big Data Processing: Jiaul Paik
No ratings yet
Big Data Processing: Jiaul Paik
47 pages
Curso Google Data Engineer
No ratings yet
Curso Google Data Engineer
36 pages
Day1 - Introduction To Database
No ratings yet
Day1 - Introduction To Database
29 pages
Bigquery, Google'S Enterprise Data Warehouse: Slid02
No ratings yet
Bigquery, Google'S Enterprise Data Warehouse: Slid02
3 pages
BQ Solutions-1
No ratings yet
BQ Solutions-1
19 pages
Data Base Management System: Iii Year V Semester
No ratings yet
Data Base Management System: Iii Year V Semester
5 pages
Architecture Advance 3
No ratings yet
Architecture Advance 3
18 pages
HPC for Industry and Research
No ratings yet
HPC for Industry and Research
73 pages
Ignition Blocking Relay: 1.1 About The Accessory
No ratings yet
Ignition Blocking Relay: 1.1 About The Accessory
5 pages
Gantt Chart Template - Ods
No ratings yet
Gantt Chart Template - Ods
12 pages
Cisco Trustsec: Security Solution Overview
No ratings yet
Cisco Trustsec: Security Solution Overview
38 pages
Selenium Automation Q A 1725258035
No ratings yet
Selenium Automation Q A 1725258035
10 pages
Usability As An ERP Selection Criteria 131789
No ratings yet
Usability As An ERP Selection Criteria 131789
11 pages
Thesis Implementation Plan Sample
100% (2)
Thesis Implementation Plan Sample
9 pages
JHGFJHGF
No ratings yet
JHGFJHGF
1 page
Journal Pre-Proof: KSCE Journal of Civil Engineering
No ratings yet
Journal Pre-Proof: KSCE Journal of Civil Engineering
45 pages
AWS SAA Cheat Sheet
No ratings yet
AWS SAA Cheat Sheet
3 pages
Parallel Programming Insights
No ratings yet
Parallel Programming Insights
32 pages
A Framework For Implementing Robotic Process Autom
No ratings yet
A Framework For Implementing Robotic Process Autom
36 pages
Blinkit Customer Satisfaction Report
100% (1)
Blinkit Customer Satisfaction Report
43 pages
Srs
No ratings yet
Srs
11 pages
GDE Installation & Configuration Guide v4.0.0.4
No ratings yet
GDE Installation & Configuration Guide v4.0.0.4
37 pages
SIREMOBIL Compact L Competition: Michael Bodky, SP CRM M3
No ratings yet
SIREMOBIL Compact L Competition: Michael Bodky, SP CRM M3
21 pages
Naukri RAJATPANDEY (3y 4m)
No ratings yet
Naukri RAJATPANDEY (3y 4m)
4 pages
WP06 WeldPRO CAD-To-Path Sept 2013
No ratings yet
WP06 WeldPRO CAD-To-Path Sept 2013
80 pages
Best Forex EA - Free Expert Advisor For MT4 in 20 3
No ratings yet
Best Forex EA - Free Expert Advisor For MT4 in 20 3
3 pages
Cursor Movement Using Hand Gesture
No ratings yet
Cursor Movement Using Hand Gesture
10 pages
343182-V12 EMR Progress Notes - Billing Information - Dec 2024
No ratings yet
343182-V12 EMR Progress Notes - Billing Information - Dec 2024
37 pages
File Backup Meaning - Google Search
No ratings yet
File Backup Meaning - Google Search
1 page
Deployment Registry
No ratings yet
Deployment Registry
9 pages
Mis-406 Assignment
No ratings yet
Mis-406 Assignment
8 pages
HTML5 Web Storage Guide
No ratings yet
HTML5 Web Storage Guide
9 pages
How To Install Odoo 16 On Ubuntu 22
No ratings yet
How To Install Odoo 16 On Ubuntu 22
5 pages
DMRE Hosted Services & Infrastructure TOR
No ratings yet
DMRE Hosted Services & Infrastructure TOR
25 pages

How Is Bigdata Handled in Kaggle?: 17Cp006-Leenanci Parmar 17CP012-DHRUVI LAD

Uploaded by

How Is Bigdata Handled in Kaggle?: 17Cp006-Leenanci Parmar 17CP012-DHRUVI LAD

Uploaded by

HOW IS BIGDATA HANDLED IN

• Kaggle is a crowdsourced data analysis competition platform.

• Kaggle prepares the data and a description of the problem.

• Kaggle supports a variety of dataset publication formats.

• Ad hoc and trial-and- error

• Very fast response

You might also like