HOW IS BIGDATA HANDLED IN
KAGGLE?
17CP006-LEENANCI PARMAR
17CP012-DHRUVI LAD
INTRODUCTION TO KAGGLE
• Kaggle is a crowdsourced data analysis competition platform.
• Businesses bring their data problems and Kaggle hosts it.
• Scientists and programmers compete to come up with the best solution.
• In 2017, Google acquired Kaggle.
HOW KAGGLE WORKS ?
• Kaggle prepares the data and a description of the problem.
• Participants compete against each other to produce the best models. Work is
shared publicly through Kaggle Kernels.
• Submissions are made through Kaggle Kernels, through manual upload or
using the Kaggle API.
• Thus, the main function of Kaggle is to provide the data publically.
HOW KAGGLE WORKS ? (CONTINUED..)
• Kaggle supports a variety of dataset publication formats.
• CSV,JSON,SQLite
• In addition to these datasets, KAGGLE also supports BigQuery.
WHAT IS BIGQUERY?
• BigQuery is a cloud-based data warehouse from Google that lets users query
and analyze large amounts of read-only data. Using a SQL-like syntax,
BigQuery runs queries on billions of rows of data in a matter of seconds.
• This is iPaaS (integration platform-as-a-service) supports any combination of
on-premises, cloud data, and application integration scenarios.
FEATURES OF BIGQUERY
• The main component of BigQuery is Dremel query engine.
• There are huge amounts of unstructured data such as images, videos, log files,
and books present.
• All of this data needed to be queried. For this, MapReduce was designed.
• However, its batch-processing approach made it less than ideal for instant
querying.
• Dremel, on the other hand, was able to perform interactive querying on
billions of records in seconds.
ARCHITECTURE OF
BIGQUERY
DREMEL’S FEATURES AND
CHARACTERISTICS:
• Tree architecture:
• It uses tree architecture, which means that it treats a query as an
execution tree.
• Execution trees break an SQL query into pieces and then reassemble the
results for faster performance. Slots (or leaves) read billions of rows of
data and perform computations on them while the mixers (or branches)
aggregate the results.
• Columnar databases:
• Another reason for it’s incredibly fast performance is its use of a columnar data
storage format instead of the traditional row-based storage.
• Columnar databases allow for better compression due to the homogenous nature
of data stored within columns. In this design, only the required columns are pulled
out, making it an ideal choice for huge databases with billions of rows.
• Data sorting and aggregation operations are also easier with columnar databases
when compared to relational databases. This makes columnar databases more
suitable for intensive data analysis and the parallel processing approach employed
in Dremel’s tree architecture.
• Nested data storage:
• Join-based queries can be time-consuming in normalized databases,
and this challenge only gets worse in large databases.
• So Dremel opts for a different approach and permits the storage
of nested or repeated data using the data type — RECORD.
• This feature gives Dremel the capability to maintain relationships
between data inside a table. Nested data can be loaded from JSON files
or other source formats into tables.
• Columnar and nested data storage are ideal for querying semi-
structured and unstructured data, which constitute an important part
of the big data universe.
• Repetition level: the level of the nesting in the field path at which the repetition is happening.
• Definition level: how many optional/repeated fields in the field path have been defined.
IMPLEMENTATION IN KAGGLE:
• To load a dataset you first need to generate a dataset reference to point BQ to it.
• Any time, for working with BQ from Kaggle the project name is bigquery-public-data.
The method "client.dataset" is named
as if it returns a dataset, but it actually
gives us a dataset reference.
COMPARISON BETWEEN THE TWO
BigQuery MapReduce
• Query service for large datasets • Programming model for processing
large datasets
• Ad hoc and trial-and- error
interactive query of large dataset for • Batch processing of large dataset for
quick analysis and troubleshooting time-consuming data conversion or
aggregation
• Very fast response
• Not very fast (takes minutes - days)
WHY BIGQUERY?
• Analyzing data is becomes faster process using this , even on really large
datasets. Also, BigQuery has several tiers to support scalable massive data
storage and query processing.
• This turns the user’s workflow into a more seamless process instead of the
previous fragmented practice, where data storage, querying, cleaning, and
analysis would take place across several tools and platforms.
• BigQuery ML is a set of extensions to the SQL language that allows the users
to easily (in minutes) create, train, and evaluate machine learning models and
their predictive performance.
REFERENCES:
• https://www.kaggle.com/dansbecker/getting-started-with-sql-and-bigquery
• https://towardsdatascience.com/want-to-use-bigquery-read-this-fab36822830
• https://www.kaggle.com/docs/datasets
17CP006-LEENANCI PARMAR
17CP012-DHRUVI LAD