Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
3 views28 pages

DataEngg Day2

The document provides an overview of data engineering fundamentals, including the CAP theorem, various types of NoSQL databases, and data warehousing concepts. It discusses the Extract-Transform-Load (ETL) process, differences between traditional ETL and Hadoop ELT, and the importance of data storage in the data engineering lifecycle. Additionally, it covers OLTP vs OLAP systems and data lake vs data warehouse distinctions.

Uploaded by

Arohi Patile
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views28 pages

DataEngg Day2

The document provides an overview of data engineering fundamentals, including the CAP theorem, various types of NoSQL databases, and data warehousing concepts. It discusses the Extract-Transform-Load (ETL) process, differences between traditional ETL and Hadoop ELT, and the importance of data storage in the data engineering lifecycle. Additionally, it covers OLTP vs OLAP systems and data lake vs data warehouse distinctions.

Uploaded by

Arohi Patile
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Fundamentals of Data Engineering

Trainer: Pradnyaa S Dindorkar

Sunbeam Infotech www.sunbeaminfo.com


CAP Theorem
• Consistency - Data is
consistent after operation.
After an update operation,
all clients see the same data.
• Availability - System is
always on (i.e. service
guarantee), no downtime.
• Partition Tolerance - System
continues to function even
the communication among
the servers is unreliable.

Sunbeam Infotech www.sunbeaminfo.com


NoSQL Databases

• Key-value databases - e.g. redis, dynamodb, riak


• Based on Amazon’s Dynamo database.
• Keys are unique and values can be of any type i.e.
JSON, BLOB, etc.
• Implemented as big distributed hash-table for fast
searching. name ->key
value=>
kay - value pair
fast searching data=>50000 X 30
• Wide Column databases - e.g. hbase, cassandra, pune=?
bigtable, … hadoop count=??
city name
• Values of columns are stored contiguously.
• Better performance while accessing few columns &
aggregations. count, sum,avg,min,max
• Good for data-warehousing, business intelligence,
CRM, ...

Sunbeam Infotech www.sunbeaminfo.com


200
NoSQL Databases

• Graph databases - e.g. Neo4J, Titan, …


• Graph is collection of vertices and edges.
• Excellent performance, while dealing with all relations of
an entity
(irrespective of size of data).

• Document oriented databases - e.g. MongoDb,


CouchDb, …
• Document contains data as key-value pair as JSON or
semi-struct
XML. fields
• Document schema is flexible & are added in collection for
processing. document

Sunbeam Infotech www.sunbeaminfo.com


Document – flexible schema
RDBMS Doc -NoSQL
row document
column fields
collection
Table

Sunbeam Infotech www.sunbeaminfo.com


NoSQL Databases

• Search databases – e.g. Elasticsearch, Solr,


Lucene, …
• For faster search – Text search, Log analysis.
• Indexed, Exact/Fuzzy matches, Anomaly detection,
Analytics.

• Time series databases – e.g. Influx, Druid, …


• Values organized by time like stock market, weather,

• Optimized for retrieval, statistical processing, …
• Used for measurement data (weather, …) and
event-based data (accidents, …)

Sunbeam Infotech www.sunbeaminfo.com


Data warehousing
• Data warehouse is a
single, complete and
consistent store of data
obtained from a variety of
different sources made
available to end users in
a what they can
understand and use in a
business context.
• Data warehousing is a
process of transforming
data into information and
making it available to
users in a timely enough
schema info
manner to make a
difference.
Sunbeam Infotech www.sunbeaminfo.com
Extract – Transform – Load

• Extracting: Extract data from sources into staging area


• Conditioning: Data types conversion to fit warehouse.
• House holding: Grouping similar data 1000/4=250
• Enrichment: Add relevant data from external sources
• Scoring: Computation of probability of an event
• Scrubbing: Data cleaning: find duplicate, missing data
• Merging: Merging data from various sources.
• De-normalize: Duplicate data to reduce joins.
• Loading: Load data in warehouse models like Star, Snowflake, Galaxy.
• Delta Updating: Incremental data uploading
• Partitioning: Dividing the data in logical parts to improve performance.

Sunbeam Infotech www.sunbeaminfo.com


De-normalize :- Duplicate data to reduce joins.
Normalized De-normalized
Students
join
Batches Students
6 roll name batch .....7.........6..........

id batch-n ....6.... roll name batchid ..7..... 1 a OM51


1 OM51 1 a 1 2 b PM31
2 PM31 2 b 2 3 c OM51
3 PM32 3 c 1 4 d PM31
4 OC06 4 d 2 5 e OM51

1 6 f PM32
5 e
7 g PM32
6 f 3

7 g 3

Sunbeam Infotech www.sunbeaminfo.com


DWH Schemas Dimension

• DWH schema is how data is stored in tables in star

warehouse for the efficient processing of the data.


• A fact table stores metrics, measurements, or facts
about business processes. number
numbers
• Dimension tables are tables used to store data city,state,cn,pin,regoin
attributes or dimensions. number+string
• Star schema: Single facts table and a few
dimension tables (de-normalized) – Simple design. pin,regoin

• Snowflake schema: Single facts table and city,state,cn


connected dimension/sub-dimension tables
(normalized). fact-1

• Galaxy or Fact-Constellation schema: Multiple


facts tables mapped to multiple dimension/sub- fact-2

dimension tables.

Sunbeam Infotech www.sunbeaminfo.com


OLTP (Database) vs OLAP (Data warehouse)
• Online Transaction Processing • Online Analytical Processing

• Modeled to run the business • Modeled to analyze/optimize


business

• Detailed/Transactional normalized
real-time data • Summarized/refined redundant
snapshot data
DML
• Transaction performance DQL
DQL
• Analytical query performance

• Read/Write operations
• Mostly Read operations sum,count,avg

• Isolated data (Application specific)


– Limited data (100 MB to 100 GB) • Integrated data (from all sources) –
Huge data (100 GB to Few TB)
Sunbeam Infotech www.sunbeaminfo.com
Data lake vs Data warehouse vs Data mart
Extract
18% IND
RDBMS Transform
@20Oct
NoSQL raw
data star

21% USA
@ 20 DEC

3rd Party 11% affica


snowflake

Galaxy

Dept-level DWH
like
staging
area

Sunbeam Infotech www.sunbeaminfo.com


Data engineering

• Data engineering is the RDBMS


NoSQL
development, implementation, and Sensor
maintenance of systems and mobile
social media
processes that take in raw data and
produce high-quality, consistent
information that supports
downstream use cases, such as
analysis and machine learning.
• Data engineer manages data
engineering lifecycle, beginning
with getting data from source
systems & ending with serving data
for use cases, such as analysis or
machine learning.

Sunbeam Infotech www.sunbeaminfo.com


Traditional ETL vs Hadoop ELT

• ETL stands for Extract, Transform • ELT stands for Extract, Load and
and Load. Transform.
• The ETL process typically extracts • As opposed to loading just the
data from the source/transactional transformed data in the target
systems, transforms it to fit the systems, the ELT process loads the
model of data-warehouse and finally entire data into the data lake. This
loads it to the data warehouse. results in faster load times.
• The transformation process involves • The load process can also perform
cleansing, enriching and applying some basic validations and data
transformations to create desired cleansing rules.
output. • The data is then transformed for
• Data is usually dumped to a staging analytical reporting as per demand.
area after extraction.

Sunbeam Infotech www.sunbeaminfo.com


Data storage

• Data storage is related to multiple stages in data engineering life cycle i.e.
ingestion, transformation and serving.
• Storage needs to be selected based on read/write requirement, speed,
durability, consistency, availability, scalability, fault tolerance, … factors.
• Storage tradeoffs
• Local storage vs Distributed storage
• Strong consistency vs Eventual consistency
• Storage options are: File storage, Local disk storage, Network attached storage
(NAS), Cloud file systems (S3/Blob), Block storage, RAID, Storage area network
(SAN), Object storage, HDFS, Streaming storage.
hadoop

Sunbeam Infotech www.sunbeaminfo.com


Q: Which of the following is not the type of NoSQL?
A: Graph
B: Doument
C: Text
D: Search

Sunbeam Infotech www.sunbeaminfo.com


Q: Which of the following is not the type of NoSQL?
A: Graph
B: Doument
C: Text
D: Search

Sunbeam Infotech www.sunbeaminfo.com


Q: Which of the following is the type of NoSQL?
A: Row
B: Column
C: Collection
D: Table

Sunbeam Infotech www.sunbeaminfo.com


Q: Which of the following is the type of NoSQL?
A: Row
B: Column
C: Collection
D: Table

Sunbeam Infotech www.sunbeaminfo.com


Q: Which of the following is not the valid Data warehouse schema?
A: Star
B: Showfall
C: Showflake
D: Galaxy

Sunbeam Infotech www.sunbeaminfo.com


Q: Which of the following is not the valid Data warehouse schema?
A: Star
B: Showfall
C: Showflake
D: Galaxy

Sunbeam Infotech www.sunbeaminfo.com


Q: Which of the following is working on real time data?
A: OLAP
B: OLTP
C: OLPP
D: OTLP

Sunbeam Infotech www.sunbeaminfo.com


Q: Which of the following is working on real time data?
A: OLAP
B: OLTP
C: OLPP
D: OTLP

Sunbeam Infotech www.sunbeaminfo.com


Q: Analytical queries are performed in ____________.
A: OLAP
B: OLTP
C: OLPP
D: OTLP

Sunbeam Infotech www.sunbeaminfo.com


Q: Analytical queries are performed in ____________.
A: OLAP
B: OLTP
C: OLPP
D: OTLP

Sunbeam Infotech www.sunbeaminfo.com


Q: In data engg life cycle pulling data is called known as ________ data.
A: Poping
B: Ingesting
C: Serving
D: Analysing

Sunbeam Infotech www.sunbeaminfo.com


Q: In data engg life cycle pulling data is called known as ________ data.
A: Poping
B: Ingesting
C: Serving
D: Analysing

Sunbeam Infotech www.sunbeaminfo.com


Thank you!
Pradnyaa S Dindorkar <[email protected]>

Sunbeam Infotech www.sunbeaminfo.com

You might also like