Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views60 pages

Week2 - Data - Formats 3

The document discusses various data formats used in the data engineering lifecycle, emphasizing the impact of format choice on storage efficiency and data exchange. It covers the creation of analog and digital data, and details several common data formats such as CSV, XML, and JSON, along with their pros and cons. Additionally, it introduces tools and services from AWS that facilitate data handling and migration.

Uploaded by

jisu9587
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views60 pages

Week2 - Data - Formats 3

The document discusses various data formats used in the data engineering lifecycle, emphasizing the impact of format choice on storage efficiency and data exchange. It covers the creation of analog and digital data, and details several common data formats such as CSV, XML, and JSON, along with their pros and cons. Additionally, it introduces tools and services from AWS that facilitate data handling and migration.

Uploaded by

jisu9587
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Data Formats

and
Computation
Data Formats in Data Engineering Lifecycle

• Often, we have no/weak control over the data source format


• Data formats can be changed during the ingestion phase
• The decision of data formats will affect storage efficiency, exchangeability, etc.
2
How is data created?
• Analog data
• Creation occurs in the real world.
E.g. vocal speech, sign language,
writing on paper, playing an
instrument if record
no ,
it is
gone
• Transient
• Digital data
• Conversion from analog data, e.g.
speech to text, digital photos
• Native product of a digital system,
e.g. credit card transactions, online
ordering

3
• AWS Data Exchange: makes it easy for AWS customers to find, subscribe to, and use
third-party data in the AWS Cloud.

• AppFlow: automates data flow between software as a service (SaaS) and AWS
services

• Fargate: a serverless compute engine for hosting Docker containers without having
to provision, manage, and scale servers.

• AWS SFTP enables the transfer of files to and from Amazon S3 and on-premises
servers using the Secure File Transfer Protocol (SFTP).

• Amazon Kinesis Firehose: a fully managed service for delivering real-time streaming
data to destinations such as S3, Redshift, and Elasticsearch Service.

• AWS DataSync: automates and accelerates moving data between on premises and
AWS Storage services.

• AWS Data Migration Service: a service provided by AWS that helps users to migrate
data from one location to another.
4
Binary files typically contain a sequence of bytes or
ordered groupings of eight bits. When creating a custom
file format for a program, a developer arranges these
bytes into a format that stores the necessary
information for the application.

of Os and Is
Sequence
• A file is a collection of data stored on a
computer or other electronic device.
• It can contain text, images, videos, Text files are more restrictive than binary files since they can
audio, or other types of information, only contain textual data.
and is typically saved with a specific file
format, such as .txt, .jpg, .mp4, etc.

5
• CSV / TSV
• XML
Common Data • JSON
Formats • Avro
• Parquet

6
Text-based Formats

7
CSV / TSV (Comma / Tab -Separated Value )
XSV file
&
2 data HELT
Pros
~11
• Most ubiquitous file formats, which are supported
by many applications and connectors
• Easy reading/modifying
• Excellent compression ratio
Cons
• No support null values, which are identical to empty
values
• It does not support native binary data
• Any entry has the potential to break the file
• Poor support for structured metadata.
Ecosystems
• Supported by a wide range of applications
• One of the most popular formats used because of its
simplicity.

8
CSV / TSV (Comma / Tab -Separated Value )
Pros
• Most ubiquitous file formats, which are supported
by many applications and connectors
• Easy reading/modifying
• Excellent compression ratio
Cons
• No support null values, which are identical to empty
values > no good value to represent NULL nullot @EUDE string
-
:

• It does not support native binary data 22 H


• Any entry has the potential to break the file
• Poor support for structured metadata.
Ecosystems
• Supported by a wide range of applications
• One of the most popular formats used because of its
simplicity.

9
CSV / TSV (Comma / Tab -Separated Value )
Pros
• Most ubiquitous file formats, which are supported
by many applications and connectors
• Easy reading/modifying
• Excellent compression ratio
Cons
• No support null values, which are identical to empty
values
• It does not support native binary data
• Any entry has the potential to break the file
• Poor support for structured metadata
Ecosystems
27b/YI
• Supported by a wide range of applications +

• One of the most popular formats used because of its


simplicity.

10
Load CSVs leveraging the processing engines

CSVs tend to be natively supported


for loading data from Databases.
This is an example in Postgres.

11
Load CSVs leveraging connectors

Amazon Athena is an interactive


query service that makes it easy to
analyze data directly in Amazon
Simple Storage Service (Amazon S3)
using standard SQL.

12
XML (eXtensible Markup Language)
• Built to separate data from HTML.
Pros
• Flexible schema, allowing the embedding of
complex types, explicitly designed to support
Unicode. (Example)
• Schema validation, performed using DTD
• Some specific parsers enable some form of
streaming, e.g. SAX
Cons
• Very verbose, and can lead to a larger file size
• Not splittable
Ecosystems
• A large historical presence in companies tends to
decrease over time.

13
XML (eXtensible Markup Language)
• Built to separate data from HTML.
Pros
• Flexible schema, allowing the embedding of
complex types, explicitly designed to support
Unicode. (Example) define DTD
• Schema validation, performed using DTD
• Some specific parsers enable some form of
streaming, e.g. SAX
Cons ) string data in form of XML

• Very verbose, and can lead to a larger file size


• Not splittable
Ecosystems
• A large historical presence in companies tends to
decrease over time.

14
XML (eXtensible Markup Language)
• Built to separate data from HTML.
Pros
• Flexible schema, allowing the embedding of
complex types, explicitly designed to support
Unicode. (Example)
• Schema validation, performed using DTD ~
• Some specific parsers enable some form of
streaming, e.g. SAX
Cons
• Very verbose, and can lead to a larger file size
• Not splittable
Ecosystems
• A large historical presence in companies tends to
decrease over time.
replace i =M)

15
How does XML handle binary data?

Base64 Encoding: Binary data can be converted to a Base64 string, which


is a text-safe encoding method. The Base64-encoded string can then be
included in an XML document. This is the most common approach.

Hexadecimal Encoding: Another option is to represent binary data as a


hexadecimal string. However, this is less space-efficient than Base64 and
I

is less commonly used.

External References: Instead of embedding binary data, you can store it in


a separate file and use an XML tag to reference the file path or URI.

16
How does XML handle binary data?
• Base64 Encoding: Binary data can be converted to a Base64 string,
which is a text-safe encoding method. The Base64-encoded string can
then be included in an XML document. This is the most common
approach. binary 64 encoding XML
>
-
>
-

<file>
<name>example.jpg</name>
<data>
iVBORw0KGgoAAAANSUhEUgAAAAUA...
</data>
</file>

17
How does XML handle binary data?
• Hexadecimal Encoding: Another option is to represent binary data as
a hexadecimal string. However, this is less space-efficient than Base64
and is less commonly used.
<file>
<name>example.txt</name>
<data>48656C6C6F</data>
</file>

48656C6C6F is the binary representation (in ASCII) of "Hello"


z

18
How does XML handle binary data?
• External References: Instead of embedding binary data, you can store
it in a separate file and use an XML tag to reference the file path or
URI. ELLON
xinaMHI
<file>
<name>example.jpg</name>
<path>/path/to/example.jpg</path>
</file> path data and
binary IEE CHOEE
hard to einforce
> path
-

data En
>
-

hard to enforce path


same

same path
19
Parsing XMLs using Python (Element Tree and LXML)

20
Working with XML using SOAP
• SOAP: Simple Object Access Protocol
<?xml version="1.0"?> <?xml version="1.0"?>

<soap:Envelope <soap:Envelope
xmlns:soap="http://www.w3.org/2003/05/soap- xmlns:soap="http://www.w3.org/2003/05/soap-
envelope/" envelope/"
soap:encodingStyle="http://www.w3.org/2003/05/soa soap:encodingStyle="http://www.w3.org/2003/05/soap-
p-encoding"> encoding">

<soap:Body> <soap:Body>
<m:GetPrice xmlns:m="https://www.w3schools.com/ <m:GetPriceResponse xmlns:m="https://www.w3school
prices"> s.com/prices">
<m:Item>Apples</m:Item>
</m:GetPrice> .
9
<m:Price>1.90</m:Price>
</m:GetPriceResponse>
</soap:Body> HTTPS/HTTP </soap:Body>

</soap:Envelope> </soap:Envelope>

Request Response
- =
21
JSON (JavaScript Object Notation)
Text format similar to XML, but in a much
simpler (compact) format
Pros
• Simple syntax and can be opened by any
text editor
• Flexible schema, supporting types
(string, number, object, array, Booleans,
null)
• Binary format (BSON) variation supports
other native types, such as Datetime
• Compressible
Cons
• Not splittable
• Limited metadata support
Ecosystems
Online formatter and validator
• Privileged format for web applications
• Widely used for NoSQL DBs such as JSON Example
MongoDB, Couchbase, etc.
22
JSON (JavaScript Object Notation)
Text format similar to XML, but in a much
simpler (compact) format
Pros
• Simple syntax and can be opened by any
text editor
• Flexible schema, supporting types
(string, number, object, array, Booleans,
null)
• Binary format (BSON) variation supports
other native types, such as Datetime
• Compressible
Cons
• Not splittable
• Limited metadata support
Ecosystems
Online formatter and validator
• Privileged format for web applications
• Widely used for NoSQL DBs such as JSON Example
MongoDB, Couchbase, etc.
23
JSON (JavaScript Object Notation)
Text format similar to XML, but in a much
simpler (compact) format
Pros
• Simple syntax and can be opened by any
text editor
• Flexible schema, supporting types
(string, number, object, array, Booleans,
null)
• Binary format (BSON) variation supports
other native types, such as Datetime
• Compressible
Cons
• Not splittable
• Limited metadata support
Ecosystems Condor ~
Online formatter and validator agent e
• Privileged format for web applications q
• Widely used for NoSQL DBs such as JSON Example ~
MongoDB, Couchbase, etc.
python dictionary et PH
24
JSON (JavaScript Object Notation)
• Arrays: lists that are represented by square brackets, and the values have
commas in between them. They can contain mixed data types, i.e., a single
array can have strings, Boolean, and numbers.
• E.g.: [1, 2, 7.8, 5, 9, 10]; ["red", "yellow", "green"]; [8, "hello", null, true];

• Objects: JSON dictionaries that are enclosed in curly brackets. In objects,


keys and values are separated by a colon ':', pairs are separated by comma.
Keys and values can be of any type.
• E.g.: {"red" : 1, "yellow" : 2, "green" : 3};

25
XML v.s. JSON

26
Binary Formats

27
What is Serialization?
Her

• Serialization is the process of converting an object into a stream of


bytes to store the object or transmit it to memory, a database, or a
file.

28
Deserialization

29
AVRO
• Row-based storage format which is widely used as a serialization process.
• AVRO stores its schema in JSON format making it easy to read and interpret by any
program.
• The data itself is stored in binary format by making it compact and efficient.

Row based storage format

Name Age

Pierre-Simon Laplace 77

John von Neumann 53

30
AVRO
Pros
• It is splittable and compressible.
• Good for data exchange.
• It highly supports schema evolution (at a different time and independently).
• Avro schemas are defined in JSON, easy to read and parse.
• The data is always accompanied by schema, which makes the data processing
much easier. text editor 2 libraryFor
Cons
• Data is not readable by human
Ecosystem
• Widely used in many applications (Kafka, Spark, etc.)

31
AVRO
Pros
• It is splittable and compressible.
• Good for data exchange.
• It highly supports schema evolution (at a different time and independently).
• Avro schemas are defined in JSON, easy to read and parse.
• The data is always accompanied by schema, which makes the data processing
much easier.
Cons
• Data is not readable by human
Ecosystem
• Widely used in many applications (Kafka, Spark, etc.)

32
AVRO
Pros
• It is splittable and compressible.
• Good for data exchange.
• It highly supports schema evolution (at a different time and independently).
• Avro schemas are defined in JSON, easy to read and parse.
• The data is always accompanied by schema, which makes the data processing
much easier.
Cons
• Data is not readable by human
Ecosystem
• Widely used in many applications (Kafka, Spark, etc.)

33
AVRO binary encoding
test_schema = '''{
"namespace": "example.avro", len("{'userName': 'Martin', 'favorite_number': 1337,
"type": "record", 'interests': ['daydreaming', 'hacking']}") = 88 bytes
"name": "Person",
8 bit Ibyle

=
"fields": [ :

{"name": "userName", "type": "string"},


&

{"name": "favorite_number", "type": ["null", 0c 4d 61 72 74 69 6e 02 f2 14 04 16 64 61



bit
"long"]}, 79 64 72 65 61 6d 69 6e 67 0e 68 61 63 6b
{"name": "interests", "type": {

Amo32byCreduction
69 6e 67 00 i

conversion tl
"type": "array",
"items": "string"
}}]}'''


Go

>
- UTE
Hex to Text Converter -t 34
*
AVRO binary encoding
test_schema = '''{
"namespace": "example.avro", {‘userName’: ‘Ben’, ‘favorite_number’: None,
"type": "record", ‘interests’: [‘sleeping’, ‘swimming’]
"name": "Person",
"fields": [
{"name": "userName", "type": "string"},
{"name": "favorite_number", "type": ["null", 06 42 65 6e 00 04 10 73 6c 65 65 70 69 6e
"long"]}, 67 10 73 77 69 6d 6d 69 6e 67 00
{"name": "interests", "type": {
"type": "array",
a
l
name
"items": "string"
}}]}'''

user B e n

&0616 0000 01102 Length of userName: 3 42 65 6e


nul
&
=
0016 -
0000 00002a
null r num
*

More
04 0000 0100
16 2 2 interests
s l e e p i n g
73 6c 65 65 70 69 6e 67
-1016 0001 00002 Length of the 1st interest: 8
s w i m m i n g

-
1016 0001 00002 Length of the 2nd interest: 8 73 77 69 6d 6d 69 6e 67 Hex to Text Converter
35
Parquet
• The column-oriented data storage format of the Apache Hadoop
ecosystem is excellent performing in reading and querying analytical
workloads.
• PARQUET file format is very popular for Spark data engineers and data
scientists
• Optimized for the paradigm of write once read many (WORM)

36
Parquet
Pros
• Splittable files.
• Organizing by column, allowing better compression, as data is
more homogeneous.
• High efficiency for OLAP workload.
• Supports schema evolution.
• Restricted to batch processing.
Cons
• Not human-readable.
• Difficulties to apply updates unless you delete and recreate it
again.
Ecosystems
• Efficient analysis for BI (Business Intelligence).
• Very fast to read for processing engines such as Spark.
• Commonly used along with Spark, Impala, Arrow and Drill.

37
Parquet
Pros
• Splittable files.
• Organizing by column, allowing better compression, as data is
more homogeneous.
• High efficiency for OLAP workload.
• Supports schema evolution.
• Restricted to batch processing.
Cons
• Not human-readable.
• Difficulties to apply updates unless you delete and recreate it
again.
Ecosystems
• Efficient analysis for BI (Business Intelligence).
• Very fast to read for processing engines such as Spark.
• Commonly used along with Spark, Impala, Arrow and Drill.

38
Parquet
Pros
• Splittable files.
• Organizing by column, allowing better compression, as data is
more homogeneous.
• High efficiency for OLAP workload.
• Supports schema evolution.
• Restricted to batch processing.
Cons
• Not human-readable.
• Difficulties to apply updates unless you delete and recreate it
again.
Ecosystems
• Efficient analysis for BI (Business Intelligence).
• Very fast to read for processing engines such as Spark.
• Commonly used along with Spark, Impala, Arrow and Drill.

39
Parquet v.s. AVRO

Parquet Avro
Column way storage format Row-based storage format
Read, Write Read: Faster Read: Slower than Parquet
Write: Slower than Avro due to better compression) Write: Faster

Schema evolution Supports schema evolution in terms of append-only. Supports schema evolution in terms
of modifying, append
Use cases Analytical Queries ETL, where we scan the complete
data
Suitable for analytical query. Write once, read many
times. Suitable for read intensive jobs. Optimized for write operation

40
Considerations for choosing a
file format
• Text v.s. binary
• Text-based file formats are easier to use
• Text-based files can be read by humans who can also modify
the file content with a text editor.
• Binary file formats require a tool or a library to be created and
consumed
• Binary files provide better performance by optimizing the data
serialization.
• Data type
• Some formats don’t allow the declaration of multiple types of
data. E.g.: distinguish a number from a string or a null value
from the string “null”.
• Scalar types hold a “single” value (e.g. integer, boolean, string,
null, …).
• Complex types are a compound of scalar types (eg arrays,
objects, …)
• Once encoded in binary form, the storage gain is significant.
For example, the string "1234" uses 4 bytes of storage while
the hex number 1234 requires only 2 bytes.
41
Considerations for choosing a
file format
• Schema enforcement
• The schema can be associated with the data or left to the
consumer who is assumed to know and understand how to
interpret the data.
• It can be provided with the data or separately.
• Schema evolution support
• Schema evolution allows updating the schema used to write
new data while maintaining backward compatibility with the
schema of the old data.
• Row and column storage
• Row-based storage supports adding data easily and quickly.
• Row-based storage is preferred in cases where the entire row
of data needs to be accessed or processed simultaneously.
• Row-based storage is commonly used for Online Transactional
Processing (OLTP), which usually processes CRUD (Create,
Read(Select), Update and Delete) at a record level.
• Column-based storage is useful when performing analytics
queries that require only a subset of columns examined over
very large datasets, used for Online Analytics Processing
(OLAP), which is an approach designed to quickly answer
analytics queries involving multiple dimensions.
42
Considerations for choosing a
file format
• Splittable
• If a file can be easily divided into several pieces, we can use
distributed file systems, such as Hadoop HDFS.
• Otherwise, it is the users’ responsibility to partition the data
into several files to enable parallelism.
• Compression
• Column-based storage brings a higher compression ratio and
better performance compared to row-based storage because
similar pieces of data are stored together.
• Compression mostly applies to text file formats and the binary
file format includes compression in its definition.
• Ecosystem
• It is often considered a good practice to respect the usage and
the culture of an ecosystem when it comes to choosing
between multiple alternatives.
• For example, when choosing between Parquet or ORC with
Hive, it is recommended to use the former on Cloudera
platforms and the latter on Hortonworks platforms. Chances
are that the integration will be smoother and the
documentation and support from the community will be much
more helpful.
43
Discussions
De
text twice
&
opening
- closing
How do you rank these formats (CSV, XML, JSON,
Avro) in terms of
• Storage efficiency
• Avro > CSV > JSON > XML
Auro
• Scalability
CsV-file contain only data no file name
• Avro > JSON > XML > CSV
ISON - each now specify column name
• Ease of use
->
XML
• (JSON, CSV)opening
> XMLa > Avroclosing The

44
C110(21


XOMELE
Discussions 12022184x((2/
Mel
d En
How do you rank these formats (CSV, XML, JSON,
Avro) in terms of
• Storage efficiency
• Avro > CSV > JSON > XML if daksize &

• Scalability => increase the size= how much it will

• Avro > JSON > XML > CSV impact on data side
Arro increase 100 size data & 100 size
• Ease of use
• (JSON, CSV) > XML > Avro
CSV
JSON
J non splitable.
XML

45
Discussions

How do you rank these formats (CSV, XML, JSON,


Avro) in terms of
• Storage efficiency
• Avro > CSV > JSON > XML
• Scalability
• Avro > CSV > JSON > XML
• Ease of use
• (JSON, CSV) > XML > Avro
(JSON , CSU)
XML
Avro
46
Discussions

How do you rank these formats (CSV, XML, JSON,


Avro) in terms of
• Storage efficiency
• Avro > CSV > JSON > XML
• Scalability
• Avro > CSV > JSON > XML
• Ease of use
• (JSON, CSV) > XML > Avro

47
Moving data
takes time
• You may have learned a lot about the time
complexity of algorithms.
• Computing takes time
• But moving data does as well
The Memory Hierarchy

49
Example
• Transaction dataset
• 2,709,550 records, 167MB
• Query: find the total sum of the quantity

50
Performance Limits
• What are the limits for that query (167MB)?
Bandwidth Query Time
1GB ethernet 125MB/s 1.34s
rotating disk 200MB/s 0.835s
SATA SSD 500MB/s 0.334s
USB-C 1GB/s 0.167s
PCIe SSD 2GB/s 0.0835s
DRAM 20GB/s 0.00835s

• This completely ignores CPU cost


• CPU cost is not so relevant when the data is stored on disks, but
very relevant for DRAM
51
Python Implementation
>> sum.py

sum=0
with open('transactions_big.csv') as f:
for line in f:
sum += float(line.split(',')[5])
print (sum)

1.80s: total amount of time (in CPU-seconds) that the command spent in the user mode.
0.06s: amount of time in CPU-seconds that the process spent in the kernel mode
98%: the percentage of CPU that was allocated to the process.
1.896: total process running time(I/O + CPU)
52
Note: 1.8/1.896 = 98.1%
Move data or
computation?
First Situation: Move
Data
To process large volumes of data
that are geographically
distributed, we will traditionally
need to transfer all the data to be
processed to a single data center,
so that they can be processed in a
centralized fashion.

54
First Situation: Move
Data
However,
• It may not be practical to move
user data across country
boundaries, due to legal reasons
or privacy concerns.
• The cost, regarding both
bandwidth cost and time, to
move large volumes of data
across geo-distributed data
centers may become prohibitive
as the amount of data grows
exponentially.

55
Second Situation:
Move Computation
• Rather than transferring data across
data centers, it may be a better
design to move computation tasks to
where the data is so that data can be
processed locally within the same
data center.
• The fundamental objective, in
general, is to minimize the job
completion times in big data
processing applications, by placing
the tasks at their respective best
possible data centers.

56
Move data v.s. Move computation
If we move data (167 𝑀𝐵),
• Network bandwidth: 100 𝑀𝐵/𝑠
• It takes 1.67𝑠 to move the data
• Computation (CPU) time: 0.4 𝑠 → speed of computation: 417 𝑀𝐵/𝑠
• Total time: 𝑡𝑚𝑑 = 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑑𝑎𝑡𝑎 ∗ (1/𝑠𝑝𝑒𝑒𝑑 𝑜𝑓 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛 + 1/𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑠𝑝𝑒𝑒𝑑) =
167𝑀𝐵 ∗ (1/100𝑀𝐵/𝑠 + 1/417 𝑀𝐵/𝑠) = 2.07 𝑠

57
Move data v.s. Move computation A/B/C
If we move the computation
• The performance is determined by latency and the number of nodes
• 𝑡𝑚𝑐 = 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 / 𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑠𝑝𝑒𝑒𝑑 + 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑑𝑎𝑡𝑎 / 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑜𝑑𝑒𝑠 /
𝑠𝑝𝑒𝑒𝑑 𝑜𝑓 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛
• Assuming the size of program is 1 𝑀𝐵, for 10 nodes, 𝑡𝑚𝑐 = 1𝑀𝐵 / (100𝑀𝐵/𝑠) +
167𝑀𝐵 / 10 / (417 𝑀𝐵/𝑠) = 0.05 𝑠
&
• So it is worth doing if the problem is large enough to reduce the latency

eit

AY
A
BC 58
𝑡𝑚𝑑
= 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑑𝑎𝑡𝑎
∗ (1/𝑠𝑝𝑒𝑒𝑑 𝑜𝑓 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛
+ 1/𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑠𝑝𝑒𝑒𝑑)

𝑡𝑚𝑐
= 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑝𝑟𝑜𝑔𝑟𝑎𝑚
/ 𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑠𝑝𝑒𝑒𝑑
+ 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑑𝑎𝑡𝑎 / 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑜𝑑𝑒𝑠
/ 𝑠𝑝𝑒𝑒𝑑 𝑜𝑓 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛

Assume the speed of computation is


1𝐺𝐵 / 𝑠
And the speed of the network is 100
e 𝑀𝐵/𝑠
ver morecomputations
am · 59
Summary
• Data types
• There is no absolute answer in
the BEST file format. We need to
consider by multiple factors like
compactivity with
applications/systems, use cases,
etc.
• It’s very important to understand
different file formats if we are
working with data-related
domains especially designing a
data warehouse with complex
integration with many systems.
• Move data v.s. Move computation
• We can achieve significant speed-
up by moving computation to the
data especially in the big data era.
60

You might also like