0% found this document useful (0 votes)

13 views60 pages

Week2 - Data - Formats 3

The document discusses various data formats used in the data engineering lifecycle, emphasizing the impact of format choice on storage efficiency and data exchange. It covers the creation of analog and digital data, and details several common data formats such as CSV, XML, and JSON, along with their pros and cons. Additionally, it introduces tools and services from AWS that facilitate data handling and migration.

Uploaded by

jisu9587

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views60 pages

Week2 - Data - Formats 3

Uploaded by

jisu9587

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Data Formats

and
Computation
Data Formats in Data Engineering Lifecycle

• Often, we have no/weak control over the data source format

• Data formats can be changed during the ingestion phase
• The decision of data formats will affect storage efficiency, exchangeability, etc.
2
How is data created?
• Analog data
• Creation occurs in the real world.
E.g. vocal speech, sign language,
writing on paper, playing an
instrument if record
no ,
it is
gone
• Transient
• Digital data
• Conversion from analog data, e.g.
speech to text, digital photos
• Native product of a digital system,
e.g. credit card transactions, online
ordering

3
• AWS Data Exchange: makes it easy for AWS customers to find, subscribe to, and use
third-party data in the AWS Cloud.

• AppFlow: automates data flow between software as a service (SaaS) and AWS
services

• Fargate: a serverless compute engine for hosting Docker containers without having
to provision, manage, and scale servers.

• AWS SFTP enables the transfer of files to and from Amazon S3 and on-premises
servers using the Secure File Transfer Protocol (SFTP).

• Amazon Kinesis Firehose: a fully managed service for delivering real-time streaming
data to destinations such as S3, Redshift, and Elasticsearch Service.

• AWS DataSync: automates and accelerates moving data between on premises and
AWS Storage services.

• AWS Data Migration Service: a service provided by AWS that helps users to migrate
data from one location to another.
4
Binary files typically contain a sequence of bytes or
ordered groupings of eight bits. When creating a custom
file format for a program, a developer arranges these
bytes into a format that stores the necessary
information for the application.

of Os and Is
Sequence
• A file is a collection of data stored on a
computer or other electronic device.
• It can contain text, images, videos, Text files are more restrictive than binary files since they can
audio, or other types of information, only contain textual data.
and is typically saved with a specific file
format, such as .txt, .jpg, .mp4, etc.

5
• CSV / TSV
• XML
Common Data • JSON
Formats • Avro
• Parquet

6
Text-based Formats

7
CSV / TSV (Comma / Tab -Separated Value )
XSV file
&
2 data HELT
Pros
~11
• Most ubiquitous file formats, which are supported
by many applications and connectors
• Easy reading/modifying
• Excellent compression ratio
Cons
• No support null values, which are identical to empty
values
• It does not support native binary data
• Any entry has the potential to break the file
• Poor support for structured metadata.
Ecosystems
• Supported by a wide range of applications
• One of the most popular formats used because of its
simplicity.

8
CSV / TSV (Comma / Tab -Separated Value )
Pros
• Most ubiquitous file formats, which are supported
by many applications and connectors
• Easy reading/modifying
• Excellent compression ratio
Cons
• No support null values, which are identical to empty
values > no good value to represent NULL nullot @EUDE string
-
:

• It does not support native binary data 22 H

• Any entry has the potential to break the file
• Poor support for structured metadata.
Ecosystems
• Supported by a wide range of applications
• One of the most popular formats used because of its
simplicity.

9
CSV / TSV (Comma / Tab -Separated Value )
Pros
• Most ubiquitous file formats, which are supported
by many applications and connectors
• Easy reading/modifying
• Excellent compression ratio
Cons
• No support null values, which are identical to empty
values
• It does not support native binary data
• Any entry has the potential to break the file
• Poor support for structured metadata
Ecosystems
27b/YI
• Supported by a wide range of applications +

• One of the most popular formats used because of its

simplicity.

10
Load CSVs leveraging the processing engines

CSVs tend to be natively supported

for loading data from Databases.
This is an example in Postgres.

11
Load CSVs leveraging connectors

Amazon Athena is an interactive

query service that makes it easy to
analyze data directly in Amazon
Simple Storage Service (Amazon S3)
using standard SQL.

12
XML (eXtensible Markup Language)
• Built to separate data from HTML.
Pros
• Flexible schema, allowing the embedding of
complex types, explicitly designed to support
Unicode. (Example)
• Schema validation, performed using DTD
• Some specific parsers enable some form of
streaming, e.g. SAX
Cons
• Very verbose, and can lead to a larger file size
• Not splittable
Ecosystems
• A large historical presence in companies tends to
decrease over time.

13
XML (eXtensible Markup Language)
• Built to separate data from HTML.
Pros
• Flexible schema, allowing the embedding of
complex types, explicitly designed to support
Unicode. (Example) define DTD
• Schema validation, performed using DTD
• Some specific parsers enable some form of
streaming, e.g. SAX
Cons ) string data in form of XML

• Very verbose, and can lead to a larger file size

• Not splittable
Ecosystems
• A large historical presence in companies tends to
decrease over time.

14
XML (eXtensible Markup Language)
• Built to separate data from HTML.
Pros
• Flexible schema, allowing the embedding of
complex types, explicitly designed to support
Unicode. (Example)
• Schema validation, performed using DTD ~
• Some specific parsers enable some form of
streaming, e.g. SAX
Cons
• Very verbose, and can lead to a larger file size
• Not splittable
Ecosystems
• A large historical presence in companies tends to
decrease over time.
replace i =M)

15
How does XML handle binary data?

Base64 Encoding: Binary data can be converted to a Base64 string, which

is a text-safe encoding method. The Base64-encoded string can then be
included in an XML document. This is the most common approach.

Hexadecimal Encoding: Another option is to represent binary data as a

hexadecimal string. However, this is less space-efficient than Base64 and
I

is less commonly used.

External References: Instead of embedding binary data, you can store it in

a separate file and use an XML tag to reference the file path or URI.

16
How does XML handle binary data?
• Base64 Encoding: Binary data can be converted to a Base64 string,
which is a text-safe encoding method. The Base64-encoded string can
then be included in an XML document. This is the most common
approach. binary 64 encoding XML
>
-
>
-

<file>
<name>example.jpg</name>
<data>
iVBORw0KGgoAAAANSUhEUgAAAAUA...
</data>
</file>

17
How does XML handle binary data?
• Hexadecimal Encoding: Another option is to represent binary data as
a hexadecimal string. However, this is less space-efficient than Base64
and is less commonly used.
<file>
<name>example.txt</name>
<data>48656C6C6F</data>
</file>

48656C6C6F is the binary representation (in ASCII) of "Hello"

18
How does XML handle binary data?
• External References: Instead of embedding binary data, you can store
it in a separate file and use an XML tag to reference the file path or
URI. ELLON
xinaMHI
<file>
<name>example.jpg</name>
<path>/path/to/example.jpg</path>
</file> path data and
binary IEE CHOEE
hard to einforce
> path
-

data En
>
-

hard to enforce path

same

same path
19
Parsing XMLs using Python (Element Tree and LXML)

20
Working with XML using SOAP
• SOAP: Simple Object Access Protocol
<?xml version="1.0"?> <?xml version="1.0"?>

<soap:Envelope <soap:Envelope
xmlns:soap="http://www.w3.org/2003/05/soap- xmlns:soap="http://www.w3.org/2003/05/soap-
envelope/" envelope/"
soap:encodingStyle="http://www.w3.org/2003/05/soa soap:encodingStyle="http://www.w3.org/2003/05/soap-
p-encoding"> encoding">

<soap:Body> <soap:Body>
<m:GetPrice xmlns:m="https://www.w3schools.com/ <m:GetPriceResponse xmlns:m="https://www.w3school
prices"> s.com/prices">
<m:Item>Apples</m:Item>
</m:GetPrice> .
9
<m:Price>1.90</m:Price>
</m:GetPriceResponse>
</soap:Body> HTTPS/HTTP </soap:Body>

</soap:Envelope> </soap:Envelope>

Request Response
- =
21
JSON (JavaScript Object Notation)
Text format similar to XML, but in a much
simpler (compact) format
Pros
• Simple syntax and can be opened by any
text editor
• Flexible schema, supporting types
(string, number, object, array, Booleans,
null)
• Binary format (BSON) variation supports
other native types, such as Datetime
• Compressible
Cons
• Not splittable
• Limited metadata support
Ecosystems
Online formatter and validator
• Privileged format for web applications
• Widely used for NoSQL DBs such as JSON Example
MongoDB, Couchbase, etc.
22
JSON (JavaScript Object Notation)
Text format similar to XML, but in a much
simpler (compact) format
Pros
• Simple syntax and can be opened by any
text editor
• Flexible schema, supporting types
(string, number, object, array, Booleans,
null)
• Binary format (BSON) variation supports
other native types, such as Datetime
• Compressible
Cons
• Not splittable
• Limited metadata support
Ecosystems
Online formatter and validator
• Privileged format for web applications
• Widely used for NoSQL DBs such as JSON Example
MongoDB, Couchbase, etc.
23
JSON (JavaScript Object Notation)
Text format similar to XML, but in a much
simpler (compact) format
Pros
• Simple syntax and can be opened by any
text editor
• Flexible schema, supporting types
(string, number, object, array, Booleans,
null)
• Binary format (BSON) variation supports
other native types, such as Datetime
• Compressible
Cons
• Not splittable
• Limited metadata support
Ecosystems Condor ~
Online formatter and validator agent e
• Privileged format for web applications q
• Widely used for NoSQL DBs such as JSON Example ~
MongoDB, Couchbase, etc.
python dictionary et PH
24
JSON (JavaScript Object Notation)
• Arrays: lists that are represented by square brackets, and the values have
commas in between them. They can contain mixed data types, i.e., a single
array can have strings, Boolean, and numbers.
• E.g.: [1, 2, 7.8, 5, 9, 10]; ["red", "yellow", "green"]; [8, "hello", null, true];

• Objects: JSON dictionaries that are enclosed in curly brackets. In objects,

keys and values are separated by a colon ':', pairs are separated by comma.
Keys and values can be of any type.
• E.g.: {"red" : 1, "yellow" : 2, "green" : 3};

25
XML v.s. JSON

26
Binary Formats

27
What is Serialization?
Her

• Serialization is the process of converting an object into a stream of

bytes to store the object or transmit it to memory, a database, or a
file.

28
Deserialization

29
AVRO
• Row-based storage format which is widely used as a serialization process.
• AVRO stores its schema in JSON format making it easy to read and interpret by any
program.
• The data itself is stored in binary format by making it compact and efficient.

Row based storage format

Name Age

Pierre-Simon Laplace 77

John von Neumann 53

30
AVRO
Pros
• It is splittable and compressible.
• Good for data exchange.
• It highly supports schema evolution (at a different time and independently).
• Avro schemas are defined in JSON, easy to read and parse.
• The data is always accompanied by schema, which makes the data processing
much easier. text editor 2 libraryFor
Cons
• Data is not readable by human
Ecosystem
• Widely used in many applications (Kafka, Spark, etc.)

31
AVRO
Pros
• It is splittable and compressible.
• Good for data exchange.
• It highly supports schema evolution (at a different time and independently).
• Avro schemas are defined in JSON, easy to read and parse.
• The data is always accompanied by schema, which makes the data processing
much easier.
Cons
• Data is not readable by human
Ecosystem
• Widely used in many applications (Kafka, Spark, etc.)

32
AVRO
Pros
• It is splittable and compressible.
• Good for data exchange.
• It highly supports schema evolution (at a different time and independently).
• Avro schemas are defined in JSON, easy to read and parse.
• The data is always accompanied by schema, which makes the data processing
much easier.
Cons
• Data is not readable by human
Ecosystem
• Widely used in many applications (Kafka, Spark, etc.)

33
AVRO binary encoding
test_schema = '''{
"namespace": "example.avro", len("{'userName': 'Martin', 'favorite_number': 1337,
"type": "record", 'interests': ['daydreaming', 'hacking']}") = 88 bytes
"name": "Person",
8 bit Ibyle

=
"fields": [ :

{"name": "userName", "type": "string"},

{"name": "favorite_number", "type": ["null", 0c 4d 61 72 74 69 6e 02 f2 14 04 16 64 61

↑
bit
"long"]}, 79 64 72 65 61 6d 69 6e 67 0e 68 61 63 6b
{"name": "interests", "type": {

Amo32byCreduction
69 6e 67 00 i

conversion tl
"type": "array",
"items": "string"
}}]}'''

↑
Go

>
- UTE
Hex to Text Converter -t 34
*
AVRO binary encoding
test_schema = '''{
"namespace": "example.avro", {‘userName’: ‘Ben’, ‘favorite_number’: None,
"type": "record", ‘interests’: [‘sleeping’, ‘swimming’]
"name": "Person",
"fields": [
{"name": "userName", "type": "string"},
{"name": "favorite_number", "type": ["null", 06 42 65 6e 00 04 10 73 6c 65 65 70 69 6e
"long"]}, 67 10 73 77 69 6d 6d 69 6e 67 00
{"name": "interests", "type": {
"type": "array",
a
l
name
"items": "string"
}}]}'''

user B e n

&0616 0000 01102 Length of userName: 3 42 65 6e

nul
&
=
0016 -
0000 00002a
null r num
*

More
04 0000 0100
16 2 2 interests
s l e e p i n g
73 6c 65 65 70 69 6e 67
-1016 0001 00002 Length of the 1st interest: 8
s w i m m i n g

-
1016 0001 00002 Length of the 2nd interest: 8 73 77 69 6d 6d 69 6e 67 Hex to Text Converter
35
Parquet
• The column-oriented data storage format of the Apache Hadoop
ecosystem is excellent performing in reading and querying analytical
workloads.
• PARQUET file format is very popular for Spark data engineers and data
scientists
• Optimized for the paradigm of write once read many (WORM)

36
Parquet
Pros
• Splittable files.
• Organizing by column, allowing better compression, as data is
more homogeneous.
• High efficiency for OLAP workload.
• Supports schema evolution.
• Restricted to batch processing.
Cons
• Not human-readable.
• Difficulties to apply updates unless you delete and recreate it
again.
Ecosystems
• Efficient analysis for BI (Business Intelligence).
• Very fast to read for processing engines such as Spark.
• Commonly used along with Spark, Impala, Arrow and Drill.

37
Parquet
Pros
• Splittable files.
• Organizing by column, allowing better compression, as data is
more homogeneous.
• High efficiency for OLAP workload.
• Supports schema evolution.
• Restricted to batch processing.
Cons
• Not human-readable.
• Difficulties to apply updates unless you delete and recreate it
again.
Ecosystems
• Efficient analysis for BI (Business Intelligence).
• Very fast to read for processing engines such as Spark.
• Commonly used along with Spark, Impala, Arrow and Drill.

38
Parquet
Pros
• Splittable files.
• Organizing by column, allowing better compression, as data is
more homogeneous.
• High efficiency for OLAP workload.
• Supports schema evolution.
• Restricted to batch processing.
Cons
• Not human-readable.
• Difficulties to apply updates unless you delete and recreate it
again.
Ecosystems
• Efficient analysis for BI (Business Intelligence).
• Very fast to read for processing engines such as Spark.
• Commonly used along with Spark, Impala, Arrow and Drill.

39
Parquet v.s. AVRO

Parquet Avro
Column way storage format Row-based storage format
Read, Write Read: Faster Read: Slower than Parquet
Write: Slower than Avro due to better compression) Write: Faster

Schema evolution Supports schema evolution in terms of append-only. Supports schema evolution in terms
of modifying, append
Use cases Analytical Queries ETL, where we scan the complete
data
Suitable for analytical query. Write once, read many
times. Suitable for read intensive jobs. Optimized for write operation

40
Considerations for choosing a
file format
• Text v.s. binary
• Text-based file formats are easier to use
• Text-based files can be read by humans who can also modify
the file content with a text editor.
• Binary file formats require a tool or a library to be created and
consumed
• Binary files provide better performance by optimizing the data
serialization.
• Data type
• Some formats don’t allow the declaration of multiple types of
data. E.g.: distinguish a number from a string or a null value
from the string “null”.
• Scalar types hold a “single” value (e.g. integer, boolean, string,
null, …).
• Complex types are a compound of scalar types (eg arrays,
objects, …)
• Once encoded in binary form, the storage gain is significant.
For example, the string "1234" uses 4 bytes of storage while
the hex number 1234 requires only 2 bytes.
41
Considerations for choosing a
file format
• Schema enforcement
• The schema can be associated with the data or left to the
consumer who is assumed to know and understand how to
interpret the data.
• It can be provided with the data or separately.
• Schema evolution support
• Schema evolution allows updating the schema used to write
new data while maintaining backward compatibility with the
schema of the old data.
• Row and column storage
• Row-based storage supports adding data easily and quickly.
• Row-based storage is preferred in cases where the entire row
of data needs to be accessed or processed simultaneously.
• Row-based storage is commonly used for Online Transactional
Processing (OLTP), which usually processes CRUD (Create,
Read(Select), Update and Delete) at a record level.
• Column-based storage is useful when performing analytics
queries that require only a subset of columns examined over
very large datasets, used for Online Analytics Processing
(OLAP), which is an approach designed to quickly answer
analytics queries involving multiple dimensions.
42
Considerations for choosing a
file format
• Splittable
• If a file can be easily divided into several pieces, we can use
distributed file systems, such as Hadoop HDFS.
• Otherwise, it is the users’ responsibility to partition the data
into several files to enable parallelism.
• Compression
• Column-based storage brings a higher compression ratio and
better performance compared to row-based storage because
similar pieces of data are stored together.
• Compression mostly applies to text file formats and the binary
file format includes compression in its definition.
• Ecosystem
• It is often considered a good practice to respect the usage and
the culture of an ecosystem when it comes to choosing
between multiple alternatives.
• For example, when choosing between Parquet or ORC with
Hive, it is recommended to use the former on Cloudera
platforms and the latter on Hortonworks platforms. Chances
are that the integration will be smoother and the
documentation and support from the community will be much
more helpful.
43
Discussions
De
text twice
&
opening
- closing
How do you rank these formats (CSV, XML, JSON,
Avro) in terms of
• Storage efficiency
• Avro > CSV > JSON > XML
Auro
• Scalability
CsV-file contain only data no file name
• Avro > JSON > XML > CSV
ISON - each now specify column name
• Ease of use
->
XML
• (JSON, CSV)opening
> XMLa > Avroclosing The

44
C110(21

↑
XOMELE
Discussions 12022184x((2/
Mel
d En
How do you rank these formats (CSV, XML, JSON,
Avro) in terms of
• Storage efficiency
• Avro > CSV > JSON > XML if daksize &

• Scalability => increase the size= how much it will

• Avro > JSON > XML > CSV impact on data side
Arro increase 100 size data & 100 size
• Ease of use
• (JSON, CSV) > XML > Avro
CSV
JSON
J non splitable.
XML

45
Discussions

How do you rank these formats (CSV, XML, JSON,

Avro) in terms of
• Storage efficiency
• Avro > CSV > JSON > XML
• Scalability
• Avro > CSV > JSON > XML
• Ease of use
• (JSON, CSV) > XML > Avro
(JSON , CSU)
XML
Avro
46
Discussions

How do you rank these formats (CSV, XML, JSON,

Avro) in terms of
• Storage efficiency
• Avro > CSV > JSON > XML
• Scalability
• Avro > CSV > JSON > XML
• Ease of use
• (JSON, CSV) > XML > Avro

47
Moving data
takes time
• You may have learned a lot about the time
complexity of algorithms.
• Computing takes time
• But moving data does as well
The Memory Hierarchy

49
Example
• Transaction dataset
• 2,709,550 records, 167MB
• Query: find the total sum of the quantity

50
Performance Limits
• What are the limits for that query (167MB)?
Bandwidth Query Time
1GB ethernet 125MB/s 1.34s
rotating disk 200MB/s 0.835s
SATA SSD 500MB/s 0.334s
USB-C 1GB/s 0.167s
PCIe SSD 2GB/s 0.0835s
DRAM 20GB/s 0.00835s

• This completely ignores CPU cost

• CPU cost is not so relevant when the data is stored on disks, but
very relevant for DRAM
51
Python Implementation
>> sum.py

sum=0
with open('transactions_big.csv') as f:
for line in f:
sum += float(line.split(',')[5])
print (sum)

1.80s: total amount of time (in CPU-seconds) that the command spent in the user mode.
0.06s: amount of time in CPU-seconds that the process spent in the kernel mode
98%: the percentage of CPU that was allocated to the process.
1.896: total process running time(I/O + CPU)
52
Note: 1.8/1.896 = 98.1%
Move data or
computation?
First Situation: Move
Data
To process large volumes of data
that are geographically
distributed, we will traditionally
need to transfer all the data to be
processed to a single data center,
so that they can be processed in a
centralized fashion.

54
First Situation: Move
Data
However,
• It may not be practical to move
user data across country
boundaries, due to legal reasons
or privacy concerns.
• The cost, regarding both
bandwidth cost and time, to
move large volumes of data
across geo-distributed data
centers may become prohibitive
as the amount of data grows
exponentially.

55
Second Situation:
Move Computation
• Rather than transferring data across
data centers, it may be a better
design to move computation tasks to
where the data is so that data can be
processed locally within the same
data center.
• The fundamental objective, in
general, is to minimize the job
completion times in big data
processing applications, by placing
the tasks at their respective best
possible data centers.

56
Move data v.s. Move computation
If we move data (167 𝑀𝐵),
• Network bandwidth: 100 𝑀𝐵/𝑠
• It takes 1.67𝑠 to move the data
• Computation (CPU) time: 0.4 𝑠 → speed of computation: 417 𝑀𝐵/𝑠
• Total time: 𝑡𝑚𝑑 = 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑑𝑎𝑡𝑎 ∗ (1/𝑠𝑝𝑒𝑒𝑑 𝑜𝑓 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛 + 1/𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑠𝑝𝑒𝑒𝑑) =
167𝑀𝐵 ∗ (1/100𝑀𝐵/𝑠 + 1/417 𝑀𝐵/𝑠) = 2.07 𝑠

57
Move data v.s. Move computation A/B/C
If we move the computation
• The performance is determined by latency and the number of nodes
• 𝑡𝑚𝑐 = 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 / 𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑠𝑝𝑒𝑒𝑑 + 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑑𝑎𝑡𝑎 / 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑜𝑑𝑒𝑠 /
𝑠𝑝𝑒𝑒𝑑 𝑜𝑓 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛
• Assuming the size of program is 1 𝑀𝐵, for 10 nodes, 𝑡𝑚𝑐 = 1𝑀𝐵 / (100𝑀𝐵/𝑠) +
167𝑀𝐵 / 10 / (417 𝑀𝐵/𝑠) = 0.05 𝑠
&
• So it is worth doing if the problem is large enough to reduce the latency

eit

AY
A
BC 58
𝑡𝑚𝑑
= 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑑𝑎𝑡𝑎
∗ (1/𝑠𝑝𝑒𝑒𝑑 𝑜𝑓 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛
+ 1/𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑠𝑝𝑒𝑒𝑑)

𝑡𝑚𝑐
= 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑝𝑟𝑜𝑔𝑟𝑎𝑚
/ 𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑠𝑝𝑒𝑒𝑑
+ 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑑𝑎𝑡𝑎 / 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑜𝑑𝑒𝑠
/ 𝑠𝑝𝑒𝑒𝑑 𝑜𝑓 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛

Assume the speed of computation is

1𝐺𝐵 / 𝑠
And the speed of the network is 100
e 𝑀𝐵/𝑠
ver morecomputations
am · 59
Summary
• Data types
• There is no absolute answer in
the BEST file format. We need to
consider by multiple factors like
compactivity with
applications/systems, use cases,
etc.
• It’s very important to understand
different file formats if we are
working with data-related
domains especially designing a
data warehouse with complex
integration with many systems.
• Move data v.s. Move computation
• We can achieve significant speed-
up by moving computation to the
data especially in the big data era.
60

XML Processing With Python
100% (2)
XML Processing With Python
447 pages
Using Web Services: Python For Informatics: Exploring Information
No ratings yet
Using Web Services: Python For Informatics: Exploring Information
57 pages
XML and PHP Basics for Web Tech
No ratings yet
XML and PHP Basics for Web Tech
14 pages
XML and Web Services
No ratings yet
XML and Web Services
176 pages
XML in 10 Points
No ratings yet
XML in 10 Points
2 pages
API and Data Format Essentials
No ratings yet
API and Data Format Essentials
126 pages
An Overview of File Formats Json
No ratings yet
An Overview of File Formats Json
3 pages
Digital Signal Processing Course
No ratings yet
Digital Signal Processing Course
3 pages
CS504 - Internet & Web Technology - Unit 4 - Notes - 1597381974
No ratings yet
CS504 - Internet & Web Technology - Unit 4 - Notes - 1597381974
12 pages
Azure Data Concepts for Beginners
No ratings yet
Azure Data Concepts for Beginners
8 pages
File Types in Data Engineering!
No ratings yet
File Types in Data Engineering!
18 pages
Xbot Manual en
No ratings yet
Xbot Manual en
21 pages
One That Follows AP (Availability, Partition)
No ratings yet
One That Follows AP (Availability, Partition)
2 pages
What You Should Already Know: Home Page
No ratings yet
What You Should Already Know: Home Page
18 pages
XML Basics and Applications Guide
No ratings yet
XML Basics and Applications Guide
52 pages
XML in 10 Points
No ratings yet
XML in 10 Points
5 pages
DLMDSBDT01 03 Data Formats
No ratings yet
DLMDSBDT01 03 Data Formats
29 pages
XML Processing With Perl, Python, and PHP
No ratings yet
XML Processing With Perl, Python, and PHP
447 pages
XML Basics and Importance Explained
No ratings yet
XML Basics and Importance Explained
13 pages
XML Processing With Perl, Python and PHP. Also Covers TCL, Rebol, Ruby and AppleScript
No ratings yet
XML Processing With Perl, Python and PHP. Also Covers TCL, Rebol, Ruby and AppleScript
447 pages
Beginning XML 4th Ed Edition David Hunter Complete Edition
No ratings yet
Beginning XML 4th Ed Edition David Hunter Complete Edition
84 pages
Network Security Essentials Guide
No ratings yet
Network Security Essentials Guide
22 pages
IPT Chapter 3
No ratings yet
IPT Chapter 3
25 pages
Database Unit-IV Notes First Half
No ratings yet
Database Unit-IV Notes First Half
9 pages
200Mhz Bandwidth Digital Storage Scope For PC: Part No. 01ossds200
No ratings yet
200Mhz Bandwidth Digital Storage Scope For PC: Part No. 01ossds200
3 pages
Course: Introduction To XML: Pierre Genevès Cnrs
No ratings yet
Course: Introduction To XML: Pierre Genevès Cnrs
87 pages
Chapter5 CEF482
No ratings yet
Chapter5 CEF482
13 pages
Advance Object Technology Unit 1,2,3 Complete
No ratings yet
Advance Object Technology Unit 1,2,3 Complete
71 pages
Raspberry Pi and Arduino Based Automated Irrigation System
No ratings yet
Raspberry Pi and Arduino Based Automated Irrigation System
40 pages
XML-Troubleshooting Professional Magazine
No ratings yet
XML-Troubleshooting Professional Magazine
47 pages
18 Network Automation
No ratings yet
18 Network Automation
28 pages
Discuss Various Features of XML: Extended Hierarchical Format Data Definition
No ratings yet
Discuss Various Features of XML: Extended Hierarchical Format Data Definition
7 pages
5BH Ti
No ratings yet
5BH Ti
16 pages
L05L - An Introduction To Markup
No ratings yet
L05L - An Introduction To Markup
37 pages
DWV Unit Ii
No ratings yet
DWV Unit Ii
37 pages
Andra PT 01-30-25
No ratings yet
Andra PT 01-30-25
6 pages
DevNet Associate - Understanding Data Formats
No ratings yet
DevNet Associate - Understanding Data Formats
16 pages
XML for Data Exchange & Integration
No ratings yet
XML for Data Exchange & Integration
5 pages
4 SDML Copy of Chapter 4 - Designing Data-Intensive Applications
No ratings yet
4 SDML Copy of Chapter 4 - Designing Data-Intensive Applications
5 pages
LM Unit-1
No ratings yet
LM Unit-1
9 pages
Lecture 2 File Types Suitable For Storing Big Data
No ratings yet
Lecture 2 File Types Suitable For Storing Big Data
12 pages
DP 900 Data Fundamentals 1710103456
No ratings yet
DP 900 Data Fundamentals 1710103456
35 pages
Chapter 4A Data Encoding and XML
No ratings yet
Chapter 4A Data Encoding and XML
106 pages
42 P16cse5a-P16ite3a 2020052204503639
No ratings yet
42 P16cse5a-P16ite3a 2020052204503639
23 pages
ZWindows 10 Keyboard Shortcut
No ratings yet
ZWindows 10 Keyboard Shortcut
25 pages
XML Interview Guide
No ratings yet
XML Interview Guide
41 pages
Unit-1 XML
No ratings yet
Unit-1 XML
9 pages
DMC1951
No ratings yet
DMC1951
176 pages
06 XML
No ratings yet
06 XML
33 pages
XML vs JSON: Data Representation Guide
No ratings yet
XML vs JSON: Data Representation Guide
83 pages
Evolution of Computer-Aided Digital Design
No ratings yet
Evolution of Computer-Aided Digital Design
20 pages
U1 Web Applications Technologies L1 P2
No ratings yet
U1 Web Applications Technologies L1 P2
13 pages
0432 XML DTD and XML Schema
No ratings yet
0432 XML DTD and XML Schema
32 pages
Lecture 2 Serialization Basics 1.5 Hours
No ratings yet
Lecture 2 Serialization Basics 1.5 Hours
10 pages
Foundry Certification Guide - Solution Architect
No ratings yet
Foundry Certification Guide - Solution Architect
6 pages
What Is XML and The Usage of XML
No ratings yet
What Is XML and The Usage of XML
46 pages
Bypassing Kernel ASLR Target: Windows 10 (Remote Bypass) : Stéfan Le Berre - Heurs
No ratings yet
Bypassing Kernel ASLR Target: Windows 10 (Remote Bypass) : Stéfan Le Berre - Heurs
13 pages
WD Practical File PDF
No ratings yet
WD Practical File PDF
37 pages
Unit-2 XML
No ratings yet
Unit-2 XML
13 pages
RPG Programmers' XML Guide
No ratings yet
RPG Programmers' XML Guide
105 pages
CH 2 Data Collection Management
No ratings yet
CH 2 Data Collection Management
42 pages
Lecture 5 DC
No ratings yet
Lecture 5 DC
50 pages
Multimedia Presentation
No ratings yet
Multimedia Presentation
22 pages
Data Analysis for Beginners
No ratings yet
Data Analysis for Beginners
46 pages
Unit 7-PHP
No ratings yet
Unit 7-PHP
12 pages
Certificate: XML-Based Servers-Communicating Meaningful Information Over The Web Using XML
No ratings yet
Certificate: XML-Based Servers-Communicating Meaningful Information Over The Web Using XML
43 pages
Truedgs: Angle Beam Probes
No ratings yet
Truedgs: Angle Beam Probes
4 pages
Kitchen Draw
No ratings yet
Kitchen Draw
62 pages
Forest Stack RFP
No ratings yet
Forest Stack RFP
41 pages
Azure Storage
No ratings yet
Azure Storage
4 pages
Final Research Paper
No ratings yet
Final Research Paper
5 pages
Things To Know About EPM 11.1.2.4
No ratings yet
Things To Know About EPM 11.1.2.4
58 pages
Leica Ads100
No ratings yet
Leica Ads100
2 pages
N4 Computerised Financial Systems
No ratings yet
N4 Computerised Financial Systems
29 pages
Designing For DTG: Prep School: File Type
No ratings yet
Designing For DTG: Prep School: File Type
11 pages
Current Log
No ratings yet
Current Log
55 pages
2.3.1 SN SAMF X010 Import Software Installation Data
No ratings yet
2.3.1 SN SAMF X010 Import Software Installation Data
9 pages
TDD
No ratings yet
TDD
3 pages
1HSM 9543 32-10en Portable Capacitance Meter CB-2000 English
No ratings yet
1HSM 9543 32-10en Portable Capacitance Meter CB-2000 English
4 pages
Maximum Supported Hopping Rate Measurements Using The Universal Software Radio Peripheral Software Defined Radio
No ratings yet
Maximum Supported Hopping Rate Measurements Using The Universal Software Radio Peripheral Software Defined Radio
7 pages
Questions Interview
No ratings yet
Questions Interview
7 pages
Electronic Gear
No ratings yet
Electronic Gear
6 pages
Advanced Security Scanner Tech
No ratings yet
Advanced Security Scanner Tech
2 pages
Joao Vitor Resume22
No ratings yet
Joao Vitor Resume22
2 pages

Week2 - Data - Formats 3

Uploaded by

Week2 - Data - Formats 3

Uploaded by

Data Formats

• Often, we have no/weak control over the data source format

• It does not support native binary data 22 H

• One of the most popular formats used because of its

CSVs tend to be natively supported

Amazon Athena is an interactive

• Very verbose, and can lead to a larger file size

Base64 Encoding: Binary data can be converted to a Base64 string, which

Hexadecimal Encoding: Another option is to represent binary data as a

is less commonly used.

External References: Instead of embedding binary data, you can store it in

48656C6C6F is the binary representation (in ASCII) of "Hello"

hard to enforce path

• Objects: JSON dictionaries that are enclosed in curly brackets. In objects,

• Serialization is the process of converting an object into a stream of

Row based storage format

John von Neumann 53

{"name": "userName", "type": "string"},

{"name": "favorite_number", "type": ["null", 0c 4d 61 72 74 69 6e 02 f2 14 04 16 64 61

&0616 0000 01102 Length of userName: 3 42 65 6e

• Scalability => increase the size= how much it will

How do you rank these formats (CSV, XML, JSON,

How do you rank these formats (CSV, XML, JSON,

• This completely ignores CPU cost

Assume the speed of computation is

You might also like