Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views25 pages

Big Data 76-100

The document discusses the data management processes in Cassandra, a NoSQL database, detailing how data is written to memtables and SSTables, and the steps involved in reading data. It highlights the use of Bloom Filters and Partition Key Cache for efficient data retrieval, along with various applications of Cassandra in industries like messaging, retail, and social media analytics. Additionally, it mentions the use of Thrift and Avro for client APIs, along with common exceptions encountered in client interactions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views25 pages

Big Data 76-100

The document discusses the data management processes in Cassandra, a NoSQL database, detailing how data is written to memtables and SSTables, and the steps involved in reading data. It highlights the use of Bloom Filters and Partition Key Cache for efficient data retrieval, along with various applications of Cassandra in industries like messaging, retail, and social media analytics. Additionally, it mentions the use of Thrift and Avro for client APIs, along with common exceptions encountered in client interactions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

NoSQL Data Management

2.35
• Cassandra stores the data in memory structure in memtable(RAM) when
the initial write request is generated from the Client. Concurrently the writes
. are written on Commitlog(Disk) as well which are permanent ·e ven if the
li ght goes off for the node.

• The data from the memtable(RAM) is flushed to the SSTables(Disk) and


- the partition index is also created that points to the location of data in the
disk. The flushing of data from memtable(RAM) to SSTables(Disk) is done
using the configurable threshold or when the commit log
threshold commitlog_ total_spa-cc_in_mb is exceeded.

• The Data is written on the SSTables tables which are immutable which means
when the memtable is flushed the data is not overwritten in SSTables despite
a new file being created. The partitions are stored on multiple SSTables so
that they can be easily searched.

Write
date

memtable
Memory
... -----------· ------.. --- .. ----------- ------·-----
Disk ,..
"INDEX
Flush

Commit log SStable

Fig 2.6 :Cassandra write operation

Steps to read in Cassandra:


The Cassandra Read operation goes through different stages to find out exa~t
data startin t:,()' from the data present in the Memtable(RAM) till the data. present m
the SSTable(DISK) files.
The follo wing steps are followed to read the data from Cassandra.

The Read request w~ll be made from the Client.



The request data will be checked in the mcmtable(RAM). If the requested

data is present then data will be read from memtabl~(RAM) and merged
with SSTables(DISK) files to send final data to the client.
Big Data Analytics
2.36

• If the row cache is enabled then it will be checked to find the data .

• The Bloom Filters are loaded in the Heap memory that will be checked to
find out the SSTables file that can store the requested partition data. Since
Bloom Filters works on probabilistic function and can return false positives.
In some cases Bloom Filters does not return the SSTable file then Cassandra
further checks in the partition key cache.
• The Partition Key Cache is used to store the partition index in heap memory
and the partition index of data will be searched in that. If the Partition Key
is present in the Partition Key Cache then Cassandra will go to compression
offset to find the Disk that has the data. If the Partition Key is not present in
the Partition Key Cache then the partition summary is searched to find user-
requested data.

• Partition Index is used to store the Partition key of the data that will be
used in the Compression offset map to find out the exact location of the
Disk which has stored the data.

• The Compression offset map is used to hold the exact location of data. It
~se~ the Partition. key to locate that. Once the Compression ~ffset map
md1cates the locat10n where data is stored the further process is to fetch the
data and share it with the user. .

Bloom filter

Read
request:- - -- -- - - - -- -~-=-= r
Compression
offsets
Partition
summary
ox... P=art
r,:=;:~= tion key cache
=i;:;.

Memory ~~:: ---l ··. ~·, 'j


--------------
Disk ------ - .
-------- ---·--- ----- --------------- --- ---------- ---- ___ ___________ __

---~
~

~
OX...
Return ox..
~ - ...J
result set ~ OX...
~

~
OX...

Data Partition
index

Fig 2·7 :Cassandra Read Operation


sQL oata Management
23 7
~ ~~ A~N~Df
S~ ~
fiR~A ~~MPL ::--- - - - - - - - - - - - -_2~
~;:ES
z.7 CAS EXA
. .
Cassandra can be used for di ffere n.t type .of apphcat10ns
1. Messaging: Cassandra is a d .
d S 't . great atabase which can handle a big amount of
1
ata. ~ is pr~fetred for th e companies that provide Mobile phones and
messaging. services. These companies have a huge amount of data, so
Cassandra 1s best for them. ·

2. Ha~~le high speed Applications: Cassandra can handle the high speed data
s~ it is a great database for the applications where data is coming at very
high speed from different devices or sensors.
3. Produ ct Catal ogs and retail apps: Cassandra is used by many retailers for
dur_able shopping cart protection and fast product catalog input and output.
4. Social Medi a Analytics and recommendation engine: Cassandra is a great
database for many online companies and social media providers for analysis
and recommendation to their customers.
The list of companies using Cassandra is growing. These companies include:
In a much-publicized blog post,
• Twitter is using Cassandra for analytics.Ryan King, explained that Twitter
Twitter 's primary Cassandra engineer,
had decided against using Cassandra as its primary store for tweets , as
originally planned, but would instead use it in production for several different
things: for real-time analytics, for geolocation and places of interest data,
and for data mining over the entire user store.
Mahalo uses it for its primary near-time data store .

Facebook still uses it for inbox search, though they are using a proprietary fork .

Digg uses it for it_s primary near-time data store .

k ·t l'or its cloud servic·e, monitorbg, and logging.
• I)\..ac space uses 1 11

• Reddit uses it as a persi stent cache .


Cloudkick uses it for monitoring statistics and analytics.

· . t '.> and serve near real-time video analytics data.
• Ooya] a uses 1t to s ore
. the main data store fo r it s real-time location
• Simpl eGeo uses 1t as
infrastructure.
Big Data Analytics
2.38
Onespot uses it fo r a subset of its mai n data store.

• Cassandra is also being u~·ia.·
by Cisco and Platform64, Comcast and bee.tv
for personalized telev ision streaming to the Web and to mobile devices.

• The largest known Cassandra installation is at Facebook, where they have


more than 150TB of data on more than 100 machines.

• Many more companies are currently evaluating Cassandra for production


use in different projects, and a services company called Riptano ,' cofounded
by Jonathan Ellis, the Apache Project Chair for Cassandra, was started in
April of 2010.
Example: Sample Applica tion for Hotel App Design

<<CF>>Hotel <<CF>>HotelByCity <<SCF>>PointOflnterest


«RowKey»#hotell D « RowKey» #city:state:hotell D «SuperColumn Name»#hotel ID
+name +hoteI1 «RowKe y» #poiName
+phone +hotel2 +desc
+address +... +Phone
+city
+state
+zip <<SCF>>RoomAvailablity <<SCF>>Room
«SuperCo lumn Name»#h otell D <<SuperColumnName>>#hotellD
«RowKey »+date «RowKe y» #roomlD
<<CF>>Guest
+kk :<unspecified> = 22 +num
«RowKey »#phone +qq: <unspecified>= 14 +type
+fname
+rate
+lname
+coffee
+email
<<CF>>Reservation +tv

«RowKey »#reslD +hottub

+hotellD +...

+roomlD
+phone
+name
+arrive
+depart
+rate
+ccNum

Fig 2.8: The hotel search represented with Cassandra's model


NoSQL Data Manage ment 2.39

• In this design , some of the tables are transfe rred some such
as Hotel and Guest, to column families.
·.-: < ;
• Other tables, such as PointOflnterest, have been denormalized into a super
column family.

• There is no SQL in Cassan dra, created an index 111 the form of


the HotelB yCity column family.

• The room and an1enities are combined into a single column family, Room.
• The column s such as type and rate will have corresponding values; other
column s, such as hot tub, will just use the presence of the. column name
itself as the value, and be otherwise empty.
Example: Schem a definit ion in cassandra.yaml
keys paces:
- name: Hotelie r
replica_placement_strategy: org.apache.cassandra.locator.RackUnawareStrategy
replica tion_factor : I
column familie s :
- name: Hotel
compar e_with: UTF8Type

- name: HotelB yCity


compar e_with: UTF8Type

- name: Guest
compar e_with: BytesType

- name: Reserv ation


compar e_with: TimeUUIDType

- name: PointO flnteres t


column_type: Super
compar e_with: UTF8Type
compar e subcolu mns_ wi th : UTFSType
Big Data Analytics

2.40
_name: Room
column_type: Super ·.··••. ·
compare_with: BytesType
compare_subcolumns_with: BytesType

_name: RoomAvailability
column type: Super
compare_with: BytesType
compare subcolumns with: BytesType

2.8 CASSANDRA CLIENTS


2.8.1 Basic Client API
• In Cassandra version 0.6 and earlier, Thrift served as the foundation for the
entire client APL
With version 0.7, Avro started being supported due to certain limitations of
• the Thrift interface and the fact that Thrift development is no longer
particularly active.
For example, there are a number of bugs in Thrift that have remained open

for over a year, and the Cassandra committers wanted to provide a client
layer that is more a~tive and receives more attention.

2.8.2 Thrift
• Thrift is the driver-level interface; it provides the API for client
implementations in a wide variety of languages
• Thrift is actually an RPC protocol or API unified with a code generation
tool for CQL, and the purpose of using thrift in Cassandra is because it
facilitates easy access to Database (DB), across the Programming Language.
• Thrift is ~ code generation library for clients in C++, C#, Erlang, Haskell,
Java, Objective C/Cocoa, OCaml, Perl, PHP, Python, Ruby, Smalltalk, and
Squeak.
• Its goal is to provide an easy way to support efficient RPC calls in a wide
variety of popular languages, without requiring the overh.ead of somethi ng
like SOAP.
NoSQL Data Management 2.41
The design of Thrift offers the following features:

• Langu age-in depend ent types: Because types are defined in a language-
n·eutral manner using the definition file, they can be shared between different
langua ges. For example, a C++ struct can be exchan ged with a Python
dictionary.

• Comm on transp ort interface: The same applica tion code can be used
whethe r you are using disk files, in-memory data, or a streaming socket.
• Protocol independence: Thrift encodes and decodes the data types for use
. across protocols.

• Versioning support: The data types are capable of being versioned to support
updates to the client APL
2.8.2.1. Excep tions
There are several exceptions that can be thrown from the client interface that
you might see on occasion.

The following is a list of basic exceptions:


• AuthenticationException: The user has invalid credentials or does not exist.
• AuthorizationException: The user exists but has not been granted access
to this keyspace.
• Config uratio nExce ption: This is thrown when the class that loads the
database descrip tor _c an't find the configuration file, or if the configuration
is invalid. This can happen if you forgot to specify a partitio ner or endpoint
snitch for your keyspace, used a negative integer for a value that only accepts
a positiv e integer, and so forth. This exception is not thrown from the Thrift
interface.
• InvalidRequestException: The use:r.-request is improperly formed. This
might mean that you've asked for data from a keyspace or column family
that doesn' t exist, or that you haven' t included all required parame ters for
the given request.
• NotFoundException: The user requested a column that does not exist.
• TExccption: This can occur different Thrift versions are mixed and matched
with server versions. This exception is not thrown from the Thrift interface,
Big Data Analytics
2.42
ht, une xpe cte d·e xce ptio ns
but is part ofT hrif t itself. TExceptions are uncaug
cur ren t Thr ift call . The y
that bubble up from the server and terminate the
mu st def ine you rse lf.
are not used as application exceptions, which you
TimedOutException:The response is taking lon
ger tha n the con figu red limit,

pen s bec aus e the ser ve·r is
which by default is 10 seconds. This typically hap
but this fail ure has not yet
overloaded with requests, the node has failed
bee n req ues ted .
been detected, or a very large amount of data has
Un ava ilab leE xce ptio n: Not all of the Cassan
dra rep lica s tha t are req uir ed

le. Thi s exc ept ion is not
to meet the quorum for a read or write are availab
thrown from the Thrift interface.
em en t
NoSQL Data M an ag
2.4 3
Esno
TWO MARKS QU N A N D A N S \V E R S
1. \Vhat is N O S Q L ?
.
NoSQ L D at ab as e is D
. . a no n- re la ti on al a ~f an ag em en t Syst em, that does
not
m a. N oS Q L da ta b at
requ 1~e a fi xe d sc he ~ ot
~ st an ds fo r "Not O nly SQL" or
se te ch no lo cuments instead
SQL .. , N oS Q L da ta ba gy_ stores in fo rm at io n in JS O N do
us ed b
o f co lu m ns an d ro w s Y re la ti on al dat abases. NoSQL da ta ba se s are wid Y
el
·
use d in re al -t im. e. w
eb ap• r . ns and bi d . . . . ad va nt ag es
T m am
P ic at io g a ~ be ca us e theI
.. ·. . .
an d h.1gh ailability.
ar e hi gh sc al ab il it y · · av

2. ha t ar e th e fe at u re s o f N oS Q L ?
W
1. Avai )a bi li ty

2. F1ex i bi 1i ty
., .
~
S ca la bi li ty
4. A va il ab il it y

5. D is tr ib ut ed

6. H ig hl y fu nc ti on al
7. H ig h pe rf or m an ce
3. e th e ty p es o f N O S Q L databases?
\V ha t ar
do cu m en ts si m il ar to JS ON (JavaScript
st or e da ta in
I. D oc u m en t d at ab as es : It co nt ai ns pa ir s o f fields and
vaJues.
ct s. E ac h do cu m en t
O bj ec t N ot at io n) ob je f ty pe s in cl ud in g things like strings.>
ca ll y be a va ri et y o
T he va lu es ca n ty pi
arrays, or ob je ct s.
nu m be rs , bo ol ea ns,
ty pe o f da ta ba se w he re ea ch item
as es : ar e a si m pl er
2. K ey -v al u e d at ab
lu es .
co nt ai ns ke ys ·a nd va
il y D at a st or es : st or e da ta in tables,
or es or C ol u m n Fam
3. \V id e--c of u m n st
lumns.
ro ws~ an d dy na m ic co od es ty pi ca lly st or e
an d ed ge s. N
st or e da ta in no de s
4. G ra p h d at ab as es : th in gs , w hi le ed ge s st or e in fo rm at
io n
op le, pl ac es , an d
in fo rm atio n ab ou t pe de s.
+; ns h;•ps be t\ ve en th e no
a bo ut t h. e re Iau o
Big Data Analytics
2.44
4. Difference between SQL and NOSQL?
SQL NOSQL

SQL databases are primarily called NoSQL databases are primarily called as
RDBMS or Relational Databases. Non-relational or distributed database.

SQL databases are table based.databases. NoSQL databases can be document


based, key-value pairs, graph databases.

SQL databases have fixed databases. NoSQL databases can be document


based, key-value pairs, graph databases.

Vertically Scalable (scale-up with a Horizontal (scale-out across commodity


larger server) servers)
ACID (atomicity, consistency, isolation, Follows CAP(consistency. availability,
durability) Transactions Supported partition tolerance)
Joins required Joins not required
These databases are best suited for These databases are not so good for
complex queries complex queries
A mix of open-source like Postgres & Open-source
MySQL. and commercial like Oracle
Database.

5. List the advantages of NOSQL

1. NoSQL databases simplify application development, particularly for


1
interactive real-time web applications, such as those using a REST API and
web services.

2. These databases provide flexibility for data that has not been normalized ' which
.
requires a flexible data model, or has different properties for different data entities.
3. They offer scalability for larger data sets, which are common in, analytics
and artificial intelligence (AI) applications.

4. NoSQL databases are better suited for cloud, mobile, social media and big
data requirements.

5. They are designed for specific use cases and are easier to use than general-
purpose relational or SQL databases for those types of applications.
2.45
NoSQL Data Managemen t
6. List the disadvant ages of NoSQL
Each NoSQL database has its own syntax for querying and managing data.
1.
Lack of a rigid database sche111a and constraints removes the data integrity
2.
safeguards that are built into relational and SQL database systems.
A schen1a with s01ne sort of structure is required in order to use the data.

4. Because most NoSQL databases use the eventual consistenc y ip.odel, they
do not provide the same level of data consistenc y as SQL databases.
5. The data will not be consistent, which means they are not well-suite d for
transaction s that require immediate integrity, such as banking and ATM
transactions.
6. There are no comprehensive industry standards as with relational and SQL
DBMS offerings.
7. Lack of ACID properties
8. Lack of JOINs.
7. Difference behveen Cassandra and MySQL
S.No. Cassand.ra ..
MySQL
1 Apache Cassandra is a type of No- It is a type of Relational Database.
SQ L database
2 Apache Software Foundatio n Oracle developed MySQL and released
developed Cassandra and released it it in May 1995.
in July 2008 .
....
.) Apache Cassandra is written in MySQL is written in C and c++.
JAVA.
4 Cassandra does not provide ACID It provides ACID properties.
properties . It only provides AID
property.
5 Read operation in Cassandra takes • Read operation in MySQL takes
0( 1) complexity. O(log(n)) complexity.
6 There is no foreign key in Cassandra.
As a result, it does not provide the
a
MySQL has foreign key, so ~t supports
th e concept of Referential Integrity.
..___ concept of Referential Integrity.
Big Data Ana lytics
2 .46
8. Differen ce between Cassand ra and ROB.M S
RDBMS
--
Cassand ra
RDBMS handles the structure d data.
--
Cassard ra handles the unstructu red data. ~

Cassand ra provides a flexible schema to RDBMS provides a fixed schema to store


store the data. the data.

In Cassand ra, the Keyspac es .are used to In RDBMS the database s are used to store
store the tables and it is the outermo st the tables and it is the outermo st
containe r of Cassand ra. containe r of the Relation al ·Databas e
managem ent system.

In Cassand ra, the Tables are represen ted The Tables are represen ted as the array
as the nested key-valu e pairs. of ROW and COLUM NS.

In Cassand ra, the entities are represen ted In RDBMS the entities are represen ted
through Tables and Columns . through Tables.
In Cassan dra, the relation ship is · The Joining in RDBMS · is supporte d by
presente d by the Collecti ons. the concept of foreign key join.

In Cassand ra, a column is a storage unit. The attribute of the table is represen ted
through a Column .

In Cassand ra, the rows are represen ted In RDBMS , the rows are represen ted as
as the replicati on unit. the actual data.

9. What is Apache Cassand ra?


Apache Cassand ra is a free and open-so urce distribu ted NoSQL database
manage ment syste1n designed to handle large amounts of data across many
commod ity servers, _providin g high availabi lity with no single point of failure.
10. What is CQLSH ? And why is it used?
Cassand ra-Cqlsh is a query language that enables users to commun icate with its
database . By using Cassand ra cqlsh, you can do followin g things:
• Define a schema
• Insert a data, and
• Execute a query

ll. What are Cluster s in Cassand ra?


The outermo st structure in Cassand ra is the cluster. A cluster is a containe r for
Keys paces
NoSQL Data Management 2.47
Sometimes called the ring, because Cassandra assigns data to nodes in the cluster
by arranging them in a ring. A node holds a replica for a different range of data.
12. What is a Keyspace in Cassandra?

A keyspace is the outermost container for data in Cassandra. Like a relational


database, a keys pace has a name and ·a set of attributes that define keyspace-
wide behaviour. The keyspace is used to group Column families together.
13. What is a Column Family?

A column family is a container for an ordered collection of rows, each of which


is itself an ordered collection of columns. We can freely add any column to any
column family at any tin1e, depend~ng on your needs. The comparator value
indicates how columns will be sorted when they are returned to you in a query.
14. What is a Row in Cassandra? and What are the different elements of it?
A row is a collection of sorted columns. It is the smallest unit that stores related
data in Cassandra. Any component of a Row can store data or metadata
The different elements/parts of a row are the

• Row Key
• Column Keys
• Column Values
15. Name some features of Apache Cassandra . .

• High Scalc\bility
• High fault tolerant
• Flexible Data storage
Easy data distribution
• Tunabl'e Consistency
• Efficient Wires
• Cassandra Query Language
16. List some of the components of Cassandra.
Some components of Cassandra are:
1. Table
2. Node
3. Cluster
4. Data Centre
Big Data AnalYties

2.48
5. Memtable
6. SSTabl e
7. Commi t log
3. B100111 Filter
17. Write some advant ages of Cassan dra.
These are the advantages ifCassa ndra: .
. b 1. t d t several nodes , Cassan dra fault tolerant.
1s
• Since data can e rep 1ca e o
• Cassandra can handle a large set of data.
• Cassandra provides high scalability.
18. Define commit log.
It is a mechanism that is used to recover data in case the databas e crashes . Every
· out 1s
· earned
· n that 1s · th e com mit log · Using this the data can
· saved 1n
operatio
be recovered.
19. Define composite key.
Composite .keys include row key and column nan1e. They are used to define
column family with a concatenation of data of differe nt type.
20. Define SSTable.
SSTable is Sorted String Table. It is a data file that accepts regular Mem Tables.
21. What is memtable?
Memtable is in-memory/write-back cache space contain ing conten t in key and
column format. In memtable, data is sorted by key, a~d each Colum nFamil y has
a distinct memtable that retrieves column data via key. It stores the writes until it
is full, and then flushed out.
22. How the SSTable is different from other relatio nal tables?
SStables do not allow any further addition and removal of data items once written.
For each SSTable, Cassandra creates three separat e files like partitio n index,
partition summary and a bloom filter.
23. What is data replication in Cassan dra?
in
Data replication is an $!lectronic copying of data fro1n a databas e one computer
or server to a database in another so that all users can share the same level of
information . Cassandra stores replicas on multipl e nodes to ensure reliability
and fau lt tolerance. The replication strategy decides the nodes where replicas are
placed.
3.1
MAP Reduce Applications

UNIT Ill

MAP RED UCE APP LICA TIO NS


MapRe duce workflo ws - unit tests with MRUni t - test data and local
tests - anatom y of MapRe duce job run - classic Map-re duce - YARN -
failures in classic Map-re duce and YARN - job schedu ling - shuffle and
sort - task execut ion - MapRe duce types - input format s - output format s.
----------■

3.1 MAPR EDUC E WORK FLOW S


MapReduce is a programn1ing model and software frmnework that allows for the
processing of large data sets in parallel across multiple comput ers.
It was origina lly develop ed by Google, and is ·con1n1only used in big data
processing applica tions.
The MapRe duce model breaks down a large dataset into s1naller chunks , and
then processes those chunks in parallel across a distributed comput ing environn1ent.
The MapRe duce model is based on two key fu~ctions:

• Map function: This function takes in a set of input data and maps it to a set
of interme diate key-val ue pairs.

• Reduce function: This function takes in the intern1ediate key-val ue pairs


and combin es them to produce the final output.

Here are some of the key concep ts related to MapRe duce.


Job:

• A Job in the context of Hadoop MapRe duce is the unit of work to be


perform ed as request ed by the client/user.

• The informa tion associated with the Job include s the data to be process ed
(input data), MapRe duce logic/ program / algorith m , and any other relevant
configuration informa tion necessary to execute the Job.
Big Data AnalYf:ics
3.2
Task:
Hadoo p NfapRe duce divides a Job into multip le sub-job s known as Tasks.

These tasks can be run indepe ndent of each other on various nodes acros_s

the cluster.
There are primar ily two types of Tasks - ?\.1 ap Tasks and Reduce Tasks .

Job Tracke r:
Just like the storage (HDFS }, the compu tation (MapR educe) also w·orks in a

master -slave / master-,,.rorke r fashion .
A JobTra cker node acts as the Master and is respon sible for schedu ling /

execut ing Tasks on approp riate nodes, coordi nating the execut ion of tasks,,
sendin g the inform ation for the execut ion of tasks, getting the results back
after the execut ion of each task, re-exec uting the failed Tasks, and monito rs
/ mainta ins the overall progres s of the Job.
• Since a Job consist s of multipl e Tasks, a Job 's progre ss depend s on the status
/ progre ss of Tasks associa ted ,vith it. There is only one Job Tracke r node per
Hadoo p Cluster .

TaskT racker :
• A TaskTr acker node acts as the Slave and is respon sible for execut ing a
Task assigne d to it by the JobTra cker.

• There is no restrict ion on the numbe r of Task Tracke r nodes that can exist in
aHado op Cluster .

• TaskTracke r receives the inform ation necessa ry for execut ion of a Task from
JobTra cker, Execut es the Task, and Sends the Result s back to JobTracker.

f.i.apO
• Map Task in MapRe duce is perform ed using the M ap() functio n.

• This part of MapRe duce is respon sible for proces sing on e or more chunks
of data and produc ing the output results .
Reduce ()
The ne xt part/co mpone nt/stag e o f the 1\ fapRed uce progra mming model is
the ReduceO functio n .
MAP Reduce Ap plicat ions 3.3
• This part of MapRedu ce is responsib le for conso lidating the resu lts produced
by each of the Map() functions /tasks.

Data Locality

• I\1apRedu ce tries to place the data and the compute as close as possibl e.
First, it tries to put the compute on the same node-whe re data resides, if that
cannot be done ( due to reasons like compute on that node is down, compute
on that node is performin g some other computat ion, etc.), then it tries to put
the compute on the node nearest to the respective data node(s) which contains
the data to be processed .

• This feature of MapRedu ce is "Data Locality" .

The following diagram shows the logical flow of a MapRedu ce programm ing
model. ·

Input Splits Mapper


Reducer

Input
Input Splits
~ ~ Mapper ...
Data
Shuffling .. Output
&
... Data
stored
on HDFS
Input Splits
H Mapper Sorting
-
stored
on HDFS

Reducer
Input Splits Mapper

Fig 3.1 MapReduce workflow

The stages depicted above are

• Input: This is the input data/ file to be processed .

• Split: Hadoop splits the incoming data into smaller pieces called " splits" .

• Map : In this step, MapRedu ce processes each split according to the logic
defined in map() function. Each mapper works on each split at a-time. Each
mapper is treated as a task and multiple tasks are executed across different
TaskTrac kers and coordin ated by the JobTrack er.

• Combine: Thi s is an optional step and is used to improve the performa nce
by reducing the amount of da ta transferre d across the network. Combine r is
the same as the reduce step and is used for aggregati ng the output of the
map() function before it is passed to the subseque nt steps.
Big Data Analytics
3.4 I

Shuffle & Sort: In this step, outputs from all the mappe rs is shuffle d, sorted
• to put them in order, and grouped before sending them to the next step.
Reduce : This step is used to aggrega te the outputs of mappe rs using the
• reduce() functio n . Output of reducer is sent to the next and final_step. Each
reducer is treated as a task and multipl e tasks are execut ed across different
TaskTrackers and coordin ated by the JobTra cker.
.,.
Output : Finally, the output of reduce step is written to a file in HDFS .

Here's an exampl e of using MapRe duce to count the frequen cy of each word in
an input text. The text is, "This is an apple. Apple is red in color."
This-1 This-1
This-1
Input Split ls-1
ls-1 Output
An-1 ls-2
This is ls-1
Input Apple-1 This-1
an Apple

~
ls- 2
An-1
This is an Apple. An-1
Apple-2
Apple is red in Apple-1
color. Apple-2 Red-1
Apple-1
ln-1
Apple is
red in Coror-1
color. Red-1 Red-1
Apple-1
ls-1
Red-1 ln-1 ln-1
ln-1
Color-1 Color-1 Color-1

Fig 3.2 MapReduce -wordco unt

• The input data is divided


. . into multipl e segmen ts , th en process ed 1n
. parallel
to re~uce pro_cess1ng tune. In this case, the input data will be divided into
two input sphts so that work can be distribu ted over all th e map no d es.

• :he Mappe r counts the number of times each word occurs from input splits
in
c. the form of key-val ue pairs where the k ey is
. th e word , and the value 1s
. the
1requency. ·
• For the first input sp rtt , t·1 generat es 4 key-val ue pairs: This l · is I· an I·
· generat es 5 key-val ue pairs: ~Apple
apple, I. ; and for the secon d , tt ' ' I·' . ' I·'
red , 1; m, I ; color. , , is, ,
MAP Reduce Applications 3.5
• It is followed by the shuffle phase; in which the values are grouped by keys
in the form of keyavalue pairs. Here we get a total of 6 groups of key•value
pairs.

• The sam'e reducer is used for all key-value pairs with the same key.

• All the words present in the data are combined into a single output in the
reducer phase. The output shows the frequency of each word .

• Here in the exan1ple, we get the final output of key-value pairs as This, I;
is, 2; an , I ; apple, 2; red, I ; in, 1; color, 1.
• The record writer writes the output key-value pairs from the reducer into
the output files, and the final output data is by default stored on HDFS.
• The MapReduce framework provides automatic parallelization and fault
tolerance, allowing for the efficient processing of large data sets.
• It is commonly used in distributed computing systems such as Apache
Hadoop and has become an important too) for processing big data in many
industries .

.3.2 UNIT TESTS WITH MRUNIT


MR.Unit is a Java library that provides a testing framework for MapReduce
programs.

With MRUnit, you can write unit tests for your MapReduce jobs to verify their
correctness and ensure that they are functioning as expected.
Here are the steps to write unit tests with MRUnit:

• Set up your test environment: To use MRUnit, you will need to include
the MRUnit library in your Java project. You can do this by adding the
MRUnit dependency to your projec_t's build file.

• Define your input and expected output: In your unit test, you will need to
define the input data for your MapReduce job and the expected output.
• Write your test case: Use the MRUnit APis to set up your test case. You
can use the MapDriver or ReduceDriver classes to test your Map or Reduce
functions, respectively. You can also use the MapReduceDriver class to test
your -entire MapReduce job.
Big Data Analytics
3.6
Run your test case: Run your test case using the runTest() method and
• verify that the output matches t4e expected output.
Here is an exan1ple of a unit test for a simple word count MapReduce job using
MRUnit:

public class WordCountTest {

@Test
public void testMapper() {
MapDriver<LongWritable, Text, Text, IntWritable> n1apDriver = new
MapDriver<>();
111apDriver.withMapper(new WordCount.WordCountMapper());
mapDriver.withlnput(new LongWritable(0), new Text("hello world hello"));
mapDriver.withOutput(new Text("hello"),·new IntWritable(l ));
mapDriver.withOutput(new Text("world"), new IntWritable(l));
mapDriver.withOutput(new Text("hello"), new Int Writable(}));
mapDri ver. runTest();
}

@Test
public void testReducer() {
ReduceDriver<Text, IntWritable , Text , IntWr1"table> r e d uce D n· ver = new
ReduceDriver<>(); . .
re_duceDriv~r.withReducer(new WordCount.WordCountReducer());
L1st<IntWntable> values= new ArrayList<>();
values.add(new IntWritable(l ));
values.add(new IntWritable(l ));
reduceJ?river.withlnput(new Text("hello"), values);
reduceDriver.withOutput(new
.- Text("hello") , "new 1n tWnta
. bl e(2));
reduceDnver.runTest();
}

@Test
public void testMapReduce() {
MAP Reduce Applications 3.7

l'vfapReduceDriver<LongWritable, Text, Text, IntWritable, Text, Int Writable>


111apReduceDriver = new MapReduceDriver<>();
mapReduceDriver. withMappe r(new WordCount. Word CountMapper() );
mapReduceDriver.withReducer(new WordCount. Word CountReducer());
mapReduceDriver.withlnput(new LongWritable(0), new Text("hello world
hello"));
mapReduceDriver.withOutput(new Text("hello"), new IntWritable(2));
mapReduceDriver.withOutput(new Text("world"), new IntWritable(l ));
mapReduceDriver.runTest();
).
J

In the above example, it defines three test cases for the Mapper, Reducer, and the
entire MapReduce job. It set up the input data and the expected output using the MRUnit
APis and then run the test using the runTest() method.
3.3 TEST DATA AND LOCAL TESTS
• Test data and local tests are two key components of MapReduce development
that are used to ensure the correctness of a MapReduce program.
• In the context of Hadoop MapReduce, "test data" refers to the data that is
used for testing MapReduce jobs during development and debugging.

• ·This data is typically small-scale and carefully crafted to cover different


scenarios and edge cases to thoroughly test the MapReduce logic.

• On the other hand, "local test" r·efers to the practice of running MapReduce .
jobs in local mode, also known as "local mode" or "local execution".

• Local mode allows developers to run MapReduce jobs on their local machine
without the need for a Hadoop cluster, using the test data as input.

• The combination of test data and local test allows developers to thoroughly
test their MapReduce logic in a controlled environment before deploying it
to a production Hadoop cluster.

• By running MapReduce jobs in local mode with test data, developers can
validate the correctness and performance of their MapReduce logic, identify
and fix any issues or bugs, and ensure that the output of the MapReduce job
meets the expected results.
3.8 Big Data Analvtics

• Local test with test data is an essential part of the Map Reduce development
and debugging process, as it helps developers catch and fix issues early in
the development cycle, leading to more robust and reliable MapReduce jobs
when deployed in a production Hadoop cluster.
Here is a detailed look at these.two concepts:
. •."

Test data:

• Test data is a set of input data that is used to test a MapReduce program.
• The test data should cover a range of possible scenarios and edge cases to
ensure that the program handles all possible inputs correctly.
• The test data should be representative of real-world scenarios that the program
is expected to handle.
• To ensure that the Map Reduce program works as expected, developers must
define the expected output for each input scenario.
• By comparing the output of the program with the expected output, developers
can verify that the program is working correctly.
• In MapReduce, test data can be generated in various ways.
• One approach is to use sample data from the production environment.
• Another approach is to generate synthetic data that simulates real-world
scenarios.

Local tests:
• Local tests are tests that are run on the developer's local machine to test a
MapReduce program in isolation.

• Local tests can be run using a testing framework like JUnit, MRUnit, or
Hadoop MiniCluster.
• · These tests help catch issues early in the development process and ensure
program correctness before deploying the program in a production
environment.
• Local tests can be used to test individual MapRerluce functions, as well as
the entire MapReduce program.
MAP Reduce Applications 3.9

• By running tests in isolation, developers can ensure that each function or


module works as expected before integrating it into the larger program.
• Local tests also help developers catch issues early before they become more
difficult and expensive to fix.

• The local test environment should simulate the production environment as


closely as possible.

• This helps ensure that the program works correctly when deployed in the
production environment.

Benefits of using test da~a and local tests in Map Reduce developn1ent:

• Early b11g detectio11: Test data and local tests help catch issues early in the
development process before the program is deployed in a production
environment. This reduces the cost and time required for debugging and
fixing iss·u es later on. ·

• llnprove,I code quality: Test data and local tests promote good coding
practices and improve code quality. By testing the co_de in isolation,
developers can ensure that each function or module works as expected before
integrating it into the larger program.

• Faster feedback: Running local tests allo_ws developers to get immediate


feedback on the correctness of their code. This helps catch issues early and
allows developers to make changes quickly.

• Easier refactoring: Test data and local tests make it easier to make changes
to the code without introducing new issues. By running the tests after each
code change, developers can ensure that the changes did. nof break any
existing functionality.
Test data and local tests are essential components of MapReduce development
that help ensure progra1n correctness, catch issues early, improve code quality, and
facilitate code changes.
The steps involved in using test data and local tests in MapReduce development:
Define the test data

• The first step is to define the test data that will be used to test the MapReduce
program.
Big Data Analyt·

~!_~- -- -- -- -- -- -- -- --:-:-~=-==::::~::
~10 Of
This involves identifying a range possi
---- -
, 'ble input sce nar ios and ·generatinICs
g
• .
sample data that represents these scenarios.
typical and edg e cases .
The test data should be diverse and cover both

Define the expected out put .
to define the exp ecte d output for
• After defining the test daia, the next step is
each input scenario. 1

on the test dat a and verifying


This involves running the MapReduce program
• d result.
the output to ensure that it matches the expecte

Write unit tests:


al ~1apReduce functions or
• The next step is to write unit tests for individu
modules.
fram ewo rk like JVn it or
• Unit tests are typically written using a testing
MRUnit.
module works as exp ecte d in
• These tests help ensure that each function or
isolation.
\Vrite inte gra tion tests:
integration test s tha t test the
• After writing unit tests, the next step is to write
entire MapReduce program.
of the program work together
• Integration tests help ensure that all components
as expected.
fram ewo rk like Hadoo
• T~e~e tests can be run locally using a testing · p
M1111Cluster.

Automate the test ing process:


t
• To· make testing more efficient ' it's important t0 au ornate the testi.n e
ss
J k' . g proc
usmg a continuous integration (CI) system lik e · en ms or Travis CI
ft .
• . This way, tests can be run automatically
er eac h cod e ch~nge, and
developers can get immediate feedback on th: correctness of the tr code.

Run end-to-end tests:


• Finally, it's important to condtic t en d -to-end te f
s mg to ver ify that the
MapReduce program works correct] . production-like environment.
y ma
3.11
.
MAP Reduce Applications . a Hadoop cluste1
. . . . . .
• This mvolve s runnmg the program on a large1 dataset in
and verifyin g that the output matche s the expecte d result.
. t · olves definin.g the
Using test data and local tests 111 MapRe duce deve 1opmen 1nv .
data writing unit and integrat ion tests, automa ting the testing process , and_ running
teSt ' h · early
end-to-end tests. This process helps ensure program correct ness, catc issues '

and improve code quality. ·

3.4 ANATOMY OF MAPREDUCE JOB RUN


The anatomy of a MapRe duce job run refers to the series of steps that are involve d
in executing a MapRe duce job on a Hadoop cluster. These steps include :

JOB SUBMISSION
• The job submis sion phase involve s specify ing the input and output paths,
the nun1ber of n1ap and reduce tasks, and any additio nal configu ration
settings.

• The input data is typicall y stored in the Hadoop Distrib uted File System
(HDFS ) or another storage system that Hadoop can access.

• The job is then submitt ed to the Hadoop cluster, and the Hadoop Job Tracke r
assigns availab le resourc es to the job.

JOB INITIALIZATION

• Once the job is submitt ed, the JobTrac ker initializ es the job by allocat ing
resourc es and creating task trackers on availab le nodes.

• Task tracker s are respons ible for executi ng map and reduce tasks and
reportin g their progres s to the JobTrac ker.

• The JobTra cker also assigns map and reduce tasks to task tracker s based on
the availab le resourc es and the input data size.
MAP PHASE

• In the map phase, the input data is divided into smaller chunks and assig ne d
to ·m d"1vidual map tasks. ·

• Ea~h map task process es its assigne d input data and generat es key-val ue
pairs as output.

You might also like