Big Data 76-100
Big Data 76-100
2.35
• Cassandra stores the data in memory structure in memtable(RAM) when
the initial write request is generated from the Client. Concurrently the writes
. are written on Commitlog(Disk) as well which are permanent ·e ven if the
li ght goes off for the node.
• The Data is written on the SSTables tables which are immutable which means
when the memtable is flushed the data is not overwritten in SSTables despite
a new file being created. The partitions are stored on multiple SSTables so
that they can be easily searched.
Write
date
memtable
Memory
... -----------· ------.. --- .. ----------- ------·-----
Disk ,..
"INDEX
Flush
• If the row cache is enabled then it will be checked to find the data .
• The Bloom Filters are loaded in the Heap memory that will be checked to
find out the SSTables file that can store the requested partition data. Since
Bloom Filters works on probabilistic function and can return false positives.
In some cases Bloom Filters does not return the SSTable file then Cassandra
further checks in the partition key cache.
• The Partition Key Cache is used to store the partition index in heap memory
and the partition index of data will be searched in that. If the Partition Key
is present in the Partition Key Cache then Cassandra will go to compression
offset to find the Disk that has the data. If the Partition Key is not present in
the Partition Key Cache then the partition summary is searched to find user-
requested data.
• Partition Index is used to store the Partition key of the data that will be
used in the Compression offset map to find out the exact location of the
Disk which has stored the data.
• The Compression offset map is used to hold the exact location of data. It
~se~ the Partition. key to locate that. Once the Compression ~ffset map
md1cates the locat10n where data is stored the further process is to fetch the
data and share it with the user. .
Bloom filter
Read
request:- - -- -- - - - -- -~-=-= r
Compression
offsets
Partition
summary
ox... P=art
r,:=;:~= tion key cache
=i;:;.
---~
~
~
OX...
Return ox..
~ - ...J
result set ~ OX...
~
~
OX...
Data Partition
index
2. Ha~~le high speed Applications: Cassandra can handle the high speed data
s~ it is a great database for the applications where data is coming at very
high speed from different devices or sensors.
3. Produ ct Catal ogs and retail apps: Cassandra is used by many retailers for
dur_able shopping cart protection and fast product catalog input and output.
4. Social Medi a Analytics and recommendation engine: Cassandra is a great
database for many online companies and social media providers for analysis
and recommendation to their customers.
The list of companies using Cassandra is growing. These companies include:
In a much-publicized blog post,
• Twitter is using Cassandra for analytics.Ryan King, explained that Twitter
Twitter 's primary Cassandra engineer,
had decided against using Cassandra as its primary store for tweets , as
originally planned, but would instead use it in production for several different
things: for real-time analytics, for geolocation and places of interest data,
and for data mining over the entire user store.
Mahalo uses it for its primary near-time data store .
•
Facebook still uses it for inbox search, though they are using a proprietary fork .
•
Digg uses it for it_s primary near-time data store .
•
k ·t l'or its cloud servic·e, monitorbg, and logging.
• I)\..ac space uses 1 11
+hotellD +...
+roomlD
+phone
+name
+arrive
+depart
+rate
+ccNum
• In this design , some of the tables are transfe rred some such
as Hotel and Guest, to column families.
·.-: < ;
• Other tables, such as PointOflnterest, have been denormalized into a super
column family.
• The room and an1enities are combined into a single column family, Room.
• The column s such as type and rate will have corresponding values; other
column s, such as hot tub, will just use the presence of the. column name
itself as the value, and be otherwise empty.
Example: Schem a definit ion in cassandra.yaml
keys paces:
- name: Hotelie r
replica_placement_strategy: org.apache.cassandra.locator.RackUnawareStrategy
replica tion_factor : I
column familie s :
- name: Hotel
compar e_with: UTF8Type
- name: Guest
compar e_with: BytesType
2.40
_name: Room
column_type: Super ·.··••. ·
compare_with: BytesType
compare_subcolumns_with: BytesType
_name: RoomAvailability
column type: Super
compare_with: BytesType
compare subcolumns with: BytesType
2.8.2 Thrift
• Thrift is the driver-level interface; it provides the API for client
implementations in a wide variety of languages
• Thrift is actually an RPC protocol or API unified with a code generation
tool for CQL, and the purpose of using thrift in Cassandra is because it
facilitates easy access to Database (DB), across the Programming Language.
• Thrift is ~ code generation library for clients in C++, C#, Erlang, Haskell,
Java, Objective C/Cocoa, OCaml, Perl, PHP, Python, Ruby, Smalltalk, and
Squeak.
• Its goal is to provide an easy way to support efficient RPC calls in a wide
variety of popular languages, without requiring the overh.ead of somethi ng
like SOAP.
NoSQL Data Management 2.41
The design of Thrift offers the following features:
• Langu age-in depend ent types: Because types are defined in a language-
n·eutral manner using the definition file, they can be shared between different
langua ges. For example, a C++ struct can be exchan ged with a Python
dictionary.
• Comm on transp ort interface: The same applica tion code can be used
whethe r you are using disk files, in-memory data, or a streaming socket.
• Protocol independence: Thrift encodes and decodes the data types for use
. across protocols.
• Versioning support: The data types are capable of being versioned to support
updates to the client APL
2.8.2.1. Excep tions
There are several exceptions that can be thrown from the client interface that
you might see on occasion.
2. ha t ar e th e fe at u re s o f N oS Q L ?
W
1. Avai )a bi li ty
2. F1ex i bi 1i ty
., .
~
S ca la bi li ty
4. A va il ab il it y
5. D is tr ib ut ed
6. H ig hl y fu nc ti on al
7. H ig h pe rf or m an ce
3. e th e ty p es o f N O S Q L databases?
\V ha t ar
do cu m en ts si m il ar to JS ON (JavaScript
st or e da ta in
I. D oc u m en t d at ab as es : It co nt ai ns pa ir s o f fields and
vaJues.
ct s. E ac h do cu m en t
O bj ec t N ot at io n) ob je f ty pe s in cl ud in g things like strings.>
ca ll y be a va ri et y o
T he va lu es ca n ty pi
arrays, or ob je ct s.
nu m be rs , bo ol ea ns,
ty pe o f da ta ba se w he re ea ch item
as es : ar e a si m pl er
2. K ey -v al u e d at ab
lu es .
co nt ai ns ke ys ·a nd va
il y D at a st or es : st or e da ta in tables,
or es or C ol u m n Fam
3. \V id e--c of u m n st
lumns.
ro ws~ an d dy na m ic co od es ty pi ca lly st or e
an d ed ge s. N
st or e da ta in no de s
4. G ra p h d at ab as es : th in gs , w hi le ed ge s st or e in fo rm at
io n
op le, pl ac es , an d
in fo rm atio n ab ou t pe de s.
+; ns h;•ps be t\ ve en th e no
a bo ut t h. e re Iau o
Big Data Analytics
2.44
4. Difference between SQL and NOSQL?
SQL NOSQL
SQL databases are primarily called NoSQL databases are primarily called as
RDBMS or Relational Databases. Non-relational or distributed database.
2. These databases provide flexibility for data that has not been normalized ' which
.
requires a flexible data model, or has different properties for different data entities.
3. They offer scalability for larger data sets, which are common in, analytics
and artificial intelligence (AI) applications.
4. NoSQL databases are better suited for cloud, mobile, social media and big
data requirements.
5. They are designed for specific use cases and are easier to use than general-
purpose relational or SQL databases for those types of applications.
2.45
NoSQL Data Managemen t
6. List the disadvant ages of NoSQL
Each NoSQL database has its own syntax for querying and managing data.
1.
Lack of a rigid database sche111a and constraints removes the data integrity
2.
safeguards that are built into relational and SQL database systems.
A schen1a with s01ne sort of structure is required in order to use the data.
4. Because most NoSQL databases use the eventual consistenc y ip.odel, they
do not provide the same level of data consistenc y as SQL databases.
5. The data will not be consistent, which means they are not well-suite d for
transaction s that require immediate integrity, such as banking and ATM
transactions.
6. There are no comprehensive industry standards as with relational and SQL
DBMS offerings.
7. Lack of ACID properties
8. Lack of JOINs.
7. Difference behveen Cassandra and MySQL
S.No. Cassand.ra ..
MySQL
1 Apache Cassandra is a type of No- It is a type of Relational Database.
SQ L database
2 Apache Software Foundatio n Oracle developed MySQL and released
developed Cassandra and released it it in May 1995.
in July 2008 .
....
.) Apache Cassandra is written in MySQL is written in C and c++.
JAVA.
4 Cassandra does not provide ACID It provides ACID properties.
properties . It only provides AID
property.
5 Read operation in Cassandra takes • Read operation in MySQL takes
0( 1) complexity. O(log(n)) complexity.
6 There is no foreign key in Cassandra.
As a result, it does not provide the
a
MySQL has foreign key, so ~t supports
th e concept of Referential Integrity.
..___ concept of Referential Integrity.
Big Data Ana lytics
2 .46
8. Differen ce between Cassand ra and ROB.M S
RDBMS
--
Cassand ra
RDBMS handles the structure d data.
--
Cassard ra handles the unstructu red data. ~
In Cassand ra, the Keyspac es .are used to In RDBMS the database s are used to store
store the tables and it is the outermo st the tables and it is the outermo st
containe r of Cassand ra. containe r of the Relation al ·Databas e
managem ent system.
In Cassand ra, the Tables are represen ted The Tables are represen ted as the array
as the nested key-valu e pairs. of ROW and COLUM NS.
In Cassand ra, the entities are represen ted In RDBMS the entities are represen ted
through Tables and Columns . through Tables.
In Cassan dra, the relation ship is · The Joining in RDBMS · is supporte d by
presente d by the Collecti ons. the concept of foreign key join.
In Cassand ra, a column is a storage unit. The attribute of the table is represen ted
through a Column .
In Cassand ra, the rows are represen ted In RDBMS , the rows are represen ted as
as the replicati on unit. the actual data.
• Row Key
• Column Keys
• Column Values
15. Name some features of Apache Cassandra . .
• High Scalc\bility
• High fault tolerant
• Flexible Data storage
Easy data distribution
• Tunabl'e Consistency
• Efficient Wires
• Cassandra Query Language
16. List some of the components of Cassandra.
Some components of Cassandra are:
1. Table
2. Node
3. Cluster
4. Data Centre
Big Data AnalYties
2.48
5. Memtable
6. SSTabl e
7. Commi t log
3. B100111 Filter
17. Write some advant ages of Cassan dra.
These are the advantages ifCassa ndra: .
. b 1. t d t several nodes , Cassan dra fault tolerant.
1s
• Since data can e rep 1ca e o
• Cassandra can handle a large set of data.
• Cassandra provides high scalability.
18. Define commit log.
It is a mechanism that is used to recover data in case the databas e crashes . Every
· out 1s
· earned
· n that 1s · th e com mit log · Using this the data can
· saved 1n
operatio
be recovered.
19. Define composite key.
Composite .keys include row key and column nan1e. They are used to define
column family with a concatenation of data of differe nt type.
20. Define SSTable.
SSTable is Sorted String Table. It is a data file that accepts regular Mem Tables.
21. What is memtable?
Memtable is in-memory/write-back cache space contain ing conten t in key and
column format. In memtable, data is sorted by key, a~d each Colum nFamil y has
a distinct memtable that retrieves column data via key. It stores the writes until it
is full, and then flushed out.
22. How the SSTable is different from other relatio nal tables?
SStables do not allow any further addition and removal of data items once written.
For each SSTable, Cassandra creates three separat e files like partitio n index,
partition summary and a bloom filter.
23. What is data replication in Cassan dra?
in
Data replication is an $!lectronic copying of data fro1n a databas e one computer
or server to a database in another so that all users can share the same level of
information . Cassandra stores replicas on multipl e nodes to ensure reliability
and fau lt tolerance. The replication strategy decides the nodes where replicas are
placed.
3.1
MAP Reduce Applications
UNIT Ill
• Map function: This function takes in a set of input data and maps it to a set
of interme diate key-val ue pairs.
• The informa tion associated with the Job include s the data to be process ed
(input data), MapRe duce logic/ program / algorith m , and any other relevant
configuration informa tion necessary to execute the Job.
Big Data AnalYf:ics
3.2
Task:
Hadoo p NfapRe duce divides a Job into multip le sub-job s known as Tasks.
•
These tasks can be run indepe ndent of each other on various nodes acros_s
•
the cluster.
There are primar ily two types of Tasks - ?\.1 ap Tasks and Reduce Tasks .
•
Job Tracke r:
Just like the storage (HDFS }, the compu tation (MapR educe) also w·orks in a
•
master -slave / master-,,.rorke r fashion .
A JobTra cker node acts as the Master and is respon sible for schedu ling /
•
execut ing Tasks on approp riate nodes, coordi nating the execut ion of tasks,,
sendin g the inform ation for the execut ion of tasks, getting the results back
after the execut ion of each task, re-exec uting the failed Tasks, and monito rs
/ mainta ins the overall progres s of the Job.
• Since a Job consist s of multipl e Tasks, a Job 's progre ss depend s on the status
/ progre ss of Tasks associa ted ,vith it. There is only one Job Tracke r node per
Hadoo p Cluster .
TaskT racker :
• A TaskTr acker node acts as the Slave and is respon sible for execut ing a
Task assigne d to it by the JobTra cker.
• There is no restrict ion on the numbe r of Task Tracke r nodes that can exist in
aHado op Cluster .
• TaskTracke r receives the inform ation necessa ry for execut ion of a Task from
JobTra cker, Execut es the Task, and Sends the Result s back to JobTracker.
f.i.apO
• Map Task in MapRe duce is perform ed using the M ap() functio n.
• This part of MapRe duce is respon sible for proces sing on e or more chunks
of data and produc ing the output results .
Reduce ()
The ne xt part/co mpone nt/stag e o f the 1\ fapRed uce progra mming model is
the ReduceO functio n .
MAP Reduce Ap plicat ions 3.3
• This part of MapRedu ce is responsib le for conso lidating the resu lts produced
by each of the Map() functions /tasks.
Data Locality
• I\1apRedu ce tries to place the data and the compute as close as possibl e.
First, it tries to put the compute on the same node-whe re data resides, if that
cannot be done ( due to reasons like compute on that node is down, compute
on that node is performin g some other computat ion, etc.), then it tries to put
the compute on the node nearest to the respective data node(s) which contains
the data to be processed .
The following diagram shows the logical flow of a MapRedu ce programm ing
model. ·
Input
Input Splits
~ ~ Mapper ...
Data
Shuffling .. Output
&
... Data
stored
on HDFS
Input Splits
H Mapper Sorting
-
stored
on HDFS
Reducer
Input Splits Mapper
• Split: Hadoop splits the incoming data into smaller pieces called " splits" .
• Map : In this step, MapRedu ce processes each split according to the logic
defined in map() function. Each mapper works on each split at a-time. Each
mapper is treated as a task and multiple tasks are executed across different
TaskTrac kers and coordin ated by the JobTrack er.
• Combine: Thi s is an optional step and is used to improve the performa nce
by reducing the amount of da ta transferre d across the network. Combine r is
the same as the reduce step and is used for aggregati ng the output of the
map() function before it is passed to the subseque nt steps.
Big Data Analytics
3.4 I
Shuffle & Sort: In this step, outputs from all the mappe rs is shuffle d, sorted
• to put them in order, and grouped before sending them to the next step.
Reduce : This step is used to aggrega te the outputs of mappe rs using the
• reduce() functio n . Output of reducer is sent to the next and final_step. Each
reducer is treated as a task and multipl e tasks are execut ed across different
TaskTrackers and coordin ated by the JobTra cker.
.,.
Output : Finally, the output of reduce step is written to a file in HDFS .
•
Here's an exampl e of using MapRe duce to count the frequen cy of each word in
an input text. The text is, "This is an apple. Apple is red in color."
This-1 This-1
This-1
Input Split ls-1
ls-1 Output
An-1 ls-2
This is ls-1
Input Apple-1 This-1
an Apple
~
ls- 2
An-1
This is an Apple. An-1
Apple-2
Apple is red in Apple-1
color. Apple-2 Red-1
Apple-1
ln-1
Apple is
red in Coror-1
color. Red-1 Red-1
Apple-1
ls-1
Red-1 ln-1 ln-1
ln-1
Color-1 Color-1 Color-1
• :he Mappe r counts the number of times each word occurs from input splits
in
c. the form of key-val ue pairs where the k ey is
. th e word , and the value 1s
. the
1requency. ·
• For the first input sp rtt , t·1 generat es 4 key-val ue pairs: This l · is I· an I·
· generat es 5 key-val ue pairs: ~Apple
apple, I. ; and for the secon d , tt ' ' I·' . ' I·'
red , 1; m, I ; color. , , is, ,
MAP Reduce Applications 3.5
• It is followed by the shuffle phase; in which the values are grouped by keys
in the form of keyavalue pairs. Here we get a total of 6 groups of key•value
pairs.
• The sam'e reducer is used for all key-value pairs with the same key.
• All the words present in the data are combined into a single output in the
reducer phase. The output shows the frequency of each word .
• Here in the exan1ple, we get the final output of key-value pairs as This, I;
is, 2; an , I ; apple, 2; red, I ; in, 1; color, 1.
• The record writer writes the output key-value pairs from the reducer into
the output files, and the final output data is by default stored on HDFS.
• The MapReduce framework provides automatic parallelization and fault
tolerance, allowing for the efficient processing of large data sets.
• It is commonly used in distributed computing systems such as Apache
Hadoop and has become an important too) for processing big data in many
industries .
With MRUnit, you can write unit tests for your MapReduce jobs to verify their
correctness and ensure that they are functioning as expected.
Here are the steps to write unit tests with MRUnit:
• Set up your test environment: To use MRUnit, you will need to include
the MRUnit library in your Java project. You can do this by adding the
MRUnit dependency to your projec_t's build file.
• Define your input and expected output: In your unit test, you will need to
define the input data for your MapReduce job and the expected output.
• Write your test case: Use the MRUnit APis to set up your test case. You
can use the MapDriver or ReduceDriver classes to test your Map or Reduce
functions, respectively. You can also use the MapReduceDriver class to test
your -entire MapReduce job.
Big Data Analytics
3.6
Run your test case: Run your test case using the runTest() method and
• verify that the output matches t4e expected output.
Here is an exan1ple of a unit test for a simple word count MapReduce job using
MRUnit:
@Test
public void testMapper() {
MapDriver<LongWritable, Text, Text, IntWritable> n1apDriver = new
MapDriver<>();
111apDriver.withMapper(new WordCount.WordCountMapper());
mapDriver.withlnput(new LongWritable(0), new Text("hello world hello"));
mapDriver.withOutput(new Text("hello"),·new IntWritable(l ));
mapDriver.withOutput(new Text("world"), new IntWritable(l));
mapDriver.withOutput(new Text("hello"), new Int Writable(}));
mapDri ver. runTest();
}
@Test
public void testReducer() {
ReduceDriver<Text, IntWritable , Text , IntWr1"table> r e d uce D n· ver = new
ReduceDriver<>(); . .
re_duceDriv~r.withReducer(new WordCount.WordCountReducer());
L1st<IntWntable> values= new ArrayList<>();
values.add(new IntWritable(l ));
values.add(new IntWritable(l ));
reduceJ?river.withlnput(new Text("hello"), values);
reduceDriver.withOutput(new
.- Text("hello") , "new 1n tWnta
. bl e(2));
reduceDnver.runTest();
}
@Test
public void testMapReduce() {
MAP Reduce Applications 3.7
In the above example, it defines three test cases for the Mapper, Reducer, and the
entire MapReduce job. It set up the input data and the expected output using the MRUnit
APis and then run the test using the runTest() method.
3.3 TEST DATA AND LOCAL TESTS
• Test data and local tests are two key components of MapReduce development
that are used to ensure the correctness of a MapReduce program.
• In the context of Hadoop MapReduce, "test data" refers to the data that is
used for testing MapReduce jobs during development and debugging.
• On the other hand, "local test" r·efers to the practice of running MapReduce .
jobs in local mode, also known as "local mode" or "local execution".
• Local mode allows developers to run MapReduce jobs on their local machine
without the need for a Hadoop cluster, using the test data as input.
• The combination of test data and local test allows developers to thoroughly
test their MapReduce logic in a controlled environment before deploying it
to a production Hadoop cluster.
• By running MapReduce jobs in local mode with test data, developers can
validate the correctness and performance of their MapReduce logic, identify
and fix any issues or bugs, and ensure that the output of the MapReduce job
meets the expected results.
3.8 Big Data Analvtics
• Local test with test data is an essential part of the Map Reduce development
and debugging process, as it helps developers catch and fix issues early in
the development cycle, leading to more robust and reliable MapReduce jobs
when deployed in a production Hadoop cluster.
Here is a detailed look at these.two concepts:
. •."
Test data:
• Test data is a set of input data that is used to test a MapReduce program.
• The test data should cover a range of possible scenarios and edge cases to
ensure that the program handles all possible inputs correctly.
• The test data should be representative of real-world scenarios that the program
is expected to handle.
• To ensure that the Map Reduce program works as expected, developers must
define the expected output for each input scenario.
• By comparing the output of the program with the expected output, developers
can verify that the program is working correctly.
• In MapReduce, test data can be generated in various ways.
• One approach is to use sample data from the production environment.
• Another approach is to generate synthetic data that simulates real-world
scenarios.
Local tests:
• Local tests are tests that are run on the developer's local machine to test a
MapReduce program in isolation.
• Local tests can be run using a testing framework like JUnit, MRUnit, or
Hadoop MiniCluster.
• · These tests help catch issues early in the development process and ensure
program correctness before deploying the program in a production
environment.
• Local tests can be used to test individual MapRerluce functions, as well as
the entire MapReduce program.
MAP Reduce Applications 3.9
• This helps ensure that the program works correctly when deployed in the
production environment.
Benefits of using test da~a and local tests in Map Reduce developn1ent:
• Early b11g detectio11: Test data and local tests help catch issues early in the
development process before the program is deployed in a production
environment. This reduces the cost and time required for debugging and
fixing iss·u es later on. ·
• llnprove,I code quality: Test data and local tests promote good coding
practices and improve code quality. By testing the co_de in isolation,
developers can ensure that each function or module works as expected before
integrating it into the larger program.
• Easier refactoring: Test data and local tests make it easier to make changes
to the code without introducing new issues. By running the tests after each
code change, developers can ensure that the changes did. nof break any
existing functionality.
Test data and local tests are essential components of MapReduce development
that help ensure progra1n correctness, catch issues early, improve code quality, and
facilitate code changes.
The steps involved in using test data and local tests in MapReduce development:
Define the test data
• The first step is to define the test data that will be used to test the MapReduce
program.
Big Data Analyt·
~!_~- -- -- -- -- -- -- -- --:-:-~=-==::::~::
~10 Of
This involves identifying a range possi
---- -
, 'ble input sce nar ios and ·generatinICs
g
• .
sample data that represents these scenarios.
typical and edg e cases .
The test data should be diverse and cover both
•
Define the expected out put .
to define the exp ecte d output for
• After defining the test daia, the next step is
each input scenario. 1
JOB SUBMISSION
• The job submis sion phase involve s specify ing the input and output paths,
the nun1ber of n1ap and reduce tasks, and any additio nal configu ration
settings.
• The input data is typicall y stored in the Hadoop Distrib uted File System
(HDFS ) or another storage system that Hadoop can access.
• The job is then submitt ed to the Hadoop cluster, and the Hadoop Job Tracke r
assigns availab le resourc es to the job.
JOB INITIALIZATION
• Once the job is submitt ed, the JobTrac ker initializ es the job by allocat ing
resourc es and creating task trackers on availab le nodes.
• Task tracker s are respons ible for executi ng map and reduce tasks and
reportin g their progres s to the JobTrac ker.
• The JobTra cker also assigns map and reduce tasks to task tracker s based on
the availab le resourc es and the input data size.
MAP PHASE
• In the map phase, the input data is divided into smaller chunks and assig ne d
to ·m d"1vidual map tasks. ·
• Ea~h map task process es its assigne d input data and generat es key-val ue
pairs as output.