MCA - BigData Notes
MCA - BigData Notes
UNIT - I
5 Marks
15 marks
UNIT - II
5 Marks
3. Write steps to Find the most popular elements using decaying windows.
15 marks
5 Marks
15 marks
2. Explain the map reduce data flow with single reduce and multiple reduce.
3. DefineHDFS.Describenamenode,datanodeandblock.ExplainHDFSoperationsin
detail.
4. Write in detail the concept of developing the Map Reduce Application.
UNIT - IV
5 Marks
1. What are the different types of Hadoop configuration files? Discuss.
15 marks
UNIT - V
5 Marks
15 marks
Error.
UNIT V – FRAMEWORKS
Applications on Big Data Using Pig and Hive – Data processing operators in
TEXT BOOKS
2. Tom White “Hadoop: The Definitive Guide” Third Edition, O’reilly
A big data platform works to wrangle this amount of information, storing it in a
manner that is organized and understandable enough to extract useful insights. Big data
platformsutilizeacombinationofdatamanagementhardwareandsoftwaretoolstoaggregate
data on a massive scale, usually onto the cloud.
One of the major challenges of conventional systems was the uncertainty of the Data
Management Landscape.
Fundamental challenges
Bigdatahasrevolutionizedthewaybusinessesoperate,butithasalsopresentedanumberof
challengesforconventionalsystems.Herearesomeofthechallengesfacedbyconventional
systems in handling big data:
Bigdataisatermusedtodescribethelargeamountofdatathatcanbestoredandanalyzed
bycomputers.Bigdataisoftenusedinbusiness,scienceandgovernment.BigDatahasbeen
around for several years now, but it's only recently that people have started realizing how
importantitisforbusinessestousethistechnologyinordertoimprovetheiroperationsand
provide betterservicestocustomers.Alotofcompanieshavealreadystartedusingbigdata
analytics tools because they realize how much potential there is in utilizing these systems
effectively!
However,whiletherearemanybenefitsassociatedwithusingsuchsystems-includingfaster
processingtimesaswellasincreasedaccuracy-therearealsosomechallengesinvolvedwith
implementing them correctly.
Challenges of Conventional System in big data
● Scalability
● Speed
● Storage
● Data Integration
● Security
Scalability
Acommonproblemwithconventionalsystemsisthattheycan'tscale.Astheamountofdata
increases, so does the time it takes to process and store it. This can cause bottlenecks and
systemcrashes,whicharenotidealforbusinesseslookingtomakequickdecisionsbasedon
their data.
Conventional systems also lack flexibility in terms of how they handle new types of
information--forexample,ifyouwanttoaddanothercolumn(columnsarelikefields)orrow
(rows are like records) without having to rewrite all your code from scratch.
Speed
Speed is a critical component of any dataprocessingsystem.Speedisimportantbecauseit
allows you to:
● Process and analyze your data faster, which means you can make better-informed
decisions about how to proceed with your business.
● Make more accurate predictions about future events based on past performance.
Storage
Theamountofdatabeingcreatedandstoredisgrowingexponentially,withestimatesthatit
will reach 44 zettabytes by 2020. That's a lot of storage space!
The problem with conventional systems is that they don't scalewellasyouaddmoredata.
Thisleadstohugeamountsofwastedstoragespaceandlostinformationduetocorruptionor
security breaches.
Data Integration
Thechallengesofconventionalsystemsinbigdataarenumerous.Dataintegrationisoneof
thebiggestchallenges,asitrequiresalotoftimeandefforttocombinedifferentsourcesinto
a single database. This is especially true whenyou'retryingtointegratedatafrommultiple
sources with different schemas and formats.
Anotherchallengeiserrorsandinaccuraciesinanalysisduetolackofunderstandingofwhat
exactly happened during an event or transaction. For example, if there was an error while
transferring money from one bank account to another, then there would be no way for us
know what actually happened unless someone tells us about it later on (which may not
happen).
Security
Securityisamajorchallengeforenterprisesthatdependonconventionalsystemstoprocess
andstoretheirdata.Traditionaldatabasesaredesignedtobeaccessedbytrusteduserswithin
anorganization,butthismakesitdifficulttoensurethatonlyauthorizedpeoplehaveaccess
to sensitive information.
Security measures such as firewalls, passwords and encryption help protect against
unauthorizedaccessandattacksbyhackerswhowanttostealdataordisruptoperations.But
thesesecuritymeasureshavelimitations:They'reexpensive;theyrequireconstantmonitoring
andmaintenance;theycanslowdownperformanceifimplementedtooextensively;andthey
often don'tpreventbreachesaltogetherbecausethere'salwayssomewayaroundthem(such
as through phishing emails).
Conventional systemsarenotequippedforbigdata.Theyweredesignedforadifferentera,
when the volume of information was much smaller and more manageable. Now thatwe're
dealing with huge amounts of data, conventional systems are struggling to keep up.
Conventional systems are also expensive and time-consuming to maintain; they require
constant maintenance and upgrades in order to meet new demands from users who want
faster access speeds and more features than ever before.
Because of the 5V's of Big Data, Big Data and analytics technologies enable your
organisation to become more competitive andgrowindefinitely.This,whencombinedwith
specialised solutions for its analysis, such as an Intelligent Data Lake, addsagreatdealof
value to a corporation. Let's get started:
TheFiveVsofBigDataarewidelyusedtodescribeitscharacteristics:Iftheproblemmeets
the Five criteria.
● Volume
● Value
● Velocity
● Veracity
● Variety
Volume capacity
One of the characteristics of big data is its enormous capacity. According to the above
description, it is "datathatcannotbecontrolledbyexistinggeneraltechnology,"althoughit
appearsthatmanypeoplebelievetheamountofdatarangesfromseveralterabytestoseveral
petabytes.
The volume of data refers to the size of thedatasetsthatmustbeexaminedandmanaged,
which are now commonly in the terabyte and petabyte ranges. The sheer volume of data
necessitates processing methods that are separate and distinct from standard storage and
processingcapabilities.Inotherwords,thedatasetsinBigDataaretoovasttobeprocessed
by a standard laptop or desktop CPU.Ahigh-volumedatasetwouldincludeallcreditcard
transactions in Europe on a given day.
Value
The most important "V" from a financial perspective, the value ofbigdatatypicallystems
from insight exploration and information processing, which leads to more efficient
functioning, bigger and more powerful client relationships,andotherclearandquantifiable
financial gains.
Thisreferstothevaluethatbigdatacandeliver,anditiscloselyrelatedtowhatenterprises
candowiththedatatheycollect.Theabilitytoextractvaluefrombigdataisrequired,asthe
value of big data increases considerably based on the insights that can be gleaned fromit.
Companiescanobtainandanalyzethedatausingthesamebigdatatechniques,buthowthey
derive value from that data should be unique to them.
Variety type
Big Data is very massive due to its diversity. Big Data originates from a wide range of
sources and is often classified as one of three types: structured, semi-structured, or
unstructured data. The multiplicity of data kindsusuallynecessitatesspecialisedprocessing
skills and algorithms. CCTVaudioandvideorecordingsgeneratedatmanypointsarounda
city are an example of a high variety data set.
Big data may not always refer to structured data that is typically managed in a company's
core system. Unstructured data includes text, sound, video, log files, location information,
sensorinformation,andsoon.Ofcourse,someofthisunstructureddatahasbeentherefora
while. Efforts are being made in the future to analyse information and extract usable
knowledge from it, rather than merely accumulating it.
Veracity
Thequalityofthedatabeingstudiedisreferredtoasitsveracity.High-qualitydatacontainsa
large number of records that are useful for analysis andcontributesignificantlytothetotal
findings. Data of low veracity, on the other hand, comprises a significant percentage of
useless data. Noise refers to the non-valuable in these data sets. Data from a medical
experiment or trial is an example of a high veracity data set.
Effortstovaluebigdataarepointlessiftheydonotresultinbusinessvalue.Bigdatacanand
will be utilised in a broad range of circumstances in the future. To create big data efforts
high-valueinitiativesandconsistentlyacquirethevaluethatbusinessesshouldseek,notonly
should tools and the usage of new services be introduced, but also operationsandservices
based on strategic measures. It must be completely rebuilt.
Torevealmeaningfulinformation,highvolume,highvelocity,andhighvarietydatamustbe
processedusingadvancedtools(analyticsandalgorithms).Becauseofthesedataproperties,
the knowledge area concerned with the storage, processing, and analysis of huge data
collections has been dubbed Big Data.
Unstructured data analysis has gained popularity in recent years as a form of large data
analysis. Some forms of unstructured data, on the other hand, are bothsuitedandunfitfor
data analysis. This time, I'd like to discuss the data with and without the regularity of
unstructured data, as well as the link between structured and unstructured data.
Dataisasetofdataconsistingofstructuredandunstructureddata,ofwhichunstructureddata
isstoredinitsnativeformat.Inaddition,althoughithasthefeaturethatnothingisprocessed
until it is used, it has the advantage of being highly flexible and versatile because it can
process data relatively freely when it is used. It is also easy for humans to recognize and
understand as it is.
Structured data
RDBs such as Oracle, PostgreSQL, and MySQL can be said to be databases for storing
structured data.
Semi-structured data
Semi-structured dataisdatathatfallsbetweenstructuredandunstructuredcategories.When
categorisedloosely,itisclassedasunstructureddata,butitisdistinguishedbytheabilityto
behandledasstructureddataassoonasitisprocessedsincethestructureoftheinformation
that specifies certain qualities is defined.
It'snotclearlystructuredwithcolumnsandrows,yetit'samanageablepieceofdatabecause
it'slayeredandincludesregularelements.Examplesinclude.csvand.tsv.While.csvisreferred
toasaCSVfile,thepointatwhichelementsaredividedandorganisedbycommaseparation
is an intermediary location that may be viewed as structured data.
Semi-structureddata,ontheotherhand,lacksasetformatlikestructureddataandmaintains
data through the combination of data and tags.
Anotherdistinguishingaspectisthatdatastructuresarenested.Semi-structureddataformats
include the XML and JSON formats.
GoogleCloudPlatformoffersNoSQLdatabasessuchasCloudFirestoreandCloudBigtable
for working with semi-structured data.
Unstructured data
One of the components of big data, real-time, provides you an advantage over your
competition.Real-timeperformanceentailstherapidprocessingofenormousamountsofdata
as well as the quick analysis of data that is continually flowing.
Big data contains a component called Veracity (accuracy), and it is distinguished by the
availability of real-time data. Real-time skills allow us todiscovermarketdemandsrapidly
and use them in marketing and management strategies to build accurate enterprises.
Immediateresponsivenesstoever-changingmarketsgivesyouacompetitiveadvantageover
your competitors.
Thisisadisadvantageforcustomersratherthanfirmsattemptingtoincreasetheaccuracyof
marketing, etc. by utilising big data, but if these issues grow and legal constraints get
stronger, the area of use may be limited. Companies that use big data must bepreparedto
handledataresponsiblyincompliancewiththePersonalInformationProtectionActandother
regulatory standards.
NATURE OF DATA
Tounderstandthenatureofdata,wemustrecall,whataredata?Andwhatarethefunctions
that data should perform on the basis of its classification?
Thefirstpointinthisisthatdatashouldhavespecificitems(valuesorfacts),whichmustbe
identified.
Secondly, specific items of data must be organised into a meaningful form.
Thirdly, data should have the functions to perform.
Furthermore, the nature of data can be understood on the basis of the class to which it
belongs.
We have seen that in sciences there are six basic types with in which there exist fifteen
different classes of data. However, these are not mutually exclusive.
There is a large measure of cross-classification, e.g., all quantitative data are numerical
data,and most data are quantitative data.
Graphic and symbolic data: Graphic and symbolic data are modes of presentation. They
enableuserstograspdatabyvisualperception.Thenatureofdata,inthesecases,isgraphic.
Likewise, it is possible to determine the nature of data in social sciences also.
Enumerative data: Most data in social sciences are enumerative in nature.However,they
are refined with the help of statistical techniquestomakethemmoremeaningful.Theyare
known as statistical data. This explains theuseofdifferentscalesofmeasurementwhereby
they are graded.
Descriptivedata:Allqualitativedatainsciencescanbedescriptiveinnature.Thesecanbe
in the form of definitive statements. All cataloguing and indexing data are bibliographic,
whereas all management data such as books acquired, books lent, visitors served and
photocopies supplied are non-bibliographic.
Having seen the nature of data, let us now examine the properties, which the data should
ideally possess.
Sampling Distributions
Sampling distribution refers to studying the randomly chosen samples to understand the
variations in the outcome expected to be derived.
Sampling distribution in statisticsrepresents the probability of varied outcomes when a
studyisconducted.Itisalsoknownasfinite-sampledistribution.Intheprocess,userscollect
samplesrandomlybutfromonechosenpopulation.Apopulationisagroupofpeoplehaving
the same attribute used for random sample collection in terms ofstatistics.
Samplingdistributionofthemean,samplingdistributionofproportion,andT-distributionare
three major types of finite-sample distribution.
Re-Sampling
Resampling involves the selection of randomized cases with replacement fromtheoriginal
data sample in suchamannerthateachnumberofthesampledrawnhasanumberofcases
thataresimilartotheoriginaldatasample.Duetoreplacement,thedrawnnumberofsamples
that are used by the method of resampling consists of repetitive cases.
Statistical Inference
Statistical Inference is defined as the procedure of analyzing the result and making
conclusions from data based on random variation. The two applications of statistical
inferencearehypothesistestingandconfidenceinterval.Statisticalinferenceisthetechnique
ofmakingdecisionsabouttheparametersofapopulationthatreliesonrandomsampling.It
enablesustoassesstherelationshipbetweendependentandindependentvariables.Theidea
ofstatisticalinferenceistoestimatetheuncertaintyorsampletosamplevariation.Itenables
us to deliver a range of value for the true value of something in the population. The
components used for making the statistical inference are:
● Sample Size
● Variability in the sample
● Size of the observed difference
But, the most important two types of statistical inference that are primarily used are
● Confidence Interval
● Hypothesis testing
Importance of Statistical Inference
StatisticalInferenceissignificanttoexaminethedataproperly.Tomakeaneffectivesolution,
accuratedataanalysisisimportanttointerprettheresultsoftheresearch.Inferentialstatistics
is used in the future prediction for varied observations in different fields. It enables us to
makeinferencesaboutthedata.Italsohelpsustodeliveraprobablerangeofvaluesforthe
true value of something in the population.
In statistics,predictionerrorreferstothedifferencebetweenthepredictedvaluesmadeby
some model and the actual values.
1. Linear regression:Used to predict the value ofsome continuous response variable.
Wetypicallymeasurethepredictionerrorofalinearregressionmodelwithametricknownas
RMSE, which stands for root mean squared error.
2. Logistic Regression:Used to predict the valueof some binary response variable.
One common way to measure the prediction error of a logistic regression model is with a
metric known as the total misclassification rate.
Introduction to Streams Concepts – Stream Data Model and Architecture – Stream
Computing-SamplingDatainaStream–FilteringStreams–CountingDistinctElementsin
a Stream – Estimating Moments – Counting Oneness in a Window – Decaying Window -
Real time Analytics Platform (RTAP) Applications - Case Studies - Real Time Sentiment
Analysis, Stock Market Predictions.
Stream Processing
Stream processing is a method of data processing that involves continuously
processingdata inreal-timeasit isgenerated,ratherthanprocessingitinbatches.In
stream processing, data is processed incrementally andinsmallchunksasitarrives,
making it possible to analyze and act on data in real-time.
Stream processing is particularly useful in scenarios where data is generatedrapidly,
suchas inthecaseofIoTdevicesorfinancialmarkets,whereitisimportanttodetect
anomaliesorpatternsindataquickly.Streamprocessingcanalsobeusedforreal-time
data analytics, machine learning, and other applications where real-time data
processing is required.
There are several popular stream processing frameworks, including Apache Flink,
Apache Kafka, Apache Storm, and Apache Spark Streaming. These frameworks
provide tools for building and deploying stream processing pipelines, and they can
handle large volumes of data with low latency and high throughput.
Mining data streams refers to the process of extracting useful insights and
patternsfromcontinuousandrapidlychangingdatastreamsinreal-time.Datastreams
are typically high- volume and high-velocity, making it challenging to analyze them
using traditional data mining techniques.
Miningdatastreamsrequiresspecializedalgorithmsthatcanhandlethedynamicnature
of data streams, as well as the need for real-time processing. These algorithms
typically use techniques such as sliding windows, online learning, and incremental
processing to adapt to changing data patterns over time.
Applications of mining data streams include fraud detection, network intrusion
detection, predictive maintenance, and real-time recommendation systems. Some
popular algorithms for mining data streams include Frequent Pattern Mining (FPM),
clustering, decision trees, and neural networks.
Miningdatastreamsalsorequirescarefulconsiderationofthecomputationalresources
required to process the data in real-time. As a result, many mining data stream
algorithms are designed to work with limitedmemoryandprocessingpower,making
them well-suited for deployment on edge devices or in cloud-based architectures.
Introduction to Streams Concepts
In computer science, a stream refers to a sequence of data elements that are
continuouslygeneratedorreceivedovertime.Streamscanbeusedtorepresentawide
range of data, including audio and video feeds, sensor data, and network packets.
1.Data Source:A stream's data source is the placewhere the data is generated or received.
This can include sensors, databases, network connections, or other sources.
2.DataSink:Astream'sdatasinkistheplacewherethedataisconsumedorstored.
This can include databases, data lakes, visualization tools, or other destinations.
3. Streaming Data Processing:This refers to the process of continuouslyprocessing
data asitarrivesinastream.Thiscaninvolvefiltering,aggregation,transformation,or
analysis of the data.
4. Stream Processing Frameworks:These are software tools that provide an
environmentforbuildinganddeployingstreamprocessingapplications.Popularstream
processing frameworks include Apache Flink, Apache Kafka, and Apache Spark
Streaming.
5.Real-timeDataProcessing:Thisreferstotheabilitytoprocessdataassoonasitis
generated or received. Real-time data processing is often used in applications that
require immediate action, such as fraud detection or monitoring of critical systems.
Overall,streamsareapowerfultoolforprocessingandanalyzinglargevolumesofdata
inreal-time,enablingawiderangeofapplicationsinfieldssuchasfinance,healthcare,
and the Internet of Things.
Stream Data Model and Architecture
The architecture of a stream processing system typically involves three main
components: data sources, stream processing engines, and data sinks.
1. Data sources:The data sources are the components that generate theeventsthat
make up the stream. These can include sensors, log files,databases,andotherdata
sources.
2.Stream processing engines:The stream processingengines are the components
responsibleforprocessingthedatainreal-time.Theseenginestypicallyuseavarietyof
algorithms and techniques to filter, transform, aggregate, and analyze the stream of
events.
3.Datasinks:Thedatasinksarethecomponentsthatreceivetheoutputofthestream
processing engines. These can include databases, data lakes, visualization tools, and
other data destinations.
The architecture of a stream processing system can be distributed or centralized,
depending on the requirements of the application. In a distributed architecture, the
streamprocessingenginesaredistributedacrossmultiplenodes,allowingforincreased
scalability and fault tolerance. In a centralized architecture, the stream processing
engines are run on a single node, which can simplify deployment and management.
Streamcomputingistheprocessofcomputingandanalyzingdatastreamsinreal-time.
It involvescontinuouslyprocessingdataasitisgenerated,ratherthanprocessingitin
batches.Streamcomputingisparticularlyusefulforscenarioswheredataisgenerated
rapidly and needs to be analyzed quickly.
Streamcomputinginvolvesasetoftechniquesandtoolsforprocessingandanalyzing
data streams, including:
1. Stream processing frameworks:These are software tools that provide an
environmentforbuildinganddeployingstreamprocessingapplications.Popularstream
processing frameworks include Apache Flink, Apache Kafka, and Apache Storm.
2.Streamprocessingalgorithms:Thesearespecializedalgorithmsthataredesignedto
handlethedynamicandrapidlychangingnatureofdatastreams.Thesealgorithmsuse
techniques such as sliding windows, online learning, and incremental processing to
adapt to changing data patterns over time.
3. Real-time data analytics:This involves using stream computing techniques to
perform real-time analysis of data streams, such as detecting anomalies, predicting
future trends, and identifying patterns.
4. Machine learning:Machine learning algorithms can also be used in stream
computing to continuously learn from the data stream and make predictions in
real-time.
Stream computing is becoming increasingly important in fields such as finance,
healthcare, andtheInternetofThings(IoT),wherelargevolumesofdataaregenerated
and need to be processed and analyzed in real-time. It enables businesses and
organizationstomakemoreinformeddecisionsbasedonreal-timeinsights,leadingto
better operational efficiency and improved customer experiences.
There are various sampling techniques that can be used for stream data, including:
1.Randomsampling:Thisinvolvesselectingdatapointsfromthestreamat random
intervals. Random sampling can be used to obtain a representative sample of the
entire stream.
2.Systematicsampling:Thisinvolvesselectingdatapointsatregularintervals,suchas
every tenth or hundredth data point. Systematic sampling can be useful when the
stream has a regular pattern or periodicity.
3.Clustersampling:Thisinvolvesdividingthestreamintoclustersandselectingdata
points from each cluster.Clustersamplingcanbeusefulwhentherearemultiplesub-
groups within the stream.
4. Stratified sampling:This involves dividing the stream into strata or sub-groups
based onsomecharacteristic,suchaslocationortimeofday.Stratifiedsamplingcan
be useful when there are significant differences between the sub-groups.
When sampling data in a stream, it is important to ensure that the sample is
representativeoftheentirestream.Thiscanbeachievedbyselectingasamplesizethat
is large enough to capture the variability of the stream and by using appropriate
sampling techniques.
Samplingdatainastreamcanbeusedinvariousapplications,suchasmonitoringand
quality control, statisticalanalysis,andmachinelearning.Byreducingtheamount of
datathatneedstobeprocessedinreal-time,samplingcanhelpimprovetheefficiency
and scalability of stream processing systems.
Filtering Streams
Filteringstreamsreferstotheprocessofselectingasubsetofdatafromadata stream
based on certain criteria. This process is often used instreamprocessingsystemsto
reduce the amount of data that needs to be processed and to focus on the relevant data.
There are various filtering techniques that can be used for stream data, including:
3.Machinelearning-basedfiltering:Thisinvolvesusingmachinelearningalgorithms
toautomaticallyclassifydatapointsinthestreambasedonpastobservations.Thiscan
be useful in applications such as anomaly detection or predictive maintenance.
Whenfilteringstreams,itisimportanttoconsiderthetrade-offbetweentheamountof
data being filtered and the accuracy of the filtering process. Too much filtering can
result in valuable data being discarded, while too little filtering can result in a large
volume of irrelevant data being processed.
Filtering streams can be useful in various applications, such as monitoring and
surveillance, real-time analytics, and Internet of Things (IoT) data processing. By
reducing the amount of data that needs to be processed and analyzed in real-time,
filtering can help improve the efficiency and scalability of stream processing systems.
Countingdistinctelementsinastreamreferstotheprocessofcountingthenumberof
unique items in a continuous and rapidly changing data stream. This isanimportant
operation in stream processing because it can help detect anomalies, identify trends,
and provide insights into the data stream.
There are various techniques for counting distinct elements in a stream, including:
2. Approximate counting:This involves using probabilistic algorithms such as the
Flajolet-Martin algorithm or the HyperLogLog algorithm to estimate the number of
distinctelementsinadatastream.Thesealgorithmsuseasmallamountofmemoryto
provide an approximate count with a known level of accuracy.
3. Sampling:This involves selecting a subset of the data stream and counting the distinct
elementsinthesample.Thiscanbeusefulwhenthedatastreamistoolargetobeprocessed
in real-time or when exact or approximate counting techniques are not feasible.
Counting distinct elements in a stream can be useful in various applications,suchas
social media analytics, fraud detection, and network trafficmonitoring.Byproviding
real-time insights into the datastream,countingdistinctelementscanhelpbusinesses
and organizations make more informed decisions and improve operational efficiency.
Estimating Moments
In statistics, moments are numerical measures that describe the shape,central
tendency,andvariabilityofaprobabilitydistribution.Theyarecalculatedasfunctions
oftherandomvariablesofthedistribution,andtheycanprovideusefulinsightsintothe
underlying properties of the data.
Therearedifferenttypesofmoments,buttwoofthemostcommonlyusedarethemean
(the first moment) and the variance (the second moment). The mean represents the
central tendency of the data, while the variance measures its spread or variability.
To estimate the moments of a distribution from a sample of data, you can use the
following formulas:
where n is the sample size, x_i are the individual observations, and s^2 is the sample
variance.
Theseformulasprovideestimatesofthepopulationmomentsbasedonthesampledata.
The larger the sample size, the more accurate the estimates will be. However, it's
importanttonotethattheseformulasonlyworkforcertaintypesofdistributions(e.g.,
normal distribution), and for other types of distributions, different formulas may be
required.
Counting the number of times a number appears exactly once (oneness) in a
windowofagivensizeinasequenceisacommonproblemincomputerscience
and data analysis. Here's one way you could approach this problem:
Decaying Window
A decaying window is a common technique used in time-series analysis and
signalprocessingtogivemoreweighttorecentobservationswhilegraduallyreducing
the importance of older observations. This can be useful when the underlying data
generating process is changing over time, and more recent observations are more
relevant for predicting future values.
Here's one way you could implement a decaying window in Python using an
exponentially weighted moving average (EWMA):
Thefunctionfirstcreatesaseriesofweightsusingthedecayrateandthewindowsize.
The weights arecalculatedusingtheformuladecay_rate^(window_size-i)whereiis
theindexoftheweightintheseries.Thisgivesmoreweighttorecentobservationsand
less weight to older observations.
Next,thefunctionnormalizestheweightssothattheysumtoone.Thisensuresthatthe
weighted average is a proper average.
Finally,thefunctionappliestherollingfunctiontothedatausingthewindowsizeanda
custom lambdafunctionthatcalculatestheweightedaverageofthewindowusingthe
weights.
Overall, RTAPs can be applied in various industries and domains where real-time
monitoring and analysis of data is critical to achieving business objectives. By
providinginsightsintostreamingdataasithappens,RTAPscanhelpbusinessesmake
faster and more informed decisions.
Real-timesentimentanalysisisapowerfultoolforbusinessesthatwanttomonitorand
respond to customer feedback in real-time. Here are some casestudiesofcompanies
that have successfully implemented real-time sentiment analysis:
1. Airbnb: The popular home-sharing platform uses real-time sentiment analysis to
monitor customer feedbackandrespondtocomplaints.Airbnb'scustomerserviceteamuses
theplatformtomonitorsocialmediaandreviewsitesformentionsofthebrand,andtotrack
sentiment over time. By analyzing this data in real-time, Airbnb can quickly respond to
complaints and improve the customer experience.
2.Coca-Cola:Coca-Colausesreal-timesentimentanalysistomonitorsocialmediafor
mentionsofthebrandandtotracksentimentovertime.Thecompany'smarketingteam
uses this data to identifytrendsandtocreatemoretargetedmarketingcampaigns.By
analyzing real-time sentiment data, Coca-Cola can quickly respond to changes in
consumer sentiment and adjust its marketing strategy accordingly.
3.Ford:Fordusesreal-timesentimentanalysistomonitorcustomerfeedbackonsocial
mediaandreviewsites.Thecompany'scustomerserviceteamusesthisdatatoidentify
issuesandtorespondtocomplaintsinreal-time.Byanalyzingreal-timesentimentdata,
Ford can quickly identify and address customer concerns, improving the overall
customer experience.
Overall, real-time sentiment analysis is a powerful tool for businesses that want to
monitor and respond to customer feedback in real-time. By analyzing real-time
sentiment data, businesses can quickly identify issues and respond to changes in
customer sentiment, improving the overall customer experience.
Predictingstockmarketperformanceisachallengingtask,buttherehavebeenseveral
successfulcasestudiesofcompaniesusingmachinelearningandartificialintelligence
to make accurate predictions. Here are some examples of successful stock market
prediction case studies:
1. Kavout:Kavout is a Seattle-based fintech company that uses artificial intelligence
and machine learning to predict stock performance. The company's system uses a
combination of fundamental and technical analysis to generate buy and sell
recommendations for individual stocks. Kavout's AI algorithms have outperformed
traditional investment strategies and consistently outperformed the S&P 500 index.
2.SentientTechnologies:SentientTechnologiesisaSanFrancisco-basedAIstartupthatuses
deep learning to predict stock market performance. The company's system uses a
combination of natural language processing, image recognition, and genetic algorithms to
analyze market data and generate investment strategies. Sentient's AI algorithms have
consistently outperformed the S&P 500 index and other traditional investment strategies.
3. Quantiacs:Quantiacs is a California-based investment firm that uses machine
learning todeveloptradingalgorithms.Thecompany'ssystemusesmachinelearning
algorithms to analyze market data and generate trading strategies. Quantiacs' trading
algorithms have consistently outperformed traditional investment strategies and have
delivered returns that are significantly higher than the S&P 500 index.
4.KenshoTechnologies:KenshoTechnologiesis a Massachusetts-based fintech company
that uses artificial intelligence to predict stock market performance. Thecompany'ssystem
uses natural language processing and machine learningalgorithmstoanalyzenewsarticles,
social media feeds, and other data sources to identify patterns and generate investment
recommendations. Kensho's AI algorithms have consistently outperformed the S&P 500
index and other traditional investment strategies.
5.AlphaSense:AlphaSenseisaNewYork-basedfintechcompanythatusesnaturallanguage
processing and machine learning to analyze financial data. The company's system uses
machine learning algorithms to identify patterns in financial data and generate investment
recommendations. AlphaSense's AI algorithms have consistently outperformed traditional
investment strategies and have delivered returns that are significantly higher than the S&P
500 index.
Overall,thesecasestudiesdemonstratethepotentialofmachinelearningandartificial
intelligence to make accurate predictions in the stock market. By analyzing large
volumes of data and identifying patterns, these systems can generate investment
strategiesthatoutperformtraditionalmethods.However,itisimportanttonotethatthe
stock market is inherently unpredictable, and past performance is not necessarily
indicative of future results.
Unit III-Hadoop
History of Hadoop
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widelyusedtext
search library. Hadoop has its origins in ApacheNutch,anopensourcewebsearchengine,
itself a part of the Lucene project.
The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug
Cutting, explains how the name came about:
The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and
pronounce,meaningless,andnotusedelsewhere:thosearemynamingcriteria.Kidsaregood
at generating such. Googol is a kid’s term.
With growing data velocity the data size easily outgrows the storage limit of a
machine. A solution would be to store the data across a network of machines. Such
filesystems are called distributed filesystems. Since data is stored across a network all the
complications of a network come in.
This is where Hadoop comes in. It provides one of the most reliable filesystems. HDFS
(HadoopDistributedFileSystem)isauniquedesignthatprovidesstorageforextremelylarge
files with streaming data access patternanditrunsoncommodityhardware.Let’selaborate
the terms:
● Extremely large files: Here we aretalkingaboutthedatainrangeofpetabytes(1000
TB).
● Streaming Data Access Pattern: HDFS is designed on principle of write-once and
read-many-times.Oncedataiswrittenlargeportionsofdatasetcanbeprocessedany
number times.
● Commodity hardware: Hardware that is inexpensive and easily available in the
market. This is one of feature which specially distinguishes HDFS from other file
system.
1. NameNode(MasterNode):
○ Manages all the slave nodes and assign work to them.
○ It executes filesystem namespace operations like opening, closing, renaming
files and directories.
○ It shouldbedeployedonreliablehardwarewhichhasthehighconfig.noton
commodity hardware.
2. DataNode(SlaveNode):
○ Actualworkernodes,whodotheactualworklikereading,writing,processing
etc.
○ Theyalsoperformcreation,deletion,andreplicationuponinstructionfromthe
master.
○ They can be deployed on commodity hardware.
● Namenodes:
○ Run on the master node.
○ Store metadata (data about data) like file path, the number of blocks, block
Ids. etc.
○ Require high amount of RAM.
○ Store meta-data in RAM for fast retrieval i.e to reduce seek time.Thougha
persistent copy of it is kept on disk.
● DataNodes:
○ Run on slave nodes.
○ Require high memory as data is actually stored here.
Data storage in HDFS: Now let’s see how the data is stored in a distributed manner.
Letsassumethat100TBfileisinserted,thenmasternode(namenode)willfirstdividethefile
intoblocksof10TB(defaultsizeis128MBinHadoop2.xandabove).Thentheseblocksare
stored across different datanodes(slavenode). Datanodes(slavenode)replicate the blocks
among themselves and the information of what blocks they contain is sent to the master.
Defaultreplicationfactoris3meansforeachblock3replicasarecreated(includingitself).In
hdfs.site.xml we can increase or decrease the replication factor i.e we can edit its
configuration here.
Answer:Let’sassumethatwedon’tdivide,nowit’sverydifficulttostorea100TBfileona
single machine. Even if we store, then each read and write operation on that whole file is
going to take very high seek time. But if we have multiple blocks of size 128MB thenits
become easy to perform various read and write operations on it compared to doingitona
whole file at once. So we divide the file to have faster data access i.e. reduce seek time.
● HeartBeat : It is the signal that datanode continuously sends to namenode. If
namenode doesn’t receive heartbeat from a datanode then it will consider it dead.
● Balancing:Ifadatanodeiscrashedtheblockspresentonitwillbegonetooandthe
blocks will be under-replicated compared to the remaining blocks. Here master
node(namenode) will give a signal to data nodes containing replicas of those lost
blocks to replicate so that overall distribution of blocks is balanced.
● Replication:: It is done by datanode.
Note: No two replicas of the same block are present on the same datanode.
Features:
Limitations: Though HDFS provides many features there are some areas where it doesn’t
work well.
● Lowlatencydataaccess:Applicationsthatrequirelow-latencyaccesstodatai.einthe
range of milliseconds will not work well with HDFS, because HDFS is designed
keeping in mind that we need high-throughput of data even at the cost of latency.
● Small file problem: Having lots of small fileswillresultinlotsofseeksandlotsof
movement from one datanode to another datanode to retrieve each small file, this
whole process is a very inefficient data access pattern.
Components of Hadoop
1. Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit of
Hadoop.
2. Hadoop MapReduce - Hadoop MapReduce is the processing unit of Hadoop.
3. Hadoop YARN - Hadoop YARN is a resource management unit of Hadoop.
Hadoop HDFS
DataisstoredinadistributedmannerinHDFS.TherearetwocomponentsofHDFS-name
node anddatanode. While there is only one name node,there can be multiple data nodes.
HDFSisspeciallydesignedforstoringhugedatasetsincommodityhardware.Anenterprise
versionofaservercostsroughly$10,000perterabyteforthefullprocessor.Incaseyouneed
to buy 100 of these enterprise version servers, it will go up to a million dollars.
Hadoop enables you to use commodity machines as your data nodes. This way, you don’t
havetospendmillionsofdollarsjustonyourdatanodes.However,thenamenodeisalways
an enterprise server.
Features of HDFS
MasterandslavenodesformtheHDFScluster.Thenamenodeiscalledthemaster,andthe
data nodes are called the slaves.
The name node is responsible for the workings of the data nodes. It also stores the metadata.
Thedatanodesread,write,process,andreplicatethedata.Theyalsosendsignals,knownas
heartbeats, to the name node. These heartbeats show the status of the data node.
Considerthat30TBofdataisloadedintothenamenode.Thenamenodedistributesitacross
the data nodes, and this data is replicated among the data notes. You can see intheimage
above that the blue, grey, and red data are replicated among the three data nodes.
Replication of the data is performed three times by default. It is done this way, so if a
commodity machine fails, you can replace it with a new machine that has the same data.
Let us focus on Hadoop MapReduce in the following section of the What is Hadoop article.
2.Hadoop MapReduce
Hadoop MapReduce is the processing unit of Hadoop. In the MapReduce approach, the
processing is done at the slave nodes, and the final result is sent to the master node.
A data containing code is used to process the entire data. This coded data is usually very
smallincomparisontothedataitself.Youonlyneedtosendafewkilobytesworthofcodeto
perform a heavy-duty process on computers.
Theinputdatasetisfirstsplitintochunksofdata.Inthisexample,theinputhasthreelinesof
textwiththreeseparateentities-“buscartrain,”“shipshiptrain,”“busshipcar.”Thedataset
is then split into three chunks, based on these entities, and processed parallely.
Inthemapphase,thedataisassignedakeyandavalueof1.Inthiscase,wehaveonebus,
one car, one ship, and one train.
Thesekey-valuepairsarethenshuffledandsortedtogetherbasedontheirkeys.Atthereduce
phase, the aggregation takes place, and the final output is obtained.
Hadoop YARN is the next concept we shall focus on in the What is Hadoop article.
Hadoop YARN
● Hadoop YARN acts like an OS to Hadoop. It is afilesystemthatisbuiltontopof
HDFS.
● It is responsibleformanagingclusterresourcestomakesureyoudon'toverloadone
machine.
● It performs job scheduling to make sure that the jobs are scheduled in the right place
Supposeaclientmachinewantstodoaqueryorfetchsomecodefordataanalysis.Thisjob
request goes to the resource manager (Hadoop Yarn), which is responsible for resource
allocation and management.
Inthenodesection,eachofthenodeshasitsnodemanagers.Thesenodemanagersmanage
thenodesandmonitortheresourceusageinthenode.Thecontainerscontainacollectionof
physical resources, which could be RAM, CPU, or hard drives. Whenever a job request
comes in, the app master requests the container from the node manager. Once the node
manager gets the resource, it goes back to the Resource Manager.
Hadoop is an open-source framework that provides distributed storage and processing of
largedatasets.Itconsistsoftwomaincomponents:HadoopDistributedFileSystem(HDFS)
and MapReduce. HDFS is a distributed file system that allows data to be stored across
multiple machines, while MapReduce is a programming model that enables large-scale
distributed data processing.
ToanalyzedatawithHadoop,youfirstneedtostoreyourdatainHDFS.Thiscanbedoneby
using the Hadoop command line interface or through a web-based graphical interface like
Apache Ambari or Cloudera Manager.
Hadoop also provides a number of other tools for analyzing data, including Apache Hive,
Apache Pig, and Apache Spark. These tools provide higher-level abstractions thatsimplify
the process of data analysis.
Apache Hive provides a SQL-like interface for querying data stored inHDFS.Ittranslates
SQL queriesintoMapReducejobs,makingiteasierforanalystswhoarefamiliarwithSQL
to work with Hadoop.
Apache Pig is a high-level scripting language that enables users to write data processing
pipelines that are translated into MapReduce jobs. Pig provides a simpler syntax than
MapReduce, making it easier to write and maintain data processing code.
ApacheSparkisadistributedcomputingframeworkthatprovidesafastandflexiblewayto
process large amounts ofdata.ItprovidesanAPIforworkingwithdatainvariousformats,
including SQL, machine learning, and graph processing.
Insummary,Hadoopprovidesapowerfulframeworkforanalyzinglargeamountsofdata.By
storingdatainHDFSandusingMapReduceorothertoolslikeApacheHive,ApachePig,or
ApacheSpark,youcanperformdistributeddataprocessingandgaininsightsfromyourdata
that would be difficult or impossible to obtain using traditional data analysis tools.
Once your data is stored in HDFS, you can use MapReduce to perform distributed data
processing.MapReducebreaksdownthedataprocessingintotwophases:themapphaseand
the reduce phase.
Inthemapphase,theinputdataisdividedintosmallerchunksandprocessedindependently
by multiple mapper nodes in parallel. The output of the map phase is a set of key-value pairs.
In the reduce phase, the key-value pairs produced by the map phase are aggregated and
processedbymultiplereducernodesinparallel.Theoutputofthereducephaseistypicallya
summary of the input data, such as a count or an average.
Scaling Out
You’veseenhowMapReduceworksforsmallinputs;nowit’stimetotakeabird’s-eyeview
of the system and look at thedataflowforlargeinputs.Forsimplicity,theexamplessofar
haveusedfilesonthelocalfilesystem.However,toscaleout,weneedtostorethedataina
distributedfilesystem,typicallyHDFS(whichyou’lllearnaboutinthenextchapter),toallow
Hadoop to move the MapReduce computation to each machine hosting a part of the data.
Let’s see how this works.
Data Flow
First, some terminology. A MapReduce job is a unit of work that the client wants to be
performed: it consists of the input data, the MapReduce program, and configuration
information. Hadoop runs the job by dividing it into tasks, of which there are two types:
map tasks and reduce tasks.
There are two types of nodes that control the job execution process: a jobtracker and a
number of tasktrackers. The jobtracker coordinates all the jobs run on the system by
scheduling tasks to run on tasktrackers. Tasktrackers runtasksandsendprogressreportsto
the jobtracker, which keeps a record of the overall progress of each job.Ifataskfails,the
jobtracker can reschedule it on a different tasktracker.
Hadoop divides the input to a MapReduce job into fixed-size pieces called input
splits,orjustsplits.Hadoopcreatesonemaptaskforeachsplit,whichrunstheuser-defined
map function for each record in the split.
Havingmanysplitsmeansthetimetakentoprocesseachsplitissmallcomparedtothetime
to process the whole input. So if we are processing the splits in parallel, the processing is
betterload-balancedwhenthesplitsaresmall,sinceafastermachinewillbeabletoprocess
proportionally more splits over the course of the job than a slower machine. Even if the
machines are identical, failed processes or other jobs running concurrently make load
balancingdesirable,andthequalityoftheloadbalancingincreasesasthesplitsbecomemore
fine-grained.
Ontheotherhand,ifsplitsaretoosmall,theoverheadofmanagingthesplitsandofmaptask
creationbeginstodominatethetotaljobexecutiontime.Formostjobs,agoodsplitsizetends
to be the size of an HDFS block, 64 MB by default, although this can be changed forthe
cluster (for all newly created files) or specified when each file is created.
Hadoop does its best to run the map task on anodewheretheinputdataresidesinHDFS.
Thisiscalledthedatalocalityoptimizationbecauseitdoesn’tusevaluableclusterbandwidth.
Sometimes,however,allthreenodeshostingtheHDFSblockreplicasforamaptask’sinput
splitarerunningothermaptasks,sothejobschedulerwilllookforafreemapslotonanode
in the same rack as one of the blocks. Very occasionally even this is not possible, soan
off-rack node isused,whichresultsinaninter-racknetworktransfer.Thethreepossibilities
are illustrated in Fig.
Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is
intermediate output: it’sprocessedbyreducetaskstoproducethefinaloutput,andoncethe
jobiscomplete,themapoutputcanbethrownaway.SostoringitinHDFSwithreplication
would be overkill. If the node running the map task fails before the map output has been
consumed by the reduce task, then Hadoop will automatically
rerun the map task on another node to re-create the map output.
Reduce tasks don’t have the advantage of data locality; the input toasinglereducetaskis
normally the output fromallmappers.Inthepresentexample,wehaveasinglereducetask
that is fed by all ofthemaptasks.Therefore,thesortedmapoutputshavetobetransferred
acrossthenetworktothenodewherethereducetaskisrunning,wheretheyaremergedand
then passedtotheuser-definedreducefunction.Theoutputofthereduceisnormallystored
in HDFS for reliability. As explained for each HDFS block of the reduce output, the first
replica isstoredonthelocalnode,withotherreplicasbeingstoredonoff-racknodes.Thus,
writing the reduce output does consume network bandwidth, butonlyasmuchasanormal
HDFS write pipeline consumes.
Thenumberofreducetasksisnotgovernedbythesizeoftheinput,butinsteadisspecified
independently.In“TheDefaultMapReduceJob”onpage227,youwillseehowtochoosethe
number of reduce tasks for a given job.
When there are multiple reducers, the map tasks partition their output, each creating one
partition foreachreducetask.Therecanbemanykeys(andtheirassociatedvalues)ineach
partition, but therecordsforanygivenkeyareallinasinglepartition.Thepartitioningcan
be controlled by a user-defined partitioning function, but normally the default
partitioner—which buckets keys using a hash function—works very well.
Thedataflowforthegeneralcaseofmultiplereducetasksisillustratedinbelowimage.This
diagram makes it clear why the data flow between map and reduce tasks is colloquially
known as “the shuffle,” as each reduce task isfedbymanymaptasks.Theshuffleismore
complicatedthanthisdiagramsuggests,andtuningitcanhaveabigimpactonjobexecution
time.
MapReduce data flow with multiple reduce tasks
Finally, it’salsopossibletohavezeroreducetasks.Thiscanbeappropriatewhenyoudon’t
need the shuffle because the processingcanbecarriedoutentirelyinparallel.Inthiscase,
the only off-node data transfer is when the map tasks write to HDFS (see Figure)
Hadoop Streaming
ItisautilityorfeaturethatcomeswithaHadoopdistributionthatallowsdevelopers
or programmers to write the Map-Reduce program using different programming languages
like Ruby,Perl,Python,C++,etc.Wecanuseanylanguagethatcanreadfromthestandard
input(STDIN)likekeyboardinputandallandwriteusingstandardoutput(STDOUT).Weall
knowtheHadoopFrameworkiscompletelywritteninjavabutprogramsforHadooparenot
necessarily need to code in Java programming language. feature of Hadoop Streaming is
available since Hadoop version 0.14.1.
In the above example image, we can see that the flow shown in a dotted block is a basic
MapReducejob.Inthat,wehaveanInputReaderwhichisresponsibleforreadingtheinput
data and produces the list of key-value pairs. We can read datain.csvformat,indelimiter
format,fromadatabasetable,imagedata(.jpg,.png),audiodataetc.Theonlyrequirementto
read all these types of data is that we have to create a particular inputformatforthatdata
with these input readers. The input reader contains the complete logic about the data it is
reading. Suppose we want to read an image then we have to specify the logic intheinput
readersothatitcanreadthatimagedataandfinallyitwillgeneratekey-valuepairsforthat
image data.
Ifwearereadinganimagedatathenwecangeneratekey-valuepairforeachpixelwherethe
key will bethelocationofthepixelandthevaluewillbeitscolorvaluefrom(0-255)fora
coloredimage.Nowthislistofkey-valuepairsisfedtotheMapphaseandMapperwillwork
oneachofthesekey-valuepairofeachpixelandgeneratesomeintermediatekey-valuepairs
which are then fed to the Reducer after doing shuffling and sorting then the final output
produced by the reducerwillbewrittentotheHDFS.ThesearehowasimpleMap-Reduce
job works.
Now let’sseehowwecanusedifferentlanguageslikePython,C++,RubywithHadoopfor
execution.Wecanrunthisarbitrarylanguagebyrunningthemasaseparateprocess.Forthat,
wewillcreateourexternalmapperandrunitasanexternalseparateprocess.Theseexternal
mapprocessesarenotpartofthebasicMapReduceflow.Thisexternalmapperwilltakeinput
from STDIN and produce output to STDOUT. As the key-value pairs are passed to the
internal mapper the internal mapper process will send these key-value pairstotheexternal
mapperwherewehavewrittenourcodeinsomeotherlanguagelikewithpythonwithhelpof
STDIN.Now,theseexternalmappersprocessthesekey-valuepairsandgenerateintermediate
key-value pairs with help of STDOUT and send it to the internal mappers.
Similarly,Reducerdoesthesamething.Oncetheintermediatekey-valuepairsareprocessed
through the shuffle and sorting process they are fedtotheinternalreducerwhichwillsend
thesepairstoexternalreducerprocessthatareworkingseparatelythroughthehelpofSTDIN
and gatherstheoutputgeneratedbyexternalreducerswithhelpofSTDOUTandfinallythe
output is stored to our HDFS.
Inthissection,wedigintotheHadoop’sFileSystemclass:theAPIforinteractingwithoneof
Hadoop’s filesystems. Although we focus mainly on the HDFS implementation,
DistributedFileSystem,ingeneralyoushouldstrivetowriteyourcodeagainsttheFileSystem
abstract class, to retain portability across filesystems. Thisisveryusefulwhentestingyour
program, for example, because you can rapidly run tests using data stored on the local
filesystemInthissection,wedigintotheHadoop’sFileSystemclass:theAPIforinteracting
withoneofHadoop’sfilesystems.AlthoughwefocusmainlyontheHDFSimplementation,
DistributedFileSystem,ingeneralyoushouldstrivetowriteyourcodeagainsttheFileSystem
abstract class, to retain portability across filesystems. Thisisveryusefulwhentestingyour
program, for example, because you can rapidly run tests using data stored on the local
filesystem
InputStream in = null;
try {
} finally {
IOUtils.closeStream(in);
}
There’salittlebitmoreworkrequiredtomakeJavarecognizeHadoop’shdfsURLscheme.
This is achieved by calling the setURLStreamHandlerFactory method on URL with an
instanceofFsUrlStreamHandlerFactory.ThismethodcanbecalledonlyonceperJVM,soit
is typically executed in a staticblock.Thislimitationmeansthatifsomeotherpartofyour
program—perhaps a third-party component outside your control sets a
URLStreamHandlerFactory, you won’t be able to use this approach for reading data from
Hadoop.
As the previous section explained, sometimes it is impossible tosetaURLStreamHandler
Factoryforyourapplication.Inthiscase,youwillneedtousetheFileSystemAPItoopenan
input stream for a file.
AfileinaHadoopfilesystemisrepresentedbyaHadoopPathobject(andnotajava.io.File
object,sinceitssemanticsaretoocloselytiedtothelocalfilesystem).YoucanthinkofaPath
as a Hadoop filesystem URI, such as hdfs://localhost/user/tom/ quangle.txt.
FileSystem is a general filesystem API, so the first step is to retrieve an instance for the
filesystem we want to use—HDFSinthiscase.Thereareseveralstaticfactorymethodsfor
getting a FileSystem instance:
A Configuration object encapsulates a client or server’s configuration, which is set using
configuration files read from the classpath, such as conf/core-site.xml. The first method
returns the default filesystem(asspecifiedinthefileconf/core-site.xml,orthedefaultlocal
filesystem if not specified there). ThesecondusesthegivenURI’sschemeandauthorityto
determine the filesystem to use, falling back to the default filesystem if no scheme is
specified in the given URI. The third retrieves the filesystem as the given user.
Displaying files from a Hadoop filesystem on standard output by using the FileSystem
directly
} finally { IOUtils.closeStream(in); } } }
FSDataInputStream
The open () method on FileSystem actually returns a FSDataInputStream rather than a
standard java.io class.Thisclassisaspecializationofjava.io.DataInputStreamwithsupport
for random access, so you can read from any part of the stream: package
org.apache.hadoop.fs;
Writing Data
TheFileSystemclasshasanumberofmethodsforcreatingafile.Thesimplestisthemethod
that takes a Path object for the file to be created and returns an output stream to write to:
public
FSDataOutputStream
The create() method on FileSystem returns an FSDataOutputStream, which, like
FSDataInputStream, has a method for querying the current position in the file:
package org.apache.hadoop.fs;
●Run on full dataset and if it fails debug it using hadoop debugging tools.
●Do profiling to tune the performance of the program.
The first stage in development of MapReduce Application is the Mapper Class. Here,
RecordReader processes each Input record and generates the respective key-value pair.
Hadoop’s Mapper store saves this intermediate data into the local disk.
Driver code
ThemajorcomponentinaMapReducejobisaDriverClass.Itisresponsibleforsettingupa
MapReduce Job to run-in Hadoop. We specify the names of Mapper and Reducer Classes
long with data types and their respective job names.
Debugging a Mapreduce Application
FortheprocessofdebuggingLogfilesareessential.LogFilescanbefoundonthelocalfsof
each TaskTracker and if JVM reuse is enabled, each log accumulates the entire JVM run.
Anything written to standard output or error is directed to the relevant logfile
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
● The Map task takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key-value pairs).
● TheReducetasktakestheoutputfromtheMapasaninputandcombinesthosedata
tuples (key-value pairs) into a smaller set of tuples.
The reduced task is always performed after the map job.
InputPhase−HerewehaveaRecordReaderthattranslateseachrecordinaninputfileand
sends the parsed data to the mapper in the form of key-value pairs.
Map−Mapisauser-definedfunction,whichtakesaseriesofkey-valuepairsandprocesses
each one of them to generate zero or more key-value pairs.
Intermediate Keys − The key-value pairs generated by the mapper are known as
intermediate keys.
Combiner − A combiner is a type of local Reducer that groupssimilardatafromthemap
phase into identifiable sets. It takes the intermediate keys from the mapper as input and
appliesauser-definedcodetoaggregatethevaluesinasmallscopeofonemapper.Itisnota
part of the main MapReduce algorithm; it is optional.
ShuffleandSort−TheReducertaskstartswiththeShuffleandSortstep.Itdownloadsthe
grouped key-value pairs onto the local machine, where the Reducer is running. The
individual key-value pairs are sorted by key into a larger data list.Thedatalistgroupsthe
equivalent keys together so that their values can be iterated easily in the Reducer task.
Reducer−TheReducertakesthegroupedkey-valuepaireddataasinputandrunsaReducer
function oneachoneofthem.Here,thedatacanbeaggregated,filtered,andcombinedina
number of ways, and it requires a wide range of processing. Once theexecutionisover,it
gives zero or more key-value pairs to the final step.
Output Phase − In the output phase, we have an output formatter that translates the final
key-value pairs from the Reducer function and writes them onto a file using a record writer.
Advantage of MapReduce
Limitations Of MapReduce
Job Submission :
● The submit() method on Job creates an internal JobSubmitter instance and calls
submitJobInternal() on it.
● Having submitted the job, waitForCompletion polls the job’s progress once per
second and reports the progress to the console if it has changed since the last report.
● When the job completes successfully, the job counters are displayed Otherwise,the
error that caused the job to fail is logged to the
console.
● Asks the resource manager for a new application ID, used for the MapReduce job ID.
● CheckstheoutputspecificationofthejobForexample,iftheoutputdirectoryhasnot
been specified or it alreadyexists,thejobisnotsubmittedandanerroristhrownto
the MapReduce program.
● Computes the input splits for the job If the splits cannot be computed (becausethe
inputpathsdon’texist,forexample),thejobisnotsubmittedandanerroristhrownto
the MapReduce program.
● Copies the resources needed to run the job, including the job JAR file, the
configuration file, and the computed input splits, to the shared filesystem in a
directory named after the job ID.
● Submits the job by calling submitApplication() on the resource manager.
Job Initialization :
● WhentheresourcemanagerreceivesacalltoitssubmitApplication()method,ithands
off the request to the YARN scheduler.
● The scheduler allocates a container, and the resource manager then launches the
application master’s process there, under the node manager’s management.
● TheapplicationmasterforMapReducejobsisaJavaapplicationwhosemainclassis
MRAppMaster .
● Itinitializesthejobbycreatinganumberofbookkeepingobjectstokeeptrackofthe
job’s progress, as it will receive progress and completion reports from the tasks.
● It retrieves the input splits computed in the client from the shared filesystem.
● It then creates a map task object for each split, as well as a number of reducetask
objects determined by the mapreduce.job.reduces property (set by the
setNumReduceTasks() method on Job).
Task Assignment:
● If the job does not qualify for running as an uber task, then the application master
requests containers for all the map and reduce tasks in the job from the resource
manager .
● Requestsformaptasksaremadefirstandwithahigherprioritythanthoseforreduce
tasks, since all the map tasks mustcompletebeforethesortphaseofthereducecan
start.
● Requests for reduce tasks are not made until 5% of map tasks have completed.
Job Scheduling
Lateron,theabilitytosetajob’sprioritywasadded,viathemapred.job.prioritypropertyor
thesetJobPriority()methodonJobClient(bothofwhichtakeoneofthevaluesVERY_HIGH,
HIGH,NORMAL,LOW,orVERY_LOW).Whenthejobschedulerischoosingthenextjob
torun,itselectsonewiththehighestpriority.However,withtheFIFOscheduler,prioritiesdo
not support preemption, so a high-priority job can still be blocked by a long-running,
low-priority job that started before the high-priority job was scheduled.
MapReduceinHadoopcomeswithachoiceofschedulers.ThedefaultinMapReduceisthe
originalFIFOqueue-basedscheduler,andtherearealsomultiuserschedulers calledtheFair
Scheduler and the Capacity Scheduler.
Capacity Scheduler
Advantage:
● Best for working with Multiple clients or priority jobs in a Hadoop cluster
● Maximizes throughput in the Hadoop cluster
Disadvantage:
TheFairSchedulerisverymuchsimilartothatofthecapacityscheduler.Thepriorityofthe
job is kept in consideration. With the help of Fair Scheduler, the YARN applications can
share the resources in the large Hadoop Cluster and these resources are maintained
dynamicallysononeedforpriorcapacity.Theresourcesaredistributedinsuchamannerthat
allapplicationswithinaclustergetanequalamountoftime.FairSchedulertakesScheduling
decisions on the basis of memory, we can configure it to work with CPU also.
AswetoldyouitissimilartoCapacitySchedulerbutthemajorthingtonoticeisthatinFair
Scheduler whenever any high priority jobarisesinthesamequeue,thetaskisprocessedin
parallel by replacing some portion from the already dedicated slots.
Advantages:
● Once a task has been assigned resources for acontaineronaparticularnodebythe
resource manager’s scheduler, the application master starts the container by
contacting the node manager.
● The task is executed by aJavaapplicationwhosemainclassisYarnChild.Beforeit
can run the task, it localizes the resources that the task needs, including the job
configuration and JAR file, and any files from the distributed cache.
● Finally, it runs the map or reduce task.
Streaming:
● Streaming runs special map and reduce tasks for the purpose of launching the user
supplied executable and communicating with it.
● The Streaming task communicates with the process (which may be written in any
language) using standard input and output streams.
● During execution of the task, the Java process passes input key value pairs to the
external process, which runs it through the user defined
maporreducefunctionanprocessdpassestheoutputkeyvaluepairsbacktotheJava
process.
● Fromthenodemanager’spointofview,itisasifthechildranthemaporreducecode
itself.
● MapReducejobsarelongrunningbatchjobs,takinganythingfromtensofsecondsto
hours to run.
● Ajobandeachofitstaskshaveastatus,whichincludessuchthingsasthestateofthe
job or task (e g running, successfully completed, failed), the progress of maps and
reduces, the values ofthejob’scounters,andastatusmessageordescription(which
may be set by user code).
● When a task is running, it keeps track of its progress (i e the proportion of task is
completed).
● For map tasks, this is the proportion of the input that has been processed.
● For reduce tasks, it’s a little more complex, but the system can still estimate the
proportion of the reduce input processed.
Itdoesthisbydividingthetotalprogressintothreeparts,correspondingtothethreephasesof
the shuffle.
● As the map or reduce task runs, the child process communicates with its parent
application master through the umbilical interface.
● The task reports its progress and status (including counters) back to its application
master, which has an aggregate view of the job, every three seconds over the
umbilical interface.
How status updates are propagated through the MapReduce System
Job Completion:
● When the application master receives a notification that the last task for a job is
complete, it changes the status for the job to Successful.
● Then,whentheJobpollsforstatus,itlearnsthatthejobhascompletedsuccessfully,
so it prints a message to tell the user and then returns from the waitForCompletion() .
● Finally, on job completion, the application master and the task containers clean up
their working state and the Output Committer’s commitJob () method is called.
● Job information is archived bythejobhistoryservertoenablelaterinterrogationby
users if desired.
Task execution
Once the resource manager’s scheduler assign a resources to the task for a container on a
particular node, the container isstartedupbytheapplicationmasterbycontactingthenode
manager. The task whose main class isYarnChildisexecuted by a Java application .
It localizes the resources that the task needed before it canrunthetask.Itincludesthejob
configuration,anyfilesfromthedistributedcacheandJARfile.Itfinallyrunsthemaporthe
reduce task. Any kind of bugs in the user-defined map and reduce functions (or even in
YarnChild)don’taffectthenodemanagerasYarnChildrunsinadedicatedJVM.Soitcan’t
be affected by a crash or hang.
All actions running in the same JVM as the task itself are performed by each task setup.
These are determined by the OutputCommitter for the job. Thecommitactionmovesthe
taskoutputtoitsfinallocationfromitsinitialpositionforafile-basedjobs.Whenspeculative
execution is enabled, the commit protocol ensures that only one of the duplicate tasks is
committed and the other one is aborted.
What does Streaming means?
It is as if the child process ran the map or reduce code itself from the manager’s point of
view. MapReducejobscantakeanytimefromtensofsecondtohourstorun,that’swhyare
long-runningbatches.It’simportantfortheusertogetfeedbackonhowthejobisprogressing
because this can be a significant length of time. Each job including the task has a status
including the state of the job or task, values of the job’s counters, progress of maps and
reduces and the description or statusmessage.Thesestatuseschangeoverthecourseofthe
job.
Thetaskkeepstrackofitsprogresswhenataskisrunninglikeapartofthetaskiscompleted.
This is the proportion of the input that has beenprocessedformaptasks.Itisalittlemore
complex for the reduce task but the system can still estimate the proportion of the reduce
inputprocessed.Whenataskisrunning,itkeepstrackofitsprogress(i.e.,theproportionof
thetaskcompleted).Formaptasks,thisistheproportionoftheinputthathasbeenprocessed.
Forreducetasks,it’salittlemorecomplex,butthesystemcanstillestimatetheproportionof
the reduce input processed.
Process involved
In Hadoop, there are various MapReduce types for InputFormat that are used for various
purposes. Let us now look at the MapReduce types of InputFormat:
FileInputFormat
Itservesasthefoundationforallfile-basedInputFormats.FileInputFormatalsoprovidesthe
input directory, which contains the location of the data files. When we start a MapReduce
task, FileInputFormat returns a pathwithfilestoread.ThisInputFormatwillreadallfiles.
Then it divides these files into one or more InputSplits.
TextInputFormat
It isthestandardInputFormat.Eachlineofeachinputfileistreatedasaseparaterecordby
this InputFormat. It does not parse anything. TextInputFormat is suitable for raw data or
line-based records, such as log files. Hence:
● Value: It is the line's substance. It does not include line terminators.
KeyValueTextInputFormat
ItiscomparabletoTextInputFormat.Eachlineofinputisalsotreatedasaseparaterecordby
this InputFormat. While TextInputFormat treats the entire line as the value,
KeyValueTextInputFormat divides the line into key and value by a tab character ('/t'). Hence:
● Value: It is the remaining part of the line after the tab character.
SequenceFileInputFormat
It'saninputformatforreadingsequencefiles.Binaryfilesaresequencefiles.Thesefilesalso
store binary key-value pair sequences. These are block-compressed and support direct
serialization and deserialization of a variety of data types. Hence Key & Value are both
user-defined.
SequenceFileAsTextInputFormat
It is a subtype of SequenceFileInputFormat. The sequence file key values are convertedto
Text objects using this format. As a result, it converts the keys and values by running
'tostring()'onthem.Asaresult,SequenceFileAsTextInputFormatconvertssequencefilesinto
text-based input for streaming.
NlineInputFormat
ItisavariantofTextInputFormatinwhichthekeysaretheline'sbyteoffset.Andvaluesare
the line's contents. As a result, each mapper receives a configurable number of lines of
TextInputFormat and KeyValueTextInputFormat input. The number is determined by the
magnitude of the split. It is also dependent on the length of the lines. So, if we want our
mapper to accept a specific amount of lines of input, we use NLineInputFormat.
Assuming N=2, each split has two lines. As a result, the first two Key-Value pairs are
distributed to one mapper. The second two key-value pairs are given to another mapper.
DBInputFormat
Using JDBC, this InputFormat reads data from a relational Database. It also loads small
datasets, which might be used to connect with huge datasets from HDFS using multiple
inputs. Hence:
Theoutputformatclassesworkintheoppositedirectionastheircorrespondinginputformat
classes.TheTextOutputFormat,forexample,isthedefaultoutputformatthatoutputsrecords
as plain text files, although key values can be of any type and are converted to strings by
usingthetoString()method.Thetabcharacterseparatesthekey-valuecharacter,butthiscan
be changed by modifying the separator attribute of the text output format.
DBOutputFormathandlestheoutputformatsforrelationaldatabasesandHBase.Itsavesthe
compressed output to a SQL table.
Features of MapReduce
Scalability
MapReduce can scale to process vast amounts of data by distributing tasks across a large
numberofnodesinacluster.Thisallowsittohandlemassivedatasets,makingitsuitablefor
Big Data applications.
Fault Tolerance
Data Locality
MapReducetakesadvantageofdatalocalitybyprocessingdataonthesamenodewhereitis
stored, minimizing data movement across the network and improving overall performance.
Simplicity
The MapReduce programming model abstracts away many complexities associated with
distributedcomputing,allowingdeveloperstofocusontheirdataprocessinglogicratherthan
low-level details.
Cost-Effective Solution
Hadoop's scalable architecture and MapReduce programming framework make storing and
processing extensive data sets very economical.
Parallel Programming
Tasks are divided into programming models to allow for the simultaneous execution of
independentoperations.Asaresult,programsrunfasterduetoparallelprocessing,makingit
easierforaprocesstohandleeachjob.Thankstoparallelprocessing,thesedistributedtasks
can be performed by multiple processors. Therefore, all software runs faster.
UNIT IV
HADOOP ENVIRONMENT
HadoopClusterisstatedasacombinedgroupofunconventionalunits.Theseunitsare
in a connected with a dedicated server which is usedforworkingasasoledataorganizing
source. It works as centralized unit throughout the working process. In simple terms, it is
statedas acommontypeofclusterwhichispresentforthecomputationaltask.Thiscluster
is helpful in distributing the workload for analyzingdata.WorkloadoverHadoopclusteris
distributedamongseveralothernodes,whichareworkingtogethertoprocessdata.Itcanbe
explained by considering the following terms:
1. Distributed Data Processing: In distributed data processing, the map gets reduced
and scrutinized from a large amount of data.Itgetassignedajobtrackerforallthe
functionalities. Apart from the job tracker, there is adatanodeandtasktracker.All
these play a huge role in processing the data.
2. DistributedDataStorage:Itallows storingahugeamountofdataintermsofname
node and secondary name node. In this both the nodes have a data node and task
tracker.
How does Hadoop Cluster Makes Working so Easy?
It plays important role to collect and analyze the data in a proper way. It is useful in
performing a number of tasks which brings out the ease in any task.
● Add nodes: It is easy to add nodes in the cluster to help in other functional areas.
Without the nodes, it is not possible to scrutinize the data from unstructured units.
● Data Analysis: This special type of cluster which is compatible with parallel
computation to analyze the data.
● Fault tolerance:Thedatastoredinanynoderemainunreliable.So,itcreatesacopy
of the data which is present on other nodes.
While working with Hadoop Cluster it is important to understand its architecture as follows :
Advantages:
ThissectiondescribeshowtoinstallandconfigureabasicHadoopclusterfromscratchusing
the Apache Hadoop distribution on a Unix operating system. It provides background
informationonthethingsyouneedtothinkaboutwhensettingupHadoop.Foraproduction
installation,mostusersandoperatorsshouldconsideroneoftheHadoopclustermanagement
toolsInstallingJavaHadooprunsonbothUnixandWindowsoperatingsystems,andrequires
Java to be installed. For a production installation, you should select a combination of
operating
system, Java, and Hadoop that has been certified by the vendoroftheHadoopdistribution
you are using. There is also a page on the Hadoop wiki that lists combinations that
community members have run with success.
It’s good practice to create dedicated Unix user accounts to separatetheHadoopprocesses
from each other, and from other services running on the same machine. The HDFS,
MapReduce,andYARNservicesareusuallyrunasseparateusers,namedhdfs,mapred,and
yarn, respectively. They all belong to the same hadoop group.
Installing Hadoop
Download Hadoop from the Apache Hadoop releases page, and unpackthecontentsofthe
distributioninasensiblelocation,suchas/usr/local(/optisanotherstandardchoice;notethat
Hadoop should not be installedinauser’shomedirectory,asthatmaybeanNFS-mounted
directory):
% cd /usr/local
You also need to change the owner of the Hadoop files to be the hadoop user and group:
It’s convenient to put the Hadoop binaries on the shell path too:
% export HADOOP_HOME=/usr/local/hadoop-x.y.z
% export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Configuring SSH
The Hadoop control scripts (but not the daemons) rely on SSH to perform cluster-wide
Operations. For example, there is a script for stopping and starting all the daemons in the
cluster. Notethatthecontrolscriptsareoptional—cluster-wideoperationscanbeperformed
by other mechanisms, too, such as a distributed shell or dedicated Hadoop management
applications.Toworkseamlessly,SSHneedstobesetuptoallowpasswordlessloginforthe
hdfs and yarn users from machines in the cluster.2 The simplest way to achieve this is to
generate a public/private key pair and place it in an NFS location that issharedacrossthe
cluster.
Eventhoughwewantpasswordlesslogins,keyswithoutpassphrasesarenotconsideredgood
practice (it’s OK to have an empty passphrase when running a local pseudo distributed
cluster,asdescribedinAppendixA),sowespecifyapassphrasewhenpromptedforone.We
use ssh-agent to avoid the need to enter a password for each connection.
The private key is in the file specified by the-foption,~/.ssh/id_rsa,andthepublickeyis
stored in a file with the same name but with .pub appended, ~/.ssh/id_rsa.pub.
Next,weneedtomakesurethatthepublickeyisinthe~/.ssh/authorized_keysfileonallthe
machinesintheclusterthatwewanttoconnectto.Iftheusers’homedirectoriesarestoredon
anNFSfilesystem,thekeyscanbesharedacrosstheclusterbytypingthefollowing(firstas
hdfs and then as yarn):
IfthehomedirectoryisnotsharedusingNFS,thepublickeyswillneedtobesharedbysome
other means (such as ssh-copy-id).Test that you can SSH from the master to a worker
machinebymakingsuressh-agentisrunning,3andthenrunssh-addtostoreyourpassphrase.
You should be able to SSH to a worker without entering the passphrase again.
To Perform setting up and installing Hadoop in the pseudo-distributed mode using the
following steps given below as follows. Let’s discuss one by one.
Step 1: Download Binary Package :
http://hadoop.apache.org/releases.html
For reference, you can check the file save to the folder as follows.
C:\BigData
Open Git Bash, and change directory (cd) to the folderwhereyousavethebinarypackage
and then unzip as follows.
$ cd C:\BigData
MINGW64: C:\BigData
Next, go to this GitHub Repo and download the receptacle organizer as a speed as
demonstrated as follows. Concentrate the compressandduplicateallthedocumentspresent
under the receptacle envelope to C:\BigData\hadoop-3.1.2\bin. Supplantthecurrentrecords
too.
● Goto C:/BigData/hadoop-3.1.2 and make an organizer ‘information’. Inside the
‘information’ envelope make two organizers ‘datanode’ and ‘namenode’. Your
documents on HDFS will dwell under the datanode envelope.
● Set Hadoop Environment Variables
● Hadoop requires the following environment variables to be set.
HADOOP_HOME=” C:\BigData\hadoop-3.1.2”
HADOOP_BIN=”C:\BigData\hadoop-3.1.2\bin”
Intheeventthatyoudon’thaveJAVA1.8introduced,atthatpointyou’llhavetodownload
and introduce itfirst.IntheeventthattheJAVA_HOMEclimatevariableisnowset,atthat
point checkwhetherthewayhasanyspacesinit(ex:C:\ProgramFiles\Java\…).Spacesin
the JAVA_HOME way will lead you to issues. There is a stunt to get around it. Supplant
‘Program Files‘to‘Progra~1’inthevariableworth.GuaranteethatthevariantofJavais1.8
and JAVA_HOME is highlighting JDK 1.8.
Nowwehavesettheenvironmentvariables,weneedtovalidatethem.OpenanewWindows
Command prompt and runanechocommandoneachvariabletoconfirmtheyareassigned
the desired values.
On the off chance that the factors are not instated yet, at that point it can likely beonthe
grounds that you are trying them in anoldmeeting.Ensureyouhaveopenedanotherorder
brief to test them.
Onceenvironmentvariablesaresetup,weneedtoconfigureHadoopbyeditingthefollowing
configuration files.
After editing core-site.xml, you need to set the replication factor and the location of
namenode and datanodes. Open C:\BigData\hadoop-3.1.2\etc\hadoop\hdfs-site.xml and
below content within <configuration> </configuration> tags
Step 7: Edit core-site.xml
At last, how about we arrange properties for the Map-Reduce system. Open
C:\BigData\hadoop-3.1.2\etc\hadoop\mapred-site.xml and beneath content inside
<configuration> </configuration> labels. Intheeventthatyoudon’tseemapred-site.xml,at
that point open mapred-site.xml.template record and rename it to mapred-site.xml
Check if C:\BigData\hadoop-3.1.2\etc\hadoop\slaves file is present, if it’s not then created
one and add localhost in it and save it.
To organizetheNameNode,openanotherWindowsCommandPromptandrunthebeneath
order. It might give you a few admonitions, disregard them.
Open another Windows Command brief, make a point to run it as an Administrator to
maintain a strategic distance from authorization mistakes. When opened, execute the
beginning all.cmd order. Since we have added %HADOOP_HOME%\sbin to the PATH
variable, you can run this order from any envelope. In the event that you haven’t done as
such, at that point go to the %HADOOP_HOME%\sbin organizer and run the order.
Youcancheckthegivenbelowscreenshotforyourreference4newwindowswillopenand
cmd terminals for 4 daemon processes like as follows.
● namenode
● datanode
● node manager
● resource manager
Don’t close these windows, minimize them. Closing the windows will terminate the
daemons. You can run them in the background if you don’t like to see these windows.
Inconclusion,howaboutwescreentoperceivehowareHadoopdaemonsaregettingalong.
Also you can utilize the Web UI for a wide range ofauthoritativeandobservingpurposes.
Open your program and begin.
HDFS(Hadoop Distributed File System) is utilized for storage permission is a Hadoop
cluster. It mainly designed for working on commodity Hardware devices(devices that are
inexpensive), working on a distributed filesystemdesign.HDFSisdesignedinsuchaway
that it believes more in storing thedatainalargechunkofblocksratherthanstoringsmall
data blocks. HDFS in Hadoop provides Fault-tolerance and Highavailabilitytothestorage
layer and the other devices present in that Hadoop cluster.
As we all know Hadoop works on the MapReduce algorithm which is a master-slave
architecture, HDFS hasNameNodeandDataNodethatworks in the similar pattern.
1.NameNode(Master)
2.DataNode(Slave)
1. NameNode: NameNode works as a Master in a Hadoop cluster that Guides the
Datanode(Slaves).NamenodeismainlyusedforstoringtheMetadatai.e.nothingbutthedata
aboutthedata.MetaDatacanbethetransactionlogsthatkeeptrackoftheuser’sactivityina
Hadoop cluster.
MetaDatacanalsobethenameofthefile,size,andtheinformationaboutthelocation(Block
number,Blockids)ofDatanodethatNamenodestorestofindtheclosestDataNodeforFaster
Communication. Namenode instructs the DataNodes with the operation like delete, create,
Replicate, etc.
AsourNameNodeisworkingasaMasteritshouldhaveahighRAMorProcessingpowerin
order to Maintain or GuidealltheslavesinaHadoopcluster.Namenodereceivesheartbeat
signals and block reports from all the slaves i.e. DataNodes.
2.DataNode:DataNodesworksasaSlaveDataNodesaremainlyutilizedforstoringthedata
inaHadoopcluster,thenumberofDataNodescanbefrom1to500orevenmorethanthat,
the more number of DataNode your Hadoop cluster has More Data can be stored. so it is
advisedthattheDataNodeshouldhaveHighstoringcapacitytostorealargenumberoffile
blocks.Datanodeperformsoperationslikecreation,deletion,etc.accordingtotheinstruction
provided by the NameNode.
Objectives and Assumptions Of HDFS
1. System Failure: As a Hadoop cluster is consists of Lots of nodes with are commodity
hardwaresonodefailureispossible,sothefundamentalgoalofHDFSfigureoutthisfailure
problem and recover it.
2. Maintaining Large Dataset: As HDFS Handle files of sizerangingfromGBtoPB,so
HDFS has to be cool enough to deal with these very large data sets on a single cluster.
3.MovingDataisCostlierthenMovingtheComputation:Ifthecomputationaloperation
isperformednearthelocationwherethedataispresentthenitisquitefasterandtheoverall
throughput of the system can be increased along with minimizing the network congestion
which is a good assumption.
4. Portable Across Various Platform: HDFS Posses portability which allows it toswitch
across diverse Hardware and software platforms.
5. Simple Coherency Model: A Hadoop Distributed File System needs a model to write
oncereadmuchaccessforFiles.Afilewrittenthenclosedshouldnotbechanged,onlydata
can be appended. This assumption helps us to minimize the data coherency issue.
MapReduce fits perfectly with such kind of file model.
6. Scalability: HDFS is designed to be scalable as the data storage requirements increase
over time. It can easily scale up or down by adding orremovingnodestothecluster.This
helps to ensure that the system can handle large amounts of data without compromising
performance.
7.Security:HDFSprovidesseveralsecuritymechanismstoprotectdatastoredonthecluster.
It supports authentication and authorization mechanisms to control
accesstodata,encryptionofdataintransitandatrest,anddataintegritycheckstodetectany
tampering or corruption.
8.DataLocality:HDFSaimstomovethecomputationtowherethedataresidesratherthan
moving the data tothecomputation.Thisapproachminimizesnetworktrafficandenhances
performance by processing data on local nodes.
9. Cost-Effective: HDFS can run on low-cost commodity hardware, which makes it a
cost-effectivesolutionforlarge-scaledataprocessing.Additionally,theabilitytoscaleupor
down as required means that organizations can start small and expand over time, reducing
upfront costs.
10. Support for Various File Formats: HDFS is designed to support awiderangeoffile
formats,includingstructured,semi-structured,andunstructureddata.Thismakesiteasierto
storeandprocessdifferenttypesofdatausingasinglesystem,simplifyingdatamanagement
and reducing costs.
Hdfs administration:
Hdfs administration and MapReduce administration, both concepts come under Hadoop
administration.
● Hdfs administration: It includes monitoring the HDFS file structure, location and
updated files.
● MapReduce administration: it includes monitoring the list of applications,
configuration of nodes, application status.
Hadoop Benchmarks
Hadoopcomeswithseveralbenchmarksthatyoucanrunveryeasilywithminimalsetupcost.
Benchmarks are packaged in the tests JAR file, and you can get a list of them, with
descriptions, by invoking the JAR file with no arguments:
Most of the benchmarks show usage instructions when invoked with no arguments. For
example:
% hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-*-tests.jar \
TestDFSIO
TestDFSIO.1.7
Missing arguments.
HadoopcomeswithaMapReduceprogramcalledTeraSortthatdoesatotalsortofitsinput.9
ItisveryusefulforbenchmarkingHDFSandMapReducetogether,asthefullinputdatasetis
transferredthroughtheshuffle.Thethreestepsare:generatesomerandomdata,performthe
sort, then validate the results.
First,wegeneratesomerandomdatausingteragen(foundintheexamplesJARfile,notthe
tests one). It runs a map-only job thatgeneratesaspecifiednumberofrowsofbinarydata.
Each row is 100 bytes long, so to generate one terabyte of data using 1,000maps,runthe
following (10t is short for 10 trillion):
% hadoop jar \
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
% hadoop jar \
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
Theoverallexecutiontimeofthesortisthemetricweareinterestedin,butit’sinstructiveto
watchthejob’sprogressviathewebUI(http://resource-manager-host:8088/),whereyoucan
get a feel for how long each phase of the job takes.
As a final sanity check, we validate that the data in sorted-data is, in fact, correctly sorted:
% hadoop jar \
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
This command runs a short MapReduce job that performs a series of checks on the
sorted data to check whether the sort is accurate. Any errors can be found in the report/
Other benchmarks
There are many more Hadoop benchmarks, but the following are widely used:
•TestDFSIOteststheI/OperformanceofHDFS.ItdoesthisbyusingaMapReducejobasa
convenient way to read or write files in parallel.
• MRBench (invoked with mrbench) runs a small job a number of times. Itactsasagood
counterpoint to TeraSort, as it checks whether small job runs are responsive.
• SWIM, or the Statistical Workload Injector for MapReduce, is a repository of real life
MapReduce workloads that you can use to generate representative test workloads for your
system.
• TPCx-HSisastandardizedbenchmarkbasedonTeraSortfromtheTransactionProcessing
Performance Council
Hadoop on AWS
Amazon Elastic Map/Reduce (EMR) is a managed service that allows you to process and
analyze large datasets using the latest versions of big data processing frameworks such as
Apache Hadoop, Spark, HBase, and Presto, on fully customizable clusters.
● Ability to launch Amazon EMR clusters in minutes, with no need to manage node
configuration, cluster setup, Hadoop configuration or cluster tuning.
● Simple and predictable pricing— flat hourly rate for every instance-hour, with the
ability to leverage low-cost spot Instances.
● Abilitytoprovisionone,hundreds,orthousandsofcomputeinstancestoprocessdata
at any scale.
● AmazonprovidestheEMRFileSystem(EMRFS)torunclustersondemandbasedon
persistent HDFS data in Amazon S3. When the jobisdone,userscanterminatethe
cluster and store the data in Amazon S3, paying onlyfortheactualtimethecluster
was running.
Hadoop on Azure
AzureHDInsightisamanaged,open-sourceanalyticsserviceinthecloud.HDInsightallows
users to leverage open-source frameworks such as Hadoop, Apache Spark, Apache Hive,
LLAP, Apache Kafka, and more, running them in the Azure cloud environment.
Azure HDInsight is a cloud distribution of Hadoop components. It makes it easy and
cost-effectivetoprocessmassiveamountsofdatainacustomizableenvironment.HDInsights
supports a broad range of scenarios such as extract, transform, and load (ETL), data
warehousing, machine learning, and IoT.
● ReadandwritedatastoredinAzureBlobStorageandconfigureseveralBlobStorage
accounts.
● Implement the standard Hadoop FileSystem interface for a hierarchical view.
● ChoosebetweenblockblobstosupportcommonusecaseslikeMapReduceandpage
blobs for continuous write use cases like HBase write-ahead log.
● Use wasb scheme-based URLs to reference file system paths, with or without SSL
encrypted access.
● Set up HDInsight as a data source in a MapReduce job or a sink.
Google Dataproc is a fully-managed cloud service for running Apache Hadoop and Spark
clusters. It provides enterprise-grade security, governance,andsupport,andcanbeusedfor
general purpose data processing, analytics, and machine learning.
Dataproc uses Cloud Storage (GCS) data for processing and stores it in GCS, Bigtable,or
BigQuery. You can use this data for analysis in your notebook and send logs to Cloud
Monitoring and Logging.
****************
UNIT V – FRAMEWORKS
Applications on Big Data Using Pig and Hive – Data processing operators in
ApachePigisanabstractionoverMapReduce.Itisatool/platformwhichisusedtoanalyze
largersetsofdatarepresentingthemasdataflows.PigisgenerallyusedwithHadoop;wecan
perform all the data manipulation operations in Hadoop using Apache Pig.
Towritedataanalysisprograms,Pigprovidesahigh-levellanguageknownasPigLatin.This
language provides various operators using which programmers can develop their own
functions for reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using Pig Latin
language.AllthesescriptsareinternallyconvertedtoMapandReducetasks.ApachePighas
a component known as Pig Engine that accepts the Pig Latin scripts as inputandconverts
those scripts into MapReduce jobs.
Features of Pig
Apache Pig comes with the following features −
Listed below are the major differences between Apache Pig and MapReduce.
● It is a high level language. ● MapReduce is low level and rigid.
● Performing a Join operationinApachePig ● It is quite difficult in MapReduce to
is pretty simple. perform a Join operation between
datasets.
● Any novice programmer with a basic ● Exposure to Java is a must to work
knowledge of SQL can work conveniently with MapReduce.
with Apache Pig.
● Apache Pig uses a multi-query approach, ● MapReduce will require almost 20
thereby reducing thelengthofthecodesto times more the number of lines to
a great extent. perform the same task.
● There is no need for compilation. On ● MapReduce jobs have a long
execution, every Apache Pig operator is compilation process.
converted internally into a MapReduce job.
Apache Pig is generally used by data scientists for performing tasks involving ad-hoc
processing and quick prototyping. Apache Pig is used −
Parser
InitiallythePigScriptsarehandledbytheParser.Itchecksthesyntaxofthescript,doestype
checking,andothermiscellaneouschecks.TheoutputoftheparserwillbeaDAG(directed
acyclic graph), which represents the Pig Latin statements and logical operators.
IntheDAG,thelogicaloperatorsofthescriptarerepresentedasthenodesandthedataflows
are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the logical
optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these
MapReduce jobs are executed on Hadoop producing the desired results.
After downloading the Apache Pig software, install it in your Linux environment by
following the steps given below.
Step 1
Create adirectorywiththenamePiginthesamedirectorywheretheinstallationdirectories
ofHadoop,Java,andothersoftwarewereinstalled.(Inourtutorial,wehavecreatedthePig
directory in the user named Hadoop).
Step 2
Step 3
Move the content of pig-0.15.0-src.tar.gz file to the Pig directorycreatedearlierasshown
below.
Configure Apache Pig
AfterinstallingApachePig,wehavetoconfigureit.Toconfigure,weneedtoedittwofiles−
bashrc and pig.properties.
.bashrc file
pig.properties file
IntheconffolderofPig,wehaveafilenamedpig.properties.Inthepig.propertiesfile,you
can set various parameters as given below.
Verify the installation of Apache Pig by typing the version command. If the installation is
successful, you will get the version of Apache Pig as shown below.
Pig Latin isthelanguageusedtoanalyzedatainHadoopusingApachePig.Inthischapter,
we are going to discuss the basics of Pig Latin such as Pig Latin statements, data types,
general and relational operators, and Pig Latin UDF’s.
Asdiscussedinthepreviouschapters,thedatamodelofPigisfullynested.ARelationisthe
outermost structure of the Pig Latin data model. And it is abagwhere −
Example
Given below is a Pig Latin statement, which loads data to Apache Pig.
grunt> Student_data = LOAD 'student_data.txt' USING PigStorage(',')as
Example: 8
Example: 5L
Example: 5.5F
Example: 10.5
Example: 1970-01-01T00:00:00.000+00:00
9 Biginteger Represents a Java BigInteger.
Example: 60708090709
Example: 185.98376256272893883
Complex Types
Example: {(raju,30),(Mohhammad,45)}
Null Values
A null can be an unknown value or a non-existent value. It is used as a placeholder for
optional values. These nulls can occur naturally or can be the result of an operation.
The followingtabledescribesthearithmeticoperatorsofPigLatin.Supposea=10andb=
20.
+ Addition− Adds values on either side of the operator a + b will give 30
− Subtraction − Subtracts right hand operand from left a − b will give −10
hand operand
* Multiplication − Multiplies values on either side of the a * b will give 200
operator
/ Division − Divides left hand operand by right hand b / a will give 2
operand
% Modulus − Divides left hand operand by right hand b % a will give 0
operand and returns remainder
Bincond − Evaluates the Boolean operators. It has three b = (a == 1)? 20: 30;
operands as shown below.
if a = 1thevalueof
? : variablex= (expression) ?value1if true:value2if false. b is 20.
ifa!=1thevalueofb
is 30.
>= Greater than or equal to − Checks if the value of the left (a >= b) is not
operandisgreaterthanorequaltothevalueoftherightoperand. true.
If yes, then the condition becomes true.
The following table describes the Type construction operators of Pig Latin.
() Tuple constructor operator − This operator is used (Raju, 30)
to construct a tuple.
{} Bag constructor operator −Thisoperatorisusedto {(Raju, 30),
construct a bag. (Mohammad, 45)}
Operator Description
LOAD To Load the data from the file system (local/HDFS) into a
relation.
Filtering
ORDER To arrange a relation in a sorted order based on one or more
fields (ascending or descending).
Diagnostic Operators
EXPLAIN Toviewthelogical,physical,orMapReduceexecutionplansto
compute a relation.
Hive :
Hive is a data warehouse infrastructure tool to processstructured data in Hadoop.
It resides on top of Hadoop to summarize Big Data and makes querying and analyzing easy.
It is used by different companies. For example, Amazon uses it in Amazon Elastic
MapReduce.
Benefits :
○ Medical
○ Sports
○ Web
○ Oil and petroleum
○ E-commerce
Hive Architecture
The following architecture explains the flow of submission of query into Hive.
Hive Client
Hive allows writing applications in various languages, including Java, Python,andC++.It
supports different types of clients such as:-
● ThriftServer-Itisacross-languageserviceproviderplatformthatservestherequest
from all those programming languages that supports Thrift.
● JDBCDriver-ItisusedtoestablishaconnectionbetweenhiveandJavaapplications.
The JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
● ODBCDriver-ItallowstheapplicationsthatsupporttheODBCprotocoltoconnect
to Hive.
Hive Services
● HiveCLI-TheHiveCLI(CommandLineInterface)isashellwherewecanexecute
Hive queries and commands.
● Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands.
● Hive MetaStore -Itisacentralrepositorythatstoresallthestructureinformationof
various tables and partitions in the warehouse. It also includes metadata of column
and its type information, the serializers and deserializers which is used to read and
write data and the corresponding HDFS files where the data is stored.
● Hive Server - It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
● HiveDriver-ItreceivesqueriesfromdifferentsourceslikewebUI,CLI,Thrift,and
JDBC/ODBC driver. It transfers the queries to the compiler.
● Hive Compiler - The purpose of the compiler is to parse the query and perform
semantic analysis on thedifferentqueryblocksandexpressions.ItconvertsHiveQL
statements into MapReduce jobs.
● HiveExecutionEngine-OptimizergeneratesthelogicalplanintheformofDAGof
map-reduce tasks and HDFS tasks. In the end, the execution engine executes the
incoming tasks in the order of their dependencies.
HiveQL
Hive’s SQL dialect, called HiveQL, is a mixture of SQL-92, MySQL, and Oracle’s SQL
dialect.ThelevelofSQL-92supporthasimprovedovertime,andwilllikelycontinuetoget
better. HiveQL also provides features from later SQL standards, such aswindowfunctions
(alsoknownasanalyticfunctions)fromSQL:2003.SomeofHive’snon-standardextensions
to SQL were inspired by MapReduce, such as multi table inserts and the TRANSFORM,
MAP, and REDUCE clauses .
Data Types
Integer Types
Decimal Type
Date/Time Types
TIMESTAMP
● It supports traditional UNIX timestamp with optional nanosecond precision.
● As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
● AsFloatingpointnumerictype,itisinterpretedasUNIXtimestampinsecondswith
decimal precision.
● As string, it follows java.sql.Timestamp format "YYYY-MM-DD
HH:MM:SS.fffffffff" (9 decimal place precision)
DATES
The Date value is used to specify a particular year, month and day, in the form
YYYY--MM--DD.However,itdidn'tprovidethetimeoftheday.TherangeofDatetypelies
between 0000--01--01 to 9999--12--31.
String Types
STRING
The string is a sequence of characters. It values can beenclosedwithinsinglequotes(')or
double quotes (").
Varchar
Thevarcharisavariablelengthtypewhoserangeliesbetween1and65535,whichspecifies
that the maximum number of characters allowed in the character string.
CHAR
ItissimilartoCstructoranobjectwherefieldsare struct('James','Roy')
Struct
accessed using the "dot" notation.
Array It is a collection of similar type of values that array('James','Roy')
indexable using zero-based integers.
InHive,thedatabaseisconsideredasacatalogornamespaceoftables.So,wecanmaintain
multiple tables within a database where a unique name isassignedtoeachtable.Hivealso
provides a default database with a name default.
● Hive also allows assigning properties with the database in the form of key-value pair.
HiveQL - Operators
The HiveQL operators facilitate to perform various arithmetic and relational operations.
Here, we are going to execute such type of operations on the records of the below table:
Example of Operators in Hive
Let's create a table and load the data into it by using the following steps: -
In Hive, the arithmetic operator accepts any numeric type. The commonly used arithmetic
operators are: -
Operators Description
A / B This is used to divide A and B and returns the quotient of the operands.
● Let's see an example to increase the salary of each employee by 50.
1. hive> select id, name, salary + 50 from employee;
● Let's see an example to decrease the salary of each employee by 50.
1. hive> select id, name, salary - 50 from employee;
● Let's see an example to find out the 10% salary of each employee.
1. hive> select id, name, (salary * 10) /100 from employee;
Relational Operators in Hive
In Hive, the relational operators are generally used with clauses like Join and Having to
compare the existing records. The commonly used relational operators are: -
Operator Description
A <> B, A !=B It returns null if A or B is null; true if A is not equal to B, otherwise false.
A<B It returns null if A or B is null; true if A is less than B, otherwise false.
A>B It returns null if A or B is null; true if A is greater than B, otherwise false.
A<=B ItreturnsnullifAorBisnull;trueifAislessthanorequaltoB,otherwise
false.
A>=B It returns null if A or B is null; true if A is greater than or equal to B,
otherwise false.
A IS NOT It returns false if A evaluates to null, otherwise true.
NULL
Examples of Relational Operator in Hive
● Let's see an example to fetch the details of the employee having salary>=25000.
1. hive> select * from employee where salary >= 25000;
● Let's see an example to fetch the details of the employee having salary<25000.
1. hive> select * from employee where salary < 25000;
HBaseisadistributedcolumn-orienteddatabasebuiltontopoftheHadoopfilesystem.Itis
an open-source project and is horizontally scalable.
HBaseisadatamodelthatissimilartoGoogle’sbigtabledesignedtoprovidequickrandom
access to huge amounts of structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS).
ItisapartoftheHadoopecosystemthatprovidesrandomreal-timeread/writeaccesstodata
in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of theHadoop
File System and provides read and write access.
HDFS HBase
HDFS does not support fast HBase provides fast lookups for larger tables.
individual record lookups.
It provides high latency batch It provides low latency access to single rows from
processing; no concept of batch billions of records (Random access).
processing.
Itprovidesonlysequentialaccessof HBaseinternallyusesHashtablesandprovidesrandom
data. access, anditstoresthedatainindexedHDFSfilesfor
faster lookups.
Column-oriented databases are those that store data tables as sections of columns of data,
rather than as rows of data. Shortly, they will have column families.
HBase RDBMS
It is built for wide tables. HBase is horizontally Itisthinandbuiltforsmalltables.Hardto
scalable. scale.
Features of HBase
● Apache HBase is used to have random, real-time read/write access to Big Data.
● It hosts very large tables on top of clusters of commodity hardware.
● ApacheHBaseisanon-relationaldatabasemodeledafterGoogle'sBigtable.Bigtable
actsuponGoogleFileSystem,likewiseApacheHBaseworksontopofHadoopand
HDFS.
Applications of HBase
HBase - Architecture
In HBase, tables are split into regions and are served by the region servers. Regions are
verticallydividedbycolumnfamiliesinto“Stores”.StoresaresavedasfilesinHDFS.Shown
below is the architecture of HBase.
MasterServer
● AssignsregionstotheregionserversandtakesthehelpofApacheZooKeeperforthis
task.
● Handles load balancing of the regions across region servers. It unloads the busy
servers and shifts the regions to less occupied servers.
● Maintains the state of the cluster by negotiating the load balancing.
● Is responsible for schemachangesandothermetadataoperationssuchascreationof
tables and column families.
Regions
Regions are nothing but tables that are split up and spread across the region servers.
Region server
When we take a deeper look into the region server, it contain regions andstoresasshown
below:
The store contains memory store and HFiles. Memstore is just like a cache memory.
Anything that is entered intotheHBaseisstoredhereinitially.Later,thedataistransferred
and saved in Hfiles as blocks and the memstore is flushed.
Zookeeper
● Zookeeper is an open-source project that provides services like maintaining
configuration information, naming, providing distributed synchronization, etc.
● Zookeeper hasephemeralnodesrepresentingdifferentregionservers.Masterservers
use these nodes to discover available servers.
● In additiontoavailability,thenodesarealsousedtotrackserverfailuresornetwork
partitions.
● Clients communicate with region servers via zookeeper.
● In pseudo and standalone modes, HBase itself will take care of zookeeper.
Architecture of ZooKeeper
Take a look at the following diagram. It depicts the “Client-Server Architecture” of
ZooKeeper.
EachoneofthecomponentsthatisapartoftheZooKeeperarchitecturehasbeenexplained
in the following table.
Part Description
Clients,oneofthenodesinourdistributedapplicationcluster,accessinformation
Client
fromtheserver.Foraparticulartimeinterval,everyclientsendsamessagetothe
server to let the sever know that the client is alive.
Similarly,theserversendsanacknowledgementwhenaclientconnects.Ifthereis
no response from the connected server, the client automatically redirects the
message to another server.
Server, one of thenodesinourZooKeeperensemble,providesalltheservicesto
Server
clients. Gives acknowledgement to client to inform that the server is alive.
Ensemble Group of ZooKeeper servers. The minimum number of nodes thatisrequiredto
form an ensemble is 3.
Leader Server node which performs automatic recovery if any of the connected node
failed. Leaders are elected on service startup.
Hierarchical Namespace
ThefollowingdiagramdepictsthetreestructureofZooKeeperfilesystemusedformemory
representation.ZooKeepernodeisreferredasznode.Everyznodeisidentifiedbyanameand
separated by a sequence of path (/).
● Inthediagram,firstyouhavearootznodeseparatedby“/”.Underroot,youhavetwo
logical namespacesconfigandworkers.
● The config namespace is used for centralized configuration management and the
workersnamespace is used for naming.
● Under config namespace, each znode canstoreupto1MBofdata.Thisissimilarto
UNIX file system except that the parent znode can store data as well. The main
purposeofthisstructureistostoresynchronizeddataanddescribethemetadataofthe
znode. This structure is called asZooKeeper DataModel.
EveryznodeintheZooKeeperdatamodelmaintainsastatstructure.Astatsimplyprovides
themetadataofaznode.ItconsistsofVersionnumber,Actioncontrollist(ACL),Timestamp,
and Data length.
● Persistence znode − Persistence znode is alive even after the client, which created
that particular znode, is disconnected. By default, all znodes are persistent unless
otherwise specified.
● Ephemeral znode − Ephemeral znodes are active until the client is alive. When a
clientgetsdisconnectedfromtheZooKeeperensemble,thentheephemeralznodesget
deletedautomatically.Forthisreason,onlyephemeralznodesarenotallowedtohave
childrenfurther.Ifanephemeralznodeisdeleted,thenthenextsuitablenodewillfill
its position. Ephemeral znodes play an important role in Leader election.
● Sequentialznode−Sequentialznodescanbeeitherpersistentorephemeral.Whena
newznodeiscreatedasasequentialznode,thenZooKeepersetsthepathoftheznode
byattachinga10digitsequencenumbertotheoriginalname.Forexample,ifaznode
withpath/myappiscreatedasasequentialznode,ZooKeeperwillchangethepathto
/myapp0000000001 and set the next sequence number as 0000000002. If two
sequential znodes are created concurrently, then ZooKeeper never uses the same
number for each znode. Sequential znodes play an important role in Locking and
Synchronization.
Sessions
Sessions are very important for the operation of ZooKeeper. Requests in a session are
executedinFIFOorder.Onceaclientconnectstoaserver,thesessionwillbeestablishedand
asession idis assigned to the client.
The client sends heartbeats at a particular time interval to keep the session valid. If the
ZooKeeper ensemble does not receive heartbeats from a client for more than the period
(session timeout) specified at the starting of the service, it decides that the client died.
Sessiontimeoutsareusuallyrepresentedinmilliseconds.Whenasessionendsforanyreason,
the ephemeral znodes created during that session also get deleted.
Watches
Watches are a simple mechanism for theclienttogetnotificationsaboutthechangesinthe
ZooKeeperensemble.Clientscansetwatcheswhilereadingaparticularznode.Watchessend
a notification to the registered client for any of the znode (on which client registers) changes.
Znodechangesaremodificationofdataassociatedwiththeznodeorchangesintheznode’s
children. Watches are triggered only once. If a client wants a notificationagain,itmustbe
done through another read operation. When a connection sessionexpires,theclientwillbe
disconnected from the server and the associated watches are also removed.
SQOOP
Sqoop has several features, which makes it helpful in the Big Data world:
1.Parallel Import/Export
Sqoop uses the YARN framework to import and export data. This provides fault
tolerance on top of parallelism.
Sqoop enables us to import the results returned from an SQL query into HDFS.
SqoopprovidesconnectorsformultipleRDBMSs,suchastheMySQLandMicrosoft
SQL servers.
Sqoop can load the entire table or parts of the table with a single command.
Sqoop Architecture
Now, let’s dive deep into the architecture of Sqoop, step by step:
1. The client submits the import/ export command to import or export data.
2.Sqoopfetchesdatafromdifferentdatabases.Here,wehaveanenterprisedatawarehouse,
document-based systems, and a relational database. We haveaconnectorforeachofthese;
connectors help to work with a range of accessible databases.
3. Multiple mappers perform map tasks to load the data on to HDFS.
Sqoop Import
Itthensubmitsamap-onlyjob.Sqoopdividestheinputdatasetintosplitsandusesindividual
map tasks to push the splits to HDFS.
Sqoop Export
Let’s now have a look at few of the arguments used in Sqoop export:
AfterunderstandingtheSqoopimportandexport,thenextsectioninthisSqooptutorialisthe
processing that takes place in Sqoop.
Sqoop Processing
3.It uses mappers to slice the incoming data into multiple formats and loads the data in
HDFS.
4.Exports data back into the RDBMS while ensuring that the schema of the data in the
database is maintained.
********************