Thanks to visit codestin.com
Credit goes to www.scribd.com

100% found this document useful (1 vote)
911 views87 pages

Big Data Unit 1 AKTU Notes

The document discusses big data including its definition, characteristics, types of data sources, history and applications of big data and analytics. It also covers hadoop and its history, distributed and parallel computing concepts.

Uploaded by

abhijitraj229
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
911 views87 pages

Big Data Unit 1 AKTU Notes

The document discusses big data including its definition, characteristics, types of data sources, history and applications of big data and analytics. It also covers hadoop and its history, distributed and parallel computing concepts.

Uploaded by

abhijitraj229
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

Big Data

Dr Umesh Kumar Pandey


Introduction to Big Data
• Information age
• Raw facts and figures are unprocessed.
• Many researcher identify 5Vs to express any
data as a big data: volume, velocity, variety,
veracity and value.
V’s of Big Data
• Volume
• Velocity
• Variety
• Veracity
• Value
What makes data big
• Is size makes big?
• One way to define it
– Paper pen method is insufficient.
– Algorithm and hardware unable to process the
data.
– Take too much time to present a result.
Types of Digital Data
• Criteria: Place of generation
– Social Media
– Machine Data
– Transactional Data
• Criteria: Structuredness
– Structured
– Semi-structured
– Un-structured
History of Big Data Innovation
• Development of storing mechanism
– Flat file
– DBMS
– RDBMS
– OODBMS
– Data Warehousing etc.
– Face number of problem in dealing with data.
History of Big Data Innovation
• Todays requirement
– Fast Access
– Sufficient and efficient data organization method.
• Challenges faced in past
– Real time data assimilation
– Variety of document.
– Huge volume
– Processing of data.
Development of Big data
• First Phase- Structured storage (DBMS,
RDBMS, Data warehouse), statistical tool, data
analytics.
• Second Phase- Semi structure Storage, Web
analytics, web intelligence, social media
analytics
• Third Phase- Mobile and sensor data, location
aware person centered data analytics
Introduction to big data platform
• Utilize cloud and aggregate dataset from
different sources.
– Google Cloud
– Microsoft Azure
– Amazon web service
– Cloudera
– Tableau
– MongoDB
– IBM Cloud
Types of Data Sources( Criteria:
Originate)
• Internal Sources: Originates within the
enterprise and helps in running business.
(referred as primary data)
• External Sources: Originates from the external
environment of any organization.(referred as
secondary data source
Types of Data Source (Criteria:
Structuredness)
• Structured
• Semi-structured
• Unstructured
Properties Structured Data
• Predefined format/ Fix schema
• Fixed number of fields
• Query and report against predefined data
types.
• Stored in flat files, relational database,
multidimensional database.
Properties of Semi-structured data
• Flexible schema
• Web database
• Key(token) based value access.
Properties of Unstructured Data
• No Schema
• Variety of data
• Challenges
– Storing and arrangement in different sets or
formats
– Combining and linking unstructured data
– Costing in terms of storage space and human
resource.
L2
Drivers of Big Data
• Traditional solutions failing to satisfy the modern
market needs.
• Society interact with digital platforms
• Technology equipment become cheaper
• Bigger data storage and high computing facility
available among the people.
• Cloud computing services are available
• Sensor based device become popular
• People engage in data driven innovation and
decision making for competitive advantage.
Drivers of Big Data
• Three Contributing Factors
• Sophisticated consumers: Uses statistics,
access social media for opinion
• Automation: Automated data collection
• Monetization: Making money to increase the
profit of the business
Big Data Architecture
Big data Applications
• Tracking Customer for spending and shopping
behavior
• Recommender system
• Smart traffic system
• Auto driving car
• Virtual Personal Assistant
• Healthcare
Big Data Applications
• Education Sector
• Energy Sector
• Media and Entertainment Sector
L3
Big Data Technology Components
• Data Source (Database, Data Lake, Data
Warehouse and social media platforms having
unstructured data and massive in volume)
• Data Storage (Scalability, distributed file system,
NoSQL database)
• Batch processing (wait for a particular quantity of
raw data before performing an ETL job to filter,
aggregate, and prepare massive volumes for data
analtyics; utilized when data freshness is not a
problem
Big Data Technology Components
• Stream Processing (continuous flow of data
which is necessary for real time data analytics;
• Machine learning (Identify patterns from a
large, complex datasets; extract valuable
insights, improved decision making, enhanced
customer experiences and increase business
efficiency.
• Analytics and reporting
Big Data Features: Security
• Challenges:
– Controlled data access
– Data Availability
– Performance
– Liability
L4
Big Data Ethics
• Identity should remain private.
• Data confidentiality and restricted use.
• Use of data must be transparent
• Should not play with human will.
• Should not be use for institutional unfair
biases.
Big Data Analytics
• Descriptive
• Predictive
• Prescriptive
• Diagnostic
Big data Analytics Process
• Understanding the problem
• Acquisition
• Transforming and preprocessing
• Data enrichment
• processing
• Evaluation/analytics
• Mining and Iteration
Advantages of Big Data Analytics
• Procurement
• Product development
• Manufacturing
• Distribution
• Marketing
• Price management
• Merchandising
Advantages of Big Data Analytics
• Sales
• Inventory Management
• Human resources
L5
Challenges of conventional system
• Outdated data and inability to operationalize
• Management of large volume database
• Big data integration and preparation issue
• Lack of scalability in conventional system.
• Centrally maintained conventional system.
• Multiple transaction write operation is
possible at some extent.
• Controlled and managed from single point.
Challenges of conventional system
• Recovery and maintenance is periodic
Intelligent big data analytics
• Intelligent big data analytics for management
taking into account main managerial
functions: planning, organizing, leading and
controlling.
Analytics Tool
• SAS
• SPSS
• Python
• R
• Apache Hadoop
Reporting Vs. Analysis
• Reporting helps in monitoring the data
whereas analysis interprets the data.
• Reporting includes building, configuring,
consolidating, organizaing, formating and
summarizing whereas analysis consist of
questioning, examining, interpreting,
comparing, and confirming
Reporting Vs. Analysis
• In reporting outputs come in the forms of
canned reports, dashboards, and alerts
whereas analysis draws information to further
probe and answer business questions
• Reporting involves repetitive tasks whereas
analysis requires a more custom approach.
• Reports are just number required for analysis
and without analysis decisons can not be
taken
Unit 2
History of Hadoop
• Hadoop name given by project creator Doug
cutting his kid yellow elephant.
• Objective Build a web search engine from
scrach.
• A project nutch started in 2002 and a working
crawling and search system quickly emerged
(Their architecture wouldn’t scale to billions of
pages)
History of Hadoop
• Google distributed file system (GDFS) project
started by google in 2003 . It solves storage
needs for larger files as a part of web crawl
and indexing process.
• 2004 open source nutch distributed file
system started.
• 2004 google introduced MapReduce.
• 2005 nutch developer used Map reduce and
NDFS.
History of Hadoop
• 2006 independent subproject of Lucone called
hadoop.
• Hadoop demonstrated in Feb 2008 at yahoo
search engine.
• Yahoo search
– Crawler: Downloads pages from web server.
– Web map: builds graph of the known web.
– Indexer: builds a reverse index to the best pages.
– Run time: answers user query.
History of Hadoop
• In 2005 the infrastructure for the web map
named Dreadnaught, similar to Map reduce
but more flexible and less structured
introduced.
• 2008, Hadoop made its own top-level project
Apache.
• April 2008 hadoop started an entire terabyte
using 910 node cluster in 209 seconds.
History of hadoop
• November 2008 google reported Map Reduce
implementation sorted 1 TB in 68 seconds.
• April 2009 yahoo reported hadoop to sort 1 TB
in 62 seconds
• 2014 Data bricks reported that 207 nodes
spark cluster sort 100 TB in 1406 seconds with
a rate of 4.2 TB per mintue.
Distributed Computing
• A distributed system is a network of
autonomous computers that communicate
with each other in order to achieve a goal
• Distributed system are independent and do
not physically share memory or processors.
Distributed Computing
• They communicate with each other using
messages, pieces of information transferred from
one computer to another over a network.
• Messages can communicate many things:
computers can tell other computers to execute a
procedures with particular arguments, they can
send and receive packets of data, or send signals
that tell other computers to behave a certain way
Parallel Computing
• Despite this explosion in speed, computers
aren't able to keep up with the scale of data
becoming available.
• To circumvent physical and mechanical
constraints on individual processor speed,
manufacturers are turning to another
solution: multiple processors
Parallel Computing
• All of them can share the same data, but the
work will proceed in parallel.
• In a shared memory model, different
processes may execute different statements,
but any statement can affect the shared
environment.
• All the methods for synchronization,
serialization and protecting shared data uses
locks and semaphores.
Parallel computing
• Locks, also known as mutexes (short for
mutual exclusions), are shared objects that are
commonly used to signal that shared state is
being read or modified.
• Semaphores are signals used to protect access
to limited resources. They are similar to locks,
except that they can be acquired multiple
times up to a limit.
Apache Hadoop
• Apache Hadoop stands as a pioneer,
revolutionizing the way we understand and
use data.
• Apache Hadoop, an open-source software
framework, is designed to store and process
vast amounts of data across clusters of
computers.
Apache Hadoop
• It provides
– Scalable
– Reliable infrastructure
– Distributed processing
– Fault tolerance
– Data locality
– Play backbone role in making data strategies.
Hadoop Distributed File System
• HDFS is the storage unit
• Store data across a distributed environment.
• Follows a master-slave architecture
• Key components are:
– Name Node (Master Server)
– Data Node (Slave Server)
Name Node
• Manages Overall file system
• Manages the file system metadata
• Keeping track of the directory tree of all files in
the file system
• Maintaining the health of DataNodes
• coordinating file reads/writes.
• Control processing plan for example executing
operations like opening, closing, and renaming
files and directories.
NameNode
• NameNode knows following thing
• For every DataNode:
– Its Name, Rack, capacity and health
• For every file:
– Its Name, replicas, type, size, timestamp, location
and health.
• Failed data node will be accessed from its replica
on other DataNode.
• Each data node sends a message to the
NameNode periodically (Heartbeat) otherwise
assumed to be dead.
NameNode
• Ensures evenly spread data across the
DataNodes in cluster.
• Balances the storage and computing load
• Limits the extent of loss from the failure of
node
• Optimize networking load.
• Store any piece of data on the three nodes i.e.
two on same rack and one on another rack.
Data Node
• Store the data blocks in its storage
• Slave node
• Follow block protocol under the direction of
NameNode.
• No awareness of the distributed file system
• Own local file system.
• Store and retrieve data blocks as per NameNode
instruction.
• Report back to the NameNode periodically with lists
of blocks that they are storing.
• Talk to each other DataNode to rebalance data, to
move copies around and to keep the replication of
data high
HDFS Design Goals
• Hardware Failure Management
• Handling huge volume for fast read, write and
operation performance
• High speed or low latency
• High variety handling
• Plug and play for maintaining easy accessbility
• Network efficiency by minimizing network
bandwidth and data movement.
Block systems
• Block of data is fundamental storage unit in HDFS
• Store data in segments (block) on multiple
machine.
• All storage capacities and file size are measured
in blocks.
• Block ranges 16 to 128 MB, default size is 64 MB
• Every file is organized as a consecutively
numbered sequence of blocks.
HDFS ensures Integrity
• Write once and read many
• Never updates the data
• Only one client can write or append file
• No concurrent updates are allowed.
• Lost/corrupted data or part of disk corrupted
a new healthy replica copied from other
DataNodes.
• Checksum algorithm applied on all data
written to HDFS.
Components of Hadoop
• Hadoop Distributed File System: Store Data on
cluster machines by providing very high
aggregate bandwidth across cluster
• MapReduce: Programing model for large scale
data processing.
• YARN: Resource management platform for
managing computing resources.
Block caching
• DataNode reads blocks from disk
• For frequently accessed files the block may be
explicitly cached in the datanodes memory.
• Job scheduler can take advantage of cached
blocks for increased read performance.
• Users or applications instruct the namenode
which files to cache (and for how long) by adding
a cache directive to a cache pool.
• Cache pools are an administrative grouping for
managing cache permissions and resource usage
HDFS Federation
• HDFS Federation introduced in 2.x release which allows
to scale by adding namenode.
• Each namenode manages a portion of file system.
• Under federation, each namenode manages a
namespace volume, which is made up of the metadata
for the namespace
• A block pool containing all the blocks for the files in the
namespace.
• Namespace volumes are independent of each other.
• To access a federated HDFS cluster, clients use client-
side mount tables to map file paths to namenodes.
HDFS High availability
• The combination of replicating namenode
metadata on multiple filesystems and using the
secondary namenode to create checkpoints
protects against data loss, but it does not provide
high availability of the filesystem.
• To recover from a failed namenode, an
administrator starts a new primary namenode
with one of the filesystem metadata replicas and
configures datanodes and clients to use this new
namenode.
HDFS High Availability
• The new namenode is not able to serve
requests until it has
– (i) loaded its namespace image into memory,
– (ii) replayed its edit log, and
– (iii) received enough block reports from the
datanodes to leave safe mode.
Data Format
• Data stored in contiguous chunk into the
memory.
• Data stored in pages, sectors and blocks on
storage device.
• When data scaling required in that case
certain access pattern are better suited for
row oriented and others for column oriented.
Data Format
• In row oriented database, these chunks
contain a row consisting of its column values
as a tuple
• In column oriented database these chunks
contain only the values in each row that
belongs to that column.
Row Oriented Database
• Traditional store method for OLTP
• Perform well for single transaction
• Better for small amount of data for inserting,
updating or deleting
• Data partitioned horizontally.
• Beneficial when most/all of the values in the
record accessed.
• Good for index range scans
Row oriented database
• Creating many indices will create many copies
of data.
• Write data quickly.
• Indexing can drastically improve query
response time.
Column Oriented Database
• Better for OLAP application
• Partitioned vertically and stored contagiously
in storage by column
• Partial reads are efficient because of lower
volume of data is loaded and read only the
relevant data.
• Compression is efficient because columns
have uniform types
Column Oriented Database
• Better for arbitrary access pattern
• Good for big data.
• Good for read data
Analyzing data With Hadoop
• Hadoop provides parallel processing with the
help of map function and reduce function
• Each phase has key value pair as input and
output; the types of which may be choosen by
the programmer.
• Programmers specify two function map
function and reduce function for performing
any task.
• Data input processing happens in input phase
Analyzing data with hadoop
• Map function is just a data preparation phase
• Output of map function s processed by
mapreduce framework before being sent to
reduce function.
• This processing sorts and groups the key-value
pairs by key.
Scaling out
• Hadoop runs the job by dividing it into tasks, of
which there are two types: map tasks and reduce
tasks.
• The tasks are scheduled using YARN and run on
nodes in the cluster.
• If a task fails, it will be automatically rescheduled
to run on a different node.
• Many splits means the time taken to process each
split is small compared to the time to process the
whole input along with overhead of
management.
• Run the map task on a node where the input
data resides in HDFS
• This is called the data locality optimization
Combiner function
• Combiner function to be run on the map
output.
• Combiner functions output forms the input to
the reduce function.
Hadoop streaming
• Hadoop provides an API to MapReduce that
allows you to write your map and reduce
functions in languages other than java.
• Hadoop uses Unix standard streams as the
interface between Hadoop and your program.
• Streaming is naturally suited for text
processing and process line by line and writes
line to standard output.
• A map output key value pair is written as tab
limited line.
Hadoop Pipe
• Hadoop pipes is the name of the C++ interface to
hadoop MapReduce.
• Apache Hadoop provides an adapter layer called
pipes which allows C++ application code to be
used in Map reduce program.
• Unlike the streaming this uses standard input and
output to communicate with the map and reduce
code.
• Pipes uses sockets as the channel over which the
task tracker communicates with the process
running the C++ Map or reduce function.
Hadoop Pipe
Scaling in Vs Scaling Out
• There are two approaches
• Scaling Up (Vertical Scaling): increase the
system capacity (powerful processor, more
memory, short term solution)
• Scaling Out (Horizontal scaling): Adding server
for parallel computing, Long term solution,
difficult to build; extremely effective solution
Command Line Interface
• There are two properties that need to set in
the pseudo distribution configuration
• The first is fsdefault.name set to
hdfs://localhost/ which is used to set a default
file system for hadoop.
• File system are specified by a uniform
resource identifier.
• File systems are specified by a URL, and here
hdfs used to configure Hadoop to use hdfs.
Command line interface
• HDFS uses this property to determine host
and port for the HDFS namenode. For
localhost default HDFS port, 8020. HDFS client
uses this to find namenode and connect it.
Command Line Interface
• The second property, dfs.replication, to 1 so
that HDFS doesn’t replicate file system blocks
by the default factor of three
• When running with a single datanode, HDFS
can’t replicate block to three datanode.
• In case of single data, HDFS cannot replicate,
this setting help in removing warning.
Basic File Operation
• %hadoop fs –help
• %hadoop fs –cp input/docs/q.txt
hdfs://localhost/user/q.txt
• %hadoop fs –mkdir books
• %hadoop fs –ls
• %hadoop fs –cat path of the file
• %hadoop fs –mv source destination
• %hadoop fs –rm path of the file
Interfaces
• Hadoop is written in Java, so most Hadoop file
system interactions are mediated through the
Java API.

You might also like