Big Data
Large amount of data in petabytes, exabytes which are difficult to process using current
database management tools or traditional data processing applications.
Main 5 V’s of big data
Volume
• Tens of Thousands of IoT sensors and thousands of cameras are placed around a
massive farm and this number will not change soon. The Audio and video streams
will be of high quality.
Velocity
• It is anticipated that millions of data points will be captured and transmitted a
second from these devices.
Variety
• IOT sensors will be used to capture the environment such as temperature, humidity,
and Light level and cameras will be used to capture audio and video data.
Veracity
• The sensors and cameras cannot be verified/authenticated in real time as the
overhead is too much
Value
• The system will use AI to analyze this data and make real time decisions abouts the
automated fans functions.
Two approaches to scale up Big Data systems
Vertical scaling – Enlarge a single machine (Limited in space and expensive)
Horizontal scaling – Use many commodities machine and form computer cluster or grids.
Features of Hadoop Features of Spark
Storage Unit (Hadoop distributed file Resilience distributed dataset (Usage of
system) RAM)
Replication of data (Redundancy) 100 times faster than Hadoop and
efficiency
Map Reduce (Split the data into parts) Spark core (Help data processing among
multiple computers, maintain efficiency
and smoothness)
Map Reduce leads to efficiency in load Spark streaming (Processing real time
balancing and save time) data)
Yarn (Contain containers, Fault tolerance Spark SQL (Write directly on data set)
Spark ML (Train large scale big data model)
Cluster manager handles spark driver
processes and executors)
Weakness of Hadoop Weakness of Spark
Rely on storing data on disk Memory consumption can lead to resource
exhaustion
Data processing is slow Since Hadoop introduces the first spark is
more unmature than Hadoop
Batch processing makes in wait to When a small data process does not need
complete another batch and then coming the Memory it transfers to the disk, but with
them together and after that giving the the spark disk configurations disk is
output. insufficient because spark mainly focus on
Memory.
Do not use RAM memory Insufficient disk usage can lead to
inefficiency in resource usage.
Technologies for Big Data
Distributed File Systems – HDFS, Google File System (GFS)
Distributed/Parallel Programming (MapReduce Model)
NoSQL database – MongoDB, Cassandra
Large Scale Machine Learning – Deep learning
Data warehouses/ Data Lakes