BIG DATA
(2021-2022, I-SEMESTER)
Big Data Architecture
By
Dr. Tene Ramakrishnudu
Assistant Professor
Department of Computer Science &Engineering
National Institute of Technology(NIT), Warangal, TS, India
Outline
❖Big Data Architecture
23-09-2021 RK-CSE-NITW 2
Big data system architectures
Source: [2]
23-09-2021 RK-CSE-NITW 3
Big data system architectures
Source: [2]
23-09-2021 RK-CSE-NITW 4
Big data architectures: Data sources
❖Data sources: All big data solutions start with one or
more data sources.
❖Examples include: Data storage: File systems, RDBMS,
NoSQL etc.
▪ Archive: Scanned documents, customer correspondence records,
students admissions& assessments records,
▪ Public web: Wikipedia, compliance, regularity, weather etc.
▪ Sensor data: building, car, smart electric meters,
▪ Machine log: event logs, clickstream logs,
▪ Social Media: Facebook post, Twitter tweets
▪ Business apps: ERP,CRM,HR
▪ Media: video, audio, image
▪ Docs: CSV,PDF,XLS,PPT etc.
23-09-2021 RK-CSE-NITW 5
Big data system architectures
Source: [2]
23-09-2021 RK-CSE-NITW 6
Big data architectures: Data storage
❖Data storage:
▪ Data for batch processing operations is typically stored in a
distributed files
▪ Distributed files can hold high volumes of large files in various
formats.
▪ This kind of store is often called a data lake.
▪ Options for implementing the storage
o Azure Data Lake Store or
o blob containers in Azure Storage
23-09-2021 RK-CSE-NITW 7
Big data system architectures
Source: [2]
23-09-2021 RK-CSE-NITW 8
Big data architectures: Real-time message ingestion
❖Real-time message ingestion. If the solution includes real-
time sources, the architecture must include a way to
▪ capture and store real-time messages for stream processing.
▪ A simple data store, where incoming messages are dropped
into a folder for processing.
▪ Message ingestion store to act as
o a buffer for messages, and
o to support scale-out processing,
o reliable delivery, and
o message queuing semantics.
23-09-2021 RK-CSE-NITW 9
Big data architectures: Real-time message ingestion
▪ This portion of a streaming architecture is often referred to as
stream buffering.
▪ Options include
o Azure Event Hubs,
o Azure IoT Hub, and
o Kafka.
23-09-2021 RK-CSE-NITW 10
Big data system architectures
Source: [2]
23-09-2021 RK-CSE-NITW 11
Big data architectures: Batch processing
❖Batch processing.
❖The data sets are so large.
❖Big data solution must process data files using long-running
batch jobs to
▪ filter,
▪ aggregate, and
▪ prepare the data for analysis.
▪ Usually these jobs involve reading source files,
▪ Processing the source files, and
▪ writing the output to new files.
▪ Options include running
o U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or
o Custom Map/Reduce jobs in an HDInsight Hadoop cluster,
o Using Java, Scala, Python programs in an HDInsight Spark cluster.
23-09-2021 RK-CSE-NITW 12
Big data system architectures
Source: [2]
23-09-2021 RK-CSE-NITW 13
Big data architectures: Stream processing
❖Stream processing: After capturing real-time messages, the
solution must process them by
▪ filtering,
▪ aggregating,
▪ preparing the data for analysis.
❖The processed stream data is then written to an output sink.
❖Azure Stream Analytics provides a managed stream
processing service.
❖The Apache streaming technologies like
▪ Storm and Spark Streaming in an HDInsight cluster.
23-09-2021 RK-CSE-NITW 14
Big data system architectures
Source: [2]
23-09-2021 RK-CSE-NITW 15
Big data architectures: Analytical data store
❖Analytical data store:
❖Many big data solutions prepare data for analysis and then serve the
processed data in a structured format
❖It can be queried using analytical tools.
❖The analytical data store used to serve these queries can be a Kimball-
style relational data warehouse,
❖The data could be presented through a
▪ low-latency NoSQL technology such as HBase,
▪ an interactive Hive database that provides a metadata abstraction over data files in
the distributed data store.
▪ Azure SQL Data Warehouse provides a managed service for large-scale, cloud-
based data warehousing.
▪ HDInsight supports Interactive Hive,
▪ HBase, and
▪ Spark SQL,
▪ used to serve data for analysis.
23-09-2021 RK-CSE-NITW 16
Big data system architectures
Source: [2]
23-09-2021 RK-CSE-NITW 17
Big data architectures: Analysis and reporting
❖Analysis and reporting.
❖The goal of most big data solutions is to provide insights into the data
through analysis and reporting.
❖To empower users to analyze the data, the architecture may include a
data modeling layer,
▪ a multidimensional OLAP cube or
▪ tabular data model in Azure Analysis Services.
▪ It might also support self-service BI, using the modeling and visualization
❖Take the form of interactive data exploration by data scientists or data
analysts.
❖Many Azure services support analytical notebooks,
▪ Jupyter, enabling these users to leverage their existing skills with Python or R.
▪ For large-scale data exploration- R Server, either standalone or with Spark.
23-09-2021 RK-CSE-NITW 18
Big data system architectures
Source: [2]
23-09-2021 RK-CSE-NITW 19
Big data architectures: Orchestration
❖Orchestration: Most big data solutions consist of
▪ repeated data processing operations,
▪ encapsulated in workflows,
o transform source data,
o move data between multiple sources and sinks,
▪ load the processed data into an analytical data store, or
▪ push the results straight to a report or dashboard.
❖To automate these workflows, you can use an
orchestration technology
▪ Azure Data Factory or
▪ Apache Oozie and
▪ Sqoop.
23-09-2021 RK-CSE-NITW 20
?
23-09-2021 RK-CSE-NITW 21
References
❖[1] Bill Franks, “Taming the BigData Tidal wave”
❖[2] zoinerTejada, “Big data architectures”,2018
❖[3] Min Chen, ShiwenMao, Yin Zhang, Victor C.M. Leung
“Big Data: Related Technologies, Challenges and Future
Prospects”,Spinger,2014.
23-09-2021 RK-CSE-NITW 22
Thank You
23-09-2021 RK-CSE-NITW 23