Module 1 : Foundations of Data
Engineering (NCMPE51)
Mrs. Priya R L
Faculty In-charge for Data Engineering
Department of Computer Engineering
VES Institute of Technology, Mumbai
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Agenda
● Course Schema
● Course Objectives
● Course Outcomes
● Text / Reference Books
● Assessment Methods
● Prerequisite
● Module 1: Introduction to Data Engineering
● Role, Importance, and Challenges.
● Data Engineering vs. Data Science vs. Data Analytics
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Course Schema
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Course Objectives
● Understand the core concepts and principles of Data Engineering.
● Understand various data storage and retrieval technologies.
● Learn to design and implement stream and batch data processing
pipelines.
● Design applications using Apache Spark for big data processing
and machine learning.
● Design data pipeline orchestration and cloud-based data
engineering applications.
● Understand importance of data quality, security, and compliance
in data engineering.
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Course Outcomes
● Understand the principals of a Data Engineering.
● Work with various data sources and storage systems.
● Design, build, and deploy both batch and stream data processing
pipelines.
● Design and implement algorithms using Apache Spark for big data
analysis and machine learning tasks.
● Understand cloud-based data engineering services and their
applications.
● Apply Data engineering principals to design and implement real world
application.
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Text / Reference Books
Text Books
● "Designing Data-Intensive Applications" by Martin Kleppmann.
● "Data Engineering Cookbook" by Andreas Kretz
● “Fundamentals of Data Engineering" by Joe Reis and Matt Housley
● “Spark: The Definitive Guide” by Matei Zaharia, Bill Chambers, and
Tathagata Das
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Text / Reference Books
Reference Books
● "Graph Algorithms: Practical Examples in Apache Spark & Neo4j", by Aleksa
Vukotic , Nicki Watt , Tareq Abedrabbo.
● "Data Pipelines with Apache Airflow" by Bas P. Harenslak and Julian Rutger de
Ruiter
● “Learning Amazon Web Services (AWS): A Hands-On Guide to the
Fundamentals of AWS Cloud ", Mark Wilkins
● “Advanced Analytics with Spark" by Sandy Ryza, Uri Laserson, Sean Owen, and
Josh Wills
.
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Prerequisite
● Discrète Structures,
● DBMS,
● Java / Python
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Assessment Methods
● Internal Assessments : 40 Marks
○ Mid Term Test 1 – 20 Marks (After completion of 50% syllabus)
○ Continuous Assessments – 20 Marks
● End Semester Examination : 60 Marks
● DE Lab - Term Work : 25 Marks
○ Experiments based on Mini Project Topic: 15 Marks
○ Term Work Assignments: 10 Marks
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Continuous
Assessment Rubrics
Module 1 : Introduction to Data Engineering
Source: “Fundamentals of Data Engineering" by Joe Reis and Matt Housley
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Data Engineer Vs. Data Analyst Vs. Data Scientist
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
What is Data Engineering?
● DE is a set of operations aimed at creating interfaces and mechanisms for the
flow and access of information.
● Data engineering involves a wide range of tasks, including data modeling, data
integration, data transformation, data quality, and data governance.
● The goal is to provide a reliable and efficient data infrastructure that supports
the organization’s data-driven decision-making processes
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Why to study Data Engineering?
❖ Data engineers accomplish this by doing things like:
● Accessing, collecting, auditing, and cleaning data from applications and
systems into a usable state
● Creating and maintaining efficient databases
● Building data pipelines
● Monitoring and managing all the data systems (scalability, security, etc)
● Implementing data scientists’ output in a scalable manner
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Role of Data Engineering
Data engineering involves the development of architectures and frameworks that
facilitate the collection, storage, processing, and retrieval of data. The primary
responsibilities of data engineers include:
● Building Data Pipelines: Data engineers design and implement pipelines
that automate the extraction, transformation, and loading (ETL) of data
from various sources into a central repository. This process ensures that
data is readily available for analysis and reporting.
● Data Integration: Integrating data from disparate sources, such as
databases, APIs, and external files, is a fundamental task.
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Role of Data Engineering
● Database Management: Managing databases involves designing schema
structures, ensuring data integrity, and optimizing performance. Data engineers
work with relational databases (SQL) and NoSQL databases to store and manage
data efficiently.
● Data Quality and Governance: Ensuring the accuracy, consistency, and reliability
of data is vital. Data engineers implement data validation checks, error handling
mechanisms, and governance policies to maintain high-quality data standards.
● Scalability and Performance: As data volumes grow, systems must scale
accordingly. Data engineers optimize data architectures to handle large datasets
and ensure high performance, even under heavy loads.
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Importance of Data Engineering
● Efficient Data Management: Without a well-designed data pipeline, organizations
would struggle with data repository and inefficiencies in data retrieval and
processing.
● Enhanced Data Quality: Data engineers implement rigorous data quality controls
and validation processes to ensure that the data used for analysis is accurate and
reliable.
● Scalability and Flexibility: As businesses grow, their data needs evolve. Data
engineers design scalable systems that can accommodate increasing volumes of
data without sacrificing performance.
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Importance of Data Engineering
● Accelerated Analytics and Insights: Effective data engineering accelerates the
process of data analysis. By creating well-structured data pipelines and
storage solutions
● Cost Efficiency: Well-engineered data systems can lead to significant cost
savings. By optimizing data storage and processing, data engineers help
reduce infrastructure costs and improve resource utilization.
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Challenges of Data Engineering
While data engineering is crucial, it comes with its own set of challenges:
● Complexity of Data Sources
● Data Security and Privacy
● Evolving Technologies
● Performance Optimization
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Data Engineering vs. Data Science vs. Data Analytics
Data Science Data Engineering Data Analytics
Data Engineer focuses on Data Analyst focuses on the
Data Scientist focuses on a
improving data consumption present technical analysis of
futuristic display of data.
techniques continuously. data.
Data engineers are responsible
Data scientists is primarily for building and maintaining Data Analyst is primarily
focused on analyzing and the infrastructure and tools focused on analyzing and
interpreting data. needed to collect and store interpreting data.
large amounts of data
Data Scientist roles are to Data Engineer roles are to Data Analyst performs data
provide supervised / build data in an appropriate cleaning, organizes raw data,
unsupervised learning of data, format. A data engineer works analyze and visualize data to
classify and regress data. at the back end. interpret the analysis
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Data Engineering vs. Data Science vs. Data Analytics
Data Science Data Engineering Data Analytics
Data Scientists heavily used A data engineer uses optimized Data analytic uses
neural networks, machine ML algorithms to maintain programming skills, data
manipulation, data
learning for continuous data and make data available in
visualization
regression analysis. the most appropriate manner.
Skills needed- Programming Skills needed- Programming
Skills needed- Programming
(Python, R), Machine Learning (Python, Java), ETL & Data
(Python, SQL), Data
(Scikit-learn, TensorFlow), Data Modeling, Big Data
Manipulation (Pandas), Data
Visualization (Matplotlib, Technologies (Spark, Hadoop),
Visualization (Tableau, Power
Seaborn), Big Data (Spark, SQL/NoSQL, Data Storage
BI), Statistical Analysis,
Hadoop), SQL/NoSQL, Cloud (Redshift, BigQuery), Cloud
Reporting Tools (Excel, Google
Platforms (AWS, Google Services (AWS, Azure), Data
Sheets), Business Acumen.
Cloud), Communication Skills. Pipeline Tools (Airflow).
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Agenda
● Data Lifecycle: Ingestion
● Storage,
● Processing,
● Analysis, and
● Visualization
● Python for Data Engineering: Fundamentals,
● Libraries (Pandas, NumPy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Data Engineering Lifecycle
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Data Engineering Lifecycle
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Data Engineering Lifecycle
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Data Engineering Lifecycle: Tools / Technologies
Stages of DE Lifecycle Tools / Technologies
Ingestion Kafka, Flume, Apache Nifi, Logstash
Cleaning/Transform Python (Pandas), Spark, dbt
S3, HDFS, PostgreSQL, Snowflake,
Storage
BigQuery
Orchestration Apache Airflow, Prefect
Governance Apache Atlas, Collibra
Serving Looker, Power BI, APIs
Monitoring Prometheus, Grafana, DataDog
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Python for Data Engineering: Fundamentals
• Why Python for Data Engineering?
• Python is a popular choice for data engineering due to its versatility,
readability, and the vast ecosystem of libraries designed for data
manipulation and processing.
• simple syntax makes it easier to learn and use for tasks like data
extraction, transformation, and loading (ETL)
• Python has rich ecosystem of libraries
• Great for data manipulation, automation, and pipeline development
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Python for Data Engineering: Libraries (Numpy)
Efficient handling of numerical arrays and matrices
Key Features:
● Multi-dimensional array objects (ndarray)
● Fast mathematical operations
● Broadcasting and vectorization
● Useful for matrix computations and scientific tasks
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Python for Data Engineering: Libraries (Numpy)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Python for Data Engineering: Libraries (Pandas)
Powerful tools to handle structured data (like tables)
Key Features:
• DataFrame and Series objects
• Read/write CSV, Excel, JSON, SQL
• Data cleaning and transformation
• Grouping, filtering, joining datasets
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Python for Data Engineering: Libraries (Pandas)
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Question Set
o Describe the core responsibilities of a Data Engineer within an organization. How do these
responsibilities contribute to the overall data strategy?
o What technical skills and non-technical attributes (e.g., problem-solving, communication) are
most crucial for a successful Data Engineer?
o How has the role of a Data Engineer evolved with the rise of big data and cloud computing?
o Explain why Data Engineering is considered the "backbone" of data-driven organizations.
What are the consequences of poor or absent data engineering?
o In what ways does robust data engineering directly impact the success of Data Science and
Data Analytics initiatives?
o Discuss the business value that efficient and reliable data pipelines bring to an enterprise
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Question Set
o Identify and elaborate on at least three significant challenges faced by Data Engineers today
(e.g., data quality, scalability, data governance, security, real-time processing). Provide examples
of how these challenges can impact data projects.
o Discuss the complexities involved in integrating data from disparate sources and ensuring data
consistency
o Clearly articulate the primary differences between Data Engineering, Data Science, and Data
Analytics.
o How do the outputs of a Data Engineer serve as inputs for a Data Scientist or Data Analyst?
Provide concrete examples.
o Describe the "Data Ingestion“ / “Data Storage” phase. What are the different types of data
storage solutions (e.g., data warehouses, data lakes, NoSQL databases)?
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Question Set
o Detail the "Data Processing" phase. What are the common types of data transformations (e.g.,
cleaning, normalization, aggregation, enrichment)?
o Differentiate between batch processing and real-time/stream processing. When would you use
one over the other?
o How does the quality of data from the previous stages impact the accuracy and reliability of the
analysis?
o Discuss the data engineering life cycle.
o Why is Python a preferred language for Data Engineering compared to others? Discuss its
advantages
o Explain what a Pandas DataFrame is and why it's a powerful data structure for tabular data?
Department Elective: Data Engineering – NCMPE51 (Autonomy)
Question Set
o Provide examples of common data manipulation tasks a Data Engineer would
perform using Pandas (e.g., reading/writing data, filtering, grouping, merging,
handling missing values).
o How does Pandas contribute to data cleaning and transformation in the
'Processing' phase of the data lifecycle?
o What is the primary role of NumPy in the data engineering ecosystem?
o Discuss the use of Pandas DataFrames in data engineering.
Department Elective: Data Engineering – NCMPE51 (Autonomy)