Roadmap to become a
DATA ENGINEER
2023
January
Basics of Programming
Operators, Variables & Conditional Statements &
Data Types in Python Looping Constructs
Data Structures in Python Writing custom Functions
Standard Libraries in Python Regular Expressions
February
Fundamentals of Computing
Shell Scripting Working with APIs
Data Structures in Python Git and GitHub
March
Relational Databases
Basic Querying in SQL Keys in SQL
Joins in SQL Subqueries in SQL
Window Functions Normalisation
April
Cloud Computing
Fundamentals with AWS
Basics of AWS IAM users and IAM Roles
AWS EC2 Lambda Functions on AWS
AWS S3 API gateway
AWS VPC AWS RDS and Aurora
May
Data Processing with
Apache Spark
Spark architecture RDDs in Spark
Working with Spark Understand Spark
Dataframes Execution
Broadcast and Spark SQL
Accumulators
June
Fundamentals of Computing
Overview of the Hadoop Understand MapReduce
Ecosystem architecture
Understand the working Work with Hadoop on
of YARN the cloud with AWS EMR
July
Data Warehousing with Apache Hive
Hive Query Language Managed vs External tables
Partitioning and Bucketing Types of File formats
SerDes in Hive
August
Ingesting streaming data
with Apache Kafka
Learn Kafka architecture Learn about Producers
and Consumers
Create topics in Kafka Ingest streaming data on
cloud with AWS Kinesis
September
Process streaming data with
Spark Streaming
DStreams Stateless vs Stateful
transformations
Checkpointing Structured Streaming
October
Advanced Programming
OOPs concepts Understand Recursion
functions
Unit testing Integration testing
November
NoSQL
CAP theorem Documents and Collections
CRUD operations Different types of operators
Aggregation Pipeline Sharding and Replication
December
Workflow Scheduling
DAGs Task dependencies
Operators Scheduling
Branching