Certified Data Analyst
Session 1
Excel Power BI
SQL
Server
Tableau
Python
Tools we will be using during the session
• Interactive Quizzes and Polls
• Class Assignments and Projects
• Supplementary Reading Materials
• Discussion Forums and Peer Interaction
• Videos and Multimedia Content
• Progress Tracking
• Certificates of Completion
• Live Sessions/Webinars
• Personalized Learning
What we will cover it today’s session
• What is Data Analysis? • Types of Data
• Modern Data Ecosystem • Data Source Types
• Enterprise Applications • Data Repositories
• Types of Enterprise Data Analysis • Data Visualization
• Data Analysis vs. Data Analytics • RDBMS vs. NoSQL Repositories
• Enterprise Data Analysis Use Cases • ETL & Data Pipeline
• Real World Data Science Use Cases • Big Data
• Data Science in Different Industries • Key Players in Data Analytics /
• Data Analysis Process Science
What is Data Analysis?
DATA COLLECTION DATA CLEANING DATA DATA ANALYSIS DATA DATA
TRANSFORMATION VISUALIZATION COMMUNICATION
Modern Data Ecosystem
What is Ecosystem?
• Google - Forest ecosystem
• Google - Corel reef ecosystem
• Google - Congo rainforest ecosystem
Modern Data Ecosystem
data ecosystem is a complex system of technologies and processes that
organizations use to collect, store, process, and analyze data
Sources Ingestion ETL / ELT Data Warehousing
BI
• Databases • Database queries • Cleaning • Landing
• Relational • Batch process • Converting • Staging • Reporting
• NoSQL • API / Web • Formatting • Warehouse • Dashboards
• Flat Files services • Filtering • Mart
• APIs • Stream • Aggregating Data Lake
• Streams processing • Normalizing
• Logs • Log aggregation • Enriching • Landing Area
• Raw Zone
• Processed Zone
• Curated Zone
Additionally, Data Ecosystem Consists
Cloud Computing Big Data Technologies Machine Learning and Data Science
Artificial Intelligence
Enterprise Applications
Types of applications are commonly needed
to manage restaurants
Point of Sale Inventory Delivery and
(POS) Management Online Ordering
HRMS (Human
Reservation and
Resource
Table
Management
Accounting
Management
System)
Types of applications are commonly needed to manufacturing
concern
Manufacturing
Enterprise Resource Supply Chain
Execution Systems
Planning (ERP) Management (SCM)
(MES)
Customer
Product Lifecycle Quality Management
Relationship
Management (PLM) System (QMS)
Management (CRM)
Inventory Human Resource
Management Management (HRM)
Analyze a Banking System
https://www.meezanbank.com/organizational-chart/
• Shariah Board and Audit • Risk Management Group
Committee • Risk Analysis Tools
• Compliance Software • Operations Group
• Audit Management Systems • Core Banking Software
• Human Resources, Learning & • Consumer Finance Group
Development Group • Loan Origination and
• Human Resources Information Management Systems
System (HRIS)
• Finance
• Financial Management Software
Types of Enterprise Data
Analysis
Types of Data Analysis
DESCRIPTIVE DIAGNOSTIC PREDICTIVE PRESCRIPTIVE
ANALYSIS ANALYSIS ANALYSIS ANALYSIS
Data Analytics vs. Data Analysis
• Analysis - detailed examination of the elements or structure of
something
• Analytics - the systematic computational analysis of data or statistics
Data Science
A few Real-World Data Science Examples
Netflix uses data Amazon uses data Google uses data Facebook uses data Tesla uses data
science to science to science to improve its science to target ads science to develop
recommend movies personalize its search results and to its users and self-driving cars and
and TV shows to its product develop new improve the overall improve the
users. recommendations products and user experience. performance of its
and predict customer services, such as electric vehicles.
demand. Google Translate and
Google Maps.
Range of other tasks Data Science is doing
HEALTHCARE: DATA SCIENCE IS FINANCE: DATA SCIENCE IS RETAIL: DATA SCIENCE IS EDUCATION: DATA SCIENCE IS ENVIRONMENT: DATA SCIENCE
BEING USED TO DEVELOP NEW BEING USED TO DETECT BEING USED TO PERSONALIZE BEING USED TO PERSONALIZE IS BEING USED TO MONITOR
DRUGS AND TREATMENTS FRAUD, PREDICT MARKET PRODUCT LEARNING, IDENTIFY STUDENTS CLIMATE CHANGE, PREDICT
TRENDS, AND ASSESS RISK RECOMMENDATIONS, AT RISK OF FALLING BEHIND, NATURAL DISASTERS, AND
OPTIMIZE SUPPLY CHAINS, AND AND IMPROVE THE EFFICIENCY DEVELOP SUSTAINABLE
IMPROVE CUSTOMER LOYALTY OF EDUCATIONAL SYSTEMS SOLUTIONS
Data Analysis vs. Data Science
AI Algorithm:
Basic Data Analysis: Python
Excel Data Collection:
Python & SQL Server
Build Model to Predict:
Python
Advanced Data Analysis: Data Integration:
Python Python & SQL Server
Generate Hypothesis:
Python
Data Storage:
Basic Data Visualization
SQL Server
& Reporting: Deploy ML Models:
Excel Data Processing & Python
Transformation:
Advanced Data Visualization Python Statistical Patterns & Analysis:
& Reporting: Python
Power BI & Tableau
Types of Data
Semi-
Structured Unstructured
structured
Data Source Types
RDBMS: SQL, Flat files: XML, CSV, Spreadsheet: Excel,
Oracle, MySQL TXT, JSON Google Sheets
Web Scrapping:
XML: Structure and APIs: Google Maps, Screen Scraping,
hierarchical data Stock Markets Web harvesting,
Web data extraction
Data Streams: Data
RSS: Really Simple NoSQL Databases:
from instruments,
Syndication, RSS MongoDB, NoSQL,
IoT, GPS, Websites,
Streams Amazon DynamoDB
Social Platforms
Top companies are using which databases
• Twitter: Twitter uses a combination of technologies like MySQL, Manhattan (Twitter’s real-time, multi-tenant
distributed database), and Hadoop for data processing and analytics. Their approach to handling massive real-
time data and analytics is quite instructive.
• Amazon: Amazon's use of DynamoDB, a NoSQL database service, and Redshift, a data warehousing solution,
are key to their operations. Amazon's database strategies offer a great example of e-commerce and cloud
computing data management.
• LinkedIn: LinkedIn uses a mix of traditional RDBMS like Oracle, and distributed systems like Apache Kafka for
stream processing and VoltDB for real-time analytics. Their approach to handling professional networking data
is unique.
• Netflix: Netflix is a prime example of using cloud-based databases efficiently, primarily using Amazon
DynamoDB and Cassandra. They also use a variety of other tools for big data analytics and personalization
algorithms.
• Spotify: Spotify’s use of Google Cloud Bigtable and Apache Cassandra for its massive music database and user
data analytics provides insights into media streaming and user behavior analysis.
• Airbnb: Airbnb utilizes Amazon RDS and MySQL for database needs, showing how tech in the sharing economy
works, especially regarding handling global listings and user data.
• Uber: Uber’s use of technologies like MySQL, PostgreSQL, and MongoDB, along with their proprietary
database technologies, are integral to their real-time ride-sharing operations.
Data Repositories
Data
Databases Data Marts
Warehouses
NoSQL
Data Lakes
Databases
Data Marts, Data Lakes, ETL, and Data
Pipelines
• Data Warehouse: Analysis ready, single source of truth
• Data Mart: Subset of the Data Warehouse, Built specifically for a
particular business function, purpose, or community of users, isolated
security, isolated performance
• Data Lake: Pool of raw data, identified by unique identifier, tagged
with meta-data, based on use-case, retains all source data without
exclusions, all types of data sources and types, predictive analytics
and advanced analytics
• ETL: Extract, Transform, and Load process
Data Visualization
RDBMS vs. NoSQL
RDBMS
• Based on the relational model introduced by E.F. Codd.
• Characteristics:
• Data Structure: Organizes data into tables (or "relations") of rows and
columns.
• Data Integrity: Ensures accuracy and consistency of data using constraints.
• Data Retrieval: Uses SQL (Structured Query Language) for querying data.
• Primary Key: Unique identifier for a record in a table.
• Foreign Key: A key from another table that can be used to establish
relationships.
RDBMS
• Benefits:
• Data Integrity: Ensures data is accurate and consistent.
• Flexibility: Allows complex queries and operations.
• Reduce Redundancy: Allows relationships between tables.
• Security: Provides mechanisms to restrict unauthorized access.
• Scalability: Can handle large amounts of data efficiently.
• ACID Compliant: Ensures accuracy and reliability.
• Popular RDBMS Examples:
• Oracle
• MySQL
• Microsoft SQL Server
• PostgreSQL
RDBMS
• Popular RDBMS Examples:
• Amazon RDS
• Google SQL
• IBM DB2 on Cloud
• Oracle Cloud
• Azure SQL
RDBMS
• Use Cases
• OLTP
• OLAP
• Limitation
• Does not work well with semi-structured and unstructured data
NoSQL
• Stands for "Not Only SQL".
• Database systems designed to handle large volumes of structured and
unstructured data more effectively than relational databases.
• Run as a distributed systems scaled across multiple data centers
• Why NoSQL?
• Scalability: Designed for large-scale data distribution and horizontal scaling.
• Flexibility: Schema-less data models that can evolve with changing requirements.
• Performance: Optimized for specific types of queries and large data operations.
• NoSQL database can store structured, semi-structured or un-structured
data
Common Types of NoSQL Databases
• Key-value store
• Used for: Session, Preferences, Real-time Recommendations, Targeted Ads,
In-memory data Caching
• Not great fit for: Query data on specific data value, Relationships, Need
multiple Unique Keys
• Some of the popular Key-value Store DB: Redis, Memcached, DynamoDB
Common Types of NoSQL Databases
• Document based
• Each record and its associated data as a single document
• Preferred for: eCommerce Platforms, Medical Records Storage, CRM
Platforms, and Analytics Platform
• Not a great fit for: Run complex search queries, Perform multi-operation
transactions
• A few of popular platforms: MongoDB, DocumentDB, CouchDB, Cloudant
Common Types of NoSQL Databases
• Column based
• Data is stored in cells grouped as columns instead of rows
• Great for: systems with heavy write requests: Storing time-series data,
Weather data, IoT data
• Not a great fit for: Run complex queries, Change querying patterns frequently
• Popular Platforms: Casandra, HBase
Common Types of NoSQL Databases
• Graph based
• Graphical model to represent data, useful for visualizing, analyzing, and
finding connections between data
• Great for: Working with connected data like Social Networks, Product
Recommendations, Network Diagrams, Fraud Detection, Access Management
• Not great for: Process high volumes of transactions
• Popular Platforms: Neo4J, CosmosDB
Key Differences between RDBMS and NoSQL
• RDBMS schema rigid • NoSQL database are schema-
• Expensive to maintain agnostic
• Support ACID • Designed for low-cost hardware
• Mature and well-documented • Not ACID compliant
• Relatively Newer Tech
ETL
• Extract: Large chunks from Source to the Target as a batch, Stream
processing i.e., data from real-time source
• Transform: Standardize Dates, Remove Duplicates, Filter Data, Enrich
Data, Establish Relationships, Apply Business Rules, and Data
Validation
• Load: Initial Loading, Incremental Loading, Full-refresh
Data Pipeline
• Entire journey of moving data from one system to another ETL is a
subset of Data Pipeline
• Destination is typically Data Lake
Understanding Big Data
Big Data
• Velocity
• Volume
• Variety
• Veracity
• Value
Big Data
• Hadoop: Distributed Storage and Processing of Big Data (HDFS)
• Hive: Data Warehouse for Data Query and Analysis
• Spark: Distributed Data Analytics tool to perform Analytics in Real-
time
Steps in Data Analysis Process
Identify the Collect the Prepare the
Clean the data
problem data data
Analyze the Interpret the Comm-unicate
data results the results
Key Players in Data Analytics &
Data Science field
Data
Data Engineer Data Analyst Data Scientist
Architects
Business Machine Data
Business
Intelligence Learning Governance &
Analyst
Analyst Engineer Ethics
Languages of Data Professionals
• Query Languages: DML, DDL, QL, SPs
• Programming Language: R, Python, Java
• Shell Programming Language: Unix / Linux Shell Scripting, PowerShell
What we have learned in today’s session
• What is Data Analysis? • Types of Data
• Modern Data Ecosystem • Data Source Types
• Enterprise Applications • Data Repositories
• Types of Enterprise Data Analysis • Data Visualization
• Data Analysis vs. Data Analytics • RDBMS vs. NoSQL Repositories
• Enterprise Data Analysis Use Cases • ETL & Data Pipeline
• Real World Data Science Use Cases • Big Data
• Data Science in Different Industries • Key Players in Data Analytics /
• Data Analysis Process Science
Thank you!