0% found this document useful (0 votes)

18 views7 pages

Content

The document provides an overview of data and big data, emphasizing the importance of accurate data collection and management for effective decision-making in various industries. It discusses types of data (structured, unstructured, semi-structured, qualitative, and quantitative), data quality, and the processes involved in data collection, management, preparation, and analysis. Additionally, it highlights the challenges and opportunities presented by big data, including its volume, velocity, variety, veracity, and value.

Uploaded by

radhouse.jaidityasharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views7 pages

Content

Uploaded by

radhouse.jaidityasharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Introduction to Data and Big Data

JAIDITYA
• Data are bits of information collected to inform decisions and analytics. Modern organizations
capture data continuously from digital sources. Ensuring data is accurate and complete during
collection is crucial, otherwise “analysis won’t be accurate”

• Big Data refers to datasets so large and complex that traditional tools struggle to process them. It is
commonly characterized by the “3Vs”: Volume (sheer size of data), Velocity (speed of generation),
and Variety (heterogeneous types of data) Emerging Big Data technologies help firms extract
insights from high-volume streaming or batch data.

Overview of Using Data

• Organizations rely on data-driven decision making across industries. For example, healthcare
providers analyze patient records for treatment planning, financial firms monitor transactions to
detect fraud, and IT companies track system logs for performance analytics. Collecting relevant data
allows businesses to “analyze past strategies and stay informed on what needs to change”

• High-quality data allows personalized services and predictive analytics (e.g. recommending
products to customers based on purchase history). In healthcare, analyzing electronic health
records and medical images can improve diagnoses; in finance, analyzing market and customer data
supports better risk management and investment strategies.

• Data-driven insights help adapt to customer needs and market trends. However, decisions based on
inaccurate or incomplete data can mislead an organization, so robust collection and management
practices are essential

Types of Data (Structured / Unstructured)

• Structured Data: Rigid format with a fixed schema (tables with rows and columns). Easy for
computers and users to query using SQL. Examples include relational databases of customer
information (names, transaction amounts, dates) or spreadsheets. For instance, a bank’s transaction
report with customer IDs and balances in columns is structured

• Unstructured Data: No fixed schema or format; often text, images, audio, or video. Examples
include doctors’ notes, social media posts, or video files. These data require specialized processing
(e.g. NLP or image analysis) because “Unstructured data has no fixed schema and can have complex
formats”. Many Big Data applications (like processing social media or sensor streams) deal heavily
with unstructured inputs.

• Semi-Structured Data: In between; no fixed schema but contains markers (metadata) to separate
elements. Common formats include JSON, XML, or CSV files. These often serve as intermediaries
(e.g. web APIs output JSON). Such data “uses metadata (tags) to identify specific data
characteristics”. An email is a semi-structured example: it has structured headers (sender, subject)
and unstructured body text

Types of Data (Qualitative / Quantitative)

• Quantitative Data: Numerical values that can be measured. Examples: patient blood pressure
readings, stock prices, or sales figures. Machine algorithms and charts (e.g. histograms) often
analyze quantitative data to find trends.
• Qualitative (Categorical) Data: Descriptive or categorical values. Examples: patient gender,
customer survey feedback, or transaction categories. Categorical data often require grouping or
encoding before analysis.

• Real-world datasets often mix both. For example, healthcare records might include quantitative lab
results and qualitative symptoms descriptions. Structured data can include both quantitative (like
revenue figures) and qualitative fields (dates, names) Understanding each type helps choose proper
analysis techniques.

Data Collection

• Definition: Data collection is the systematic process of gathering information about subjects of
interest. It can be manual (surveys, interviews) or automated (sensor logs, transactions). It’s “crucial
to ensure data is complete during the collection phase” (, because incomplete or illegal data
gathering hurts the validity of analysis.

• Sources of Data:

o Transactions & Logs: Point-of-sale records, banking transactions, web server logs (common
in finance/IT).

o Sensors & IoT: Wearable health monitors, environmental sensors, RFID scanners (common in
healthcare and manufacturing).

o Surveys & Forms: Patient questionnaires, market research surveys.

o Social Media and Public Data: Tweets, posts, public datasets.

• Best Practices: Define clear objectives and methods before collecting. For example, if predicting
patient outcomes, identify which clinical measurements to gather and how (e.g. regular vitals from
monitoring devices). Good practices include ensuring ethical consent and data privacy. Accurate,
relevant data collection lays the foundation for all downstream analysis.

Data Management

• Definition: Data management is the practice of ingesting, storing, organizing, and maintaining data
so it can be accessed and used. This covers database administration, data governance (policies and
standards), backup, security, and monitoring data quality.

• Key Activities:

o Storage: Choosing appropriate systems (relational databases, NoSQL, data lakes) for
different data types.

o Organization: Structuring data through schemas, catalogs, and metadata so it is searchable

and consistent.

o Governance: Policies for data access, privacy, and compliance (especially critical in
healthcare/finance).

o Lifecycle Management: Archiving or deleting old data according to retention rules.

• Importance: Effective data management ensures information is accurate, available to authorized

users, and secure, enabling reliable business operations and decision-making For example, in a
hospital, managing patient records with clear standards prevents duplication or loss; in finance,
managing trade data with audit trails supports regulatory compliance. In short, good data
management “helps drive decision-making” by delivering trusted information

Big Data: Volume & Velocity

• Volume: Refers to the massive scale of data. Big Data deals with terabytes to petabytes (or more)
of information. Modern businesses generate data at unprecedented rates (e.g. a social network’s
daily posts, or IoT sensor streams). For instance, billions of credit card transactions or sensor
readings may need storage. The key challenge is storing and processing such large volumes cost-
effectively

• Velocity: Refers to the speed of data generation and processing. Data may arrive in real-time or
near-real-time. Big Data technologies must handle rapid input (e.g. streaming logs, live health
monitor feeds) and often require real-time analysis or alerts. If data streams in faster than it is
processed, valuable insights may be lost. Thus, systems need to be engineered for high-throughput
and low-latency data handling

• Example: Online retail websites see thousands of customer interactions per minute. Capturing
clickstreams and purchases at high velocity allows real-time personalization. In finance, stock
exchange data ticks at sub-second intervals. Efficiently handling volume and velocity enables
applications like fraud detection and live system monitoring.

Big Data: Variety, Veracity, & Value

• Variety: Big Data comes in diverse formats and sources: structured tables, unstructured text,
images, videos, sensor readings, etc. Modern analytics must integrate all these. For example, a
smart city project might combine traffic sensor data (time-series), weather reports (structured), and
social media posts (text). Variety adds complexity: “data is heterogeneous, meaning it can come
from many different sources and can be structured, unstructured, or semi-structured” (

• Veracity: Data accuracy and quality. Big datasets often contain errors, noise, or incomplete entries.
“Big data can be messy, noisy, and error-prone,” which makes quality control difficult. Ensuring
veracity means rigorous cleaning and validation. For instance, inconsistent patient records or
duplicate customer entries must be resolved. Higher veracity (trustworthiness) of data leads to
more reliable analytics.

• Value: The ultimate goal is actionable insight. Big Data technologies must help extract useful
information. Not all collected data has equal value, so organizations must identify and analyze the
most relevant data. High-value insights might come from combining previously siloed datasets.
Example: a bank might merge account history with social media sentiment to improve credit
scoring. Capturing big data without finding its value yields no benefit.

• Summary: Altogether, Big Data’s “Vs” highlight challenges: we must store huge volumes, process
them quickly, handle diverse formats, ensure data quality, and extract meaningful results. When
managed well, Big Data analytics can uncover new opportunities and efficiencies across healthcare,
finance, IT, etc.
Data Quality
AMOGH
• Definition: Data quality refers to how well data is “fit for use” in its intended context. Key
dimensions include accuracy (correctness relative to real-world), completeness (no missing values
in critical fields), consistency (uniform format across datasets), timeliness (up-to-date), and validity
(meeting business rules)

• Why It Matters: High-quality data leads to reliable analyses and decisions; low-quality (dirty) data
leads to flawed outcomes. As one expert notes, “bad data almost certainly guarantees flawed
conclusions and misguided decisions”. In healthcare, using inaccurate patient data can result in
wrong diagnoses; in finance, errors in transaction data can skew risk models. Even a small error rate
can have large effects (nearly half of new data records may contain errors.

• Dimensions Examples:

o Accuracy: Data fields should match reality (e.g. correct patient address).

o Completeness: All necessary data points are present (e.g. lab results filled for every patient
record).

o Consistency: Data formatted uniformly (dates in one format, names spelled consistently).

o Timeliness: Data is updated as needed (e.g. up-to-date stock prices).

• Industry Focus: Regulated fields especially demand quality. For example, patient data in healthcare
must be “complete, accurate, and available when required”, else treatment plans suffer. Finance
similarly requires precise data for compliance and reporting. Ensuring data quality often involves
cleansing steps (detecting duplicates, validating entries) during data preparation.

Missing or Incomplete Data

• Definition: Missing data refers to blank or null fields in a dataset. It happens when values are not
recorded or lost (e.g. patient skips a survey question, or a sensor malfunctions). YData summarizes:
“Missing Data refers to the absence of certain values in observations or features within a dataset”.

• Impacts: Gaps in data can bias analysis and weaken models. Many algorithms cannot handle nulls,
and simply dropping rows with missing values can reduce sample size and skew results. For
instance, if high-risk patient records disproportionately lack data, a model may underpredict
complications. Unaddressed missing data makes results unreliable.

• Handling Strategies: Common approaches include:

o Case Deletion: Remove records (rows) with missing fields if they are few . This is simple but
can lose information and introduce bias.

o Imputation: Fill missing values with estimated ones (mean/median, regression prediction, or
model-based methods like KNN).This retains data but introduces uncertainty.

o Model-Based Methods: Use statistical algorithms (e.g. Expectation-Maximization) to infer

likely values .

o Use of Algorithms That Tolerate Nulls: Some tree-based models can work around missing
entries without explicit imputation.
• Best Practice: Carefully choose a strategy based on context. For example, if only 1% of entries are
missing at random, simple deletion might suffice. If many values are missing in a non-random way,
imputation or more advanced methods are needed. Always document how missing data were
handled to assess any bias introduced.

Data Profiling

• Definition: Data profiling is the process of examining and summarizing source data to understand
its structure, content, and quality. It is usually done early in a project to discover what the data
contains before deep analysis.

• Key Activities:

o Descriptive Statistics: Compute min, max, average, median, frequency counts, number of
nulls, etc. for each field For example, finding the percentage of missing values or outliers in a
column.

o Pattern Analysis: Detect patterns or distributions (e.g. are phone numbers consistently
formatted?).

o Anomaly Detection: Identify inconsistent formats or invalid values.

o Relationship Discovery: Find possible keys (unique identifiers) or foreign key candidates
between tables.

• Purpose: Profiling reveals data issues before heavy processing. It can uncover errors (e.g. duplicate
records, invalid ranges) and help define cleansing rules for ETL. According to one guide, profiling is
crucial in data warehousing and BI projects because it “uncover[s] data quality issues in data
sources, and what needs to be corrected in ETL” .In migration projects, it may reveal new
requirements for the target system.

• Examples: A finance team profiling customer data might find that 10% of account records have
missing emails, or that ZIP codes are sometimes 4-digit instead of 5. Knowing this helps plan data
cleaning steps.

Data Preparation

• Cleaning: Remove or correct erroneous data. This includes de-duplicating records, fixing
inconsistent formats, and resolving obvious errors (e.g., a height field entered as 600 meters instead
of 6.00). It also involves handling missing data as discussed earlier.

• Transformation: Convert data into a suitable format. Examples: normalizing numeric scales (e.g. z-
scores), encoding categorical variables as numbers, deriving new features (like BMI from
weight/height). Transformations prepare data for analysis or machine learning.

• Integration: Combine data from different sources into a unified form. For instance, joining customer
demographics with purchase history from another system. This often involves aligning field names,
units, and schemas.

• Goal: After preparation, data should be consistent, accurate, and analysis-ready. Well-prepared
data feeds into analytics or ML models smoothly. For example, before training a model, a team
might scale all financial metrics to thousands of dollars and encode categorical fields (e.g.
“male/female” to 0/1).

Data Exploration (EDA)

• Definition: Exploratory Data Analysis (EDA) involves visually and statistically examining a prepared
dataset to uncover patterns, anomalies, or relationships. It is an open-ended, iterative process to
understand what the data can reveal.

• Purpose: EDA helps formulate hypotheses and guides modeling. Analysts look at distributions of
each variable, relationships between variables (using scatter plots, correlation matrices), and
identify outliers or trends. According to IBM, EDA is used to “summarize [data’s] main
characteristics, often employing data visualization methods”.

• Common Techniques:

o Univariate analysis: histograms or boxplots of single variables to check distribution, skew,

outliers.

o Bivariate/multivariate analysis: scatterplots, heatmaps, or pivot tables to see how pairs of

variables relate.

o Summary statistics: mean, median, standard deviation, quartiles.

o Dimensionality reduction (like PCA) and clustering to visualize high-dimensional data.

• Outcome: Insights from EDA inform which models or transformations to apply. For example, seeing
that two variables are strongly correlated might suggest combining them or dropping one.
Anomalies detected (like a strangely high salary entry) might indicate data errors to correct. In
short, EDA lets data scientists check assumptions and ensure models will be appropriate.

ETL (Extract, Transform, Load)

• Concept: ETL is a three-step process to gather data for analysis. First, Extract raw data from various
source systems (databases, APIs, logs). Second, Transform the data by cleaning, converting formats,
aggregating, and applying business rules. Third, Load the processed data into a target repository
(often a data warehouse or data lake).

• Purpose: To consolidate and prepare data for business intelligence, reporting, or machine learning.
ETL ensures data from disparate sources is combined into a consistent, usable form. For example, an
ETL job might pull transactional data from a sales system (extract), adjust currency units and drop
irrelevant fields (transform), and then load the refined data into a central reporting database (load).

• Key Features:

o Uses business rules during Transform to standardize data (e.g. calculating total order value,
formatting dates).

o Often scheduled in batches or triggered by events.

o Handles both structured and semi-structured data sources; big data ETL tools can process
large volumes in distributed systems.
• Example: An online retailer extracts customer orders and website clickstream logs, transforms them
by joining on customer ID and summarizing daily behavior, then loads the result into a warehouse
for analysis. This allows analysts to run queries like “What is the average order value for customers
who viewed a product page twice before buying?”

• Benefits: ETL pipelines make data readily available for analytics and BI. They can “combine data
from multiple sources into a large, central repository called a data warehouse”. For instance, by
ETLing POS (point-of-sale) data, an analysis might forecast demand and optimize inventory.

Data Warehousing & Summary

• Data Warehouse: A data warehouse is a centralized repository designed for analysis and reporting.
It stores integrated data (often historical and aggregated) from multiple operational sources. Data in
a warehouse is typically organized into subject-oriented schemas (e.g. tables for sales, customers,
time). For example, a hospital data warehouse might combine patient demographics, treatments,
and billing into one system for analytics.

• Data warehouse software is optimized for read queries and can span various storage hardware to
handle large datasets .Unlike transactional databases (OLTP), warehouses support OLAP (Online
Analytical Processing), enabling complex queries and dashboards.

• Use Cases: Analysts use data warehouses for business intelligence, trend analysis, and reporting. In
finance, a data warehouse might hold all trade and account data over years to allow querying risk
exposures. In IT, a warehouse of log data can help spot system-wide patterns.

CCS341-Data Warehousing Notes-Unit I
100% (2)
CCS341-Data Warehousing Notes-Unit I
30 pages
Introduction to Big Data Concepts
100% (2)
Introduction to Big Data Concepts
33 pages
Big Data Analytics
No ratings yet
Big Data Analytics
194 pages
Big Data: Beginning With Capture, Organize, Integrate, Analyze, and Act
100% (1)
Big Data: Beginning With Capture, Organize, Integrate, Analyze, and Act
23 pages
Spark SQL Performance Tuning PDF 1745571931
No ratings yet
Spark SQL Performance Tuning PDF 1745571931
35 pages
Big Data 1 Unit
No ratings yet
Big Data 1 Unit
21 pages
Big Data
No ratings yet
Big Data
28 pages
GW Data Plataform
No ratings yet
GW Data Plataform
3 pages
Big Data Basics for Beginners
No ratings yet
Big Data Basics for Beginners
53 pages
Big Data
No ratings yet
Big Data
34 pages
Unit 1
No ratings yet
Unit 1
59 pages
What Is Spend Analysis
No ratings yet
What Is Spend Analysis
35 pages
Unit 1
No ratings yet
Unit 1
44 pages
Big Data
No ratings yet
Big Data
54 pages
Big Data
No ratings yet
Big Data
54 pages
BDA Unit 1 Notes
No ratings yet
BDA Unit 1 Notes
34 pages
Abinitio Vijay - 8553385664
No ratings yet
Abinitio Vijay - 8553385664
28 pages
1.2 Big Data
No ratings yet
1.2 Big Data
23 pages
Data, Big
No ratings yet
Data, Big
90 pages
$R3N9XOZ
No ratings yet
$R3N9XOZ
56 pages
M1 Q&a
No ratings yet
M1 Q&a
26 pages
Big Data Analtics (Unit 1)
No ratings yet
Big Data Analtics (Unit 1)
31 pages
BD 1
No ratings yet
BD 1
15 pages
BDA Unit 1 Notes-1
No ratings yet
BDA Unit 1 Notes-1
34 pages
Introduction To Business Analytics
No ratings yet
Introduction To Business Analytics
63 pages
L01-Fundamentals of Big Data and Data Analytics
No ratings yet
L01-Fundamentals of Big Data and Data Analytics
58 pages
Big Data Analytics
No ratings yet
Big Data Analytics
58 pages
SAP Signavio Instructor Led Training Catalog
No ratings yet
SAP Signavio Instructor Led Training Catalog
10 pages
Business Analytics
No ratings yet
Business Analytics
34 pages
KCA 034 - Unit 1
No ratings yet
KCA 034 - Unit 1
48 pages
Big Data
No ratings yet
Big Data
17 pages
Unit 1
No ratings yet
Unit 1
21 pages
BDA Unit 1
No ratings yet
BDA Unit 1
23 pages
Getting An Overview of Big Data (Module1)
No ratings yet
Getting An Overview of Big Data (Module1)
58 pages
Big Data-Intro
No ratings yet
Big Data-Intro
31 pages
Security, Backup, Recovery, Tuning, Testing of Data Mining and Warehousing
No ratings yet
Security, Backup, Recovery, Tuning, Testing of Data Mining and Warehousing
16 pages
BDA Notes
No ratings yet
BDA Notes
68 pages
Introduction To Data
No ratings yet
Introduction To Data
34 pages
Big Data Analysis
No ratings yet
Big Data Analysis
39 pages
Unit-1 Introduction To Big Data Analytics
No ratings yet
Unit-1 Introduction To Big Data Analytics
57 pages
Big Data Processing
No ratings yet
Big Data Processing
38 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
31 pages
What Is Big Data
No ratings yet
What Is Big Data
4 pages
Module 04 Ba
No ratings yet
Module 04 Ba
45 pages
Intro. To Business Analytics
No ratings yet
Intro. To Business Analytics
44 pages
BDA-1st Unit
No ratings yet
BDA-1st Unit
39 pages
Big Data Analytics
No ratings yet
Big Data Analytics
25 pages
Big Data Analytics Project Proposal by Slidesgo
No ratings yet
Big Data Analytics Project Proposal by Slidesgo
12 pages
Big Data Analytics
No ratings yet
Big Data Analytics
14 pages
Module 1. 16974328175990
No ratings yet
Module 1. 16974328175990
119 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
4 pages
Unit 1 Introduction
No ratings yet
Unit 1 Introduction
70 pages
Big Data
No ratings yet
Big Data
13 pages
Reviewerku
No ratings yet
Reviewerku
6 pages
Big Data Platforms and Analytics
No ratings yet
Big Data Platforms and Analytics
20 pages
Introduction
No ratings yet
Introduction
21 pages
Toyota
No ratings yet
Toyota
8 pages
Unit 1 - ETI (BDA)
No ratings yet
Unit 1 - ETI (BDA)
20 pages
Challenges in Big Data Analytics Techniques
No ratings yet
Challenges in Big Data Analytics Techniques
6 pages
BDT 1
No ratings yet
BDT 1
49 pages
1.big Data and Its Importance
No ratings yet
1.big Data and Its Importance
17 pages
Big Data in Pharmaceutical Industry
No ratings yet
Big Data in Pharmaceutical Industry
10 pages
AP Invoices Conversion
50% (2)
AP Invoices Conversion
52 pages
Data Warehousing Lab Course Guide
0% (1)
Data Warehousing Lab Course Guide
28 pages
BDA Notes
No ratings yet
BDA Notes
96 pages
Big Data Insights for IT Professionals
No ratings yet
Big Data Insights for IT Professionals
35 pages
Tci Reference Architecture Quick Guide
No ratings yet
Tci Reference Architecture Quick Guide
24 pages
BI Architect Questions
100% (1)
BI Architect Questions
2 pages
DW Unit I Notes
No ratings yet
DW Unit I Notes
28 pages
C - TS462 - 2022 Latest and New
No ratings yet
C - TS462 - 2022 Latest and New
67 pages
How To Become A DBA
No ratings yet
How To Become A DBA
11 pages
Manikanta Teku - Looker Developer Resume-1
No ratings yet
Manikanta Teku - Looker Developer Resume-1
3 pages
ETL Testing Interview Questions
75% (16)
ETL Testing Interview Questions
20 pages
Java Metadata Interface (JMI) Specification: JSR 040 Java Community Process
No ratings yet
Java Metadata Interface (JMI) Specification: JSR 040 Java Community Process
142 pages
Bi Unit 1
No ratings yet
Bi Unit 1
87 pages
RealTime BW 740 PDF
No ratings yet
RealTime BW 740 PDF
36 pages
SAP FICO Consultant Profile
No ratings yet
SAP FICO Consultant Profile
6 pages
Geokettle Readme
No ratings yet
Geokettle Readme
5 pages
BI Infrastructure Planning Guide
No ratings yet
BI Infrastructure Planning Guide
20 pages
数据仓库技术架构及方案
No ratings yet
数据仓库技术架构及方案
60 pages
Data Warehousing PArt B
No ratings yet
Data Warehousing PArt B
7 pages
Informatica ETL & Network Architect Roles
No ratings yet
Informatica ETL & Network Architect Roles
3 pages
Google Cloud Data Lakes & Warehouses
No ratings yet
Google Cloud Data Lakes & Warehouses
4 pages
Abhishek - Resume
No ratings yet
Abhishek - Resume
1 page
Ust Etl Dev 161490 Hari
No ratings yet
Ust Etl Dev 161490 Hari
5 pages
Data Literacy Glossary
No ratings yet
Data Literacy Glossary
2 pages

Content

Uploaded by

Content

Uploaded by

Introduction to Data and Big Data

Overview of Using Data

Types of Data (Structured / Unstructured)

Types of Data (Qualitative / Quantitative)

o Surveys & Forms: Patient questionnaires, market research surveys.

o Social Media and Public Data: Tweets, posts, public datasets.

o Organization: Structuring data through schemas, catalogs, and metadata so it is searchable

o Lifecycle Management: Archiving or deleting old data according to retention rules.

• Importance: Effective data management ensures information is accurate, available to authorized

Big Data: Volume & Velocity

Big Data: Variety, Veracity, & Value

o Timeliness: Data is updated as needed (e.g. up-to-date stock prices).

Missing or Incomplete Data

• Handling Strategies: Common approaches include:

o Model-Based Methods: Use statistical algorithms (e.g. Expectation-Maximization) to infer

o Anomaly Detection: Identify inconsistent formats or invalid values.

Data Exploration (EDA)

o Univariate analysis: histograms or boxplots of single variables to check distribution, skew,

o Bivariate/multivariate analysis: scatterplots, heatmaps, or pivot tables to see how pairs of

o Summary statistics: mean, median, standard deviation, quartiles.

o Dimensionality reduction (like PCA) and clustering to visualize high-dimensional data.

ETL (Extract, Transform, Load)

o Often scheduled in batches or triggered by events.

Data Warehousing & Summary

You might also like