Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
18 views7 pages

Content

The document provides an overview of data and big data, emphasizing the importance of accurate data collection and management for effective decision-making in various industries. It discusses types of data (structured, unstructured, semi-structured, qualitative, and quantitative), data quality, and the processes involved in data collection, management, preparation, and analysis. Additionally, it highlights the challenges and opportunities presented by big data, including its volume, velocity, variety, veracity, and value.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views7 pages

Content

The document provides an overview of data and big data, emphasizing the importance of accurate data collection and management for effective decision-making in various industries. It discusses types of data (structured, unstructured, semi-structured, qualitative, and quantitative), data quality, and the processes involved in data collection, management, preparation, and analysis. Additionally, it highlights the challenges and opportunities presented by big data, including its volume, velocity, variety, veracity, and value.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Introduction to Data and Big Data

JAIDITYA
• Data are bits of information collected to inform decisions and analytics. Modern organizations
capture data continuously from digital sources. Ensuring data is accurate and complete during
collection is crucial, otherwise “analysis won’t be accurate”

• Big Data refers to datasets so large and complex that traditional tools struggle to process them. It is
commonly characterized by the “3Vs”: Volume (sheer size of data), Velocity (speed of generation),
and Variety (heterogeneous types of data) Emerging Big Data technologies help firms extract
insights from high-volume streaming or batch data.

Overview of Using Data

• Organizations rely on data-driven decision making across industries. For example, healthcare
providers analyze patient records for treatment planning, financial firms monitor transactions to
detect fraud, and IT companies track system logs for performance analytics. Collecting relevant data
allows businesses to “analyze past strategies and stay informed on what needs to change”

• High-quality data allows personalized services and predictive analytics (e.g. recommending
products to customers based on purchase history). In healthcare, analyzing electronic health
records and medical images can improve diagnoses; in finance, analyzing market and customer data
supports better risk management and investment strategies.

• Data-driven insights help adapt to customer needs and market trends. However, decisions based on
inaccurate or incomplete data can mislead an organization, so robust collection and management
practices are essential

Types of Data (Structured / Unstructured)

• Structured Data: Rigid format with a fixed schema (tables with rows and columns). Easy for
computers and users to query using SQL. Examples include relational databases of customer
information (names, transaction amounts, dates) or spreadsheets. For instance, a bank’s transaction
report with customer IDs and balances in columns is structured

• Unstructured Data: No fixed schema or format; often text, images, audio, or video. Examples
include doctors’ notes, social media posts, or video files. These data require specialized processing
(e.g. NLP or image analysis) because “Unstructured data has no fixed schema and can have complex
formats”. Many Big Data applications (like processing social media or sensor streams) deal heavily
with unstructured inputs.

• Semi-Structured Data: In between; no fixed schema but contains markers (metadata) to separate
elements. Common formats include JSON, XML, or CSV files. These often serve as intermediaries
(e.g. web APIs output JSON). Such data “uses metadata (tags) to identify specific data
characteristics”. An email is a semi-structured example: it has structured headers (sender, subject)
and unstructured body text

Types of Data (Qualitative / Quantitative)

• Quantitative Data: Numerical values that can be measured. Examples: patient blood pressure
readings, stock prices, or sales figures. Machine algorithms and charts (e.g. histograms) often
analyze quantitative data to find trends.
• Qualitative (Categorical) Data: Descriptive or categorical values. Examples: patient gender,
customer survey feedback, or transaction categories. Categorical data often require grouping or
encoding before analysis.

• Real-world datasets often mix both. For example, healthcare records might include quantitative lab
results and qualitative symptoms descriptions. Structured data can include both quantitative (like
revenue figures) and qualitative fields (dates, names) Understanding each type helps choose proper
analysis techniques.

Data Collection

• Definition: Data collection is the systematic process of gathering information about subjects of
interest. It can be manual (surveys, interviews) or automated (sensor logs, transactions). It’s “crucial
to ensure data is complete during the collection phase” (, because incomplete or illegal data
gathering hurts the validity of analysis.

• Sources of Data:

o Transactions & Logs: Point-of-sale records, banking transactions, web server logs (common
in finance/IT).

o Sensors & IoT: Wearable health monitors, environmental sensors, RFID scanners (common in
healthcare and manufacturing).

o Surveys & Forms: Patient questionnaires, market research surveys.

o Social Media and Public Data: Tweets, posts, public datasets.

• Best Practices: Define clear objectives and methods before collecting. For example, if predicting
patient outcomes, identify which clinical measurements to gather and how (e.g. regular vitals from
monitoring devices). Good practices include ensuring ethical consent and data privacy. Accurate,
relevant data collection lays the foundation for all downstream analysis.

Data Management

• Definition: Data management is the practice of ingesting, storing, organizing, and maintaining data
so it can be accessed and used. This covers database administration, data governance (policies and
standards), backup, security, and monitoring data quality.

• Key Activities:

o Storage: Choosing appropriate systems (relational databases, NoSQL, data lakes) for
different data types.

o Organization: Structuring data through schemas, catalogs, and metadata so it is searchable


and consistent.

o Governance: Policies for data access, privacy, and compliance (especially critical in
healthcare/finance).

o Lifecycle Management: Archiving or deleting old data according to retention rules.

• Importance: Effective data management ensures information is accurate, available to authorized


users, and secure, enabling reliable business operations and decision-making For example, in a
hospital, managing patient records with clear standards prevents duplication or loss; in finance,
managing trade data with audit trails supports regulatory compliance. In short, good data
management “helps drive decision-making” by delivering trusted information

Big Data: Volume & Velocity

• Volume: Refers to the massive scale of data. Big Data deals with terabytes to petabytes (or more)
of information. Modern businesses generate data at unprecedented rates (e.g. a social network’s
daily posts, or IoT sensor streams). For instance, billions of credit card transactions or sensor
readings may need storage. The key challenge is storing and processing such large volumes cost-
effectively

• Velocity: Refers to the speed of data generation and processing. Data may arrive in real-time or
near-real-time. Big Data technologies must handle rapid input (e.g. streaming logs, live health
monitor feeds) and often require real-time analysis or alerts. If data streams in faster than it is
processed, valuable insights may be lost. Thus, systems need to be engineered for high-throughput
and low-latency data handling

• Example: Online retail websites see thousands of customer interactions per minute. Capturing
clickstreams and purchases at high velocity allows real-time personalization. In finance, stock
exchange data ticks at sub-second intervals. Efficiently handling volume and velocity enables
applications like fraud detection and live system monitoring.

Big Data: Variety, Veracity, & Value

• Variety: Big Data comes in diverse formats and sources: structured tables, unstructured text,
images, videos, sensor readings, etc. Modern analytics must integrate all these. For example, a
smart city project might combine traffic sensor data (time-series), weather reports (structured), and
social media posts (text). Variety adds complexity: “data is heterogeneous, meaning it can come
from many different sources and can be structured, unstructured, or semi-structured” (

• Veracity: Data accuracy and quality. Big datasets often contain errors, noise, or incomplete entries.
“Big data can be messy, noisy, and error-prone,” which makes quality control difficult. Ensuring
veracity means rigorous cleaning and validation. For instance, inconsistent patient records or
duplicate customer entries must be resolved. Higher veracity (trustworthiness) of data leads to
more reliable analytics.

• Value: The ultimate goal is actionable insight. Big Data technologies must help extract useful
information. Not all collected data has equal value, so organizations must identify and analyze the
most relevant data. High-value insights might come from combining previously siloed datasets.
Example: a bank might merge account history with social media sentiment to improve credit
scoring. Capturing big data without finding its value yields no benefit.

• Summary: Altogether, Big Data’s “Vs” highlight challenges: we must store huge volumes, process
them quickly, handle diverse formats, ensure data quality, and extract meaningful results. When
managed well, Big Data analytics can uncover new opportunities and efficiencies across healthcare,
finance, IT, etc.
Data Quality
AMOGH
• Definition: Data quality refers to how well data is “fit for use” in its intended context. Key
dimensions include accuracy (correctness relative to real-world), completeness (no missing values
in critical fields), consistency (uniform format across datasets), timeliness (up-to-date), and validity
(meeting business rules)

• Why It Matters: High-quality data leads to reliable analyses and decisions; low-quality (dirty) data
leads to flawed outcomes. As one expert notes, “bad data almost certainly guarantees flawed
conclusions and misguided decisions”. In healthcare, using inaccurate patient data can result in
wrong diagnoses; in finance, errors in transaction data can skew risk models. Even a small error rate
can have large effects (nearly half of new data records may contain errors.

• Dimensions Examples:

o Accuracy: Data fields should match reality (e.g. correct patient address).

o Completeness: All necessary data points are present (e.g. lab results filled for every patient
record).

o Consistency: Data formatted uniformly (dates in one format, names spelled consistently).

o Timeliness: Data is updated as needed (e.g. up-to-date stock prices).

• Industry Focus: Regulated fields especially demand quality. For example, patient data in healthcare
must be “complete, accurate, and available when required”, else treatment plans suffer. Finance
similarly requires precise data for compliance and reporting. Ensuring data quality often involves
cleansing steps (detecting duplicates, validating entries) during data preparation.

Missing or Incomplete Data

• Definition: Missing data refers to blank or null fields in a dataset. It happens when values are not
recorded or lost (e.g. patient skips a survey question, or a sensor malfunctions). YData summarizes:
“Missing Data refers to the absence of certain values in observations or features within a dataset”.

• Impacts: Gaps in data can bias analysis and weaken models. Many algorithms cannot handle nulls,
and simply dropping rows with missing values can reduce sample size and skew results. For
instance, if high-risk patient records disproportionately lack data, a model may underpredict
complications. Unaddressed missing data makes results unreliable.

• Handling Strategies: Common approaches include:

o Case Deletion: Remove records (rows) with missing fields if they are few . This is simple but
can lose information and introduce bias.

o Imputation: Fill missing values with estimated ones (mean/median, regression prediction, or
model-based methods like KNN).This retains data but introduces uncertainty.

o Model-Based Methods: Use statistical algorithms (e.g. Expectation-Maximization) to infer


likely values .

o Use of Algorithms That Tolerate Nulls: Some tree-based models can work around missing
entries without explicit imputation.
• Best Practice: Carefully choose a strategy based on context. For example, if only 1% of entries are
missing at random, simple deletion might suffice. If many values are missing in a non-random way,
imputation or more advanced methods are needed. Always document how missing data were
handled to assess any bias introduced.

Data Profiling

• Definition: Data profiling is the process of examining and summarizing source data to understand
its structure, content, and quality. It is usually done early in a project to discover what the data
contains before deep analysis.

• Key Activities:

o Descriptive Statistics: Compute min, max, average, median, frequency counts, number of
nulls, etc. for each field For example, finding the percentage of missing values or outliers in a
column.

o Pattern Analysis: Detect patterns or distributions (e.g. are phone numbers consistently
formatted?).

o Anomaly Detection: Identify inconsistent formats or invalid values.

o Relationship Discovery: Find possible keys (unique identifiers) or foreign key candidates
between tables.

• Purpose: Profiling reveals data issues before heavy processing. It can uncover errors (e.g. duplicate
records, invalid ranges) and help define cleansing rules for ETL. According to one guide, profiling is
crucial in data warehousing and BI projects because it “uncover[s] data quality issues in data
sources, and what needs to be corrected in ETL” .In migration projects, it may reveal new
requirements for the target system.

• Examples: A finance team profiling customer data might find that 10% of account records have
missing emails, or that ZIP codes are sometimes 4-digit instead of 5. Knowing this helps plan data
cleaning steps.

Data Preparation

• Cleaning: Remove or correct erroneous data. This includes de-duplicating records, fixing
inconsistent formats, and resolving obvious errors (e.g., a height field entered as 600 meters instead
of 6.00). It also involves handling missing data as discussed earlier.

• Transformation: Convert data into a suitable format. Examples: normalizing numeric scales (e.g. z-
scores), encoding categorical variables as numbers, deriving new features (like BMI from
weight/height). Transformations prepare data for analysis or machine learning.

• Integration: Combine data from different sources into a unified form. For instance, joining customer
demographics with purchase history from another system. This often involves aligning field names,
units, and schemas.

• Goal: After preparation, data should be consistent, accurate, and analysis-ready. Well-prepared
data feeds into analytics or ML models smoothly. For example, before training a model, a team
might scale all financial metrics to thousands of dollars and encode categorical fields (e.g.
“male/female” to 0/1).

Data Exploration (EDA)

• Definition: Exploratory Data Analysis (EDA) involves visually and statistically examining a prepared
dataset to uncover patterns, anomalies, or relationships. It is an open-ended, iterative process to
understand what the data can reveal.

• Purpose: EDA helps formulate hypotheses and guides modeling. Analysts look at distributions of
each variable, relationships between variables (using scatter plots, correlation matrices), and
identify outliers or trends. According to IBM, EDA is used to “summarize [data’s] main
characteristics, often employing data visualization methods”.

• Common Techniques:

o Univariate analysis: histograms or boxplots of single variables to check distribution, skew,


outliers.

o Bivariate/multivariate analysis: scatterplots, heatmaps, or pivot tables to see how pairs of


variables relate.

o Summary statistics: mean, median, standard deviation, quartiles.

o Dimensionality reduction (like PCA) and clustering to visualize high-dimensional data.

• Outcome: Insights from EDA inform which models or transformations to apply. For example, seeing
that two variables are strongly correlated might suggest combining them or dropping one.
Anomalies detected (like a strangely high salary entry) might indicate data errors to correct. In
short, EDA lets data scientists check assumptions and ensure models will be appropriate.

ETL (Extract, Transform, Load)

• Concept: ETL is a three-step process to gather data for analysis. First, Extract raw data from various
source systems (databases, APIs, logs). Second, Transform the data by cleaning, converting formats,
aggregating, and applying business rules. Third, Load the processed data into a target repository
(often a data warehouse or data lake).

• Purpose: To consolidate and prepare data for business intelligence, reporting, or machine learning.
ETL ensures data from disparate sources is combined into a consistent, usable form. For example, an
ETL job might pull transactional data from a sales system (extract), adjust currency units and drop
irrelevant fields (transform), and then load the refined data into a central reporting database (load).

• Key Features:

o Uses business rules during Transform to standardize data (e.g. calculating total order value,
formatting dates).

o Often scheduled in batches or triggered by events.

o Handles both structured and semi-structured data sources; big data ETL tools can process
large volumes in distributed systems.
• Example: An online retailer extracts customer orders and website clickstream logs, transforms them
by joining on customer ID and summarizing daily behavior, then loads the result into a warehouse
for analysis. This allows analysts to run queries like “What is the average order value for customers
who viewed a product page twice before buying?”

• Benefits: ETL pipelines make data readily available for analytics and BI. They can “combine data
from multiple sources into a large, central repository called a data warehouse”. For instance, by
ETLing POS (point-of-sale) data, an analysis might forecast demand and optimize inventory.

Data Warehousing & Summary

• Data Warehouse: A data warehouse is a centralized repository designed for analysis and reporting.
It stores integrated data (often historical and aggregated) from multiple operational sources. Data in
a warehouse is typically organized into subject-oriented schemas (e.g. tables for sales, customers,
time). For example, a hospital data warehouse might combine patient demographics, treatments,
and billing into one system for analytics.

• Data warehouse software is optimized for read queries and can span various storage hardware to
handle large datasets .Unlike transactional databases (OLTP), warehouses support OLAP (Online
Analytical Processing), enabling complex queries and dashboards.

• Use Cases: Analysts use data warehouses for business intelligence, trend analysis, and reporting. In
finance, a data warehouse might hold all trade and account data over years to allow querying risk
exposures. In IT, a warehouse of log data can help spot system-wide patterns.

You might also like