0% found this document useful (0 votes)

21 views16 pages

DMV U4 RK

Uploaded by

Sai Dahiwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views16 pages

DMV U4 RK

Uploaded by

Sai Dahiwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Unit 4 - Data Visualization and Data Wrangling rohan_kaitake

Data Wrangling

Definition:
The process of cleaning, transforming, and organizing raw data into a usable format for analysis.

# Steps in Data Wrangling

1. Data Collection / Understanding the Dataset

o Get familiar with data structure, format, and types.

o Use exploratory functions like .info(), .describe(), .head(), and .tail().

2. DATA CLEANING -> Handling Missing Data

o Identify: Detect missing values using .isnull() or .isna().

o Handle:

▪ Remove rows/columns: df.dropna().

▪ Impute values: df.fillna(value).

3. DATA CLEANING - > Removing Duplicates

o Identify duplicates using .duplicated().

o Remove them using .drop_duplicates().

4. Data Transformation

o Standardization: Converting data into a consistent format.

▪ Example: Converting dates to YYYY-MM-DD.

o Normalization: Scaling data to a uniform range.

▪ Example: Use MinMaxScaler or StandardScaler.

5. Data formatting / Data Type Conversion

o Convert data to appropriate types using .astype().

▪ Example: df['column'] = df['column'].astype('int').

6. Handling Outliers

o Detect using statistical methods (e.g., z-score, IQR).

o Handle:

▪ Remove them.

▪ Replace with median or mean.

7. Feature Engineering

o Create new features based on existing ones.

▪ Example: Extracting year from a datetime column.

o One-hot encoding categorical variables: pd.get_dummies().

8. Data Integration

o Combine multiple datasets into one:

▪ Merge: pd.merge().

▪ Concatenate: pd.concat().

▪ Join: .join().

9. Filtering and Subsetting

o Filter data based on conditions using .loc[] or .query().

o Subset relevant columns using df[['col1', 'col2']].

10. Validation and Verification

o Ensure data integrity by checking for inconsistencies or errors.

o Use assertions or manually verify data quality.

Combining and Merging Datasets

Combining and merging are crucial operations for integrating data from multiple sources.

1. Merging Datasets (Join Operations)

Key Function: pd.merge()

• Combines datasets based on common columns or indices.

• Types of joins:

o Inner Join: Retains only matching rows.

o Outer Join: Retains all rows from both datasets.

o Left Join: Retains all rows from the left dataset.

o Right Join: Retains all rows from the right dataset.

2. Concatenating Datasets

Key Function: pd.concat()

• Stacks datasets vertically (by rows) or horizontally (by columns).

Example:

Combining and Merging Datasets

Operation Dataset 1 Dataset 2 Result

ID: 1, Alice ID: 2, Score: 85

ID: 2, Bob, 85
Merging (Inner) ID: 2, Bob ID: 3, Score: 90
ID: 3, Charlie, 90
ID: 3, Charlie ID: 4, Score: 75

ID: 1, Alice
ID: 1, Alice ID: 3, Charlie ID: 2, Bob
Concatenation
ID: 2, Bob ID: 4, David ID: 3, Charlie
ID: 4, David
3. Reshaping and Pivoting

Reshaping: Transform data format (e.g., wide to long or vice versa).

• Key Function: pd.melt() for wide-to-long and pd.pivot() for long-to-wide.

Pivoting: Rearrange data to create summary tables.

• Key Function: pd.pivot_table().

Example :

Operation Input Table Result

ID: 1, Subject: Math, Score: 90

ID: 1, Math: 90, Science: 95 ID: 1, Subject: Science, Score: 95
Wide to Long (Melt)
ID: 2, Math: 85, Science: 88 ID: 2, Subject: Math, Score: 85
ID: 2, Subject: Science, Score: 88

ID: 1, Subject: Math, Score: 90

ID: 1, Subject: Science, Score: 95 ID: 1, Math: 90, Science: 95
Long to Wide (Pivot)
ID: 2, Subject: Math, Score: 85 ID: 2, Math: 85, Science: 88
ID: 2, Subject: Science, Score: 88
Matplotlib: Overview and Key Functions for Data Visualization

Matplotlib is a powerful Python library used for creating static, interactive, and animated visualizations. It
provides functions to create various types of plots like line, scatter, bar, histogram, and pie charts, making it
essential for data analysis and presentation. With customization options, support for 3D plotting, and
compatibility with other libraries, Matplotlib is highly versatile for data visualization tasks.

Matplotlib provides an extensive suite of tools for:

• Basic and advanced visualizations.

• Customizing and styling plots.

• Saving and presenting data effectively.

import matplotlib.pyplot as plt -----------------------SYNTAX

1. Basic Plotting Functions

These functions are used to create basic visualizations:

Function Description

plot() Creates a simple line plot.

scatter() Generates a scatter plot to visualize relationships between two variables.

bar() Draws a bar plot, useful for categorical data.

hist() Creates a histogram to show data distribution.

boxplot() Displays a box plot to show data spread and identify outliers. (e.g., plt.boxplot())

pie() Makes a pie chart to show proportions. (e.g., plt.pie())

2. Customization Functions

Enhance and format your plots with these customization tools:

Function Description

xlabel() Labels the x-axis.

ylabel() Labels the y-axis.

title() Adds a title to the plot.

legend() Displays a legend to describe plot elements.

grid() Adds grid lines to the plot for better readability.

xlim(), ylim() Set the limits for the x-axis and y-axis.
3. Styling and Formatting

Customize the appearance of plots using these options:

Function/Parameter Description

style.use() Applies a predefined style (e.g., ggplot, seaborn).

color Sets line or marker color.

linewidth Adjusts the thickness of plot lines.

linestyle Customizes line style (e.g., dashed, dotted).

alpha Adjusts the transparency of plot elements.

4. Advanced Plotting Functions

For more complex visualizations, Matplotlib offers advanced tools:

Function Description

subplots() Creates multiple plots in a single figure for comparisons.

imshow() Displays images or heatmaps for matrix data.

contour() Generates contour plots to visualize 3D data in 2D.

3D Plotting Functions like plot_surface() and plot_wireframe() in mpl_toolkits.mplot3d are used for 3D data.

5. Saving and Displaying Plots

Function Description

show() Displays the plot on the screen.

savefig() Saves the plot as an image file (e.g., PNG, JPG, PDF).
Effective Graph Layout Techniques

Technique Description Example Use

Force-Directed Simulates physical forces (e.g., attraction and repulsion) to

Social network analysis.
Layout position nodes and edges naturally.

Organizational charts or file

Tree Layout Represents hierarchical data with a tree structure.
systems.

Arranges nodes in a circular pattern to highlight cycles or

Circular Layout Visualizing feedback loops.
group relationships.

Geographical Traffic or weather data

Maps data onto real-world geographic locations.
Layout visualization.

Layered Places nodes in layers, showing dependencies or flows Flowcharts or decision

(Hierarchical) clearly. trees.

Groups nodes into clusters based on similarity or

Cluster Layout Communities in a network.
relationships.

Key Considerations for Effective Graph Visualization

1. Simplicity: Avoid clutter by showing only necessary information.

2. Color and Size: Use them to highlight key nodes or edges.

3. Interactivity: Add zoom and filters for better exploration.

4. Readability: Ensure labels and connections are clear and distinct.

Force-Directed Layout (Simplified)

A Force-Directed Layout is a graph visualization technique that arranges nodes and edges naturally, as if
physical forces are acting on them.

How It Works

1. Attraction: Nodes connected by edges are drawn closer together.

2. Repulsion: Nodes that are not connected push away from each other.

3. Balance: The algorithm adjusts positions until the forces stabilize, creating a visually appealing and clear
graph.

Key Features

• Natural Structure: Creates intuitive layouts for complex relationships.

• Interactive: Often used in dynamic or interactive visualizations.

Example Use Cases

• Social networks (showing friendships or connections).

• Knowledge graphs (representing related concepts).

• Network traffic (visualizing communication between devices).

Force-Directed Techniques in Multidimensional Scaling (MDS)

Force-Directed Techniques and Multidimensional Scaling (MDS) are popular methods for visualizing and
organizing complex data, particularly in graph structures and high-dimensional spaces.

Force-Directed Techniques

Aspect Details

Definition Force-directed techniques simulate physical forces to arrange nodes in a graph for clarity.

- Attraction: Nodes connected by edges attract each other.

Key Components
- Repulsion: Unconnected nodes repel each other.

1. Initialize nodes randomly.

Algorithm Steps 2. Apply attractive and repulsive forces iteratively.
3. Adjust positions until equilibrium is reached.

- Produces natural and intuitive layouts.

Advantages
- Highlights clusters and relationships clearly.

- Computationally expensive for large datasets.

Challenges
- May lead to overlapping nodes in dense graphs.

Multidimensional Scaling (MDS)

Aspect Details

MDS reduces high-dimensional data to lower dimensions (2D or 3D) while preserving
Definition
relative distances.

Key - Input: A distance matrix representing pairwise similarities or dissimilarities.

Components - Output: A low-dimensional plot of points.

1. Compute a similarity or distance matrix.

Algorithm Steps
2. Minimize a stress function to position nodes while maintaining distances.

- Visualizes high-dimensional data effectively.

Advantages
- Highlights relationships and patterns.

- Computationally expensive for very large datasets.

Challenges
- May lose some data fidelity in reduction.
Force-Directed Techniques in MDS

Force-directed techniques can be combined with MDS to visualize high-dimensional data in a graph-like
structure:

1. Step 1: Apply MDS to reduce high-dimensional data into 2D or 3D coordinates.

2. Step 2: Use force-directed algorithms to refine the layout, improving clarity and interpretability of
relationships.

3. Step 3: Fine-tune visualization by adjusting parameters like node size, edge length, or attraction forces.

Advantages of Combining Force-Directed Techniques with MDS

• Creates layouts that are both data-accurate (preserving distances) and visually appealing (structured
intuitively).

• Useful for datasets that are both high-dimensional and relational, such as social networks or gene
expression data.

Solution to Explain Force-Directed Techniques in MDS (7 Marks)

1. Introduction: Define force-directed techniques and MDS.

2. Mechanism: Explain the working of force-directed algorithms (attraction/repulsion forces).

3. MDS Process: Describe how MDS reduces high-dimensional data into lower dimensions.

4. Combination: Illustrate how force-directed techniques refine MDS outputs for clearer visualization.

5. Advantages: Mention the benefits of using these techniques together (e.g., preserving accuracy,
improving interpretability).

6. Applications: Highlight real-world use cases (e.g., social networks, hierarchical data).

7. Conclusion: Emphasize the effectiveness of this approach in simplifying and visualizing complex
datasets.
Bipartite Graphs #V.V.IMP

(biograph) ----- type of graph in graph theory contains of 2 set of nodes

A bipartite graph is a type of graph where nodes (vertices) can be divided into two distinct sets, and edges
(connections) exist only between nodes from different sets. No edges are present within the same set.

Key Characteristics of Bipartite Graphs

1. Two Distinct Sets: Nodes are split into two groups, typically denoted as U and V.

2. No Intra-set Connections: Edges only occur between nodes in U and V, not within the same set.

3. No Odd-length Cycles: Bipartite graphs do not contain any cycles of odd length.

4. Two-Colorable: Nodes can be colored with two colors such that no two adjacent vertices share the
same color.

5. Visualization: Typically represented with nodes of one set on one side and nodes of the other on the
opposite side, with edges connecting them.

How to Identify a Bipartite Graph

1. Start with any vertex and assign it to one set (e.g., U).

2. Assign its neighbors to the other set (e.g., V).

3. Repeat this process for all vertices.

4. If any adjacent vertices are assigned to the same set, the graph is not bipartite.

Why Bipartite Graphs are Important || Advantages

Bipartite graphs are widely used to solve complex problems in:

• Network theory for connectivity.

• Algorithm design for matching and optimization.

• Data analysis in diverse fields like social networking, bioinformatics, and recommendation systems.
Applications of Bipartite Graphs

Application Description Example

Recommender User-movie recommendation

Links users to items like movies or products.
Systems engines.

Assign tasks to workers, students to courses, or stable Job matching, college

Matching Problems
marriage pairings. admissions.

Social Networks Represents relationships between users and groups. User-group affiliations.

Models protein-protein interactions or genes and their

Biological Networks Protein-function mapping.
functions.

Represents relationships between keywords and Search engine keyword

Web Mining
documents. analysis.

TEXTBOOK DEFINATION
Hierarchical Indexing

Hierarchical Indexing (or multi-level indexing) is a feature in pandas that allows you to create multiple levels of
indices (row or column labels) in a dataset. It enables working with higher-dimensional data in a 1D or 2D
format, making data organization and access easier.

Key Features of Hierarchical Indexing

1. Multiple Levels: Data can have multiple index levels for rows or columns.

2. Efficient Access: Enables quick access to subsets of data based on one or more index levels.

3. Flexibility: Helps represent complex data structures compactly.

Example of Hierarchical Indexing

Dataset: Sales Data for Products Across Two Regions

Region City Product Sales

North Delhi A 100

North Delhi B 150

North Jaipur A 200

South Bangalore A 250

South Hyderabad B 300

Hierarchical Index Implementation

Region City Product Sales

North Delhi A 100

B 150

Jaipur A 200

South Bangalore A 250

Hyderabad B 300

• Region and City are hierarchical row indices.

• Accessing data becomes easy:

o Example: To access sales for North-Delhi:

Sales.loc[('North', 'Delhi')]

o Result: ----->
Advantages of Hierarchical Indexing

1. Efficient representation of multi-dimensional data.

2. Simplifies data slicing and aggregation by levels.

3. Makes working with grouped data intuitive.

Conclusion

Hierarchical indexing is a powerful tool for managing multi-dimensional datasets in pandas. By organizing data
into multiple index levels, it enhances readability, accessibility, and efficiency, making it especially useful for
analysis in fields like sales, finance, and research.

--------------------------------------------------------RK-------------------------------------------------------------------------

REVISION []

For syllabus topics like Data Wrangling, Visualization, and Graph Representations, use this simple and
reusable template to create answers.

1. Definition/Introduction

Define the topic briefly.

• Example for Data Wrangling:

"Data wrangling is the process of cleaning, transforming, and organizing raw data into a usable format for
analysis."

2. Key Concepts/Features

List the main components or techniques.

• For Data Visualization:

o "Matplotlib: A fundamental library for static visualizations in Python."

o "Seaborn: Built on matplotlib for statistical plots."

o "Pandas: Provides easy plotting options for dataframes."

3. Example

Always include a concise, practical example.

• For Hierarchical Indexing:

"Hierarchical indexing organizes data across multiple levels, making subsets easy to access. E.g., in
sales data, regions and cities can be hierarchical indices for efficient grouping."
4. Applications

Discuss where the concept is applied.

• For Graph Layout Techniques:

"Used in social networks , biological networks , and visualizing complex relationships in data."

5. Advantages/Importance

Highlight why the topic is useful.

• For Combining and Merging Data Sets:

"Combining datasets simplifies analysis by integrating related information, improving accuracy in insights
."

Reusable Answer Structure

You can adapt this approach for:

1. Hierarchical Indexing: Definition → Features → Example → Applications.

2. Graph Layout Techniques: Types (Force-directed, Bipartite) → Examples → Uses (social networks,
recommender systems).

3. Visualization Tools: Define libraries (Matplotlib, Seaborn) → Basic functions → Use case examples.

------------------------------------------------------RK----------------------------------------------------------------------------

DMV Unit-4-1 PDF
No ratings yet
DMV Unit-4-1 PDF
10 pages
Data Analysis
No ratings yet
Data Analysis
20 pages
Data Visualization With Matplotlib
No ratings yet
Data Visualization With Matplotlib
20 pages
Datascience
No ratings yet
Datascience
26 pages
Module 1
No ratings yet
Module 1
91 pages
Data Visualization
No ratings yet
Data Visualization
19 pages
Eda Lab Assignment2
No ratings yet
Eda Lab Assignment2
10 pages
Data Visualization Python Tutorial
100% (1)
Data Visualization Python Tutorial
9 pages
EDA Exp 2 Outout
No ratings yet
EDA Exp 2 Outout
7 pages
NumPy, Pandas, and Matplotlib Guide
No ratings yet
NumPy, Pandas, and Matplotlib Guide
21 pages
Data Handling Module
No ratings yet
Data Handling Module
10 pages
Matplotlib
No ratings yet
Matplotlib
9 pages
1st Class-Introduction and Python Package
No ratings yet
1st Class-Introduction and Python Package
93 pages
SyamilFakhruddin - DS - Summary - Data Analysis
No ratings yet
SyamilFakhruddin - DS - Summary - Data Analysis
17 pages
AIML Short Term Internship Session 9 Summary-1719044709410
No ratings yet
AIML Short Term Internship Session 9 Summary-1719044709410
14 pages
DV LAb Staff
No ratings yet
DV LAb Staff
73 pages
Matplotlib Cheetsheet
No ratings yet
Matplotlib Cheetsheet
9 pages
Python
No ratings yet
Python
1 page
Python 2D & 3D Plotting Guide
No ratings yet
Python 2D & 3D Plotting Guide
43 pages
Pandas Notes
No ratings yet
Pandas Notes
3 pages
Data Visualization Notes
No ratings yet
Data Visualization Notes
7 pages
Python Unit IV
No ratings yet
Python Unit IV
12 pages
UNIT4
No ratings yet
UNIT4
62 pages
Test 1 Datasheet
No ratings yet
Test 1 Datasheet
3 pages
CS1010S Lecture 11 - Visualising Data
No ratings yet
CS1010S Lecture 11 - Visualising Data
68 pages
Fundamental - Python
No ratings yet
Fundamental - Python
3 pages
Data Visualization
No ratings yet
Data Visualization
31 pages
Data Wrangling & Data Manipulation With Pandas
No ratings yet
Data Wrangling & Data Manipulation With Pandas
6 pages
Module1 Part2
No ratings yet
Module1 Part2
40 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
ML Week 7
No ratings yet
ML Week 7
12 pages
Data Visualisation
No ratings yet
Data Visualisation
5 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
Jmis 26 4 167
No ratings yet
Jmis 26 4 167
9 pages
DSBDAL
No ratings yet
DSBDAL
87 pages
More On Matplotlib
No ratings yet
More On Matplotlib
43 pages
Q.1 Explain Process of Working With Data From Files in Data Science
No ratings yet
Q.1 Explain Process of Working With Data From Files in Data Science
10 pages
Pandas Puzzles for Data Science
100% (1)
Pandas Puzzles for Data Science
156 pages
Data Visualization With Python PDF
93% (15)
Data Visualization With Python PDF
662 pages
Notes DV 2025
No ratings yet
Notes DV 2025
10 pages
Data Unit4
No ratings yet
Data Unit4
8 pages
Unit2 Modified
No ratings yet
Unit2 Modified
42 pages
DS 2
No ratings yet
DS 2
38 pages
Q.1 Explain Process of Working With Data From Files in Data Science
No ratings yet
Q.1 Explain Process of Working With Data From Files in Data Science
20 pages
UNIT-IV - Matplotlib
No ratings yet
UNIT-IV - Matplotlib
10 pages
Lecture 3 - Data Manipulation
No ratings yet
Lecture 3 - Data Manipulation
56 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
PDS - Chapter 4
No ratings yet
PDS - Chapter 4
25 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
12 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
Unit 4 Fod
100% (1)
Unit 4 Fod
21 pages
Summary: Introduction To Data Visualization Tools
No ratings yet
Summary: Introduction To Data Visualization Tools
13 pages
Pandas Complete + Visualisation Summary of IBM Visualization
No ratings yet
Pandas Complete + Visualisation Summary of IBM Visualization
21 pages
Python Data Analysis Guide
100% (3)
Python Data Analysis Guide
72 pages
CRAI AI BOOTCAMP Week Two 2025
No ratings yet
CRAI AI BOOTCAMP Week Two 2025
29 pages
Unit 4 Plotting Final
No ratings yet
Unit 4 Plotting Final
51 pages
Description of Data Visualization Tools
No ratings yet
Description of Data Visualization Tools
15 pages
Ds With Py
No ratings yet
Ds With Py
39 pages
Sample ICT Action Plan
100% (2)
Sample ICT Action Plan
2 pages
True or False Items
No ratings yet
True or False Items
17 pages
Chapter 6 - Multiphase Systems: CBE2124, Levicky
No ratings yet
Chapter 6 - Multiphase Systems: CBE2124, Levicky
27 pages
Nguyễn Văn Thành Trung-K59BF-ML15 PDF
No ratings yet
Nguyễn Văn Thành Trung-K59BF-ML15 PDF
9 pages
MTD3055VL 115349
No ratings yet
MTD3055VL 115349
5 pages
Disorders of The Thyroid Gand
No ratings yet
Disorders of The Thyroid Gand
167 pages
Avasthas of Planets
No ratings yet
Avasthas of Planets
13 pages
Secure Stock 2081-0709
No ratings yet
Secure Stock 2081-0709
3 pages
Navigating Landscapes of Mediated Memory 1st Edition Paul Wilson Instant Download
100% (5)
Navigating Landscapes of Mediated Memory 1st Edition Paul Wilson Instant Download
85 pages
Reto 4
No ratings yet
Reto 4
5 pages
Tomato Processing Guide by Mynampati Sreenivasa Rao
No ratings yet
Tomato Processing Guide by Mynampati Sreenivasa Rao
4 pages
Aircraft Electrical Load and Power Source Capacity Analysis: Standard Guide For
100% (4)
Aircraft Electrical Load and Power Source Capacity Analysis: Standard Guide For
8 pages
MSDS Pigment Yellow 14
No ratings yet
MSDS Pigment Yellow 14
3 pages
Dual Clutch Transmission
0% (1)
Dual Clutch Transmission
18 pages
Christian Family: Divine Foundation
No ratings yet
Christian Family: Divine Foundation
2 pages
For Green Marketing Project
No ratings yet
For Green Marketing Project
16 pages
Education, Arts, and Sciences
No ratings yet
Education, Arts, and Sciences
1 page
Economics of Oil Prices 2
No ratings yet
Economics of Oil Prices 2
8 pages
Android-Controlled Pesticide Spraying Robot
No ratings yet
Android-Controlled Pesticide Spraying Robot
6 pages
Aspiring Entrepreneur's CV
No ratings yet
Aspiring Entrepreneur's CV
4 pages
Fiz117 Notebook
No ratings yet
Fiz117 Notebook
77 pages
Super Memory British English Student A2 B1
No ratings yet
Super Memory British English Student A2 B1
6 pages
Pega CSSA Cheat Sheet For OOTB Rules
No ratings yet
Pega CSSA Cheat Sheet For OOTB Rules
4 pages
Loop SMPTE - TST-B1 Until You Have Completed The Questions
No ratings yet
Loop SMPTE - TST-B1 Until You Have Completed The Questions
1 page
Partial Derivatives Quiz Analysis
No ratings yet
Partial Derivatives Quiz Analysis
8 pages
Heimdal The Gjallarhorn The Horn Resounding and Ragnarok by Ormungandr Melchizedek
100% (1)
Heimdal The Gjallarhorn The Horn Resounding and Ragnarok by Ormungandr Melchizedek
4 pages
Tutorial 2 - Signals
No ratings yet
Tutorial 2 - Signals
9 pages
DSR Ss 03 January 2023 Indordb
No ratings yet
DSR Ss 03 January 2023 Indordb
19 pages
Pro Forma Invoice
0% (1)
Pro Forma Invoice
1 page
The Wizard's Harem - Volume Five - His Elven Dancer - Griz T. Orc & Kimiko Petaway - 2020 - Anna's Archive
No ratings yet
The Wizard's Harem - Volume Five - His Elven Dancer - Griz T. Orc & Kimiko Petaway - 2020 - Anna's Archive
45 pages