Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
21 views16 pages

DMV U4 RK

Uploaded by

Sai Dahiwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views16 pages

DMV U4 RK

Uploaded by

Sai Dahiwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Unit 4 - Data Visualization and Data Wrangling rohan_kaitake

Data Wrangling

Definition:
The process of cleaning, transforming, and organizing raw data into a usable format for analysis.

# Steps in Data Wrangling

1. Data Collection / Understanding the Dataset

o Get familiar with data structure, format, and types.

o Use exploratory functions like .info(), .describe(), .head(), and .tail().

2. DATA CLEANING -> Handling Missing Data

o Identify: Detect missing values using .isnull() or .isna().

o Handle:

▪ Remove rows/columns: df.dropna().

▪ Impute values: df.fillna(value).

3. DATA CLEANING - > Removing Duplicates

o Identify duplicates using .duplicated().

o Remove them using .drop_duplicates().

4. Data Transformation

o Standardization: Converting data into a consistent format.

▪ Example: Converting dates to YYYY-MM-DD.

o Normalization: Scaling data to a uniform range.

▪ Example: Use MinMaxScaler or StandardScaler.

5. Data formatting / Data Type Conversion

o Convert data to appropriate types using .astype().

▪ Example: df['column'] = df['column'].astype('int').

6. Handling Outliers

o Detect using statistical methods (e.g., z-score, IQR).

o Handle:

▪ Remove them.

▪ Replace with median or mean.


7. Feature Engineering

o Create new features based on existing ones.

▪ Example: Extracting year from a datetime column.

o One-hot encoding categorical variables: pd.get_dummies().

8. Data Integration

o Combine multiple datasets into one:

▪ Merge: pd.merge().

▪ Concatenate: pd.concat().

▪ Join: .join().

9. Filtering and Subsetting

o Filter data based on conditions using .loc[] or .query().

o Subset relevant columns using df[['col1', 'col2']].

10. Validation and Verification

o Ensure data integrity by checking for inconsistencies or errors.

o Use assertions or manually verify data quality.


Combining and Merging Datasets

Combining and merging are crucial operations for integrating data from multiple sources.

1. Merging Datasets (Join Operations)

Key Function: pd.merge()

• Combines datasets based on common columns or indices.

• Types of joins:

o Inner Join: Retains only matching rows.

o Outer Join: Retains all rows from both datasets.

o Left Join: Retains all rows from the left dataset.

o Right Join: Retains all rows from the right dataset.

2. Concatenating Datasets

Key Function: pd.concat()

• Stacks datasets vertically (by rows) or horizontally (by columns).

Example:

Combining and Merging Datasets

Operation Dataset 1 Dataset 2 Result

ID: 1, Alice ID: 2, Score: 85


ID: 2, Bob, 85
Merging (Inner) ID: 2, Bob ID: 3, Score: 90
ID: 3, Charlie, 90
ID: 3, Charlie ID: 4, Score: 75

ID: 1, Alice
ID: 1, Alice ID: 3, Charlie ID: 2, Bob
Concatenation
ID: 2, Bob ID: 4, David ID: 3, Charlie
ID: 4, David
3. Reshaping and Pivoting

Reshaping: Transform data format (e.g., wide to long or vice versa).

• Key Function: pd.melt() for wide-to-long and pd.pivot() for long-to-wide.

Pivoting: Rearrange data to create summary tables.

• Key Function: pd.pivot_table().

Example :

Operation Input Table Result

ID: 1, Subject: Math, Score: 90


ID: 1, Math: 90, Science: 95 ID: 1, Subject: Science, Score: 95
Wide to Long (Melt)
ID: 2, Math: 85, Science: 88 ID: 2, Subject: Math, Score: 85
ID: 2, Subject: Science, Score: 88

ID: 1, Subject: Math, Score: 90


ID: 1, Subject: Science, Score: 95 ID: 1, Math: 90, Science: 95
Long to Wide (Pivot)
ID: 2, Subject: Math, Score: 85 ID: 2, Math: 85, Science: 88
ID: 2, Subject: Science, Score: 88
Matplotlib: Overview and Key Functions for Data Visualization

Matplotlib is a powerful Python library used for creating static, interactive, and animated visualizations. It
provides functions to create various types of plots like line, scatter, bar, histogram, and pie charts, making it
essential for data analysis and presentation. With customization options, support for 3D plotting, and
compatibility with other libraries, Matplotlib is highly versatile for data visualization tasks.

Matplotlib provides an extensive suite of tools for:

• Basic and advanced visualizations.

• Customizing and styling plots.

• Saving and presenting data effectively.

import matplotlib.pyplot as plt -----------------------SYNTAX

1. Basic Plotting Functions

These functions are used to create basic visualizations:

Function Description

plot() Creates a simple line plot.

scatter() Generates a scatter plot to visualize relationships between two variables.

bar() Draws a bar plot, useful for categorical data.

hist() Creates a histogram to show data distribution.

boxplot() Displays a box plot to show data spread and identify outliers. (e.g., plt.boxplot())

pie() Makes a pie chart to show proportions. (e.g., plt.pie())

2. Customization Functions

Enhance and format your plots with these customization tools:

Function Description

xlabel() Labels the x-axis.

ylabel() Labels the y-axis.

title() Adds a title to the plot.

legend() Displays a legend to describe plot elements.

grid() Adds grid lines to the plot for better readability.

xlim(), ylim() Set the limits for the x-axis and y-axis.
3. Styling and Formatting

Customize the appearance of plots using these options:

Function/Parameter Description

style.use() Applies a predefined style (e.g., ggplot, seaborn).

color Sets line or marker color.

linewidth Adjusts the thickness of plot lines.

linestyle Customizes line style (e.g., dashed, dotted).

alpha Adjusts the transparency of plot elements.

4. Advanced Plotting Functions

For more complex visualizations, Matplotlib offers advanced tools:

Function Description

subplots() Creates multiple plots in a single figure for comparisons.

imshow() Displays images or heatmaps for matrix data.

contour() Generates contour plots to visualize 3D data in 2D.

3D Plotting Functions like plot_surface() and plot_wireframe() in mpl_toolkits.mplot3d are used for 3D data.

5. Saving and Displaying Plots

Function Description

show() Displays the plot on the screen.

savefig() Saves the plot as an image file (e.g., PNG, JPG, PDF).
Effective Graph Layout Techniques

Technique Description Example Use

Force-Directed Simulates physical forces (e.g., attraction and repulsion) to


Social network analysis.
Layout position nodes and edges naturally.

Organizational charts or file


Tree Layout Represents hierarchical data with a tree structure.
systems.

Arranges nodes in a circular pattern to highlight cycles or


Circular Layout Visualizing feedback loops.
group relationships.

Geographical Traffic or weather data


Maps data onto real-world geographic locations.
Layout visualization.

Layered Places nodes in layers, showing dependencies or flows Flowcharts or decision


(Hierarchical) clearly. trees.

Groups nodes into clusters based on similarity or


Cluster Layout Communities in a network.
relationships.

Key Considerations for Effective Graph Visualization

1. Simplicity: Avoid clutter by showing only necessary information.

2. Color and Size: Use them to highlight key nodes or edges.

3. Interactivity: Add zoom and filters for better exploration.

4. Readability: Ensure labels and connections are clear and distinct.


Force-Directed Layout (Simplified)

A Force-Directed Layout is a graph visualization technique that arranges nodes and edges naturally, as if
physical forces are acting on them.

How It Works

1. Attraction: Nodes connected by edges are drawn closer together.

2. Repulsion: Nodes that are not connected push away from each other.

3. Balance: The algorithm adjusts positions until the forces stabilize, creating a visually appealing and clear
graph.

Key Features

• Natural Structure: Creates intuitive layouts for complex relationships.

• Interactive: Often used in dynamic or interactive visualizations.

Example Use Cases

• Social networks (showing friendships or connections).

• Knowledge graphs (representing related concepts).

• Network traffic (visualizing communication between devices).


Force-Directed Techniques in Multidimensional Scaling (MDS)

Force-Directed Techniques and Multidimensional Scaling (MDS) are popular methods for visualizing and
organizing complex data, particularly in graph structures and high-dimensional spaces.

Force-Directed Techniques

Aspect Details

Definition Force-directed techniques simulate physical forces to arrange nodes in a graph for clarity.

- Attraction: Nodes connected by edges attract each other.


Key Components
- Repulsion: Unconnected nodes repel each other.

1. Initialize nodes randomly.


Algorithm Steps 2. Apply attractive and repulsive forces iteratively.
3. Adjust positions until equilibrium is reached.

- Produces natural and intuitive layouts.


Advantages
- Highlights clusters and relationships clearly.

- Computationally expensive for large datasets.


Challenges
- May lead to overlapping nodes in dense graphs.

Multidimensional Scaling (MDS)

Aspect Details

MDS reduces high-dimensional data to lower dimensions (2D or 3D) while preserving
Definition
relative distances.

Key - Input: A distance matrix representing pairwise similarities or dissimilarities.


Components - Output: A low-dimensional plot of points.

1. Compute a similarity or distance matrix.


Algorithm Steps
2. Minimize a stress function to position nodes while maintaining distances.

- Visualizes high-dimensional data effectively.


Advantages
- Highlights relationships and patterns.

- Computationally expensive for very large datasets.


Challenges
- May lose some data fidelity in reduction.
Force-Directed Techniques in MDS

Force-directed techniques can be combined with MDS to visualize high-dimensional data in a graph-like
structure:

1. Step 1: Apply MDS to reduce high-dimensional data into 2D or 3D coordinates.

2. Step 2: Use force-directed algorithms to refine the layout, improving clarity and interpretability of
relationships.

3. Step 3: Fine-tune visualization by adjusting parameters like node size, edge length, or attraction forces.

Advantages of Combining Force-Directed Techniques with MDS

• Creates layouts that are both data-accurate (preserving distances) and visually appealing (structured
intuitively).

• Useful for datasets that are both high-dimensional and relational, such as social networks or gene
expression data.

Solution to Explain Force-Directed Techniques in MDS (7 Marks)

1. Introduction: Define force-directed techniques and MDS.

2. Mechanism: Explain the working of force-directed algorithms (attraction/repulsion forces).

3. MDS Process: Describe how MDS reduces high-dimensional data into lower dimensions.

4. Combination: Illustrate how force-directed techniques refine MDS outputs for clearer visualization.

5. Advantages: Mention the benefits of using these techniques together (e.g., preserving accuracy,
improving interpretability).

6. Applications: Highlight real-world use cases (e.g., social networks, hierarchical data).

7. Conclusion: Emphasize the effectiveness of this approach in simplifying and visualizing complex
datasets.
Bipartite Graphs #V.V.IMP

(biograph) ----- type of graph in graph theory contains of 2 set of nodes

A bipartite graph is a type of graph where nodes (vertices) can be divided into two distinct sets, and edges
(connections) exist only between nodes from different sets. No edges are present within the same set.

Key Characteristics of Bipartite Graphs

1. Two Distinct Sets: Nodes are split into two groups, typically denoted as U and V.

2. No Intra-set Connections: Edges only occur between nodes in U and V, not within the same set.

3. No Odd-length Cycles: Bipartite graphs do not contain any cycles of odd length.

4. Two-Colorable: Nodes can be colored with two colors such that no two adjacent vertices share the
same color.

5. Visualization: Typically represented with nodes of one set on one side and nodes of the other on the
opposite side, with edges connecting them.

How to Identify a Bipartite Graph

1. Start with any vertex and assign it to one set (e.g., U).

2. Assign its neighbors to the other set (e.g., V).

3. Repeat this process for all vertices.

4. If any adjacent vertices are assigned to the same set, the graph is not bipartite.

Why Bipartite Graphs are Important || Advantages

Bipartite graphs are widely used to solve complex problems in:

• Network theory for connectivity.

• Algorithm design for matching and optimization.

• Data analysis in diverse fields like social networking, bioinformatics, and recommendation systems.
Applications of Bipartite Graphs

Application Description Example

Recommender User-movie recommendation


Links users to items like movies or products.
Systems engines.

Assign tasks to workers, students to courses, or stable Job matching, college


Matching Problems
marriage pairings. admissions.

Social Networks Represents relationships between users and groups. User-group affiliations.

Models protein-protein interactions or genes and their


Biological Networks Protein-function mapping.
functions.

Represents relationships between keywords and Search engine keyword


Web Mining
documents. analysis.

TEXTBOOK DEFINATION
Hierarchical Indexing

Hierarchical Indexing (or multi-level indexing) is a feature in pandas that allows you to create multiple levels of
indices (row or column labels) in a dataset. It enables working with higher-dimensional data in a 1D or 2D
format, making data organization and access easier.

Key Features of Hierarchical Indexing

1. Multiple Levels: Data can have multiple index levels for rows or columns.

2. Efficient Access: Enables quick access to subsets of data based on one or more index levels.

3. Flexibility: Helps represent complex data structures compactly.

Example of Hierarchical Indexing

Dataset: Sales Data for Products Across Two Regions

Region City Product Sales

North Delhi A 100

North Delhi B 150

North Jaipur A 200

South Bangalore A 250

South Hyderabad B 300

Hierarchical Index Implementation

Region City Product Sales

North Delhi A 100

B 150

Jaipur A 200

South Bangalore A 250

Hyderabad B 300

• Region and City are hierarchical row indices.

• Accessing data becomes easy:

o Example: To access sales for North-Delhi:


Sales.loc[('North', 'Delhi')]

o Result: ----->
Advantages of Hierarchical Indexing

1. Efficient representation of multi-dimensional data.

2. Simplifies data slicing and aggregation by levels.

3. Makes working with grouped data intuitive.

Conclusion

Hierarchical indexing is a powerful tool for managing multi-dimensional datasets in pandas. By organizing data
into multiple index levels, it enhances readability, accessibility, and efficiency, making it especially useful for
analysis in fields like sales, finance, and research.

--------------------------------------------------------RK-------------------------------------------------------------------------

REVISION []

For syllabus topics like Data Wrangling, Visualization, and Graph Representations, use this simple and
reusable template to create answers.

1. Definition/Introduction

Define the topic briefly.

• Example for Data Wrangling:


"Data wrangling is the process of cleaning, transforming, and organizing raw data into a usable format for
analysis."

2. Key Concepts/Features

List the main components or techniques.

• For Data Visualization:

o "Matplotlib: A fundamental library for static visualizations in Python."

o "Seaborn: Built on matplotlib for statistical plots."

o "Pandas: Provides easy plotting options for dataframes."

3. Example

Always include a concise, practical example.

• For Hierarchical Indexing:


"Hierarchical indexing organizes data across multiple levels, making subsets easy to access. E.g., in
sales data, regions and cities can be hierarchical indices for efficient grouping."
4. Applications

Discuss where the concept is applied.

• For Graph Layout Techniques:


"Used in social networks , biological networks , and visualizing complex relationships in data."

5. Advantages/Importance

Highlight why the topic is useful.

• For Combining and Merging Data Sets:


"Combining datasets simplifies analysis by integrating related information, improving accuracy in insights
."

Reusable Answer Structure

You can adapt this approach for:

1. Hierarchical Indexing: Definition → Features → Example → Applications.

2. Graph Layout Techniques: Types (Force-directed, Bipartite) → Examples → Uses (social networks,
recommender systems).

3. Visualization Tools: Define libraries (Matplotlib, Seaborn) → Basic functions → Use case examples.

------------------------------------------------------RK----------------------------------------------------------------------------

You might also like