Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
27 views37 pages

Unit II - Final

This document covers the fundamentals of NumPy, SciPy, and Pandas for data handling in Python, focusing on array operations, matrix operations, and statistical functions. It introduces key concepts such as array creation, slicing, indexing, and various mathematical functions provided by NumPy. Additionally, it highlights the importance of these libraries in data science and machine learning applications.

Uploaded by

tusharbake007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views37 pages

Unit II - Final

This document covers the fundamentals of NumPy, SciPy, and Pandas for data handling in Python, focusing on array operations, matrix operations, and statistical functions. It introduces key concepts such as array creation, slicing, indexing, and various mathematical functions provided by NumPy. Additionally, it highlights the importance of these libraries in data science and machine learning applications.

Uploaded by

tusharbake007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Notes: Data Analytics with Python (B.Sc.

AI and ML SY) UNIT II: NumPy, SciPy and Pandas for


Data Handling

NumPy, SciPy and Pandas for Data Handling


2.1 Introduction to NumPy: Arrays, Array Operations
2.2 NumPy Functions: arrange, linspace, reshape, slicing, indexing, etc.
2.3 Matrix Operations, Statistical Functions in NumPy
2.4 Introduction to SciPy: Sub-packages
2.5 Basics of Pandas: Series and DataFrame
2.6 DataFrame Operations: Reading, Writing, Indexing, Filtering
2.7 Handling Missing Values and Duplicates
2.8 Data Aggregation, Grouping, and Sorting
2.9 Merging, Joining and Concatenating DataFrames
2.10 Data Cleaning and Transformation using Pandas

2.1 INTRODUCTION TO NUMPY: ARRAYS AND ARRAY OPERATIONS

NumPy (Numerical Python) is a powerful Python library used for numerical


computing.
NumPy is a Python library used for working with arrays.

It also has functions for working in domain of linear algebra, fourier transform, and
matrices.
NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can
use it freely.
In Python we have lists that serve the purpose of arrays, but they are slow to process.
NumPy is often used along with packages like SciPy (Scientific Python)
and Mat−plotlib (plotting library).
NumPy aims to provide an array object that is up to 50x faster than traditional Python
lists.
The array object in NumPy is called ndarray, it provides a lot of supporting functions
that make working with ndarray very easy.
Arrays are very frequently used in data science, where speed and resources are very
important.
NumPy is a Python library and is written partially in Python, but most of the parts that
require fast computation are written in C or C++.

It provides:

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling

 Fast and efficient array operations


 Tools for mathematical, logical, and statistical operations
 Multidimensional array support (n-dimensional arrays)

If you have Python and PIP already installed on a system, then installation of NumPy is
very easy.

NumPy is the foundation of data science and machine learning in Python.

Real-Life Applications of NumPy

1. Data analysis with Pandas (built on NumPy)


2. Machine learning with scikit-learn, TensorFlow
3. Image processing with OpenCV,
4. Simulation and numerical research
5. Signal and audio processing

Installing NumPy

pip install numpy

NumPy Arrays

What is an Array?

A NumPy array is a grid of values (all of the same data type) indexed by a tuple of non-
negative integers.
It’s similar to Python lists but much faster and more powerful.

import numpy as np

arr = np.array([1, 2, 3, 4, 5])


print(arr)
# Output: [1 2 3 4 5]

Types of Arrays

Type Example
1D array np.array([1, 2, 3])
2D array np.array([[1, 2], [3, 4]])
3D array np.array([[[1], [2]], [[3], [4]]])

Creating Arrays

np.zeros((2, 3)) # Array of zeros


np.ones((3, 3)) # Array of ones

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
np.eye(3) # Identity matrix
np.arange(0, 10, 2) # Array from 0 to 10 with step of 2
np.linspace(0, 1, 5) # 5 values between 0 and 1

Array Operations

NumPy allows vectorized operations, meaning operations are applied element-wise


without loops.

Arithmetic Operations

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

print(a + b) # [5 7 9]
print(a * b) # [ 4 10 18]
print(a ** 2) # [1 4 9]

Scalar Operations

print(a + 5) # [6 7 8]
print(a * 2) # [2 4 6]

Comparison and Boolean Arrays

print(a > 2) # [False False True]

Matrix Operations

A = np.array([[1, 2], [3, 4]])


B = np.array([[2, 0], [1, 3]])

print(A + B) # Element-wise addition


print(np.dot(A, B)) # Matrix multiplication
print(A.T) # Transpose of matrix

Array Attributesarr = np.array([[1, 2, 3], [4, 5, 6]])

print(arr.shape) # (2, 3)
print(arr.ndim) # 2
print(arr.size) # 6
print(arr.dtype) # int32 (depends on system)

Indexing and Slicing

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
arr = np.array([10, 20, 30, 40, 50])
print(arr[1:4]) # [20 30 40]

For 2D arrays:

matrix = np.array([[1, 2, 3], [4, 5, 6]])


print(matrix[1, 2]) # 6
print(matrix[:, 1]) # [2 5] → All rows, column 1

2.2 NUMPY FUNCTIONS: ARRANGE, LINSPACE, RESHAPE, SLICING, INDEXING, ETC.

1. np.arange()

arange() is used to create an array with regularly spaced values. It works like the
Python range() function but returns a NumPy array.

Syntax:

np.arange(start, stop, step)

 start: Starting value (default is 0)


 stop: Stop before this value
 step: Interval between values (default is 1)

Example:

import numpy as np
arr = np.arange(1, 10, 2)
print(arr)

Output:

[1 3 5 7 9]

2. np.linspace()

linspace() creates an array of evenly spaced numbers between a start and stop value.
Useful when you know how many points you want, not the step.

Syntax:

np.linspace(start, stop, num)

 start: Starting value


 stop: Ending value

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
 num: Number of values to generate

Example:

arr = np.linspace(0, 5, 6)
print(arr)

Output:

[0. 1. 2. 3. 4. 5.]

3. np.reshape()

reshape() changes the shape of an existing array without changing the data. It returns a
new array with the same data in a different layout.

Syntax:

array.reshape(new_shape)

 new_shape: A tuple indicating the new dimensions (rows, columns)

Example:

arr = np.arange(6)
reshaped = arr.reshape(2, 3)
print(reshaped)

Output:

[[0 1 2]
[3 4 5]]

4. Slicing in NumPy

Slicing allows you to extract specific portions of an array using index ranges.

Syntax:

array[start:stop:step]

 start: Starting index


 stop: Ending index (not included)
 step: Step size

Example:

arr = np.array([10, 20, 30, 40, 50])

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
print(arr[1:4]) # From index 1 to 3
print(arr[::-1]) # Reverse the array

Output:

[20 30 40]
[50 40 30 20 10]

5. Indexing in NumPy

Indexing is used to access specific elements of an array. NumPy supports several types
of indexing:

a. Basic Indexing

Access elements using index numbers.

arr = np.array([10, 20, 30])


print(arr[1]) # Output: 20

b. Boolean Indexing

Use a condition to filter values.

arr = np.array([5, 10, 15, 20])


print(arr[arr > 10]) # Output: [15 20]

c. Fancy Indexing

Use an array of indexes to pick specific elements.

arr = np.array([10, 20, 30, 40, 50])


print(arr[[0, 2, 4]]) # Output: [10 30 50]

6. Commonly Used NumPy Functions

Function Description Example


np.zeros((2, 3)) Creates a 2x3 array of zeros [[0. 0. 0.], [0. 0. 0.]]
np.ones((2, 2)) Creates a 2x2 array of ones [[1. 1.], [1. 1.]]
np.eye(3) Identity matrix of size 3x3 [[1. 0. 0.], [0. 1. 0.], [0. 0.
1.]]
np.random.rand(2) Generates 2 random values between [0.34 0.89] (random)
0 and 1
np.sum(arr) Sum of all array elements np.sum([1, 2, 3]) → 6
np.mean(arr) Average (mean) of values np.mean([4, 6]) → 5.0
np.max(arr) Maximum value in the array np.max([4, 9, 1]) → 9
np.min(arr) Minimum value in the array np.min([4, 9, 1]) → 1

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Summary

Function Description
arange() Creates array using a range and step
linspace() Creates evenly spaced values between two points
reshape() Changes shape of array (e.g., 1D to 2D)
slicing Gets part of array using index range
indexing Access or filter array elements
Other funcs zeros(), ones(), eye(), sum(), etc.

2.3 MATRIX OPERATIONS, STATISTICAL FUNCTIONS IN NUMPY

Matrix Operations in NumPy

What is a Matrix?

A matrix is a 2D array of numbers arranged in rows and columns. In NumPy, matrices


are created using 2D arrays.

Example:

import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

1. Matrix Addition

Adds two matrices element-wise.

C=A+B
print(C)

Output:

[[ 6 8]
[10 12]]

2. Matrix Subtraction

Subtracts elements of one matrix from another.

C=A-B
print(C)

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Output:

[[-4 -4]
[-4 -4]]

3. Matrix Multiplication (Element-wise)

Multiplies elements in the same position (NOT dot product).

C=A*B
print(C)

Output:

[[ 5 12]
[21 32]]

4. Matrix Multiplication (Dot Product)

Multiplies matrices using matrix multiplication rule.

C = A.dot(B)
# OR
C = np.matmul(A, B)
print(C)

Output:

[[19 22]
[43 50]]

5. Transpose of a Matrix

Flips rows and columns.

print(A.T)

Output:

[[1 3]
[2 4]]

6. Inverse of a Matrix

Only for square (n x n) matrices.

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
A_inv = np.linalg.inv(A)
print(A_inv)

7. Determinant of a Matrix

Gives a single number that represents the matrix.

det = np.linalg.det(A)
print(det)

8. Identity Matrix

Matrix with 1s on the diagonal, 0s elsewhere.

I = np.eye(3)
print(I)

9. Diagonal of a Matrix

Extracts or creates diagonal elements.

np.diag(A) # Gets diagonal → [1, 4]

Statistical Functions in NumPy

These functions help analyze and summarize numeric data.

1. np.mean()

Finds the average value of elements.

arr = np.array([1, 2, 3, 4])


print(np.mean(arr)) # Output: 2.5

2. np.median()

Finds the middle value when data is sorted.

arr = np.array([10, 20, 30, 40])


print(np.median(arr)) # Output: 25.0

3. np.std()

Calculates standard deviation (spread of data).

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
arr = np.array([10, 20, 30])
print(np.std(arr)) # Output: 8.16...

4. np.var()

Calculates variance (square of std. deviation).

print(np.var(arr)) # Output: 66.66...

5. np.min() and np.max()

Finds the smallest and largest value.

arr = np.array([3, 6, 9])


print(np.min(arr)) # 3
print(np.max(arr)) # 9

6. np.sum()

Adds all values.

print(np.sum(arr)) # 18

✅ 7. np.percentile()

Finds the value below which a given % of data falls.

arr = np.array([10, 20, 30, 40])


print(np.percentile(arr, 50)) # Output: 25.0 (median)

8. Axis-Based Calculations

You can apply statistical functions row-wise (axis=1) or column-wise (axis=0) for 2D
arrays.

matrix = np.array([[1, 2], [3, 4]])


np.mean(matrix, axis=0) # Column-wise mean → [2.0 3.0]
np.mean(matrix, axis=1) # Row-wise mean → [1.5 3.5]

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
2.4 INTRODUCTION TO SCIPY: SUB-PACKAGES

SciPy is a scientific computation library that uses NumPy underneath.


SciPy stands for Scientific Python.
It provides more utility functions for optimization, stats and signal processing.
Like NumPy, SciPy is open source so we can use it freely.
SciPy was created by NumPy's creator Travis Olliphant.
SciPy has optimized and added functions that are frequently used in NumPy and Data
Science.
SciPy is predominantly written in Python, but a few segments are written in C.
Installation of SciPy
If you have Python and PIP already installed on a system, then installation of SciPy is
very easy.
Install it using this command:
C:\Users\Your Name>pip install scipy
If this command fails, then use a Python distribution that already has SciPy installed
like, Anaconda, Spyder etc.

Import SciPy
Once SciPy is installed, import the SciPy module(s) you want to use in your applications
by adding the from scipy import module statement:
from scipy import constants
Now we have imported the constants module from SciPy, and the application is ready to
use it:
Example
How many cubic meters are in one liter:
from scipy import constants
print(constants.liter)
constants: SciPy offers a set of mathematical constants, one of them is liter which
returns 1 liter as cubic meters.

Checking SciPy Version


The version string is stored under the __version__ attribute.
Example
import scipy
print(scipy.__version__)

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Constants in SciPy
As SciPy is more focused on scientific implementations, it provides many built-in
scientific constants.
These constants can be helpful when you are working with Data Science.
PI is an example of a scientific constant.
Example
Print the constant value of PI:
from scipy import constants
print(constants.pi)
Constant Units
A list of all units under the constants module can be seen using the dir() function.
Example
List all constants:
from scipy import constants
print(dir(constants))
Unit Categories
The units are placed under these categories:
 Metric
 Binary
 Mass
 Angle
 Time
 Length
 Pressure
 Volume
 Speed
 Temperature
 Energy
 Power
 Force
Metric (SI) Prefixes:
Return the specified unit in meter (e.g. centi returns 0.01)

Example
from scipy import constants
print(constants.yotta) #1e+24
print(constants.zetta) #1e+21
print(constants.exa) #1e+18
print(constants.peta) #1000000000000000.0
print(constants.tera) #1000000000000.0
print(constants.giga) #1000000000.0
print(constants.mega) #1000000.0
print(constants.kilo) #1000.0
print(constants.hecto) #100.0
print(constants.deka) #10.0
print(constants.deci) #0.1
print(constants.centi) #0.01

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
print(constants.milli) #0.001
print(constants.micro) #1e-06
print(constants.nano) #1e-09
print(constants.pico) #1e-12
print(constants.femto) #1e-15
print(constants.atto) #1e-18
print(constants.zepto) #1e-21
Binary Prefixes:
Return the specified unit in bytes (e.g. kibi returns 1024)
Example
from scipy import constants
print(constants.kibi) #1024
print(constants.mebi) #1048576
print(constants.gibi) #1073741824
print(constants.tebi) #1099511627776
print(constants.pebi) #1125899906842624
print(constants.exbi) #1152921504606846976
print(constants.zebi) #1180591620717411303424
print(constants.yobi) #1208925819614629174706176
Mass:
Return the specified unit in kg (e.g. gram returns 0.001)
Example
from scipy import constants
print(constants.gram) #0.001
print(constants.metric_ton) #1000.0
print(constants.grain) #6.479891e-05
print(constants.lb) #0.45359236999999997
print(constants.pound) #0.45359236999999997
print(constants.carat) #0.0002

Angle:
Return the specified unit in radians (e.g. degree returns 0.017453292519943295)
Example
from scipy import constants
print(constants.degree) #0.017453292519943295
print(constants.arcsecond) #4.84813681109536e-06
Time:
Return the specified unit in seconds (e.g. hour returns 3600.0)
Example
from scipy import constants
print(constants.minute) #60.0
print(constants.hour) #3600.0
print(constants.day) #86400.0

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
print(constants.week) #604800.0
print(constants.year) #31536000.0

Length:
Return the specified unit in meters (e.g. nautical_mile returns 1852.0)
Example
from scipy import constants
print(constants.inch) #0.0254
print(constants.foot) #0.30479999999999996
print(constants.yard) #0.9143999999999999
print(constants.mile) #1609.3439999999998
print(constants.mil) #2.5399999999999997e-05
print(constants.pt) #0.00035277777777777776
print(constants.point) #0.00035277777777777776

Pressure:
Return the specified unit in pascals (e.g. psi returns 6894.757293168361)
Example
from scipy import constants
print(constants.atm) #101325.0
print(constants.atmosphere) #101325.0
print(constants.bar) #100000.0

Area:
Return the specified unit in square meters(e.g. hectare returns 10000.0)
Example
from scipy import constants
print(constants.hectare) #10000.0
print(constants.acre) #4046.8564223999992
Volume:
Return the specified unit in cubic meters (e.g. liter returns 0.001)
Example
from scipy import constants
print(constants.liter) #0.001
print(constants.litre) #0.001
print(constants.gallon) #0.0037854117839999997

Speed:
Return the specified unit in meters per second (e.g. speed_of_sound returns 340.5)
Example
from scipy import constants
print(constants.kmh) #0.2777777777777778

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
print(constants.mph) #0.44703999999999994
print(constants.mach) #340.5

Temperature:
Return the specified unit in Kelvin (e.g. zero_Celsius returns 273.15)
Example
from scipy import constants
print(constants.zero_Celsius) #273.15
print(constants.degree_Fahrenheit) #0.5555555555555556
Energy:
Return the specified unit in joules (e.g. calorie returns 4.184)
Example
from scipy import constants
print(constants.eV) #1.6021766208e-19
print(constants.electron_volt) #1.6021766208e-19
print(constants.calorie) #4.184

Power:
Return the specified unit in watts (e.g. horsepower returns 745.6998715822701)
Example
from scipy import constants
print(constants.hp) #745.6998715822701
print(constants.horsepower) #745.6998715822701
Force:
Return the specified unit in newton (e.g. kilogram_force returns 9.80665)
Example
from scipy import constants
print(constants.dyn) #1e-05
print(constants.dyne) #1e-05

2.5 BASICS OF PANDAS: SERIES AND DATAFRAME

Pandas is an open-source Python Library providing high-performance data


manipulation and analysis tool using its powerful data structures.
The name Pandas is derived from the word Panel Data – an Multidimensional data.

In 2008, developer Wes McKinney started developing pandas when in need of high
performance, flexible tool for analysis of data.

Prior to Pandas, Python was majorly used for data preparation. It had very little
contribution towards data analysis.

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Pandas solved this problem. Using Pandas, we can accomplish five typical steps in the
processing and analysis of data, regardless of the origin of data — load, prepare,
manipulate, model, and analyze.
Python with Pandas is used in a wide range of fields including academic and commercial
domains including finance, economics, Statistics, analytics, etc.
Key Features of Pandas
 Fast and efficient DataFrame object with default and customized indexing.
 Tools for loading data into in-memory data objects from different file formats.
 Data alignment and integrated handling of missing data.
 Reshaping and pivoting of date sets.
 Label-based slicing, indexing and subsetting of large data sets.
 Columns from a data structure can be deleted or inserted.
 Group by data for aggregation and transformations.
 High performance merging and joining of data.
 Time Series functionality.

To Install Pandas use the following command.


pip install pandas

 If you install Anaconda Python package, Pandas will be installed by default

Pandas deals with the following three data structures −


 Series
 DataFrame
 Panel

These data structures are built on top of Numpy array, which means they are fast.
Dimension & Description
The best way to think of these data structures is that the higher dimensional data
structure is a container of its lower dimensional data structure. For example, DataFrame
is a container of Series, Panel is a container of DataFrame.
Data Dimensions Description
Structure

Series 1 1D labelled homogeneous array, size immutable.

Data Frames 2 General 2D labelled, size-mutable tabular structure with


potentially heterogeneously typed columns.

Panel 3 General 3D labelled, size-mutable array.

All Pandas data structures are value mutable (can be changed) and except Series all are
size mutable. Series is size immutable.
Note − DataFrame is widely used and one of the most important data structures. Panel
is used much less.

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Series
Series is a one-dimensional array like structure with homogeneous data. For example,
the following series is a collection of integers 10, 23, 56, …
10 23 56 17 52 61 73 90 26 72

Key Points
 Homogeneous data
 Size Immutable
 Values of Data Mutable

DataFrame
DataFrame is a two-dimensional array with heterogeneous data. For example,
Name Age Gender Rating

Steve 32 Male 3.45

Lia 28 Female 4.6

Vin 45 Male 3.9

Katie 38 Female 2.78

The table represents the data of a sales team of an organization with their overall
performance rating. The data is represented in rows and columns. Each column
represents an attribute and each row represents a person.
Data Type of Columns
The data types of the four columns are as follows −
Column Type

Name String

Age Integer

Gender String

Rating Float

Key Points
 Heterogeneous data
 Size Mutable
 Data Mutable
Panel

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Panel is a three-dimensional data structure with heterogeneous data. It is hard to
represent the panel in graphical representation. But a panel can be illustrated as a
container of DataFrame.
Key Points
 Heterogeneous data
 Size Mutable
 Data Mutable
Series is a one-dimensional labeled array capable of holding data of any type (integer,
string, float, python objects, etc.). The axis labels are collectively called index.
pandas.Series
A pandas Series can be created using the following constructor −
pandas.Series( data, index, dtype, copy)
The parameters of the constructor are as follows −
Sr.No Parameter & Description

1 data
data takes various forms like ndarray, list, constants

2 index
Index values must be unique and hashable, same length as data.
Default np.arrange(n) if no index is passed.

3 dtype
dtype is for data type. If None, data type will be inferred

4 copy
Copy data. Default False
A series can be created using various inputs like −
 Array
 Dict
 Scalar value or constant
Create an Empty Series
A basic series, which can be created is an Empty Series.
Example
#import the pandas library and aliasing as pd
import pandas as pd
s = pd.Series()
print s
Its output is as follows −
Series([], dtype: float64)
Create a Series from ndarray
If data is an ndarray, then index passed must be of the same length. If no index is passed,
then by default index will be range(n) where n is array length, i.e.,
[0,1,2,3…. range(len(array))-1].
Example 1
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling

data = np.array(['a','b','c','d'])
s = pd.Series(data)
print s
Its output is as follows −
0 a
1 b
2 c
3 d
dtype: object
We did not pass any index, so by default, it assigned the indexes ranging from 0
to len(data)-1, i.e., 0 to 3.
Example 2
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print s
Its output is as follows −
100 a
101 b
102 c
103 d
dtype: object
We passed the index values here. Now we can see the customized indexed values in the
output.
Create a Series from dict
A dict can be passed as input and if no index is specified, then the dictionary keys are
taken in a sorted order to construct index. If index is passed, the values in data
corresponding to the labels in the index will be pulled out.
Example 1
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print s
Its output is as follows −
a 0.0
b 1.0
c 2.0
dtype: float64
Observe − Dictionary keys are used to construct index.
Example 2
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling

s = pd.Series(data,index=['b','c','d','a'])
print s
Its output is as follows −
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64
Observe − Index order is persisted and the missing element is filled with NaN (Not a
Number).
Create a Series from Scalar
If data is a scalar value, an index must be provided. The value will be repeated to match
the length of index
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print s
Its output is as follows −
0 5
1 5
2 5
3 5
dtype: int64
Accessing Data from Series with Position
Data in the series can be accessed similar to that in an ndarray.
Example 1
Retrieve the first element. As we already know, the counting starts from zero for the
array, which means the first element is stored at zeroth position and so on.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first element


print s[0]
Its output is as follows −
1
Example 2
Retrieve the first three elements in the Series. If a : is inserted in front of it, all items from
that index onwards will be extracted. If two parameters (with : between them) is used,
items between the two indexes (not including the stop index)
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first three element


print s[:3]
Its output is as follows −
a 1
b 2

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
c 3
dtype: int64
Example 3
Retrieve the last three elements.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the last three element


print s[-3:]
Its output is as follows −
c 3
d 4
e 5
dtype: int64
Retrieve Data Using Label (Index)
A Series is like a fixed-size dict in that you can get and set values by index label.
Example 1
Retrieve a single element using index label value.

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve a single element


print s['a']
Its output is as follows −
1
Example 2
Retrieve multiple elements using a list of index label values.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements


print s[['a','c','d']]
Its output is as follows −
a 1
c 3
d 4
dtype: int64
Example 3
If a label is not contained, an exception is raised.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements


print s['f']
Its output is as follows −

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
KeyError: 'f'

DATAFRAME

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion
in rows and columns.
Features of DataFrame
 Potentially columns are of different types
 Size – Mutable
 Labelled axes (rows and columns)
 Can Perform Arithmetic operations on rows and columns
Structure
Let us assume that we are creating a data frame with student’s data.

You can think of it as an SQL table or a spreadsheet data representation.


pandas.DataFrame
A pandas DataFrame can be created using the following constructor −
pandas.DataFrame( data, index, columns, dtype, copy)
The parameters of the constructor are as follows −
Sr.No Parameter & Description

1 data
data takes various forms like ndarray, series, map, lists, dict, constants and also
another DataFrame.

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling

2 index
For the row labels, the Index to be used for the resulting frame is Optional
Default np.arange(n) if no index is passed.

3 columns
For column labels, the optional default syntax is - np.arange(n). This is only
true if no index is passed.

4 dtype
Data type of each column.

5 copy
This command (or whatever it is) is used for copying of data, if the default is
False.
Create DataFrame
A pandas DataFrame can be created using various inputs like −
 Lists
 dict
 Series
 Numpy ndarrays
 Another DataFrame
In the subsequent sections of this chapter, we will see how to create a DataFrame using
these inputs.
Create an Empty DataFrame
A basic DataFrame, which can be created is an Empty Dataframe.
Example
#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print df
Its output is as follows −
Empty DataFrame
Columns: []
Index: []
Create a DataFrame from Lists
The DataFrame can be created using a single list or a list of lists.
Example 1
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print df
Its output is as follows −
0
0 1
1 2
2 3
3 4
4 5

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Example 2
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print df
Its output is as follows −
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
Example 3
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print df
Its output is as follows −
Name Age
0 Alex 10.0
1 Bob 12.0
2 Clarke 13.0
Note − Observe, the dtype parameter changes the type of Age column to floating point.

2.6 DATAFRAME OPERATIONS: READING, WRITING, INDEXING, FILTERING

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print df
Its output is as follows −
Age Name
0 28 Tom
1 34 Jack
2 29 Steve
3 42 Ricky
Note − Observe the values 0,1,2,3. They are the default index assigned to each using the
function range(n).
Example 2
Let us now create an indexed DataFrame using arrays.
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print df
Its output is as follows −
Age Name
rank1 28 Tom
rank2 34 Jack
rank3 29 Steve
rank4 42 Ricky

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Note − Observe, the index parameter assigns an index to each row.
Read CSV Files using Pandas
A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contains plain text and is a well know format that can be read by everyone
including Pandas.
In our examples we will be using a CSV file called 'data.csv'.
Example
Load the CSV into a DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())

Tip: use to_string() to print the entire DataFrame.


If you have a large DataFrame with many rows, Pandas will only return the first 5 rows,
and the last 5 rows:
Example
Print the DataFrame without the to_string() method:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)

Column Addition
We will understand this by adding a new column to an existing data frame.
Example
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
# Adding a new column to an existing DataFrame object with column label by passing
new series
print ("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print df
print ("Adding a new column using the existing columns in DataFrame:")
df['four']=df['one']+df['three']
print df
Its output is as follows −
Adding a new column by passing as Series:
one two three
a 1.0 1 10.0
b 2.0 2 20.0
c 3.0 3 30.0
d NaN 4 NaN
Adding a new column using the existing columns in DataFrame:
one two three four
a 1.0 1 10.0 11.0
b 2.0 2 20.0 22.0
c 3.0 3 30.0 33.0
d NaN 4 NaN NaN
Column Deletion

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Columns can be deleted or popped; let us take an example to understand how.
Example
# Using the previous DataFrame, we will delete a column
# using del function
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']),
'three' : pd.Series([10,20,30], index=['a','b','c'])}
df = pd.DataFrame(d)
print ("Our dataframe is:")
print df
# using del function
print ("Deleting the first column using DEL function:")
del df['one']
print df
# using pop function
print ("Deleting another column using POP function:")
df.pop('two')
print df
Its output is as follows −
Our dataframe is:
one three two
a 1.0 10.0 1
b 2.0 20.0 2
c 3.0 30.0 3
d NaN NaN 4
Deleting the first column using DEL function:
three two
a 10.0 1
b 20.0 2
c 30.0 3
d NaN 4
Deleting another column using POP function:
three
a 10.0
b 20.0
c 30.0
d NaN
Row Selection, Addition, and Deletion
We will now understand row selection, addition and deletion through examples. Let us
begin with the concept of selection.
Selection by Label
Rows can be selected by passing row label to a loc function.
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print df.loc['b']

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Its output is as follows −
one 2.0
two 2.0
Name: b, dtype: float64
The result is a series with labels as column names of the DataFrame. And, the Name of
the series is the label with which it is retrieved.
Selection by integer location
Rows can be selected by passing integer location to an iloc function.
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print df.iloc[2]
Its output is as follows −
one 3.0
two 3.0
Name: c, dtype: float64
Slice Rows
Multiple rows can be selected using ‘ : ’ operator.
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print df[2:4]
Its output is as follows −
one two
c 3.0 3
d NaN 4
Addition of Rows
Add new rows to a DataFrame using the append function. This function will append the
rows at the end.
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df.append(df2)
print df
Its output is as follows −
a b
0 1 2
1 3 4
0 5 6
1 7 8
Deletion of Rows
Use index label to delete or drop rows from a DataFrame. If label is duplicated, then
multiple rows will be dropped.
If you observe, in the above example, the labels are duplicate. Let us drop a label and will
see how many rows will get dropped.
import pandas as pd

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])


df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df.append(df2)
# Drop rows with label 0
df = df.drop(0)
print df
Its output is as follows −
a b
134
178

2.7 HANDLING MISSING VALUES AND DUPLICATES IN PANDAS

In real-world data, we often encounter:

 Missing values (NaN/null)


 Duplicate entries

These issues must be cleaned before analysis or machine learning.

1. Handling Missing Values (NaN)

Checking for Missing Values

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', None],


'Age': [25, None, 30, 22],
'City': ['Pune', 'Mumbai', None, 'Delhi']}

df = pd.DataFrame(data)
print(df.isnull()) # Shows True for NaN
print(df.isnull().sum()) # Count of missing values per column

Removing Missing Values

Drop rows with any NaN:


df_cleaned = df.dropna()
Drop rows only if all values are NaN:
df_cleaned = df.dropna(how='all')
Drop columns with any NaN:
df_cleaned = df.dropna(axis=1)

Filling Missing Values

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Fill with a constant:
df_filled = df.fillna("Unknown")
Fill using forward fill (previous value):
df_filled = df.fillna(method='ffill')
Fill using backward fill (next value):
df_filled = df.fillna(method='bfill')
Fill with mean/median of column:
df['Age'] = df['Age'].fillna(df['Age'].mean())

2. Handling Duplicates

Check for Duplicates:

df.duplicated()

🧹 Remove Duplicate Rows:

df_no_duplicates = df.drop_duplicates()

Remove Duplicates Based on Specific Columns:

df_unique = df.drop_duplicates(subset=['Name'])

Example Summary

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', None],


'Age': [25, None, 25, 22],
'City': ['Pune', 'Mumbai', 'Pune', 'Delhi']}

df = pd.DataFrame(data)

# Remove duplicates
df = df.drop_duplicates()

# Fill missing Name with 'Unknown'


df['Name'] = df['Name'].fillna('Unknown')

# Fill missing Age with mean


df['Age'] = df['Age'].fillna(df['Age'].mean())

print(df)

Output:

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Name Age City
0 Alice 25.000000 Pune
1 Bob 24.000000 Mumbai
3 Unknown 22.000000 Delhi

2.8 DATA AGGREGATION, GROUPING, AND SORTING IN PANDAS

These are essential operations to summarize, organize, and sort data for analysis.

1. Data Aggregation in Pandas

Aggregation means applying summary functions like sum(), mean(), count(), min(),
max() on a DataFrame or Series.

Example:

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
import pandas as pd

data = {'Department': ['HR', 'HR', 'IT', 'IT', 'Finance'],


'Salary': [25000, 27000, 30000, 32000, 28000]}

df = pd.DataFrame(data)

print(df['Salary'].sum()) # Total salary


print(df['Salary'].mean()) # Average salary
print(df['Salary'].max()) # Max salary

2. Grouping using groupby()

Grouping means splitting the data into groups and then applying aggregation
functions on each group.

Example:

grouped = df.groupby('Department')
print(grouped['Salary'].sum())
print(grouped['Salary'].mean())

Output:

Department
Finance 28000
HR 52000
IT 62000

This shows the total salary per department.

Common Aggregation Functions:

Function Description
sum() Total of values
mean() Average
count() Number of entries
min() Minimum
max() Maximum
std() Standard deviation

Multiple Aggregations:

print(grouped['Salary'].agg(['sum', 'mean', 'max']))

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
3. Sorting in Pandas

Sorting means arranging rows in ascending or descending order using one or more
columns.

Example:

# Sort by Salary ascending


print(df.sort_values(by='Salary'))

# Sort by Department then Salary (descending)


print(df.sort_values(by=['Department', 'Salary'], ascending=[True, False]))

Combining Grouping and Sorting:

grouped = df.groupby('Department')['Salary'].sum()
sorted_grouped = grouped.sort_values(ascending=False)
print(sorted_grouped)

Full Example:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],


'Department': ['IT', 'HR', 'IT', 'HR', 'Finance'],
'Salary': [30000, 25000, 32000, 27000, 28000]}

df = pd.DataFrame(data)

# Group by Department
grouped = df.groupby('Department')['Salary'].agg(['sum', 'mean', 'count'])

# Sort by total salary descending


grouped_sorted = grouped.sort_values(by='sum', ascending=False)

print(grouped_sorted)

Output:

sum mean count


Department
IT 62000 31000.0 2
HR 52000 26000.0 2
Finance 28000 28000.0 1

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
2.9 MERGING, JOINING AND CONCATENATING DATAFRAMES

1. Concatenating DataFrames in Pandas

Concatenation means stacking DataFrames either vertically (row-wise) or


horizontally (column-wise).

Syntax:

pd.concat([df1, df2], axis=0) # Vertical


pd.concat([df1, df2], axis=1) # Horizontal

Example (Vertical):

import pandas as pd

df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})


df2 = pd.DataFrame({'ID': [3, 4], 'Name': ['Charlie', 'David']})

result = pd.concat([df1, df2])


print(result)

Output:

ID Name
0 1 Alice
1 2 Bob
0 3 Charlie
1 4 David

Use ignore_index=True to reset row numbers.

2. Merging DataFrames (Like SQL JOIN)

Merging combines rows based on common columns or indices, just like a relational
database join.

Syntax:

pd.merge(df1, df2, on='column_name', how='inner')

✅ Example:

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})


df2 = pd.DataFrame({'ID': [1, 2, 4], 'Salary': [30000, 40000, 50000]})

merged = pd.merge(df1, df2, on='ID', how='inner')

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
print(merged)

Output:

ID Name Salary
0 1 Alice 30000
1 2 Bob 40000

Merge Types:

how Description
inner Only matching rows from both
left All rows from left, matched from right
right All rows from right, matched from left
outer All rows from both, fill with NaN if no match

3. Joining DataFrames using .join()

.join() is used to combine DataFrames on index by default (or column if specified).

Example:

df1 = pd.DataFrame({'Name': ['Alice', 'Bob']}, index=[1, 2])


df2 = pd.DataFrame({'Salary': [30000, 40000]}, index=[1, 2])

joined = df1.join(df2)
print(joined)

Output:
Name Salary
1 Alice 30000
2 Bob 40000

Join with Key Column:

df1.set_index('ID').join(df2.set_index('ID'))

Full Working Example:


import pandas as pd

# DataFrames
students = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
marks = pd.DataFrame({'ID': [1, 2, 4], 'Marks': [85, 90, 78]})

# Merging on ID
merge_inner = pd.merge(students, marks, on='ID', how='inner')
merge_outer = pd.merge(students, marks, on='ID', how='outer')

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
print("Inner Join:\n", merge_inner)
print("\nOuter Join:\n", merge_outer)

2.10 Data Cleaning and Transformation using Pandas

Data cleaning is the process of fixing or removing incorrect, corrupted, or


incomplete data from a dataset.

Data transformation means changing data format, structure, or values to make it


more usable for analysis.

Common Tasks in Data Cleaning & Transformation

1. Handling Missing Values

Fill missing values:

import pandas as pd

data = {'Name': ['Alice', 'Bob', None], 'Age': [25, None, 30]}


df = pd.DataFrame(data)

df.fillna({'Name': 'Unknown', 'Age': df['Age'].mean()}, inplace=True)


print(df)

Output:

Name Age
0 Alice 25.0
1 Bob 27.5
2 Unknown 30.0

Drop missing values:

df.dropna(inplace=True)

2. Removing Duplicates

data = {'ID': [1, 2, 2, 3], 'Name': ['A', 'B', 'B', 'C']}


df = pd.DataFrame(data)

df = df.drop_duplicates()
print(df)

Output:

ID Name

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
0 1 A
1 2 B
3 3 C

3. Renaming Columns

df = df.rename(columns={'Name': 'Full_Name'})
print(df)

4. Changing Data Types

df['ID'] = df['ID'].astype(str)
print(df.dtypes)

5. String Cleaning (Remove Whitespace / Lowercase)

df = pd.DataFrame({'City': [' Pune ', 'MUMBAI', ' delhi ']})


df['City'] = df['City'].str.strip().str.lower()
print(df)

Output:

markdown
CopyEdit
City
0 pune
1 mumbai
2 delhi

6. Replacing Values

df['City'] = df['City'].replace('mumbai', 'Mumbai City')


print(df)

7. Creating New Columns

df = pd.DataFrame({'Math': [90, 80], 'Science': [85, 95]})


df['Total'] = df['Math'] + df['Science']
print(df)

Output:

Math Science Total


0 90 85 175
1 80 95 175

8. Sorting

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
df = pd.DataFrame({'Name': ['C', 'A', 'B'], 'Marks': [78, 90, 85]})
df = df.sort_values(by='Marks', ascending=False)
print(df)

Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur

You might also like