Unit II - Final
Unit II - Final
It also has functions for working in domain of linear algebra, fourier transform, and
matrices.
NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can
use it freely.
In Python we have lists that serve the purpose of arrays, but they are slow to process.
NumPy is often used along with packages like SciPy (Scientific Python)
and Mat−plotlib (plotting library).
NumPy aims to provide an array object that is up to 50x faster than traditional Python
lists.
The array object in NumPy is called ndarray, it provides a lot of supporting functions
that make working with ndarray very easy.
Arrays are very frequently used in data science, where speed and resources are very
important.
NumPy is a Python library and is written partially in Python, but most of the parts that
require fast computation are written in C or C++.
It provides:
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
If you have Python and PIP already installed on a system, then installation of NumPy is
very easy.
Installing NumPy
NumPy Arrays
What is an Array?
A NumPy array is a grid of values (all of the same data type) indexed by a tuple of non-
negative integers.
It’s similar to Python lists but much faster and more powerful.
import numpy as np
Types of Arrays
Type Example
1D array np.array([1, 2, 3])
2D array np.array([[1, 2], [3, 4]])
3D array np.array([[[1], [2]], [[3], [4]]])
Creating Arrays
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
np.eye(3) # Identity matrix
np.arange(0, 10, 2) # Array from 0 to 10 with step of 2
np.linspace(0, 1, 5) # 5 values between 0 and 1
Array Operations
Arithmetic Operations
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(a + b) # [5 7 9]
print(a * b) # [ 4 10 18]
print(a ** 2) # [1 4 9]
Scalar Operations
print(a + 5) # [6 7 8]
print(a * 2) # [2 4 6]
Matrix Operations
print(arr.shape) # (2, 3)
print(arr.ndim) # 2
print(arr.size) # 6
print(arr.dtype) # int32 (depends on system)
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
arr = np.array([10, 20, 30, 40, 50])
print(arr[1:4]) # [20 30 40]
For 2D arrays:
1. np.arange()
arange() is used to create an array with regularly spaced values. It works like the
Python range() function but returns a NumPy array.
Syntax:
Example:
import numpy as np
arr = np.arange(1, 10, 2)
print(arr)
Output:
[1 3 5 7 9]
2. np.linspace()
linspace() creates an array of evenly spaced numbers between a start and stop value.
Useful when you know how many points you want, not the step.
Syntax:
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
num: Number of values to generate
Example:
arr = np.linspace(0, 5, 6)
print(arr)
Output:
[0. 1. 2. 3. 4. 5.]
3. np.reshape()
reshape() changes the shape of an existing array without changing the data. It returns a
new array with the same data in a different layout.
Syntax:
array.reshape(new_shape)
Example:
arr = np.arange(6)
reshaped = arr.reshape(2, 3)
print(reshaped)
Output:
[[0 1 2]
[3 4 5]]
4. Slicing in NumPy
Slicing allows you to extract specific portions of an array using index ranges.
Syntax:
array[start:stop:step]
Example:
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
print(arr[1:4]) # From index 1 to 3
print(arr[::-1]) # Reverse the array
Output:
[20 30 40]
[50 40 30 20 10]
5. Indexing in NumPy
Indexing is used to access specific elements of an array. NumPy supports several types
of indexing:
a. Basic Indexing
b. Boolean Indexing
c. Fancy Indexing
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Summary
Function Description
arange() Creates array using a range and step
linspace() Creates evenly spaced values between two points
reshape() Changes shape of array (e.g., 1D to 2D)
slicing Gets part of array using index range
indexing Access or filter array elements
Other funcs zeros(), ones(), eye(), sum(), etc.
What is a Matrix?
Example:
import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
1. Matrix Addition
C=A+B
print(C)
Output:
[[ 6 8]
[10 12]]
2. Matrix Subtraction
C=A-B
print(C)
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Output:
[[-4 -4]
[-4 -4]]
C=A*B
print(C)
Output:
[[ 5 12]
[21 32]]
C = A.dot(B)
# OR
C = np.matmul(A, B)
print(C)
Output:
[[19 22]
[43 50]]
5. Transpose of a Matrix
print(A.T)
Output:
[[1 3]
[2 4]]
6. Inverse of a Matrix
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
A_inv = np.linalg.inv(A)
print(A_inv)
7. Determinant of a Matrix
det = np.linalg.det(A)
print(det)
8. Identity Matrix
I = np.eye(3)
print(I)
9. Diagonal of a Matrix
1. np.mean()
2. np.median()
3. np.std()
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
arr = np.array([10, 20, 30])
print(np.std(arr)) # Output: 8.16...
4. np.var()
6. np.sum()
print(np.sum(arr)) # 18
✅ 7. np.percentile()
8. Axis-Based Calculations
You can apply statistical functions row-wise (axis=1) or column-wise (axis=0) for 2D
arrays.
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
2.4 INTRODUCTION TO SCIPY: SUB-PACKAGES
Import SciPy
Once SciPy is installed, import the SciPy module(s) you want to use in your applications
by adding the from scipy import module statement:
from scipy import constants
Now we have imported the constants module from SciPy, and the application is ready to
use it:
Example
How many cubic meters are in one liter:
from scipy import constants
print(constants.liter)
constants: SciPy offers a set of mathematical constants, one of them is liter which
returns 1 liter as cubic meters.
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Constants in SciPy
As SciPy is more focused on scientific implementations, it provides many built-in
scientific constants.
These constants can be helpful when you are working with Data Science.
PI is an example of a scientific constant.
Example
Print the constant value of PI:
from scipy import constants
print(constants.pi)
Constant Units
A list of all units under the constants module can be seen using the dir() function.
Example
List all constants:
from scipy import constants
print(dir(constants))
Unit Categories
The units are placed under these categories:
Metric
Binary
Mass
Angle
Time
Length
Pressure
Volume
Speed
Temperature
Energy
Power
Force
Metric (SI) Prefixes:
Return the specified unit in meter (e.g. centi returns 0.01)
Example
from scipy import constants
print(constants.yotta) #1e+24
print(constants.zetta) #1e+21
print(constants.exa) #1e+18
print(constants.peta) #1000000000000000.0
print(constants.tera) #1000000000000.0
print(constants.giga) #1000000000.0
print(constants.mega) #1000000.0
print(constants.kilo) #1000.0
print(constants.hecto) #100.0
print(constants.deka) #10.0
print(constants.deci) #0.1
print(constants.centi) #0.01
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
print(constants.milli) #0.001
print(constants.micro) #1e-06
print(constants.nano) #1e-09
print(constants.pico) #1e-12
print(constants.femto) #1e-15
print(constants.atto) #1e-18
print(constants.zepto) #1e-21
Binary Prefixes:
Return the specified unit in bytes (e.g. kibi returns 1024)
Example
from scipy import constants
print(constants.kibi) #1024
print(constants.mebi) #1048576
print(constants.gibi) #1073741824
print(constants.tebi) #1099511627776
print(constants.pebi) #1125899906842624
print(constants.exbi) #1152921504606846976
print(constants.zebi) #1180591620717411303424
print(constants.yobi) #1208925819614629174706176
Mass:
Return the specified unit in kg (e.g. gram returns 0.001)
Example
from scipy import constants
print(constants.gram) #0.001
print(constants.metric_ton) #1000.0
print(constants.grain) #6.479891e-05
print(constants.lb) #0.45359236999999997
print(constants.pound) #0.45359236999999997
print(constants.carat) #0.0002
Angle:
Return the specified unit in radians (e.g. degree returns 0.017453292519943295)
Example
from scipy import constants
print(constants.degree) #0.017453292519943295
print(constants.arcsecond) #4.84813681109536e-06
Time:
Return the specified unit in seconds (e.g. hour returns 3600.0)
Example
from scipy import constants
print(constants.minute) #60.0
print(constants.hour) #3600.0
print(constants.day) #86400.0
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
print(constants.week) #604800.0
print(constants.year) #31536000.0
Length:
Return the specified unit in meters (e.g. nautical_mile returns 1852.0)
Example
from scipy import constants
print(constants.inch) #0.0254
print(constants.foot) #0.30479999999999996
print(constants.yard) #0.9143999999999999
print(constants.mile) #1609.3439999999998
print(constants.mil) #2.5399999999999997e-05
print(constants.pt) #0.00035277777777777776
print(constants.point) #0.00035277777777777776
Pressure:
Return the specified unit in pascals (e.g. psi returns 6894.757293168361)
Example
from scipy import constants
print(constants.atm) #101325.0
print(constants.atmosphere) #101325.0
print(constants.bar) #100000.0
Area:
Return the specified unit in square meters(e.g. hectare returns 10000.0)
Example
from scipy import constants
print(constants.hectare) #10000.0
print(constants.acre) #4046.8564223999992
Volume:
Return the specified unit in cubic meters (e.g. liter returns 0.001)
Example
from scipy import constants
print(constants.liter) #0.001
print(constants.litre) #0.001
print(constants.gallon) #0.0037854117839999997
Speed:
Return the specified unit in meters per second (e.g. speed_of_sound returns 340.5)
Example
from scipy import constants
print(constants.kmh) #0.2777777777777778
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
print(constants.mph) #0.44703999999999994
print(constants.mach) #340.5
Temperature:
Return the specified unit in Kelvin (e.g. zero_Celsius returns 273.15)
Example
from scipy import constants
print(constants.zero_Celsius) #273.15
print(constants.degree_Fahrenheit) #0.5555555555555556
Energy:
Return the specified unit in joules (e.g. calorie returns 4.184)
Example
from scipy import constants
print(constants.eV) #1.6021766208e-19
print(constants.electron_volt) #1.6021766208e-19
print(constants.calorie) #4.184
Power:
Return the specified unit in watts (e.g. horsepower returns 745.6998715822701)
Example
from scipy import constants
print(constants.hp) #745.6998715822701
print(constants.horsepower) #745.6998715822701
Force:
Return the specified unit in newton (e.g. kilogram_force returns 9.80665)
Example
from scipy import constants
print(constants.dyn) #1e-05
print(constants.dyne) #1e-05
In 2008, developer Wes McKinney started developing pandas when in need of high
performance, flexible tool for analysis of data.
Prior to Pandas, Python was majorly used for data preparation. It had very little
contribution towards data analysis.
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Pandas solved this problem. Using Pandas, we can accomplish five typical steps in the
processing and analysis of data, regardless of the origin of data — load, prepare,
manipulate, model, and analyze.
Python with Pandas is used in a wide range of fields including academic and commercial
domains including finance, economics, Statistics, analytics, etc.
Key Features of Pandas
Fast and efficient DataFrame object with default and customized indexing.
Tools for loading data into in-memory data objects from different file formats.
Data alignment and integrated handling of missing data.
Reshaping and pivoting of date sets.
Label-based slicing, indexing and subsetting of large data sets.
Columns from a data structure can be deleted or inserted.
Group by data for aggregation and transformations.
High performance merging and joining of data.
Time Series functionality.
These data structures are built on top of Numpy array, which means they are fast.
Dimension & Description
The best way to think of these data structures is that the higher dimensional data
structure is a container of its lower dimensional data structure. For example, DataFrame
is a container of Series, Panel is a container of DataFrame.
Data Dimensions Description
Structure
All Pandas data structures are value mutable (can be changed) and except Series all are
size mutable. Series is size immutable.
Note − DataFrame is widely used and one of the most important data structures. Panel
is used much less.
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Series
Series is a one-dimensional array like structure with homogeneous data. For example,
the following series is a collection of integers 10, 23, 56, …
10 23 56 17 52 61 73 90 26 72
Key Points
Homogeneous data
Size Immutable
Values of Data Mutable
DataFrame
DataFrame is a two-dimensional array with heterogeneous data. For example,
Name Age Gender Rating
The table represents the data of a sales team of an organization with their overall
performance rating. The data is represented in rows and columns. Each column
represents an attribute and each row represents a person.
Data Type of Columns
The data types of the four columns are as follows −
Column Type
Name String
Age Integer
Gender String
Rating Float
Key Points
Heterogeneous data
Size Mutable
Data Mutable
Panel
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Panel is a three-dimensional data structure with heterogeneous data. It is hard to
represent the panel in graphical representation. But a panel can be illustrated as a
container of DataFrame.
Key Points
Heterogeneous data
Size Mutable
Data Mutable
Series is a one-dimensional labeled array capable of holding data of any type (integer,
string, float, python objects, etc.). The axis labels are collectively called index.
pandas.Series
A pandas Series can be created using the following constructor −
pandas.Series( data, index, dtype, copy)
The parameters of the constructor are as follows −
Sr.No Parameter & Description
1 data
data takes various forms like ndarray, list, constants
2 index
Index values must be unique and hashable, same length as data.
Default np.arrange(n) if no index is passed.
3 dtype
dtype is for data type. If None, data type will be inferred
4 copy
Copy data. Default False
A series can be created using various inputs like −
Array
Dict
Scalar value or constant
Create an Empty Series
A basic series, which can be created is an Empty Series.
Example
#import the pandas library and aliasing as pd
import pandas as pd
s = pd.Series()
print s
Its output is as follows −
Series([], dtype: float64)
Create a Series from ndarray
If data is an ndarray, then index passed must be of the same length. If no index is passed,
then by default index will be range(n) where n is array length, i.e.,
[0,1,2,3…. range(len(array))-1].
Example 1
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print s
Its output is as follows −
0 a
1 b
2 c
3 d
dtype: object
We did not pass any index, so by default, it assigned the indexes ranging from 0
to len(data)-1, i.e., 0 to 3.
Example 2
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print s
Its output is as follows −
100 a
101 b
102 c
103 d
dtype: object
We passed the index values here. Now we can see the customized indexed values in the
output.
Create a Series from dict
A dict can be passed as input and if no index is specified, then the dictionary keys are
taken in a sorted order to construct index. If index is passed, the values in data
corresponding to the labels in the index will be pulled out.
Example 1
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print s
Its output is as follows −
a 0.0
b 1.0
c 2.0
dtype: float64
Observe − Dictionary keys are used to construct index.
Example 2
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
s = pd.Series(data,index=['b','c','d','a'])
print s
Its output is as follows −
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64
Observe − Index order is persisted and the missing element is filled with NaN (Not a
Number).
Create a Series from Scalar
If data is a scalar value, an index must be provided. The value will be repeated to match
the length of index
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print s
Its output is as follows −
0 5
1 5
2 5
3 5
dtype: int64
Accessing Data from Series with Position
Data in the series can be accessed similar to that in an ndarray.
Example 1
Retrieve the first element. As we already know, the counting starts from zero for the
array, which means the first element is stored at zeroth position and so on.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
c 3
dtype: int64
Example 3
Retrieve the last three elements.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
KeyError: 'f'
DATAFRAME
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion
in rows and columns.
Features of DataFrame
Potentially columns are of different types
Size – Mutable
Labelled axes (rows and columns)
Can Perform Arithmetic operations on rows and columns
Structure
Let us assume that we are creating a data frame with student’s data.
1 data
data takes various forms like ndarray, series, map, lists, dict, constants and also
another DataFrame.
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
2 index
For the row labels, the Index to be used for the resulting frame is Optional
Default np.arange(n) if no index is passed.
3 columns
For column labels, the optional default syntax is - np.arange(n). This is only
true if no index is passed.
4 dtype
Data type of each column.
5 copy
This command (or whatever it is) is used for copying of data, if the default is
False.
Create DataFrame
A pandas DataFrame can be created using various inputs like −
Lists
dict
Series
Numpy ndarrays
Another DataFrame
In the subsequent sections of this chapter, we will see how to create a DataFrame using
these inputs.
Create an Empty DataFrame
A basic DataFrame, which can be created is an Empty Dataframe.
Example
#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print df
Its output is as follows −
Empty DataFrame
Columns: []
Index: []
Create a DataFrame from Lists
The DataFrame can be created using a single list or a list of lists.
Example 1
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print df
Its output is as follows −
0
0 1
1 2
2 3
3 4
4 5
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Example 2
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print df
Its output is as follows −
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
Example 3
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print df
Its output is as follows −
Name Age
0 Alex 10.0
1 Bob 12.0
2 Clarke 13.0
Note − Observe, the dtype parameter changes the type of Age column to floating point.
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print df
Its output is as follows −
Age Name
0 28 Tom
1 34 Jack
2 29 Steve
3 42 Ricky
Note − Observe the values 0,1,2,3. They are the default index assigned to each using the
function range(n).
Example 2
Let us now create an indexed DataFrame using arrays.
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print df
Its output is as follows −
Age Name
rank1 28 Tom
rank2 34 Jack
rank3 29 Steve
rank4 42 Ricky
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Note − Observe, the index parameter assigns an index to each row.
Read CSV Files using Pandas
A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contains plain text and is a well know format that can be read by everyone
including Pandas.
In our examples we will be using a CSV file called 'data.csv'.
Example
Load the CSV into a DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())
Column Addition
We will understand this by adding a new column to an existing data frame.
Example
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
# Adding a new column to an existing DataFrame object with column label by passing
new series
print ("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print df
print ("Adding a new column using the existing columns in DataFrame:")
df['four']=df['one']+df['three']
print df
Its output is as follows −
Adding a new column by passing as Series:
one two three
a 1.0 1 10.0
b 2.0 2 20.0
c 3.0 3 30.0
d NaN 4 NaN
Adding a new column using the existing columns in DataFrame:
one two three four
a 1.0 1 10.0 11.0
b 2.0 2 20.0 22.0
c 3.0 3 30.0 33.0
d NaN 4 NaN NaN
Column Deletion
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Columns can be deleted or popped; let us take an example to understand how.
Example
# Using the previous DataFrame, we will delete a column
# using del function
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']),
'three' : pd.Series([10,20,30], index=['a','b','c'])}
df = pd.DataFrame(d)
print ("Our dataframe is:")
print df
# using del function
print ("Deleting the first column using DEL function:")
del df['one']
print df
# using pop function
print ("Deleting another column using POP function:")
df.pop('two')
print df
Its output is as follows −
Our dataframe is:
one three two
a 1.0 10.0 1
b 2.0 20.0 2
c 3.0 30.0 3
d NaN NaN 4
Deleting the first column using DEL function:
three two
a 10.0 1
b 20.0 2
c 30.0 3
d NaN 4
Deleting another column using POP function:
three
a 10.0
b 20.0
c 30.0
d NaN
Row Selection, Addition, and Deletion
We will now understand row selection, addition and deletion through examples. Let us
begin with the concept of selection.
Selection by Label
Rows can be selected by passing row label to a loc function.
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print df.loc['b']
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Its output is as follows −
one 2.0
two 2.0
Name: b, dtype: float64
The result is a series with labels as column names of the DataFrame. And, the Name of
the series is the label with which it is retrieved.
Selection by integer location
Rows can be selected by passing integer location to an iloc function.
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print df.iloc[2]
Its output is as follows −
one 3.0
two 3.0
Name: c, dtype: float64
Slice Rows
Multiple rows can be selected using ‘ : ’ operator.
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print df[2:4]
Its output is as follows −
one two
c 3.0 3
d NaN 4
Addition of Rows
Add new rows to a DataFrame using the append function. This function will append the
rows at the end.
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df.append(df2)
print df
Its output is as follows −
a b
0 1 2
1 3 4
0 5 6
1 7 8
Deletion of Rows
Use index label to delete or drop rows from a DataFrame. If label is duplicated, then
multiple rows will be dropped.
If you observe, in the above example, the labels are duplicate. Let us drop a label and will
see how many rows will get dropped.
import pandas as pd
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
import pandas as pd
df = pd.DataFrame(data)
print(df.isnull()) # Shows True for NaN
print(df.isnull().sum()) # Count of missing values per column
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Fill with a constant:
df_filled = df.fillna("Unknown")
Fill using forward fill (previous value):
df_filled = df.fillna(method='ffill')
Fill using backward fill (next value):
df_filled = df.fillna(method='bfill')
Fill with mean/median of column:
df['Age'] = df['Age'].fillna(df['Age'].mean())
2. Handling Duplicates
df.duplicated()
df_no_duplicates = df.drop_duplicates()
df_unique = df.drop_duplicates(subset=['Name'])
Example Summary
import pandas as pd
df = pd.DataFrame(data)
# Remove duplicates
df = df.drop_duplicates()
print(df)
Output:
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
Name Age City
0 Alice 25.000000 Pune
1 Bob 24.000000 Mumbai
3 Unknown 22.000000 Delhi
These are essential operations to summarize, organize, and sort data for analysis.
Aggregation means applying summary functions like sum(), mean(), count(), min(),
max() on a DataFrame or Series.
Example:
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
import pandas as pd
df = pd.DataFrame(data)
Grouping means splitting the data into groups and then applying aggregation
functions on each group.
Example:
grouped = df.groupby('Department')
print(grouped['Salary'].sum())
print(grouped['Salary'].mean())
Output:
Department
Finance 28000
HR 52000
IT 62000
Function Description
sum() Total of values
mean() Average
count() Number of entries
min() Minimum
max() Maximum
std() Standard deviation
Multiple Aggregations:
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
3. Sorting in Pandas
Sorting means arranging rows in ascending or descending order using one or more
columns.
Example:
grouped = df.groupby('Department')['Salary'].sum()
sorted_grouped = grouped.sort_values(ascending=False)
print(sorted_grouped)
Full Example:
import pandas as pd
df = pd.DataFrame(data)
# Group by Department
grouped = df.groupby('Department')['Salary'].agg(['sum', 'mean', 'count'])
print(grouped_sorted)
Output:
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
2.9 MERGING, JOINING AND CONCATENATING DATAFRAMES
Syntax:
Example (Vertical):
import pandas as pd
Output:
ID Name
0 1 Alice
1 2 Bob
0 3 Charlie
1 4 David
Merging combines rows based on common columns or indices, just like a relational
database join.
Syntax:
✅ Example:
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
print(merged)
Output:
ID Name Salary
0 1 Alice 30000
1 2 Bob 40000
Merge Types:
how Description
inner Only matching rows from both
left All rows from left, matched from right
right All rows from right, matched from left
outer All rows from both, fill with NaN if no match
Example:
joined = df1.join(df2)
print(joined)
Output:
Name Salary
1 Alice 30000
2 Bob 40000
df1.set_index('ID').join(df2.set_index('ID'))
# DataFrames
students = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
marks = pd.DataFrame({'ID': [1, 2, 4], 'Marks': [85, 90, 78]})
# Merging on ID
merge_inner = pd.merge(students, marks, on='ID', how='inner')
merge_outer = pd.merge(students, marks, on='ID', how='outer')
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
print("Inner Join:\n", merge_inner)
print("\nOuter Join:\n", merge_outer)
import pandas as pd
Output:
Name Age
0 Alice 25.0
1 Bob 27.5
2 Unknown 30.0
df.dropna(inplace=True)
2. Removing Duplicates
df = df.drop_duplicates()
print(df)
Output:
ID Name
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
0 1 A
1 2 B
3 3 C
3. Renaming Columns
df = df.rename(columns={'Name': 'Full_Name'})
print(df)
df['ID'] = df['ID'].astype(str)
print(df.dtypes)
Output:
markdown
CopyEdit
City
0 pune
1 mumbai
2 delhi
6. Replacing Values
Output:
8. Sorting
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur
Notes: Data Analytics with Python (B.Sc. AI and ML SY) UNIT II: NumPy, SciPy and Pandas for
Data Handling
df = pd.DataFrame({'Name': ['C', 'A', 'B'], 'Marks': [78, 90, 85]})
df = df.sort_values(by='Marks', ascending=False)
print(df)
Prepared By : Dr. D. R. Somwanshi (M.Sc. CS. M.Phil., P.hD, NET), COCSIT Latur