Informatics
Practices
DATA HANDLING
WITH PANDAS
NOTES For CLASS-XII
C.Shaaswathi
NumPy, Pandas and Matplotlib are three Python libraries for scientific and analytical use. These
libraries allow us to manipulate, transform and visualise data easily and efficiently.
NumPy, which stands for ‘Numerical Python’, it is a package that can be used for numerical data
analysis and scientific computing. NumPy uses a multidimensional array object and has functions
and tools for working with these arrays.
PANDAS (PANel DAta) is a high-level data manipulation tool used for analysing data.
Pandas
PANel DAta is a Python library programming language for data manipulation and analysis.
It is built on packages like NumPy and Matplotlib and gives us a single, convenient place to do
most of our data analysis and visualisation work.
Pandas has three important data structures, namely – Series, DataFrame and Panel to make the
process of analysing data organised, effective and efficient.
The Matplotlib library in Python is used for plotting graphs and visualisation. Using Matplotlib,
with just a few lines of code we can generate publication quality plots, histograms, bar charts,
scatterplots, etc.
DATA STRUCTURE IN PANDAS
A data structure is a way of storing and organizing the data in computer so that it can be accessed and
worked with in an appropriate way. It is a collection of data values. Pandas deals with 3 data structure
namely;
1. Series – One dimensional structure that stores homogeneous mutable data.
2. Data Frame - Two dimensional structure that stores heterogeneous mutable data
3. Panel – Three dimensional structure
SERIES
Series is a one-dimensional array like structure with homogeneous data.
It has data structure with two arrays out of which one array is INDEX and the second array is
DATA.
Series data is mutable but size of series data is immutable.
CREATION OF SERIES
1) Syntax to create an empty Series: <seriesobject> =pandas.Series ()
2) Syntax to create a Series: <seriesobject> =pandas.Series (data, index=idx (optional))
Data may be:
A sequence (Lists),
A NumPy ndarray,
Scalar value,
A Dictionary,
A mathematical expression/Function.
SYNTAX AND EXAMPLE OF CREATING A SERIES WITH
DIFFERENT DTYPES:
(A) SERIES FROM A LIST: (B) SERIES FROM SCALAR VALUE:
(C) SERIES FROM DICTIONARY: (D) SERIES FROM NUMPY NDARRAY:
ACCESSING ELEMENTS OF A SERIES
(A) INDEXING:
(B) SLICING:
MODIFYING SERIES ELEMENTS
(A) USING INDEX (B) USING SLICING
SERIES OBJECT ATTRIBUTES
NAME ABOUT SYNTAX EXAMPLE OUTPUT
name assigns a name to the <seriesobject>.name=’name’ s.name ='capitals'
Series print(s)
index (i). returns the index of (i). <seriesobject>.index (i). print(s.index) (i). Index(['India', 'USA', 'UK',
series object 'France'], dtype='object')
(ii).<seriesobject>.index= (ii). s.index=[1,2,3,4]
(ii). 1 NewDelhi
<new index list> print(s)
(ii). helps to rename 2 WashingtonDC
the existing index 3 London
4 Paris
dtype: object
index assigns a name to the <seriesobject>.index.name= s.index.name='country'
name index of the series index name print(s)
values prints a list of the <seriesobject>.values print(s.values) ['NewDelhi'
values in the series 'WashingtonDC' 'London'
'Paris']
size prints the number of <seriesobject>.size print(s.size) 4
values in the Series
object
hasnans returns True if any <seriesobject>.hasnans print(s.hasnans) False
NaN values in exists
series otherwise False
empty returns True if any <seriesobject>.empty print(s.empty) False
empty values in exists
series otherwise False
dtype returns the data type <seriesobject>.dtype print(s.dtype) object
of the series
SERIES OBJECT FUNCTIONS
(this program is only for loc[])
NAME ABOUT SYNTAX EXAMPLE OUTPUT
iloc Used for indexing based on <seriesobject>.iloc[] print(seriesTenTwenty.iloc[2:4]) 2 12
3 13
position i.e., by using index dtype: int64
loc Used for indexing based on <seriesobject>.loc[] print(S.loc['a':'c']) a 10
b 20
name i.e., by using row c 10
dtype: int64
label/index name
head() Returns the first n members <seriesobject>.head() print(seriesTenTwenty.head())
of the series. If the value for
n is not passed, then by
default n takes 5 and the first
five members are displayed.
tail() Returns the last n members <seriesobject>.tail() print(seriesTenTwenty.tail())
of the series. If the value for
n is not passed, then by
default n takes 5 and the last
five members are displayed.
count() Returns the number of non- <seriesobject>.count() print(seriesTenTwenty.count()) 10
NaN values in the Series
drop() <seriesobject>.drop(index, seriesTenTwenty.drop([2,4,6,8],
0 10
1 11
implace=True/False) inplace=True)
3 13
5 15
print(seriesTenTwenty)
7 17
9 19
dtype: int64
MATHEMATICAL OPERATIONS ON SERIES
(A) ADDITION OF TWO SERIES (B) SUBTRACTION OF TWO SERIES
(C) MULTIPLICATION OF TWO SERIES (D) DIVISION OF TWO SERIES
Using fill_value=any number we can replace the missing values by any number. (not there in
syllabus)
VECTOR OPERATIONS ON SERIES
Ex:1 Ex:2 Ex:3
All the other comparing symbols, mathmetical symbols can be used
COMPARING THE DATA
Ex:1
Similarly can be done for all the other relational operators and the output will be in BOLLEAN VALUES.
DATAFRAME
A DataFrame is a two-dimensional labelled data structure like a table of MySQL.
It contains rows and columns, and therefore has both a row and column index.
Each column can have a different type of value such as numeric, string, boolean, etc., as in tables
of a database.
CREATION OF DATAFRAME:
Data may be:
A NumPy ndarray,
List of dictionary
Dictionary of list
Series
Dictionary of series
SYNTAX AND EXAMPLE OF CREATING A DATAFRAME WITH
DIFFERENT DTYPES:
(A) CREATION OF DATAFRAME FROM NUMPY NDARRAYS:
(B) CREATION OF DATAFRAME FROM LIST OF DICTIONARIES:
Here, the dictionary keys are taken as column labels, and the values corresponding to each key are taken as rows.
(C) CREATION OF DATAFRAME FROM DICTIONARY OF LISTS:
The dictionary keys become column labels by default in a dataframe, and the lists become the rows.
(D) CREATION OF DATAFRAME FROM SERIES:
(i) We can create a dataframe using a single series as shown below:
(ii) To create a dataframe using more than one series, we need to pass multiple series in the list as shown
below:
(E) CREATION OF DATAFRAME FROM DICTIONARY OF SERIES:
DATAFRAME ATTRIBUTES
NAME ABOUT SYNTAX EXAMPLE OUTPUT
index This attribute is used to display <dataframeobject>.<index> print(df.index) Index([0,1,2],dtype=object)
the index/row labels
columns This attribute is used to display <dataframeobject>.<columns> print(df.columns) Index(['Name', 'Age', 'City'],
the column name/labels dtype='object')
axes This attribute is used to display <dataframeobject>.<axes> print(df.axes) Index([0,1,2],dtype=object),
both the index & column name Index(['Name', 'Age', 'City'],
dtype='object')]
dtypes This attribute is used to display <dataframeobject>.<dtypes> print(df.dtypes) Name object
the data type of each column Age int64
City object
dtype: object
size It returns the size of the df which <dataframeobject>.<size> print(df.size) 9
is the product of the no.of rows &
columns.
values It is used to display NumPy <dataframeobject>.<values> print(df.values) [['ram' 20 'ayodhya']
ndarray having all the values ['sita' 22 'lanka']
['hanu' 24 'japali']]
without axes labels
shape It returns the size of the df as <dataframeobject>.<shape> print(df.shape) (3, 3)
no.of rows & columns individually
ndim It returns the dimension of the df <dataframeobject>.<ndim> print(df.ndim) 2
empty It returns Boolean value of no.of <dataframeobject>.<empty> print(df.empty) False
empty/missing values
T It returns the transpose, i.e., rows <dataframeobject>.<T> print(df.T) 0 1 2
(Transpose) as columns & columns as rows Name ram sita hanu
Age 20 22 24
City ayodhya lanka japali
Count() This function returns the no.of <dataframeobject>.<count()> print(df.count()) Name 3
non-null/non-missing values in Age 3
City 3
each column (excludes NaN
dtype: int64
values)
head() Returns the first n members of the <dataframeobject>.head() print(df.head(2)) Name Age City
series. If the value for n is not 0 ram 20 ayodhya
1 sita 22 lanka
passed, then by default n takes 5
and the first five members are
displayed.
tail() Returns the last n members of the <dataframeobject>.tail() print(df.tail(2)) Name Age City
series. If the value for n is not 1 sita 22 lanka
2 hanu 24 japali
passed, then by default n takes 5
and the last five members are
displayed.
OTHER IMPORTANT ATTRIBUTES OF DATAFRAME
NAME ABOUT SYNTAX EXAMPLE OUTPUT
Setting To make any of <dataframeobject>.set_index(<column_name>, df.set_index('Count
index of df the column as inplace=True) ry',inplace=True)
the index Column_name: that is the column which is to print(df)
be set as the index
Resetting To reset the <dataframeobject>.reset_index(inplace=True) df.reset_index
index of df index (undo the (inplace=True)
above print(df)
operation)
Retrieving (i).iat-helps to <dataframeobject>.iat[row_number,column_number] df.iat[1,0] France
& access single print(df)
accessing value using
row/ index
columns
(ii).at-helps to <dataframeobject>.at[row_number,column_number] df.at[1,'Capital'] Paris
access single print(df)
value using
lables
(iii)slicing can <dataframeobject>[:] (print(df[1:3]))
be used to
access many
values/rows
OPERATION ON ROWS AND COLUMNS IN DATAFRAME
NAME ABOUT SYNTAX EXAMPLE OUTPUT
Adding/ (i).loc attribute <dataframeobject>.loc[index]=[data] df.loc[3]=['pret',97,97]
modifying a helps us to add print(df)
new a new row/data
row/data
(ii).at attribute <dataframeobject>.at[row label,:]=[data] df.at[3,:]=['pret',97,97]
helps us to add print(df)
a new row/data
Renaming Rename() helps <dataframeobject>.rename(columns={old col_name :new df.rename(columns={'english':eng'},
column us to rename a col_name},inplace=True) inplace=True)
column print(df)
Renaming a Rename() helps <dataframeobject>.rename(columns={old row_name : new df.rename({'IP':'ip'},axis=1,
row (row us to rename a row_name},axis=1,inplace=True) inplace=True)
index/label) column
Adding a <dataframeobject>[new col_name]=[new data] df['HPE']=[100,100,100]
new column print(df)
To add a Insert() helps to <dataframeobject>.insert(n,new col_name,[new data]) df.insert(1,'HPE',[100,100,100])
column in add a new n: the index where the new column to be print(df)
specified column in inserted
position specified
position
Selecting a (i)using square <dataframeobject>[‘column_name’] print(df['IP'])
column format
(ii)using dot <dataframeobject>.column_name print(df.IP)
notation
This can be (iii)using .iloc <dataframeobject>.iloc[:,column-index] print(df.iloc[:,1:3])
used for (a)columns
both rows &
columns
(b) rows <dataframeobject>.iloc[row-index,:] print(df.iloc[1:2,:])
(c)both rows <dataframeobject>.iloc[row-index,column-index] print(df.iloc[1:2,0:2])
& columns
(iv) using .loc <dataframeobject>.loc[:,column-index] print(df.loc[:, 'IP'])
(a)column
(b) rows <dataframeobject>.loc[row-index,:] print(df.loc[1,:])
(c)both rows <dataframeobject>.loc[row-index,column-index] print(df.loc[1,'Name'])
& columns
Deleting a (i)using del key Del<dataframeobject>[col_name] del df['English']
column word print(df)
(ii)using pop() <dataframeobject>.pop(col_name) print(df.pop('IP'))
method
(iii)using drop() (i). rows/data: (i). print(df.drop(1,axis=0))
This is used for <dataframeobject>.drop(index,axis=0, inplace=True/False
both rows/data axis=0 is for retrieving the data & it is the (ii). print(df.drop('Name',axis=1)) (i).
and columns default value
(ii). Columns: <dataframeobject>.drop(col_name,axis=1,
inplace=True/False
axis=1 is for retrieving the columns (ii).
Sorting a (i)ascending <dataframeobject>.sort_values(by=[col_name]) print(df.sort_values(by=['IP']))
data order Ascending=True is not mandatory because, it is
the default vaue
(ii)descending <dataframeobject>.sort_values(by=[col_name], print(df.sort_values(by=['IP'],
order ascending=False) ascending=False))
ITERATION IN DATAFRAME
1. ITEMS(): FOR ITERATING THE COLUMNS 2. ITERROWS(): FOR ITERATING THE ROWS
BOOLEAN INDEXING
Ex:1
Similarly can be done for all the other relational operators and the output will be in BOLLEAN VALUES
CSV FILE
CSV is known as Comma Separated Values.
It is a simple file format used to store tabular data, such as a spreadsheet or database.
A CSV file stores tabular data is plain text.
Each line of the file is a data record.
Each record consist of one or more than one fields which is separated by commas.
In python there is an inbuilt module called csv.
ADVANTAGES OF CSV FORMAT
READING A CSV FILE (from excel to python)
1) Syntax to read a csv file: <dataframeobject>=pd.read_csv(<filepath>
Steps to do it:
FUNCTIONS/ATTRIBUTES IN CSV FILE (while reading)
NAME SYNTAX EXAMPLE OUTPUT
To return <dataframeobject>.pd.read_csv(<filepath>, df=pd.read_csv('PRAC.csv',usecols=['NAME', 'SEC'])
selected/specific usecols=[col_name]) print(df)
columns
To return <dataframeobject>.pd.read_csv(<filepath, df=pd.read_csv('PRAC.csv',nrows=2)
selected/specific nrows=no.of rows) print(df)
rows
Reading a csv <dataframeobject>.pd.read_csv(<filepath, (i) None: (i).
without header Header=None/number) df=pd.read_csv('PRAC.csv',header=None)
If header=None it will return all the print(df)
values with the col_names as also
the values (ii) Number: (ii).
If header=Number it will retrun df=pd.read_csv('PRAC.csv',header=2)
that row as the col_name and the print(df)
following values
Reading a csv <dataframeobject>.pd.read_csv(<filepath, df=pd.read_csv('PRAC.csv',index_col=0)
without header index_col=col_index number) print(df)
Reading a csv with <dataframeobject>.pd.read_csv(<filepath, df=pd.read_csv('PRAC.csv',skiprows=1,
new column names skiprows=1) names=['s_Name','s_Std','s_Sec'])
print(df)
UPDATING/ MODIFYING IN A CSV FILE
Example:
WRITING A CSV FILE (from python to excel)
Output: (shown in excel/notepad)
------------------------------------------------------------------------------------------------
SOME IMPORTANT DIFFERENCES:
1. Difference b/w Series and DataFrame.
Series DataFrame
One-dimensional (1D) Two-dimensional (2D)
Labelled array, similar to a single column or list Tabular structure with rows and columns, similar to a
spreadsheet or SQL table
Elements must be of the same data type (homogeneous) Columns can have different data types (heterogeneous)
Immutable Data Mutable Data
2. Difference b/w Pandas and NumPy.
Pandas NumPy
Series (1D labeled array), DataFrame (2D labeled table) ndarray (N-dimensional array)
Data manipulation, analysis, and handling of Numerical computations, array operations, scientific
structured/tabular data computing
Columns can have different data types (heterogeneous) Elements must be of the same data type (homogeneous)
Tabular data with labeled rows (index) and columns Raw numerical arrays, matrices
3. Difference b/w .loc[] and .ilo[].
.loc[] .iloc[]
Selects data based on the labels of rows and columns. Selects data based on the integer positions of rows and
columns, starting from 0.
When slicing, the end label is inclusive. When slicing, the end position is exclusive.
Can be used with boolean arrays for conditional selection. Does not directly support boolean indexing for conditional
selection.
Syntax: <seriesobject>.loc[] Syntax: <seriesobject>.iloc[]
4. Difference b/w .at and .iat.
.at .iat
Access a single value in a using row label and column Access a single value in a using row index number and
label. column index number.
Label-based indexing. Label-based indexing.
Accepts String/integer datatype. Accepts Only integer datatype.
Syntax: .at[row_label, column_label] Syntax: .iat[row_index, column_index]
5. Difference b/w size and count().
size Count()
Returns the total number of elements in the Returns the number of non-NaN values in each
DataFrame/Series column/row.
NaN values are counted. NaN values are excluded.
Syntax: <dataframeobject>.size Syntax: <dataframeobject>.count()
6. Difference between drop() method with axis=0 and axis=1.
Axis=0 Axis=1
Drops/removes rows from the DataFrame. Drops/removes columns from the DataFrame.
axis=0 refers to rows. axis=0 refers to rows.
Example: df.drop(2, axis=0) removes the row with index Example: df.drop(2, axis=0) removes the row with index
label 2. label 2.
7. Difference between shape and ndim attributes.
shape ndim
Returns the dimensions of the object as a tuple (rows, Returns the number of dimensions of the object.
columns).
Returns the output in A tuple Returns the output as An integer (1 for Series, 2 for
DataFrame)
7. Difference between del, pop, drop()
Del Pop Drop()
Deletes a column permanently Removes a column and returns it. Removes specified rows/columns.
Does not return anything. Returns the deleted column as a Series. Returns a new DataFrame.
Works only on columns Works only on columns Works only on columns
del df['colname'] df.pop('colname') For column: df.drop('colname',
axis=1)
For row: df.drop(index, axis=0)
8. Difference between head() and tail() methods.
Head() Tail()
Returns the first n rows of a DataFrame/Series. Returns the last n rows of a DataFrame/Series.
First 5 rows if n is not specified. Last 5 rows if n is not specified.
Syntax: <dataframeobject>.head() Syntax: <dataframeobject>.tail()