Pandas
Pandas
DELL
0
Pandas is an open-source, BSD-licensed Python library providing high-
performance, easy-to-use data structures and data analysis tools for the
Python programming language. Python with Pandas is used in a wide range
of fields including academic and commercial domains including finance,
economics, Statistics, analytics, etc. In this tutorial, we will learn the various
features of Python Pandas and how to use them in practice.
Prior to Pandas, Python was majorly used for data munging and preparation.
It had very little contribution towards data analysis. Pandas solved this
problem. Using Pandas, we can accomplish five typical steps in the
processing and analysis of data, regardless of the origin of data — load,
prepare, manipulate, model, and analyze.
Python with Pandas is used in a wide range of fields including academic and
commercial domains including finance, economics, Statistics, analytics, etc.
Standard Python distribution doesn't come bundled with Pandas module. A lightweight alternative is to
install NumPy using popular Python package installer, pip.
If you install Anaconda Python package, Pandas will be installed by default with the following −
Windows
• Anaconda (from https://www.continuum.io) is a free Python distribution for SciPy stack. It is also
available for Linux and Mac.
• Canopy (https://www.enthought.com/products/canopy/) is available as free as well as commercial
distribution with full SciPy stack for Windows, Linux and Mac.
• Python (x,y) is a free Python distribution with SciPy stack and Spyder IDE for Windows OS.
(Downloadable from http://python-xy.github.io/)
Linux
Package managers of respective Linux distributions are used to install one or more packages in SciPy stack.
• Series
• DataFrame
• Panel
These data structures are built on top of Numpy array, which means they
are fast.
Data
Dimensions Description
Structure
Mutability
All Pandas data structures are value mutable (can be changed) and except
Series all are size mutable. Series is size immutable.
Note − DataFrame is widely used and one of the most important data
structures. Panel is used much less.
Series
Series is a one-dimensional array like structure with homogeneous data. For
example, the following series is a collection of integers 10, 23, 56, …
10 23 56 17 52 61 73 90 26 72
Key Points
• Homogeneous data
• Size Immutable
• Values of Data Mutable
DataFrame
DataFrame is a two-dimensional array with heterogeneous data. For
example,
The table represents the data of a sales team of an organization with their
overall performance rating. The data is represented in rows and columns.
Each column represents an attribute and each row represents a person.
Column Type
Name String
Age Integer
Gender String
Rating Float
Key Points
• Heterogeneous data
• Size Mutable
• Data Mutable
Panel
Panel is a three-dimensional data structure with heterogeneous data. It is
hard to represent the panel in graphical representation. But a panel can be
illustrated as a container of DataFrame.
Key Points
• Heterogeneous data
• Size Mutable
• Data Mutable
Python Pandas - Series
pandas.Series
A pandas Series can be created using the following constructor −
1 data
data takes various forms like ndarray, list, constants
index
2 Index values must be unique and hashable, same length as data.
Default np.arrange(n) if no index is passed.
3 dtype
dtype is for data type. If None, data type will be inferred
4 copy
Copy data. Default False
• Array
• Dict
• Scalar value or constant
Example
Live Demo
#import the pandas library and aliasing as pd
import pandas as pd
s = pd.Series()
print s
Example 1
Live Demo
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print s
0 a
1 b
2 c
3 d
dtype: object
We did not pass any index, so by default, it assigned the indexes ranging
from 0 to len(data)-1, i.e., 0 to 3.
Example 2
Live Demo
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print s
100 a
101 b
102 c
103 d
dtype: object
We passed the index values here. Now we can see the customized indexed
values in the output.
Example 1
Live Demo
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print s
a 0.0
b 1.0
c 2.0
dtype: float64
Example 2
Live Demo
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
print s
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64
Observe − Index order is persisted and the missing element is filled with
NaN (Not a Number).
Live Demo
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print s
0 5
1 5
2 5
3 5
dtype: int64
Example 1
Retrieve the first element. As we already know, the counting starts from
zero for the array, which means the first element is stored at zeroth position
and so on.
Live Demo
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
1
Example 2
Retrieve the first three elements in the Series. If a : is inserted in front of it,
all items from that index onwards will be extracted. If two parameters (with
: between them) is used, items between the two indexes (not including the
stop index)
Live Demo
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
a 1
b 2
c 3
dtype: int64
Example 3
Live Demo
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
c 3
d 4
e 5
dtype: int64
Live Demo
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
1
Example 2
Live Demo
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
a 1
c 3
d 4
dtype: int64
Example 3
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
…
KeyError: 'f'
Python Pandas - DataFrame
Features of DataFrame
• Potentially columns are of different types
• Size – Mutable
• Labeled axes (rows and columns)
• Can Perform Arithmetic operations on rows and columns
Structure
Let us assume that we are creating a data frame with student’s data.
pandas.DataFrame
A pandas DataFrame can be created using the following constructor −
pandas.DataFrame( data, index, columns, dtype, copy)
data
1 data takes various forms like ndarray, series, map, lists, dict,
constants and also another DataFrame.
index
2 For the row labels, the Index to be used for the resulting frame is
Optional Default np.arange(n) if no index is passed.
columns
3 For column labels, the optional default syntax is - np.arange(n).
This is only true if no index is passed.
4 dtype
Data type of each column.
copy
5 This command (or whatever it is) is used for copying of data, if
the default is False.
Create DataFrame
A pandas DataFrame can be created using various inputs like −
• Lists
• dict
• Series
• Numpy ndarrays
• Another DataFrame
Example
Live Demo
#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print df
Its output is as follows −
Empty DataFrame
Columns: []
Index: []
Example 1
Live Demo
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print df
0
0 1
1 2
2 3
3 4
4 5
Example 2
Live Demo
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print df
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
Example 3
Live Demo
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print df
Name Age
0 Alex 10.0
1 Bob 12.0
2 Clarke 13.0
Note − Observe, the dtype parameter changes the type of Age column to
floating point.
Example 1
Live Demo
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve',
'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print df
Age Name
0 28 Tom
1 34 Jack
2 29 Steve
3 42 Ricky
Note − Observe the values 0,1,2,3. They are the default index assigned to
each using the function range(n).
Example 2
Live Demo
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve',
'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print df
Age Name
rank1 28 Tom
rank2 34 Jack
rank3 29 Steve
rank4 42 Ricky
Example 1
Live Demo
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df
a b c
0 1 2 NaN
1 5 10 20.0
Example 2
Live Demo
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print df
a b c
first 1 2 NaN
second 5 10 20.0
Example 3
Live Demo
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'],
columns=['a', 'b1'])
print df1
print df2
#df1 output
a b
first 1 2
second 5 10
#df2 output
a b1
first 1 NaN
second 5 NaN
Note − Observe, df2 DataFrame is created with a column index other than
the dictionary key; thus, appended the NaN’s in place. Whereas, df1 is
created with column indices same as dictionary keys, so NaN’s appended.
Example
Live Demo
import pandas as pd
df = pd.DataFrame(d)
print df
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Note − Observe, for the series one, there is no label ‘d’ passed, but in the
result, for the d label, NaN is appended with NaN.
Column Selection
We will understand this by selecting a column from the DataFrame.
Example
Live Demo
import pandas as pd
df = pd.DataFrame(d)
print df ['one']
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64
Column Addition
We will understand this by adding a new column to an existing data frame.
Example
Live Demo
import pandas as pd
df = pd.DataFrame(d)
print df
Example
Live Demo
# Using the previous DataFrame, we will delete a column
# using del function
import pandas as pd
df = pd.DataFrame(d)
print ("Our dataframe is:")
print df
Selection by Label
Live Demo
import pandas as pd
df = pd.DataFrame(d)
print df.loc['b']
one 2.0
two 2.0
Name: b, dtype: float64
The result is a series with labels as column names of the DataFrame. And,
the Name of the series is the label with which it is retrieved.
Live Demo
import pandas as pd
df = pd.DataFrame(d)
print df.iloc[2]
Its output is as follows −
one 3.0
two 3.0
Name: c, dtype: float64
Slice Rows
Live Demo
import pandas as pd
df = pd.DataFrame(d)
print df[2:4]
one two
c 3.0 3
d NaN 4
Addition of Rows
Add new rows to a DataFrame using the append function. This function will
append the rows at the end.
Live Demo
import pandas as pd
df = df.append(df2)
print df
a b
0 1 2
1 3 4
0 5 6
1 7 8
Deletion of Rows
If you observe, in the above example, the labels are duplicate. Let us drop a
label and will see how many rows will get dropped.
Live Demo
import pandas as pd
df = df.append(df2)
print df
ab
134
178
In the above example, two rows were dropped because those two contain
the same label 0.
The names for the 3 axes are intended to give some semantic meaning to
describing operations involving panel data. They are −
Parameter Description
Data takes various forms like ndarray, series, map, lists, dict,
data
constants and also another DataFrame
items axis=0
major_axis axis=1
minor_axis axis=2
Create Panel
A Panel can be created using multiple ways like −
• From ndarrays
• From dict of DataFrames
From 3D ndarray
Live Demo
# creating an empty panel
import pandas as pd
import numpy as np
data = np.random.rand(2,4,5)
p = pd.Panel(data)
print p
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 4
Note − Observe the dimensions of the empty panel and the above panel, all
the objects are different.
Live Demo
#creating an empty panel
import pandas as pd
p = pd.Panel()
print p
<class 'pandas.core.panel.Panel'>
Dimensions: 0 (items) x 0 (major_axis) x 0 (minor_axis)
Items axis: None
Major_axis axis: None
Minor_axis axis: None
0 1 2
0 0.488224 -0.128637 0.930817
1 0.417497 0.896681 0.576657
2 -2.775266 0.571668 0.290082
3 -0.400538 -0.144234 1.110535
We have two items, and we retrieved item1. The result is a DataFrame with
4 rows and 3 columns, which are the Major_axis and Minor_axis dimensions.
Using major_axis
Live Demo
# creating an empty panel
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print p.major_xs(1)
Item1 Item2
0 0.417497 0.748412
1 0.896681 -0.557322
2 0.576657 NaN
Using minor_axis
Live Demo
# creating an empty panel
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print p.minor_xs(1)
Item1 Item2
0 -0.128637 -1.047032
1 0.896681 -0.557322
2 0.571668 0.431953
3 -0.144234 1.302466
By now, we learnt about the three Pandas DataStructures and how to create
them. We will majorly focus on the DataFrame objects because of its
importance in the real time data processing and also discuss a few other
DataStructures.
1 axes
Returns a list of the row axis labels
2 dtype
Returns the dtype of the object.
3 empty
Returns True if series is empty.
4 ndim
Returns the number of dimensions of the underlying data, by
definition 1.
5 size
Returns the number of elements in the underlying data.
6 values
Returns the Series as ndarray.
7 head()
Returns the first n rows.
8 tail()
Returns the last n rows.
Let us now create a Series and see all the above tabulated attributes
operation.
Example
Live Demo
import pandas as pd
import numpy as np
0 0.967853
1 -0.148368
2 -1.395906
3 -1.758394
dtype: float64
axes
Live Demo
import pandas as pd
import numpy as np
empty
Returns the Boolean value saying whether the Object is empty or not. True
indicates that the object is empty.
Live Demo
import pandas as pd
import numpy as np
Live Demo
import pandas as pd
import numpy as np
0 0.175898
1 0.166197
2 -0.609712
3 -1.377000
dtype: float64
Live Demo
import pandas as pd
import numpy as np
0 3.078058
1 -1.207803
dtype: float64
Live Demo
import pandas as pd
import numpy as np
To view a small sample of a Series or the DataFrame object, use the head()
and the tail() methods.
head() returns the first n rows(observe the index values). The default
number of elements to display is five, but you may pass a custom number.
Live Demo
import pandas as pd
import numpy as np
tail() returns the last n rows(observe the index values). The default number
of elements to display is five, but you may pass a custom number.
Live Demo
import pandas as pd
import numpy as np
1 T
Transposes rows and columns.
axes
2 Returns a list with the row axis labels and column axis labels as
the only members.
3 dtypes
Returns the dtypes in this object.
empty
4 True if NDFrame is entirely empty [no items]; if any of the axes
are of length 0.
5 ndim
Number of axes / array dimensions.
shape
6 Returns a tuple representing the dimensionality of the
DataFrame.
7 size
Number of elements in the NDFrame.
8 values
Numpy representation of NDFrame.
9 head()
Returns the first n rows.
10 tail()
Returns last n rows.
Let us now create a DataFrame and see all how the above mentioned
attributes operate.
Example
Live Demo
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data series is:")
print df
Returns the transpose of the DataFrame. The rows and columns will
interchange.
Live Demo
import pandas as pd
import numpy as np
# Create a DataFrame
df = pd.DataFrame(d)
print ("The transpose of the data series is:")
print df.T
Returns the list of row axis labels and column axis labels.
Live Demo
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print ("Row axis labels and column axis labels are:")
print df.axes
Live Demo
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print ("The data types of each column are:")
print df.dtypes
Returns the Boolean value saying whether the Object is empty or not; True
indicates that the object is empty.
Live Demo
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print ("Is the object empty?")
print df.empty
Live Demo
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print df
print ("The dimension of the object is:")
print df.ndim
Live Demo
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print df
print ("The shape of the object is:")
print df.shape
Live Demo
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print df
print ("The total number of elements in our object is:")
print df.size
Live Demo
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print df
print ("The actual data in our data frame is:")
print df.values
To view a small sample of a DataFrame object, use the head() and tail()
methods. head() returns the first n rows (observe the index values). The
default number of elements to display is five, but you may pass a custom
number.
Live Demo
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data frame is:")
print df
print ("The first two rows of the data frame is:")
print df.head(2)
tail() returns the last n rows (observe the index values). The default number
of elements to display is five, but you may pass a custom number.
Live Demo
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data frame is:")
print df
print ("The last two rows of the data frame is:")
print df.tail(2)
Let us create a DataFrame and use this object throughout this chapter for
all the operations.
Example
Live Demo
import pandas as pd
import numpy as np
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4
.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print df
Returns the sum of the values for the requested axis. By default, axis is
index (axis=0).
Live Demo
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print df.sum()
Age 382
Name TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...
Rating 44.92
dtype: object
axis=1
Live Demo
import pandas as pd
import numpy as np
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4
.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print df.sum(1)
0 29.23
1 29.24
2 28.98
3 25.56
4 33.20
5 33.60
6 26.80
7 37.78
8 42.98
9 34.80
10 55.10
11 49.65
dtype: float64
mean()
Live Demo
import pandas as pd
import numpy as np
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4
.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print df.mean()
Age 31.833333
Rating 3.743333
dtype: float64
std()
Live Demo
import pandas as pd
import numpy as np
#Create a Dictionary of series
d =
{'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','
Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4
.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print df.std()
Age 9.232682
Rating 0.661628
dtype: float64
• Functions like sum(), cumsum() work with both numeric and character
(or) string data elements without any error. Though n practice,
character aggregations are never used generally, these functions do
not throw any exception.
• Functions like abs(), cumprod() throw exception when the DataFrame
contains character or string data because such operations cannot be
performed.
Summarizing Data
The describe() function computes a summary of statistics pertaining to the
DataFrame columns.
Live Demo
import pandas as pd
import numpy as np
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4
.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print df.describe()
Age Rating
count 12.000000 12.000000
mean 31.833333 3.743333
std 9.232682 0.661628
min 23.000000 2.560000
25% 25.000000 3.230000
50% 29.500000 3.790000
75% 35.500000 4.132500
max 51.000000 4.800000
This function gives the mean, std and IQR values. And, function excludes the
character columns and given summary about numeric columns. 'include' is
the argument which is used to pass necessary information regarding what
columns need to be considered for summarizing. Takes the list of values; by
default, 'number'.
Now, use the following statement in the program and check the output −
Live Demo
import pandas as pd
import numpy as np
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4
.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print df.describe(include=['object'])
Name
count 12
unique 12
top Ricky
freq 1
Now, use the following statement and check the output −
Live Demo
import pandas as pd
import numpy as np
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4
.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print df. describe(include='all')
For example, add a value 2 to all the elements in the DataFrame. Then,
adder function
The adder function adds two numeric values as parameters and returns the
sum.
def adder(ele1,ele2):
return ele1+ele2
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.pipe(adder,2)
Live Demo
import pandas as pd
import numpy as np
def adder(ele1,ele2):
return ele1+ele2
df =
pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3']
)
df.pipe(adder,2)
print df.apply(np.mean)
Example 1
Live Demo
import pandas as pd
import numpy as np
df =
pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3']
)
df.apply(np.mean)
print df.apply(np.mean)
col1 -0.288022
col2 1.044839
col3 -0.187009
dtype: float64
Example 2
Live Demo
import pandas as pd
import numpy as np
df =
pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3']
)
df.apply(np.mean,axis=1)
print df.apply(np.mean)
col1 0.034093
col2 -0.152672
col3 -0.229728
dtype: float64
Example 3
Live Demo
import pandas as pd
import numpy as np
df =
pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3']
)
df.apply(lambda x: x.max() - x.min())
print df.apply(np.mean)
col1 -0.167413
col2 -0.370495
col3 -0.707631
dtype: float64
Example 1
Live Demo
import pandas as pd
import numpy as np
df =
pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3']
)
# My custom function
df['col1'].map(lambda x:x*100)
print df.apply(np.mean)
col1 0.480742
col2 0.454185
col3 0.266563
dtype: float64
Example 2
Live Demo
import pandas as pd
import numpy as np
# My custom function
df =
pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3']
)
df.applymap(lambda x:x*100)
print df.apply(np.mean)
col1 0.395263
col2 0.204418
col3 -0.795188
dtype: float64
N=20
df = pd.DataFrame({
'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
'x': np.linspace(0,stop=N-1,num=N),
'y': np.random.rand(N),
'C': np.random.choice(['Low','Medium','High'],N).tolist(),
'D': np.random.normal(100, 10, size=(N)).tolist()
})
#reindex the DataFrame
df_reindexed = df.reindex(index=[0,2,5], columns=['A', 'C',
'B'])
print df_reindexed
A C B
0 2016-01-01 Low NaN
2 2016-01-03 High NaN
5 2016-01-06 Low NaN
Example
Live Demo
import pandas as pd
import numpy as np
df1 =
pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3'
])
df2 =
pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3']
)
df1 = df1.reindex_like(df2)
print df1
df1 =
pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3']
)
df2 =
pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3']
)
# Padding NAN's
print df2.reindex_like(df1)
Example
Live Demo
import pandas as pd
import numpy as np
df1 =
pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3']
)
df2 =
pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3']
)
# Padding NAN's
print df2.reindex_like(df1)
Note − Observe, only the 7th row is filled by the preceding 6th row. Then,
the rows are left as they are.
Renaming
The rename() method allows you to relabel an axis based on some mapping
(a dict or Series) or an arbitrary function.
Live Demo
import pandas as pd
import numpy as np
df1 =
pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3']
)
print df1
The behavior of basic iteration over Pandas objects depends on the type.
When iterating over a Series, it is regarded as array-like, and basic iteration
produces the values. Other data structures, like DataFrame and Panel,
follow the dict-like convention of iterating over the keys of the objects.
• Series − values
• DataFrame − column labels
• Panel − item labels
Iterating a DataFrame
Iterating a DataFrame gives column names. Let us consider the following
example to understand the same.
Live Demo
import pandas as pd
import numpy as np
N=20
df = pd.DataFrame({
'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
'x': np.linspace(0,stop=N-1,num=N),
'y': np.random.rand(N),
'C': np.random.choice(['Low','Medium','High'],N).tolist(),
'D': np.random.normal(100, 10, size=(N)).tolist()
})
To iterate over the rows of the DataFrame, we can use the following
functions −
iteritems()
Iterates over each column as key, value pair with label as key and column
value as a Series object.
Live Demo
import pandas as pd
import numpy as np
df =
pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3']
)
for key,value in df.iteritems():
print key,value
col1 0 0.802390
1 0.324060
2 0.256811
3 0.839186
Name: col1, dtype: float64
col2 0 1.624313
1 -1.033582
2 1.796663
3 1.856277
Name: col2, dtype: float64
col3 0 -0.022142
1 -0.230820
2 1.160691
3 -0.830279
Name: col3, dtype: float64
iterrows()
iterrows() returns the iterator yielding each index value along with a series
containing the data in each row.
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4,3),columns =
['col1','col2','col3'])
for row_index,row in df.iterrows():
print row_index,row
0 col1 1.529759
col2 0.762811
col3 -0.634691
Name: 0, dtype: float64
1 col1 -0.944087
col2 1.420919
col3 -0.507895
Name: 1, dtype: float64
2 col1 -0.077287
col2 -0.858556
col3 -0.663385
Name: 2, dtype: float64
3 col1 -1.638578
col2 0.059866
col3 0.493482
Name: 3, dtype: float64
Note − Because iterrows() iterate over the rows, it doesn't preserve the data
type across the row. 0,1,2 are the row indices and col1,col2,col3 are column
indices.
itertuples()
itertuples() method will return an iterator yielding a named tuple for each
row in the DataFrame. The first element of the tuple will be the row’s
corresponding index value, while the remaining values are the row values.
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4,3),columns =
['col1','col2','col3'])
for row in df.itertuples():
print row
Note − Do not try to modify any object while iterating. Iterating is meant for
reading and the iterator returns a copy of the original object (a view), thus
the changes will not reflect on the original object.
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4,3),columns =
['col1','col2','col3'])
The behavior of basic iteration over Pandas objects depends on the type.
When iterating over a Series, it is regarded as array-like, and basic iteration
produces the values. Other data structures, like DataFrame and Panel,
follow the dict-like convention of iterating over the keys of the objects.
• Series − values
• DataFrame − column labels
• Panel − item labels
Iterating a DataFrame
Iterating a DataFrame gives column names. Let us consider the following
example to understand the same.
Live Demo
import pandas as pd
import numpy as np
N=20
df = pd.DataFrame({
'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
'x': np.linspace(0,stop=N-1,num=N),
'y': np.random.rand(N),
'C': np.random.choice(['Low','Medium','High'],N).tolist(),
'D': np.random.normal(100, 10, size=(N)).tolist()
})
A
C
D
x
y
To iterate over the rows of the DataFrame, we can use the following
functions −
iteritems()
Iterates over each column as key, value pair with label as key and column
value as a Series object.
Live Demo
import pandas as pd
import numpy as np
df =
pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3']
)
for key,value in df.iteritems():
print key,value
col1 0 0.802390
1 0.324060
2 0.256811
3 0.839186
Name: col1, dtype: float64
col2 0 1.624313
1 -1.033582
2 1.796663
3 1.856277
Name: col2, dtype: float64
col3 0 -0.022142
1 -0.230820
2 1.160691
3 -0.830279
Name: col3, dtype: float64
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4,3),columns =
['col1','col2','col3'])
for row_index,row in df.iterrows():
print row_index,row
0 col1 1.529759
col2 0.762811
col3 -0.634691
Name: 0, dtype: float64
1 col1 -0.944087
col2 1.420919
col3 -0.507895
Name: 1, dtype: float64
2 col1 -0.077287
col2 -0.858556
col3 -0.663385
Name: 2, dtype: float64
3 col1 -1.638578
col2 0.059866
col3 0.493482
Name: 3, dtype: float64
Note − Because iterrows() iterate over the rows, it doesn't preserve the data
type across the row. 0,1,2 are the row indices and col1,col2,col3 are column
indices.
itertuples()
itertuples() method will return an iterator yielding a named tuple for each
row in the DataFrame. The first element of the tuple will be the row’s
corresponding index value, while the remaining values are the row values.
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4,3),columns =
['col1','col2','col3'])
for row in df.itertuples():
print row
Note − Do not try to modify any object while iterating. Iterating is meant for
reading and the iterator returns a copy of the original object (a view), thus
the changes will not reflect on the original object.
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4,3),columns =
['col1','col2','col3'])
• By label
• By Actual Value
import pandas as pd
import numpy as np
unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,
5,9,8,0,7],colu
mns=['col2','col1'])
print unsorted_df
col2 col1
1 -2.063177 0.537527
4 0.142932 -0.684884
6 0.012667 -0.389340
2 -0.548797 1.848743
3 -1.044160 0.837381
5 0.385605 1.300185
9 1.031425 -1.002967
8 -0.407374 -0.435142
0 2.237453 -1.067139
7 -1.445831 -1.701035
In unsorted_df, the labels and the values are unsorted. Let us see how these
can be sorted.
By Label
Using the sort_index() method, by passing the axis arguments and the order
of sorting, DataFrame can be sorted. By default, sorting is done on row
labels in ascending order.
import pandas as pd
import numpy as np
unsorted_df =
pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],c
olu
mns = ['col2','col1'])
sorted_df=unsorted_df.sort_index()
print sorted_df
col2 col1
0 0.208464 0.627037
1 0.641004 0.331352
2 -0.038067 -0.464730
3 -0.638456 -0.021466
4 0.014646 -0.737438
5 -0.290761 -1.669827
6 -0.797303 -0.018737
7 0.525753 1.628921
8 -0.567031 0.775951
9 0.060724 -0.322425
Order of Sorting
import pandas as pd
import numpy as np
unsorted_df =
pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],c
olu
mns = ['col2','col1'])
sorted_df = unsorted_df.sort_index(ascending=False)
print sorted_df
By passing the axis argument with a value 0 or 1, the sorting can be done
on the column labels. By default, axis=0, sort by row. Let us consider the
following example to understand the same.
import pandas as pd
import numpy as np
unsorted_df =
pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],c
olu
mns = ['col2','col1'])
sorted_df=unsorted_df.sort_index(axis=1)
print sorted_df
col1 col2
1 -0.291048 0.584890
4 0.851385 1.302561
6 0.954300 -0.202951
2 0.166609 -1.222295
3 -0.388659 -0.157915
5 -1.551250 -1.289321
9 0.374463 0.825697
8 0.510373 -1.699509
0 -0.061294 0.668444
7 0.622958 -0.581378
By Value
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col1')
print sorted_df
col1 col2
1 1 3
2 1 2
3 1 4
0 2 1
Observe, col1 values are sorted and the respective col2 value and row index
will alter along with col1. Thus, they look unsorted.
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by=['col1','col2'])
print sorted_df
col1 col2
2 1 2
1 1 3
3 1 4
0 2 1
Sorting Algorithm
sort_values() provides a provision to choose the algorithm from mergesort,
heapsort and quicksort. Mergesort is the only stable algorithm.
Live Demo
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col1' ,kind='mergesort')
print sorted_df
col1 col2
1 1 3
2 1 2
3 1 4
0 2 1
In this chapter, we will discuss the string operations with our basic
Series/Index. In the subsequent chapters, we will learn how to apply these
string functions on the DataFrame.
1 lower()
Converts strings in the Series/Index to lower case.
2 upper()
Converts strings in the Series/Index to upper case.
3 len()
Computes String length().
strip()
4 Helps strip whitespace(including newline) from each string in the
Series/index from both the sides.
5 split(' ')
Splits each string with the given pattern.
6 cat(sep=' ')
Concatenates the series/index elements with given separator.
7 get_dummies()
Returns the DataFrame with One-Hot Encoded values.
contains(pattern)
8 Returns a Boolean value True for each element if the substring
contains in the element, else False.
9 replace(a,b)
Replaces the value a with the value b.
10 repeat(value)
Repeats each element with specified number of times.
11 count(pattern)
Returns count of appearance of pattern in each element.
startswith(pattern)
12 Returns true if the element in the Series/Index starts with the
pattern.
endswith(pattern)
13 Returns true if the element in the Series/Index ends with the
pattern.
14 find(pattern)
Returns the first position of the first occurrence of the pattern.
15 findall(pattern)
Returns a list of all occurrence of the pattern.
16 swapcase
Swaps the case lower/upper.
islower()
17 Checks whether all characters in each string in the Series/Index in
lower case or not. Returns Boolean
isupper()
18 Checks whether all characters in each string in the Series/Index in
upper case or not. Returns Boolean.
isnumeric()
19 Checks whether all characters in each string in the Series/Index
are numeric. Returns Boolean.
Let us now create a Series and see how all the above functions work.
Live Demo
import pandas as pd
import numpy as np
print s
0 Tom
1 William Rick
2 John
3 Alber@t
4 NaN
5 1234
6 Steve Smith
dtype: object
lower()
Live Demo
import pandas as pd
import numpy as np
print s.str.lower()
0 tom
1 william rick
2 john
3 alber@t
4 NaN
5 1234
6 steve smith
dtype: object
upper()
Live Demo
import pandas as pd
import numpy as np
s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan,
'1234','SteveSmith'])
print s.str.upper()
0 TOM
1 WILLIAM RICK
2 JOHN
3 ALBER@T
4 NaN
5 1234
6 STEVE SMITH
dtype: object
len()
Live Demo
import pandas as pd
import numpy as np
0 3.0
1 12.0
2 4.0
3 7.0
4 NaN
5 4.0
6 10.0
dtype: float64
strip()
Live Demo
import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s
print ("After Stripping:")
print s.str.strip()
After Stripping:
0 Tom
1 William Rick
2 John
3 Alber@t
dtype: object
split(pattern)
Live Demo
import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s
print ("Split Pattern:")
print s.str.split(' ')
0 Tom
1 William Rick
2 John
3 Alber@t
dtype: object
Split Pattern:
0 [Tom, , , , , , , , , , ]
1 [, , , , , William, Rick]
2 [John]
3 [Alber@t]
dtype: object
cat(sep=pattern)
Live Demo
import pandas as pd
import numpy as np
print s.str.cat(sep='_')
Its output is as follows −
print s.str.get_dummies()
0 True
1 True
2 False
3 False
dtype: bool
replace(a,b)
Live Demo
import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s
print ("After replacing @ with $:")
print s.str.replace('@','$')
print s.str.repeat(2)
0 Tom Tom
1 William Rick William Rick
2 JohnJohn
3 Alber@tAlber@t
dtype: object
count(pattern)
Live Demo
import pandas as pd
0 True
1 False
2 False
3 False
dtype: bool
endswith(pattern)
Live Demo
import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print ("Strings that end with 't':")
print s.str.endswith('t')
print s.str.find('e')
0 -1
1 -1
2 -1
3 3
dtype: int64
findall(pattern)
Live Demo
import pandas as pd
print s.str.findall('e')
0 []
1 []
2 []
3 [e]
dtype: object
Null list([ ]) indicates that there is no such pattern available in the element.
swapcase()
Live Demo
import pandas as pd
0 tOM
1 wILLIAM rICK
2 jOHN
3 aLBER@T
dtype: object
islower()
Live Demo
import pandas as pd
0 False
1 False
2 False
3 False
dtype: bool
isupper()
Live Demo
import pandas as pd
print s.str.isupper()
0 False
1 False
2 False
3 False
dtype: bool
isnumeric()
Live Demo
import pandas as pd
print s.str.isnumeric()
0 False
1 False
2 False
3 False
dtype: bool
1 lower()
Converts strings in the Series/Index to lower case.
2 upper()
Converts strings in the Series/Index to upper case.
3 len()
Computes String length().
strip()
4 Helps strip whitespace(including newline) from each string in the
Series/index from both the sides.
5 split(' ')
Splits each string with the given pattern.
6 cat(sep=' ')
Concatenates the series/index elements with given separator.
7 get_dummies()
Returns the DataFrame with One-Hot Encoded values.
contains(pattern)
8 Returns a Boolean value True for each element if the substring
contains in the element, else False.
9 replace(a,b)
Replaces the value a with the value b.
10 repeat(value)
Repeats each element with specified number of times.
11 count(pattern)
Returns count of appearance of pattern in each element.
startswith(pattern)
12 Returns true if the element in the Series/Index starts with the
pattern.
endswith(pattern)
13 Returns true if the element in the Series/Index ends with the
pattern.
14 find(pattern)
Returns the first position of the first occurrence of the pattern.
15 findall(pattern)
Returns a list of all occurrence of the pattern.
16 swapcase
Swaps the case lower/upper.
islower()
17 Checks whether all characters in each string in the Series/Index in
lower case or not. Returns Boolean
isupper()
18 Checks whether all characters in each string in the Series/Index in
upper case or not. Returns Boolean.
isnumeric()
19 Checks whether all characters in each string in the Series/Index
are numeric. Returns Boolean.
Let us now create a Series and see how all the above functions work.
Live Demo
import pandas as pd
import numpy as np
print s
0 Tom
1 William Rick
2 John
3 Alber@t
4 NaN
5 1234
6 Steve Smith
dtype: object
lower()
Live Demo
import pandas as pd
import numpy as np
print s.str.lower()
0 tom
1 william rick
2 john
3 alber@t
4 NaN
5 1234
6 steve smith
dtype: object
upper()
Live Demo
import pandas as pd
import numpy as np
print s.str.upper()
0 TOM
1 WILLIAM RICK
2 JOHN
3 ALBER@T
4 NaN
5 1234
6 STEVE SMITH
dtype: object
len()
Live Demo
import pandas as pd
import numpy as np
0 3.0
1 12.0
2 4.0
3 7.0
4 NaN
5 4.0
6 10.0
dtype: float64
strip()
Live Demo
import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s
print ("After Stripping:")
print s.str.strip()
0 Tom
1 William Rick
2 John
3 Alber@t
dtype: object
After Stripping:
0 Tom
1 William Rick
2 John
3 Alber@t
dtype: object
split(pattern)
Live Demo
import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s
print ("Split Pattern:")
print s.str.split(' ')
0 Tom
1 William Rick
2 John
3 Alber@t
dtype: object
Split Pattern:
0 [Tom, , , , , , , , , , ]
1 [, , , , , William, Rick]
2 [John]
3 [Alber@t]
dtype: object
cat(sep=pattern)
Live Demo
import pandas as pd
import numpy as np
print s.str.cat(sep='_')
print s.str.get_dummies()
0 True
1 True
2 False
3 False
dtype: bool
replace(a,b)
Live Demo
import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s
print ("After replacing @ with $:")
print s.str.replace('@','$')
0 Tom
1 William Rick
2 John
3 Alber@t
dtype: object
print s.str.repeat(2)
0 Tom Tom
1 William Rick William Rick
2 JohnJohn
3 Alber@tAlber@t
dtype: object
count(pattern)
Live Demo
import pandas as pd
0 True
1 False
2 False
3 False
dtype: bool
endswith(pattern)
Live Demo
import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print ("Strings that end with 't':")
print s.str.endswith('t')
print s.str.find('e')
0 -1
1 -1
2 -1
3 3
dtype: int64
findall(pattern)
Live Demo
import pandas as pd
print s.str.findall('e')
0 []
1 []
2 []
3 [e]
dtype: object
Null list([ ]) indicates that there is no such pattern available in the element.
swapcase()
Live Demo
import pandas as pd
0 tOM
1 wILLIAM rICK
2 jOHN
3 aLBER@T
dtype: object
islower()
Live Demo
import pandas as pd
0 False
1 False
2 False
3 False
dtype: bool
isupper()
Live Demo
import pandas as pd
print s.str.isupper()
print s.str.isnumeric()
0 False
1 False
2 False
3 False
dtype: bool
Previous
Next
In this chapter, we will discuss how to slice and dice the date and generally
get the subset of pandas object.
The Python and NumPy indexing operators "[ ]" and attribute operator "."
provide quick and easy access to Pandas data structures across a wide
range of use cases. However, since the type of the data to be accessed isn’t
known in advance, directly using standard operators has some optimization
limits. For production code, we recommend that you take advantage of the
optimized pandas data access methods explained in this chapter.
Pandas now supports three types of Multi-axes indexing; the three types are
mentioned in the following table −
2 .iloc()
Integer based
3 .ix()
Both Label and Integer based
.loc()
Pandas provide various methods to have purely label based indexing. When
slicing, the start bound is also included. Integers are valid labels, but they
refer to the label and not the position.
loc takes two single/list/range operator separated by ','. The first one
indicates the row and the second one indicates columns.
Example 1
Live Demo
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B',
'C', 'D'])
a 0.391548
b -0.070649
c -0.317212
d -2.162406
e 2.202797
f 0.613709
g 1.050559
h 1.122680
Name: A, dtype: float64
Example 2
Live Demo
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B',
'C', 'D'])
A C
a 0.391548 0.745623
b -0.070649 1.620406
c -0.317212 1.448365
d -2.162406 -0.873557
e 2.202797 0.528067
f 0.613709 0.286414
g 1.050559 0.216526
h 1.122680 -1.621420
Example 3
Live Demo
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B',
'C', 'D'])
A C
a 0.391548 0.745623
b -0.070649 1.620406
f 0.613709 0.286414
h 1.122680 -1.621420
Example 4
Live Demo
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B',
'C', 'D'])
A B C D
a 0.391548 -0.224297 0.745623 0.054301
b -0.070649 -0.880130 1.620406 1.419743
c -0.317212 -1.929698 1.448365 0.616899
d -2.162406 0.614256 -0.873557 1.093958
e 2.202797 -2.315915 0.528067 0.612482
f 0.613709 -0.157674 0.286414 -0.500517
g 1.050559 -2.272099 0.216526 0.928449
h 1.122680 0.324368 -1.621420 -0.741470
Example 5
Live Demo
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B',
'C', 'D'])
A False
B True
C False
D False
Name: a, dtype: bool
.iloc()
Pandas provide various methods in order to get purely integer based
indexing. Like python and numpy, these are 0-based indexing.
• An Integer
• A list of integers
• A range of values
Example 1
Live Demo
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
A B C D
0 0.699435 0.256239 -1.270702 -0.645195
1 -0.685354 0.890791 -0.813012 0.631615
2 -0.783192 -0.531378 0.025070 0.230806
3 0.539042 -1.284314 0.826977 -0.026251
Example 2
Live Demo
import pandas as pd
import numpy as np
# Integer slicing
print df.iloc[:4]
print df.iloc[1:5, 2:4]
A B C D
0 0.699435 0.256239 -1.270702 -0.645195
1 -0.685354 0.890791 -0.813012 0.631615
2 -0.783192 -0.531378 0.025070 0.230806
3 0.539042 -1.284314 0.826977 -0.026251
C D
1 -0.813012 0.631615
2 0.025070 0.230806
3 0.826977 -0.026251
4 1.423332 1.130568
Example 3
Live Demo
import pandas as pd
import numpy as np
B D
1 0.890791 0.631615
3 -1.284314 -0.026251
5 -0.512888 -0.518930
A B C D
1 -0.685354 0.890791 -0.813012 0.631615
2 -0.783192 -0.531378 0.025070 0.230806
B C
0 0.256239 -1.270702
1 0.890791 -0.813012
2 -0.531378 0.025070
3 -1.284314 0.826977
4 -0.460729 1.423332
5 -0.512888 0.581409
6 -1.204853 0.098060
7 -0.947857 0.641358
.ix()
Besides pure label based and integer based, Pandas provides a hybrid
method for selections and subsetting the object using the .ix() operator.
Example 1
Live Demo
import pandas as pd
import numpy as np
# Integer slicing
print df.ix[:4]
A B C D
0 0.699435 0.256239 -1.270702 -0.645195
1 -0.685354 0.890791 -0.813012 0.631615
2 -0.783192 -0.531378 0.025070 0.230806
3 0.539042 -1.284314 0.826977 -0.026251
Example 2
Live Demo
import pandas as pd
import numpy as np
0 0.699435
1 -0.685354
2 -0.783192
3 0.539042
4 -1.044209
5 -1.415411
6 1.062095
7 0.994204
Name: A, dtype: float64
Use of Notations
Getting values from the Pandas object with Multi-axes indexing uses the
following notation −
p.loc[item_index,major_index, p.loc[item_index,major_index,
Panel
minor_index] minor_index]
Note − .iloc() & .ix() applies the same indexing options and Return value.
Let us now see how each operation can be performed on the DataFrame
object. We will use the basic indexing operator '[ ]' −
Example 1
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B',
'C', 'D'])
print df['A']
0 -0.478893
1 0.391931
2 0.336825
3 -1.055102
4 -0.165218
5 -0.328641
6 0.567721
7 -0.759399
Name: A, dtype: float64
Example 2
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B',
'C', 'D'])
print df[['A','B']]
A B
0 -0.478893 -0.606311
1 0.391931 -0.949025
2 0.336825 0.093717
3 -1.055102 -0.012944
4 -0.165218 1.550310
5 -0.328641 -0.226363
6 0.567721 -0.312585
7 -0.759399 -0.372696
Example 3
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B',
'C', 'D'])
print df[2:2]
Columns: [A, B, C, D]
Index: []
Attribute Access
Example
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B',
'C', 'D'])
print df.A
0 -0.478893
1 0.391931
2 0.336825
3 -1.055102
4 -0.165218
5 -0.328641
6 0.567721
7 -0.759399
Name: A, dtype: float64
In this chapter, we will discuss how to slice and dice the date and generally
get the subset of pandas object.
The Python and NumPy indexing operators "[ ]" and attribute operator "."
provide quick and easy access to Pandas data structures across a wide
range of use cases. However, since the type of the data to be accessed isn’t
known in advance, directly using standard operators has some optimization
limits. For production code, we recommend that you take advantage of the
optimized pandas data access methods explained in this chapter.
Pandas now supports three types of Multi-axes indexing; the three types are
mentioned in the following table −
1 .loc()
Label based
2 .iloc()
Integer based
3 .ix()
Both Label and Integer based
.loc()
Pandas provide various methods to have purely label based indexing. When
slicing, the start bound is also included. Integers are valid labels, but they
refer to the label and not the position.
loc takes two single/list/range operator separated by ','. The first one
indicates the row and the second one indicates columns.
Example 1
Live Demo
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B',
'C', 'D'])
a 0.391548
b -0.070649
c -0.317212
d -2.162406
e 2.202797
f 0.613709
g 1.050559
h 1.122680
Name: A, dtype: float64
Example 2
Live Demo
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B',
'C', 'D'])
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B',
'C', 'D'])
A C
a 0.391548 0.745623
b -0.070649 1.620406
f 0.613709 0.286414
h 1.122680 -1.621420
Example 4
Live Demo
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B',
'C', 'D'])
A B C D
a 0.391548 -0.224297 0.745623 0.054301
b -0.070649 -0.880130 1.620406 1.419743
c -0.317212 -1.929698 1.448365 0.616899
d -2.162406 0.614256 -0.873557 1.093958
e 2.202797 -2.315915 0.528067 0.612482
f 0.613709 -0.157674 0.286414 -0.500517
g 1.050559 -2.272099 0.216526 0.928449
h 1.122680 0.324368 -1.621420 -0.741470
Example 5
Live Demo
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B',
'C', 'D'])
A False
B True
C False
D False
Name: a, dtype: bool
.iloc()
Pandas provide various methods in order to get purely integer based
indexing. Like python and numpy, these are 0-based indexing.
• An Integer
• A list of integers
• A range of values
Example 1
Live Demo
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B',
'C', 'D'])
A B C D
0 0.699435 0.256239 -1.270702 -0.645195
1 -0.685354 0.890791 -0.813012 0.631615
2 -0.783192 -0.531378 0.025070 0.230806
3 0.539042 -1.284314 0.826977 -0.026251
Example 2
Live Demo
import pandas as pd
import numpy as np
# Integer slicing
print df.iloc[:4]
print df.iloc[1:5, 2:4]
A B C D
0 0.699435 0.256239 -1.270702 -0.645195
1 -0.685354 0.890791 -0.813012 0.631615
2 -0.783192 -0.531378 0.025070 0.230806
3 0.539042 -1.284314 0.826977 -0.026251
C D
1 -0.813012 0.631615
2 0.025070 0.230806
3 0.826977 -0.026251
4 1.423332 1.130568
Example 3
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B',
'C', 'D'])
B D
1 0.890791 0.631615
3 -1.284314 -0.026251
5 -0.512888 -0.518930
A B C D
1 -0.685354 0.890791 -0.813012 0.631615
2 -0.783192 -0.531378 0.025070 0.230806
B C
0 0.256239 -1.270702
1 0.890791 -0.813012
2 -0.531378 0.025070
3 -1.284314 0.826977
4 -0.460729 1.423332
5 -0.512888 0.581409
6 -1.204853 0.098060
7 -0.947857 0.641358
.ix()
Besides pure label based and integer based, Pandas provides a hybrid
method for selections and subsetting the object using the .ix() operator.
Example 1
Live Demo
import pandas as pd
import numpy as np
# Integer slicing
print df.ix[:4]
Its output is as follows −
A B C D
0 0.699435 0.256239 -1.270702 -0.645195
1 -0.685354 0.890791 -0.813012 0.631615
2 -0.783192 -0.531378 0.025070 0.230806
3 0.539042 -1.284314 0.826977 -0.026251
Example 2
Live Demo
import pandas as pd
import numpy as np
0 0.699435
1 -0.685354
2 -0.783192
3 0.539042
4 -1.044209
5 -1.415411
6 1.062095
7 0.994204
Name: A, dtype: float64
Use of Notations
Getting values from the Pandas object with Multi-axes indexing uses the
following notation −
p.loc[item_index,major_index, p.loc[item_index,major_index,
Panel
minor_index] minor_index]
Note − .iloc() & .ix() applies the same indexing options and Return value.
Let us now see how each operation can be performed on the DataFrame
object. We will use the basic indexing operator '[ ]' −
Example 1
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B',
'C', 'D'])
print df['A']
0 -0.478893
1 0.391931
2 0.336825
3 -1.055102
4 -0.165218
5 -0.328641
6 0.567721
7 -0.759399
Name: A, dtype: float64
Example 2
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B',
'C', 'D'])
print df[['A','B']]
A B
0 -0.478893 -0.606311
1 0.391931 -0.949025
2 0.336825 0.093717
3 -1.055102 -0.012944
4 -0.165218 1.550310
5 -0.328641 -0.226363
6 0.567721 -0.312585
7 -0.759399 -0.372696
Example 3
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B',
'C', 'D'])
print df[2:2]
Columns: [A, B, C, D]
Index: []
Attribute Access
Example
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B',
'C', 'D'])
print df.A
0 -0.478893
1 0.391931
2 0.336825
3 -1.055102
4 -0.165218
5 -0.328641
6 0.567721
7 -0.759399
Name: A, dtype: float64
Percent_change
Series, DatFrames and Panel, all have the function pct_change(). This
function compares every element with its prior element and computes the
change percentage.
Live Demo
import pandas as pd
import numpy as np
s = pd.Series([1,2,3,4,5,4])
print s.pct_change()
df = pd.DataFrame(np.random.randn(5, 2))
print df.pct_change()
0 NaN
1 1.000000
2 0.500000
3 0.333333
4 0.250000
5 -0.200000
dtype: float64
0 1
0 NaN NaN
1 -15.151902 0.174730
2 -0.746374 -1.449088
3 -3.582229 -3.165836
4 15.601150 -1.860434
Covariance
Covariance is applied on series data. The Series object has a method cov to
compute covariance between series objects. NA will be excluded
automatically.
Cov Series
Live Demo
import pandas as pd
import numpy as np
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(10))
print s1.cov(s2)
-0.12978405324
Live Demo
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b',
'c', 'd', 'e'])
print frame['a'].cov(frame['b'])
print frame.cov()
-0.58312921152741437
a b c d e
a 1.780628 -0.583129 -0.185575 0.003679 -0.136558
b -0.583129 1.297011 0.136530 -0.523719 0.251064
c -0.185575 0.136530 0.915227 -0.053881 -0.058926
d 0.003679 -0.523719 -0.053881 1.521426 -0.487694
e -0.136558 0.251064 -0.058926 -0.487694 0.960761
Note − Observe the cov between a and b column in the first statement and
the same is the value returned by cov on DataFrame.
Correlation
Correlation shows the linear relationship between any two array of values
(series). There are multiple methods to compute the correlation like
pearson(default), spearman and kendall.
Live Demo
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b',
'c', 'd', 'e'])
print frame['a'].corr(frame['b'])
print frame.corr()
-0.383712785514
a b c d e
a 1.000000 -0.383713 -0.145368 0.002235 -0.104405
b -0.383713 1.000000 0.125311 -0.372821 0.224908
c -0.145368 0.125311 1.000000 -0.045661 -0.062840
d 0.002235 -0.372821 -0.045661 1.000000 -0.403380
e -0.104405 0.224908 -0.062840 -0.403380 1.000000
Data Ranking
Data Ranking produces ranking for each element in the array of elements.
In case of ties, assigns the mean rank.
Live Demo
import pandas as pd
import numpy as np
s = pd.Series(np.random.np.random.randn(5), index=list('abcde'))
s['d'] = s['b'] # so there's a tie
print s.rank()
a 1.0
b 3.5
c 2.0
d 3.5
e 5.0
dtype: float64
Previous
Next
Once the rolling, expanding and ewm objects are created, several methods
are available to perform aggregations on data.
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r
A B C D
2000-01-01 1.088512 -0.650942 -2.547450 -0.566858
2000-01-02 0.790670 -0.387854 -0.668132 0.267283
2000-01-03 -0.575523 -0.965025 0.060427 -2.179780
2000-01-04 1.669653 1.211759 -0.254695 1.429166
2000-01-05 0.100568 -0.236184 0.491646 -0.466081
2000-01-06 0.155172 0.992975 -1.205134 0.320958
2000-01-07 0.309468 -0.724053 -1.412446 0.627919
2000-01-08 0.099489 -1.028040 0.163206 -1.274331
2000-01-09 1.639500 -0.068443 0.714008 -0.565969
2000-01-10 0.326761 1.479841 0.664282 -1.361169
Rolling [window=3,min_periods=1,center=False,axis=0]
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r.aggregate(np.sum)
A B C D
2000-01-01 1.088512 -0.650942 -2.547450 -0.566858
2000-01-02 1.879182 -1.038796 -3.215581 -0.299575
2000-01-03 1.303660 -2.003821 -3.155154 -2.479355
2000-01-04 1.884801 -0.141119 -0.862400 -0.483331
2000-01-05 1.194699 0.010551 0.297378 -1.216695
2000-01-06 1.925393 1.968551 -0.968183 1.284044
2000-01-07 0.565208 0.032738 -2.125934 0.482797
2000-01-08 0.564129 -0.759118 -2.454374 -0.325454
2000-01-09 2.048458 -1.820537 -0.535232 -1.212381
2000-01-10 2.065750 0.383357 1.541496 -3.201469
A B C D
2000-01-01 1.088512 -0.650942 -2.547450 -0.566858
2000-01-02 1.879182 -1.038796 -3.215581 -0.299575
2000-01-03 1.303660 -2.003821 -3.155154 -2.479355
2000-01-04 1.884801 -0.141119 -0.862400 -0.483331
2000-01-05 1.194699 0.010551 0.297378 -1.216695
2000-01-06 1.925393 1.968551 -0.968183 1.284044
2000-01-07 0.565208 0.032738 -2.125934 0.482797
2000-01-08 0.564129 -0.759118 -2.454374 -0.325454
2000-01-09 2.048458 -1.820537 -0.535232 -1.212381
2000-01-10 2.065750 0.383357 1.541496 -3.201469
Apply Aggregation on a Single Column of a Dataframe
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r['A'].aggregate(np.sum)
A B C D
2000-01-01 1.088512 -0.650942 -2.547450 -0.566858
2000-01-02 1.879182 -1.038796 -3.215581 -0.299575
2000-01-03 1.303660 -2.003821 -3.155154 -2.479355
2000-01-04 1.884801 -0.141119 -0.862400 -0.483331
2000-01-05 1.194699 0.010551 0.297378 -1.216695
2000-01-06 1.925393 1.968551 -0.968183 1.284044
2000-01-07 0.565208 0.032738 -2.125934 0.482797
2000-01-08 0.564129 -0.759118 -2.454374 -0.325454
2000-01-09 2.048458 -1.820537 -0.535232 -1.212381
2000-01-10 2.065750 0.383357 1.541496 -3.201469
2000-01-01 1.088512
2000-01-02 1.879182
2000-01-03 1.303660
2000-01-04 1.884801
2000-01-05 1.194699
2000-01-06 1.925393
2000-01-07 0.565208
2000-01-08 0.564129
2000-01-09 2.048458
2000-01-10 2.065750
Freq: D, Name: A, dtype: float64
Apply Aggregation on Multiple Columns of a DataFrame
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r[['A','B']].aggregate(np.sum)
A B C D
2000-01-01 1.088512 -0.650942 -2.547450 -0.566858
2000-01-02 1.879182 -1.038796 -3.215581 -0.299575
2000-01-03 1.303660 -2.003821 -3.155154 -2.479355
2000-01-04 1.884801 -0.141119 -0.862400 -0.483331
2000-01-05 1.194699 0.010551 0.297378 -1.216695
2000-01-06 1.925393 1.968551 -0.968183 1.284044
2000-01-07 0.565208 0.032738 -2.125934 0.482797
2000-01-08 0.564129 -0.759118 -2.454374 -0.325454
2000-01-09 2.048458 -1.820537 -0.535232 -1.212381
2000-01-10 2.065750 0.383357 1.541496 -3.201469
A B
2000-01-01 1.088512 -0.650942
2000-01-02 1.879182 -1.038796
2000-01-03 1.303660 -2.003821
2000-01-04 1.884801 -0.141119
2000-01-05 1.194699 0.010551
2000-01-06 1.925393 1.968551
2000-01-07 0.565208 0.032738
2000-01-08 0.564129 -0.759118
2000-01-09 2.048458 -1.820537
2000-01-10 2.065750 0.383357
Apply Multiple Functions on a Single Column of a DataFrame
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r['A'].aggregate([np.sum,np.mean])
Its output is as follows −
A B C D
2000-01-01 1.088512 -0.650942 -2.547450 -0.566858
2000-01-02 1.879182 -1.038796 -3.215581 -0.299575
2000-01-03 1.303660 -2.003821 -3.155154 -2.479355
2000-01-04 1.884801 -0.141119 -0.862400 -0.483331
2000-01-05 1.194699 0.010551 0.297378 -1.216695
2000-01-06 1.925393 1.968551 -0.968183 1.284044
2000-01-07 0.565208 0.032738 -2.125934 0.482797
2000-01-08 0.564129 -0.759118 -2.454374 -0.325454
2000-01-09 2.048458 -1.820537 -0.535232 -1.212381
2000-01-10 2.065750 0.383357 1.541496 -3.201469
sum mean
2000-01-01 1.088512 1.088512
2000-01-02 1.879182 0.939591
2000-01-03 1.303660 0.434553
2000-01-04 1.884801 0.628267
2000-01-05 1.194699 0.398233
2000-01-06 1.925393 0.641798
2000-01-07 0.565208 0.188403
2000-01-08 0.564129 0.188043
2000-01-09 2.048458 0.682819
2000-01-10 2.065750 0.688583
Apply Multiple Functions on Multiple Columns of a DataFrame
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r[['A','B']].aggregate([np.sum,np.mean])
A B C D
2000-01-01 1.088512 -0.650942 -2.547450 -0.566858
2000-01-02 1.879182 -1.038796 -3.215581 -0.299575
2000-01-03 1.303660 -2.003821 -3.155154 -2.479355
2000-01-04 1.884801 -0.141119 -0.862400 -0.483331
2000-01-05 1.194699 0.010551 0.297378 -1.216695
2000-01-06 1.925393 1.968551 -0.968183 1.284044
2000-01-07 0.565208 0.032738 -2.125934 0.482797
2000-01-08 0.564129 -0.759118 -2.454374 -0.325454
2000-01-09 2.048458 -1.820537 -0.535232 -1.212381
2000-01-10 2.065750 0.383357 1.541496 -3.201469
A B
sum mean sum mean
2000-01-01 1.088512 1.088512 -0.650942 -0.650942
2000-01-02 1.879182 0.939591 -1.038796 -0.519398
2000-01-03 1.303660 0.434553 -2.003821 -0.667940
2000-01-04 1.884801 0.628267 -0.141119 -0.047040
2000-01-05 1.194699 0.398233 0.010551 0.003517
2000-01-06 1.925393 0.641798 1.968551 0.656184
2000-01-07 0.565208 0.188403 0.032738 0.010913
2000-01-08 0.564129 0.188043 -0.759118 -0.253039
2000-01-09 2.048458 0.682819 -1.820537 -0.606846
2000-01-10 2.065750 0.688583 0.383357 0.127786
Apply Different Functions to Different Columns of a
Dataframe
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 4),
index = pd.date_range('1/1/2000', periods=3),
columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r.aggregate({'A' : np.sum,'B' : np.mean})
A B C D
2000-01-01 -1.575749 -1.018105 0.317797 0.545081
2000-01-02 -0.164917 -1.361068 0.258240 1.113091
2000-01-03 1.258111 1.037941 -0.047487 0.867371
A B
2000-01-01 -1.575749 -1.018105
2000-01-02 -1.740666 -1.189587
2000-01-03 -0.482555 -0.447078
Missing data is always a problem in real life scenarios. Areas like machine
learning and data mining face severe issues in the accuracy of their model
predictions because of poor quality of data caused by missing values. In
these areas, missing value treatment is a major point of focus to make their
models more accurate and valid.
Let us now see how we can handle missing values (say NA or NaN) using
Pandas.
Live Demo
# import the pandas library
import pandas as pd
import numpy as np
print df
Example 1
Live Demo
import pandas as pd
import numpy as np
print df['one'].isnull()
a False
b True
c False
d True
e False
f False
g True
h False
Name: one, dtype: bool
Example 2
Live Demo
import pandas as pd
import numpy as np
print df['one'].notnull()
Its output is as follows −
a True
b False
c True
d False
e True
f True
g False
h True
Name: one, dtype: bool
Calculations with Missing Data
• When summing data, NA will be treated as Zero
• If the data are all NA, then the result will be NA
Example 1
Live Demo
import pandas as pd
import numpy as np
print df['one'].sum()
2.02357685917
Example 2
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two'])
print df['one'].sum()
nan
Live Demo
import pandas as pd
import numpy as np
print df
print ("NaN replaced with '0':")
print df.fillna(0)
Here, we are filling with value zero; instead we can also fill with any other
value.
2 bfill/backfill
Fill methods Backward
Example 1
Live Demo
import pandas as pd
import numpy as np
print df.fillna(method='pad')
print df.fillna(method='backfill')
Example 1
Live Demo
import pandas as pd
import numpy as np
Empty DataFrame
Columns: [ ]
Index: [a, b, c, d, e, f, g, h]
Example 1
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[10,20,30,40,50,2000],
'two':[1000,0,30,40,50,60]})
print df.replace({1000:10,2000:60})
one two
0 10 10
1 20 0
2 30 30
3 40 40
4 50 50
5 60 60
Example 2
Live Demo
import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[10,20,30,40,50,2000],
'two':[1000,0,30,40,50,60]})
print df.replace({1000:10,2000:60})
Its output is as follows −
one two
0 10 10
1 20 0
2 30 30
3 40 40
4 50 50
5 60 60
Previous
Next
In many situations, we split the data into sets and we apply some
functionality on each subset. In the apply functionality, we can perform the
following operations −
Let us now create a DataFrame object and perform all the operations on it −
Live Demo
#import the pandas library
import pandas as pd
• obj.groupby('key')
• obj.groupby(['key1','key2'])
• obj.groupby(key,axis=1)
Let us now see how the grouping objects can be applied to the DataFrame
object
Example
Live Demo
# import the pandas library
import pandas as pd
View Groups
Live Demo
# import the pandas library
import pandas as pd
print df.groupby('Team').groups
Live Demo
# import the pandas library
import pandas as pd
print df.groupby(['Team','Year']).groups
Live Demo
# import the pandas library
import pandas as pd
grouped = df.groupby('Year')
2015
Points Rank Team Year
1 789 2 Riders 2015
3 673 3 Devils 2015
5 812 4 kings 2015
10 804 1 Royals 2015
2016
Points Rank Team Year
6 756 1 Kings 2016
8 694 2 Riders 2016
2017
Points Rank Team Year
7 788 1 Kings 2017
11 690 2 Riders 2017
By default, the groupby object has the same label name as the group name.
Select a Group
Using the get_group() method, we can select a single group.
Live Demo
# import the pandas library
import pandas as pd
grouped = df.groupby('Year')
print grouped.get_group(2014)
Aggregations
An aggregated function returns a single aggregated value for each group.
Once the group by object is created, several aggregation operations can be
performed on the grouped data.
Live Demo
# import the pandas library
import pandas as pd
import numpy as np
grouped = df.groupby('Year')
print grouped['Points'].agg(np.mean)
Year
2014 795.25
2015 769.50
2016 725.00
2017 739.00
Name: Points, dtype: float64
Another way to see the size of each group is by applying the size() function
−
Live Demo
import pandas as pd
import numpy as np
With grouped Series, you can also pass a list or dict of functions to do
aggregation with, and generate DataFrame as output −
Live Demo
# import the pandas library
import pandas as pd
import numpy as np
grouped = df.groupby('Team')
print grouped['Points'].agg([np.sum, np.mean, np.std])
Transformations
Transformation on a group or a column returns an object that is indexed the
same size of that is being grouped. Thus, the transform should return a
result that is the same size as that of a group chunk.
Live Demo
# import the pandas library
import pandas as pd
import numpy as np
grouped = df.groupby('Team')
score = lambda x: (x - x.mean()) / x.std()*10
print grouped.transform(score)
Filtration
Filtration filters the data on a defined criteria and returns the subset of data.
The filter() function is used to filter the data.
Live Demo
import pandas as pd
import numpy as np
In the above filter condition, we are asking to return the teams which have
participated three or more times in IPL.
Python Pandas - Merging/Joining
Previous
Next
Pandas provides a single function, merge, as the entry point for all standard
database join operations between DataFrame objects −
Let us now create two different DataFrames and perform the merging
operations on it.
Live Demo
# import the pandas library
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print left
print right
Name id subject_id
0 Alex 1 sub1
1 Amy 2 sub2
2 Allen 3 sub4
3 Alice 4 sub6
4 Ayoung 5 sub5
Name id subject_id
0 Billy 1 sub2
1 Brian 2 sub4
2 Bran 3 sub3
3 Bryce 4 sub6
4 Betty 5 sub5
Merge Two DataFrames on a Key
Live Demo
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left,right,on='id')
Here is a summary of the how options and their SQL equivalent names −
Left Join
Live Demo
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left, right, on='subject_id', how='left')
Live Demo
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left, right, on='subject_id', how='inner')
Previous
Next
Concatenating Objects
The concat function does all of the heavy lifting of performing concatenation
operations along an axis. Let us create different objects and do
concatenation.
Live Demo
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two])
Suppose we wanted to associate specific keys with each of the pieces of the
chopped up DataFrame. We can do this by using the keys argument −
Live Demo
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two],keys=['x','y'])
x 1 98 Alex sub1
2 90 Amy sub2
3 87 Allen sub4
4 69 Alice sub6
5 78 Ayoung sub5
y 1 89 Billy sub2
2 80 Brian sub4
3 79 Bran sub3
4 97 Bryce sub6
5 88 Betty sub5
Live Demo
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two],keys=['x','y'],ignore_index=True)
Observe, the index changes completely and the Keys are also overridden.
If two objects need to be added along axis=1, then the new columns will be
appended.
Live Demo
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two],axis=1)
A useful shortcut to concat are the append instance methods on Series and
DataFrame. These methods actually predated concat. They concatenate
along axis=0, namely the index −
Live Demo
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print one.append(two)
Live Demo
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print one.append([two,one,two])
Time Series
Pandas provide a robust tool for working time with Time series data,
especially in the financial sector. While working with time series data, we
frequently come across the following −
• Generating sequence of time
• Convert the time series to different frequencies
Live Demo
import pandas as pd
print pd.datetime.now()
2017-05-11 06:10:13.393147
Create a TimeStamp
Time-stamped data is the most basic type of timeseries data that associates
values with points in time. For pandas objects, it means using the points in
time. Let’s take an example −
Live Demo
import pandas as pd
print pd.Timestamp('2017-03-01')
2017-03-01 00:00:00
It is also possible to convert integer or float epoch times. The default unit
for these is nanoseconds (since these are how Timestamps are stored).
However, often epochs are stored in another unit which can be specified.
Let’s take another example
Live Demo
import pandas as pd
print pd.Timestamp(1587687255,unit='s')
2020-04-24 00:14:15
Create a Range of Time
Live Demo
import pandas as pd
Live Demo
import pandas as pd
0 2009-07-31
1 2010-01-10
2 NaT
dtype: datetime64[ns]
Live Demo
import pandas as pd
Extending the Time series, Date functionalities play major role in financial
data analysis. While working with Date data, we will frequently come across
the following −
Live Demo
import pandas as pd
Live Demo
import pandas as pd
Observe, after 3rd March, the date jumps to 6th march excluding 4th and
5th. Just check your calendar for the days.
Live Demo
import pandas as pd
start = pd.datetime(2011, 1, 1)
end = pd.datetime(2011, 1, 5)
Offset Aliases
A number of string aliases are given to useful common time series
frequencies. We will refer to these aliases as offset aliases.
annual(Year) end
D calendar day frequency A
frequency
business year end
W weekly frequency BA
frequency
T,
MS month start frequency minutely frequency
min
String
By passing a string literal, we can create a timedelta object.
Live Demo
import pandas as pd
2 days 02:15:30
Integer
By passing an integer value with the unit, an argument creates a Timedelta
object.
Live Demo
import pandas as pd
print pd.Timedelta(6,unit='h')
0 days 06:00:00
Data Offsets
Data offsets such as - weeks, days, hours, minutes, seconds, milliseconds,
microseconds, nanoseconds can also be used in construction.
Live Demo
import pandas as pd
print pd.Timedelta(days=2)
2 days 00:00:00
to_timedelta()
Using the top-level pd.to_timedelta, you can convert a scalar, array, list, or
series from a recognized timedelta format/ value into a Timedelta type. It
will construct Series if the input is a Series, a scalar if the input is scalar-
like, otherwise will output a TimedeltaIndex.
Live Demo
import pandas as pd
print pd.Timedelta(days=2)
2 days 00:00:00
Operations
You can operate on Series/ DataFrames and construct timedelta64[ns] Series
through subtraction operations on datetime64[ns] Series, or Timestamps.
Let us now create a DataFrame with Timedelta and datetime objects and
perform some arithmetic operations on it −
Live Demo
import pandas as pd
print df
A B
0 2012-01-01 0 days
1 2012-01-02 1 days
2 2012-01-03 2 days
Addition Operations
Live Demo
import pandas as pd
print df
A B C
0 2012-01-01 0 days 2012-01-01
1 2012-01-02 1 days 2012-01-03
2 2012-01-03 2 days 2012-01-05
Subtraction Operation
Live Demo
import pandas as pd
print df
A B C D
0 2012-01-01 0 days 2012-01-01 2012-01-01
1 2012-01-02 1 days 2012-01-03 2012-01-04
2 2012-01-03 2 days 2012-01-05 2012-01-07
Previous
Next
Often in real-time, data includes the text columns, which are repetitive.
Features like gender, country, and codes are always repetitive. These are
the examples for categorical data.
Categorical variables can take on only a limited, and usually fixed number of
possible values. Besides the fixed length, categorical data might have an
order but cannot perform numerical operation. Categorical are a Pandas
data type.
Object Creation
Categorical object can be created in multiple ways. The different ways have
been described below −
category
Live Demo
import pandas as pd
s = pd.Series(["a","b","c","a"], dtype="category")
print s
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a, b, c]
The number of elements passed to the series object is four, but the
categories are only three. Observe the same in the output Categories.
pd.Categorical
Live Demo
import pandas as pd
[a, b, c, a, b, c]
Categories (3, object): [a, b, c]
Live Demo
import pandas as pd
[a, b, c, a, b, c, NaN]
Categories (3, object): [c, b, a]
Here, the second argument signifies the categories. Thus, any value which
is not present in the categories will be treated as NaN.
Live Demo
import pandas as pd
[a, b, c, a, b, c, NaN]
Categories (3, object): [c < b < a]
Logically, the order means that, a is greater than b and b is greater than c.
Description
Using the .describe() command on the categorical data, we get similar output
to a Series or DataFrame of the type string.
Live Demo
import pandas as pd
import numpy as np
print df.describe()
print df["cat"].describe()
Live Demo
import pandas as pd
import numpy as np
Live Demo
import pandas as pd
import numpy as np
False
Live Demo
import pandas as pd
s = pd.Series(["a","b","c","a"], dtype="category")
s.cat.categories = ["Group %s" % g for g in s.cat.categories]
print s.cat.categories
Live Demo
import pandas as pd
s = pd.Series(["a","b","c","a"], dtype="category")
s = s.cat.add_categories([4])
print s.cat.categories
Live Demo
import pandas as pd
s = pd.Series(["a","b","c","a"], dtype="category")
print ("Original object:")
print s
print ("After removal:")
print s.cat.remove_categories("a")
Original object:
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a, b, c]
After removal:
0 NaN
1 b
2 c
3 NaN
dtype: category
Categories (2, object): [b, c]
Comparison of Categorical Data
• comparing equality (== and !=) to a list-like object (list, Series, array,
...) of the same length as the categorical data.
• all comparisons (==, !=, >, >=, <, and <=) of categorical data to
another categorical Series, when ordered==True and the categories
are the same.
• all comparisons of a categorical data to a scalar.
Live Demo
import pandas as pd
print cat>cat1
Previous
Next
import pandas as pd
import numpy as np
df =
pd.DataFrame(np.random.randn(10,4),index=pd.date_range('1/1/2000
',
periods=10), columns=list('ABCD'))
df.plot()
Plotting methods allow a handful of plot styles other than the default line
plot. These methods can be provided as the kind keyword argument
to plot(). These include −
Bar Plot
Let us now see what a Bar Plot is by creating one. A bar plot can be created
in the following way −
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d')
df.plot.bar()
import pandas as pd
df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d')
df.plot.bar(stacked=True)
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d')
df.plot.barh(stacked=True)
Histograms
Histograms can be plotted using the plot.hist() method. We can specify
number of bins.
import pandas as pd
import numpy as np
df =
pd.DataFrame({'a':np.random.randn(1000)+1,'b':np.random.randn(10
00),'c':
np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
df.plot.hist(bins=20)
To plot different histograms for each column, use the following code −
import pandas as pd
import numpy as np
df=pd.DataFrame({'a':np.random.randn(1000)+1,'b':np.random.randn
(1000),'c':
np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
df.diff.hist(bins=20)
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C',
'D', 'E'])
df.plot.box()
Area Plot
Area plot can be created using the Series.plot.area() or
the DataFrame.plot.area() methods.
import pandas as pd
import numpy as np
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(50, 4), columns=['a', 'b', 'c',
'd'])
df.plot.scatter(x='a', y='b')
Pie Chart
Pie chart can be created using the DataFrame.plot.pie() method.
import pandas as pd
import numpy as np
Kickstart