SKILLATHON.
CO
PYTHON
FOR
DATA ANALYSIS
© www.skillathon.co
Content
✔ NumPy ✔ Pandas
✔ Introduction ✔ Introduction
✔ Installation ✔ Series
✔ Numpy Arrays ✔ DataFrames
✔ How to create ndarrays? ✔ Missing Data
✔ random() methods ✔ Groupby
✔ Shape of arrays ✔ Aggregate Functions
✔ Reshaping arrays ✔ Merging joining and
✔ Operation on arrays concatenating
✔ Arithmetic ✔ Operations
✔ Broadcasting ✔ Data Input and output
© www.skillathon.co
© www.skillathon.co
Introduction
❑ Stands for Numerical Python .
❑ Fundamental package for scientific computing in
python
❑ Incredibly fast , since has binding to C libraries .
❑ Part of the SciPy stack .
❑ Many other libraries rely on numpy as one of their
building blocks .
© www.skillathon.co
Installation
❑ It’s highly recommended to install anaconda distribution to make sure all underlying
dependencies sync up .
❑ If you have anaconda , install numpy by going to the
terminal or command prompt and start typing :
conda install numpy
❑ If you don’t have anaconda , then type
pip install numpy
© www.skillathon.co
Numpy arrays
❑ Fast built-in n-dimensional array object containing elements of same type .
❑ Dimensions are called axes .
Note
✔ Indexing starts at 0
✔Unlike list , they can be broadcasted .
© www.skillathon.co
How to create numpy arrays
❑ To start using numpy package , we need to import it.
>>> import numpy as np ### we’re importing numpy as np to reduce the work
❑ numpy arrays can directly be created using np.array() function.
>>> arr1 = np.array([1,2,3]) ###passing a simple list as arguments
>>> arr1
array([1,2,3]) ### returns a 1-d array
>>> arr2 = np.array( [ [1,2,3] , [2,3,4] ] ) ### passing nested list
>>>arr2
array( [ [1, 2, 3],
[2, 3, 4] ] ) ### returns a 2-d array
❑ numpy arrays can be quickly generated using np.arange() function.
np.arange ( start , stop, step)
❑Example:
>>> a = np.arange( 0 , 5) ###generates an array from 0 to 4.
© www.skillathon.co
How to create numpy arrays (continued)
❑ To generate an array of zeroes :
>>> np.zeros(shape)
❑ To generate arrays of ones :
>>> np.ones(shape)
❑ To create an identity matrix of size n*n:
>>> np.eye(n)
❑ To create an array with evenly spaced points :
>>> np.linspace(start, stop, no. of points)
linspace is same as arange but it takes an
additional argument of number of points.
© www.skillathon.co
Random Functions
❑ Numpy consists of some functions to generate arrays with random
elements.
np.random.rand(shape) : This function returns random numbers from a uniform
distribution
np.random.randn(shape) : This function generates array of the given size from
gaussian distribution or normal distribution set around zero.
np.random.randint( low , high , size ) : It returns array of given range and size.
Note:
✔In randint() function , lower limit is inclusive and upper limit is exclusive.
© www.skillathon.co
Random function (Examples)
© www.skillathon.co
Array Shape
❑ To get the shape of an numpy array shape attribute is used.
>>> a = np.array ( [ 7, 2, 9, 10] )
>>> a.shape
( 4, )
>>> b = np.array ( [ [ 2, 4, 6 ] , [ 1, 3, 5 ] ] )
>>> b.shape
( 2, 3)
Note :
✔No brackets ,since it’s not a method but attribute .
© www.skillathon.co
Reshaping Arrays
❑ Shape of the arrays can be changed.
❑ Using numpy’s reshape() function , the dimensions of the given function can be changed.
❑ Example :
>>> a = np.random.rand( 4,4 )
>>> a.resahpe ( 2, 2, 4)
© www.skillathon.co
Basic Operations :
❑ Numpy provide some functions to perform basic operations on the array.
ndarray.max() : returns the max element in the given array.
>>> a = np.array ( [ 2, 4, 12, 83, 1] )
>>> a.max()
83
ndarray.min() : returns the smallest element in the given array.
>>> a.min()
1
ndarray.argmax() : returns the index of max element.
>>> a.argmax()
3
ndarray.argmin() : returns the index of smallest element.
>>> a.argmin()
4
ndarray.sum() : returns the sum of the given array.
>>> a.sum()
102
© www.skillathon.co
Basic Operations : statistics
❑ We can calculate mean , median or standard deviation using numpy functions directly.
>>> a = np.array([1,2,3,3])
>>> a.mean () ### will return mean of a
2.25
>>> a.median() ### return the median
2.5
>>> a.std() ### standard deviation
0.8291
© www.skillathon.co
Element-wise operations
❑ Many arithmetic operations can be done with numpy arrays.
❑ With scalars :
>>> a = np.array( [1 , 2, 3] )
>>> a + 1 ###adding 1 to each element in the array
[2, 3, 4]
>>> a ** 2 ### squaring all the elements of the array
[1, 4, 9]
❑ With another array :
>>> b = np.ones(3) ###generates this array [ 1, 1, 1]
>>> a + b
[2, 3, 4]
>>> a-b
[0,1,2]
>>> a * b
[1, 2, 3] ###this multiplication is not matrix multiplication,we use np.dot(a,b) for that.
Note: These operations are of course much faster than if you did them in pure python
© www.skillathon.co
Element-wise operations : comparisons and logical operators
❑ Comparisons can be done between elements 2 arrays.
>>> a == b ###returns an array of Booleans
[ True, False ,False]
>>> a > b
[False , True , True ]
❑ Comparing 2 arrays.
>>> np.array_equal (a ,b) ### returns a boolean value
False
❑ Logical operations :
>>> a = np.array([1 , 0, 0, 1], dtype=bool)
>>> b = np.array([0 , 1, 0, 1],dtype=bool)
>>> np.logical_or(a , b)
[ True, True, False, True ]
>>> np.logical_and(a, b)
[False, False, False, True]
© www.skillathon.co
Broadcasting
❑ Broadcasting is useful when we want to do element-wise operations on numpy arrays with different
shape.
❑ It’s possible to do operations on arrays of different sizes if NumPy can transform these arrays so that
they all have the same size: this conversion is called broadcasting.
❑ It does this without making needless copies of data and usually leads to efficient algorithm
implementations.
Note:
✔If both your arrays are two-dimensional, then their corresponding sizes have to be either
equal or one of them has to be 1 .
© www.skillathon.co
Broadcasting : example
© www.skillathon.co
© www.skillathon.co
Introduction
❑ One of the richest library in python.
❑ Can be used to analyze and visualize data.
❑ Pandas provide us two high performing new data structures :
Series : 1D labeled vector
DataFrames : 2-D spreadsheet like structure
❑ These data structures are fast since they are made on top of Numpy.
❑ SQL like functionality : GroupBy , joining / merging etc.
❑ Missing data handling
© www.skillathon.co
Series
❑ Series is One dimensional object similar to array, list or column in a table.
❑ To each item in the list , an index is assigned .
❑ The index can be integer or string .
❑ By default each item will receive an index label from 0 to n .
❑ Values Can be heterogeneous
© www.skillathon.co
Series ( contd.)
❑ Dictionaries can be converted into series.
❑ To grab any value from the given series, it’s index is used.
© www.skillathon.co
DataFrame
❑ A DataFrame is a tabular data structure comprised of rows and columns, like a spreadsheet,
database table, or R's dataframe object.
❑ Could be thought of as a bunch of Series objects grouped together to share the same index.
❑ Most commonly used pandas object.
© www.skillathon.co
DataFrame (contd.) :
❑ To create a DataFrame, pd.DataFrame() is used.
❑ Like Series, DataFrame accepts many different kinds of input:
Dict of 1D ndarrays, lists, dicts, or Series
2-D numpy.ndarray
Structured or record ndarray
A Series
Another DataFrame
Note:
✔ Along with the data, you can optionally pass index (row labels) and columns (column labels)
arguments.
✔ If axis labels are not passed, they will be constructed from the input data based on common
sense rules
© www.skillathon.co
DataFrames : Columns and rows
❑ To select a column in a data frame , we simply write:
dataframe_name [ ‘ Column_name’]
dataframe_name [ [ ‘Column_name_1’ ,‘Column_name_2’]] ###To select multiple columns
❑ To create a new column:
dataframe_name [‘New_column_name’] = [‘ Values’ ]
❑ We can also remove any column from the dataset .
dataframe_name.drop ( ‘Column_name’ , axis , inplace )
Note: we have to specify the axis of that column and whether we want to remove the column
permanently.
❑ To select rows in a dataframe we use loc attribute
dataframe_name.loc[ ‘row_name’]
© www.skillathon.co
Handling Missing Data
❑ There maybe many missing data in your datasets.
❑ Pandas provide some functions to deal with the.
df.dropna() : Return object with labels on given axis
omitted where alternately any or all of the
data are missing.
df.fillna() : Fill NA/NaN values using the specified
method.
© www.skillathon.co
GroupBy
❑ GroupBy method is used to group together the data based off any row or column .
❑ After grouping them together , aggregate functions can be used on the data for analysis.
❑ There are many aggregate functions available like:
sum()
std()
mean()
min()
max()
describe()
Note: describe() method is the prior to the
rest of them, as it would already print the
max, min, std (standard deviation), count, etc.
out of the numerical columns of the
DataFrame.
© www.skillathon.co
© www.skillathon.co
Merging and Concatenation
❑ Concatenation basically glues together two dataframes who’s dimensions are same.
❑ Pandas provide a function pd.concat( ) to concatenate.
❑ The merge function allows you to merge DataFrames together using a similar logic as merging
SQL Tables together.
Conc
at
Merging
© www.skillathon.co
Data Input and Output
❑ Using pandas we can read and write files of various format like :
.csv()
.json()
.xml()
.html
And many more…
❑ Functions to read a file:
pd.read_csv(‘file_name’)
pd.read_json(‘file_name’)
pd.read_excel(‘file_name’)
❑ Functions to write a file:
pd.to_csv(‘file_name’)
pd.to_excel(‘file_name’)
© www.skillathon.co
Thanks!
Does anyone have any questions?
© www.skillathon.co