GVPCOEW-Pandas and Numpy For Data Analysis - DONE
GVPCOEW-Pandas and Numpy For Data Analysis - DONE
Contents
Chapter-1 (Pandas)
1. What is Data Analysis
2. Where is Data Analysis used
3. Use Cases – Applications of Data Analytics
4. What is Pandas
5. Structural Perspective of Datatypes
6. What is a DataFrame
7. How to install pandas
8. Using iPython Notebook
9. Installing pandas with Anaconda (Steps)
10. What is Jupyter Notebook
11. How to use Anaconda
12. Program: Reading tabular data file (data with rows and columns) into pandas – demo on
read_table()
13. Program: To display 1st 7 rows only
14. Program: Explicitly specifying the delimiter symbol and solving header row problem
15. Program: Explicitly specifying the column names
16. Program: demo on read_csv()
17. Program: Demo on type() and selecting a pandas series from a DataFrame
18. Program: Creating a new series/column in a DataFrame
19. Program: Filtering rows of a pandas DataFrame by column value
20. Program: using the loc indexer – conditional lookup
21. Program: Applying multiple filter criteria to a pandas DataFrame
22. Program: Reading from a selection of columns
23. Program: Demo on iterrows()
24. Program: Demo on dropping a column or a row from display, also covers usage of “axis”
parameter
25. Program: Demo on mean() method
26. Program: Demo on “groupby” in pandas
27. Program: Checking for null values in a DataFrame
28. Program: To figure out whether there are any null values in a DataFrame
29. Program: To figure out how many null values are present in a DataFrame
30. Program: Dealing with null values/Missing data
31. Program: Concatenation of DataFrames
32. Program: Setting a column as Index
33. loc() vs iloc()
34. Program: Getting unique elements of a column in a DataFrame
35. Program: Getting count of unique elements of a column in a DataFrame
36. Program: Getting count of each unique element of a column in a DataFrame
BVRAJU 1
37. Program: Demo on apply() on a DataFrame
38. Program: Demo on apply() on a lambda expression
39. Program: Knowing column names of a DataFrame
40. Program: Sorting a DataFrame
Chapter-3: NumPy
1. Datatypes
2. Data Structures
3. Algorithms
4. Method of Memory Storage Per Data Structure
5. NumPy Arrays Vs Python Sequences(Lists, Tuples, Dictionaries & Sets)
6. Why use NumPy Arrays
7. NumPy Datatypes
8. Program: NumPy Demo1
9. Program: NumPy Demo2 – transforming the 1D array into 2D array
10. Program: NumPy avoids copies wherever possible – demo
11. Program: If we need a true copy of an array – demo
12. Program: To demonstrate that in NumPy arrays, the operations are propagated to the
individual elements.
13. Program: Indexing Demo1
14. Program: Indexing Demo2
15. Program: Array Indexing Demo3 – Generating arrays using arange()
BVRAJU 2
16. Program: Generating arrays using arange() – Ex2
17. Program: To generate zeros
18. Program: To generate ones
19. Program: Demo on linspace()
20. Program: To generate identity matrix
21. Program: To generate random numbers between 0 to 1
22. Program: Generating random integers
23. Program: Finding index of max value and min value in an array
24. Program: Knowing datatype of an array
25. Program: Slicing / Indexing 2-D arrays
26. NumPy Operations
27. Program: Performing arithmetic operations between two 1-D arrays
28. Program: Applying a scalar value to a 1-D array
29. Program: Using Universal array Functions on 1-D arrays
30. Program: Comparing Arrays
31. Program: Demo on any() and all()
32. Program: Demo on logical_and(), and logical_or() functions
33. Program: Demo on where()
34. Program: To retrieve non-zero elements from an array
35. Program: Aliasing the arrays()
36. Program: Viewing arrays (shallow copying)
37. Program: Copying arrays (deep copying)
38. Program: Attributes of an Array – ndim
39. Program: Attributes of an array – shape
40. Program: Attributes of an array – size, itemsize, dtype, nbytes
41. Program: Demo on reshape()
42. Program: Demo on flatten()
43. Program: Matrix object in numpy – demo1
44. Program: Getting diagonal elements of a Matrix
45. Program: Finding Max and Min elements in a Matrix
46. Program: Finding Sum and Average of Matrix elements
47. Program: Product of Matrix elements
48. Program: Sorting of Matrix elements
49. Program: Transpose of a Matrix
50. Program: Addition of two matrices
51. Program: Multiplication of two matrices
BVRAJU 3
Chapter-1: Pandas
Pandas – A Python data analytics library
Examples:
• A Chain of Hospitals (like Apollo Hospitals) contain data related to medical reports, prescriptions,
feedback of their patients.
• A Bank having thousands of branches (like AXIS Bank) contains lakhs of customers’ transaction details.
• Stock exchanges have Share market data of thousands of companies.
• Mobile phone network companies (like Airtel) contain crores of customers’ data like their location
moving data, voice data
• Loan lending financial companies (like Bajaj FinSev) contain lakhs of customers’ loan repayment data
• Supermarket/Retail chains (like Big Bazaar, More, Reliance, Spencer’s, Shoppers Stop) contain lakhs of
customers’ purchase details
• E-Shopping companies (like Flipkart) contain lakhs of customers’ online shopping data
Such data is stored in a system called “Data Warehouse”. (OLAP System) This data can be queried to answer
the questions raised by the management of an organization. This is called Data Analysis/Data Analytics.
Note: The Data Warehouse is different from RDBMS like Oracle which is used for day-to-day Business
Transactions with their customers. (OLTP System)
Data analysis is a process of inspecting the data with the goal of discovering useful information, drawing
conclusions and supporting decision-making.
This is common in other domains like Crime Case Analysis by Police Department, Analyzing losing in an Election
in a State/National Vote Elections.
BVRAJU 4
1st 5 records:
Ans:
Step1.
BVRAJU 5
Note: Notice that the value counts are in descending order
Step2.
Ans:
Ans:
Step1.
Step2.
BVRAJU 6
Question4: What is the most common “Reason” (a new column created using the prefix of the title) for a 911
calls
Ans:
Step1.
Step2.
Data analysis is used in various fields and industries where data is collected and analyzed to gain insights, make
business decisions:
• Hotel Business
• Health Care
• Automobiles Sales
• Social Sciences like Political Science
• Sports Analysis
• E-Commerce Sales Analysis
• Uber Data Analytics
• Etc.
BVRAJU 7
Advantages offered by CitiBank to Customers:
• No joining fees
• No renewal fees
• No annual fee
• No add-on fees
• Every time the customer spent some money (say 100Rs) with his/her credit card, customer gets 2 JP (Jet
Airways Royalty Program) miles free. Also provide additional benefits like extra baggage allowance.
Now what is the advantage for both these companies – JetAirways and Citibank for giving this free
stuff?
• CitiBank, after analysis on the spending history of their customers, CitiBank has identified Air travelers
as highly profitable customer segment.
• Now CitiBank is targeting them with more focused marketing strategies and providing higher customer
satisfaction through more relevant offers.
• More profitable customers for a Credit Card issuing Bank are those who forget (due to busy life or some
other reason) to repay the spent amount within billing time and pays interest for the same.
This is a WIN-WIN situation for CitiBank, Jet Airways and for the Customer.
What is pandas
It’s an open-source library in python for data analysis, data manipulation & data visualization all.
The word pandas is an acronym which is derived from "Python and data analysis" and "panel data".
Library Highlights
• A fast and efficient DataFrame object for data manipulation with integrated indexing;
• Tools for reading and writing data between in-memory data structures and different formats: CSV and
text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
• Intelligent data alignment and integrated handling of missing data: gain automatic label-based
alignment in computations and easily manipulate messy data into an orderly form;
• Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
• Aggregating or transforming data with a powerful group by engine allowing split-apply-combine
operations on data sets;
• High performance merging and joining of data sets;
• Python with pandas is in use in a wide variety of academic and commercial domains, including Finance,
Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.
BVRAJU 8
• etc.
Structured Data
Un-Structured Data
Semi-Structured Data
What is a DataFrame
BVRAJU 9
At its core, it is very much like operating a headless version of a spreadsheet, like in Excel or a table in RDBMS.
Most of the datasets you work with will be what are called dataframes.
You may be familiar with this term already, it is used across other languages (Like “R”), but, if not, a dataframe is
most often just like a spreadsheet, columns and rows, that's all there is to it!
From here, we can utilize Pandas to perform operations on our data sets at lightning speeds. Each of those
columns is known as Pandas series.
Step1. From the windows explorer, go to the python installation directory, from there go to the scripts directory
and use the “pip” tool to install pandas:
Until now we've worked with Python either directly via the interactive Python console, or by writing Python
programs using a text editor.
However, there are other ways to work with Python. IPython is a set of tools originally developed to make it
easier for scientists to work with Python and data.
BVRAJU 10
Starting iPython Notebook:
Note: If we use “Anaconda” Software, it comes with a lot of built-in libraries like pandas. So, no need to install
anything explicitly.
Note: Anaconda is a Python distribution. It’s an alternative way to install pandas and its dependencies:
Note: Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for
Scientific computing and Data Science.
Step1.
BVRAJU 11
Step2.
Step3.
Step4.
BVRAJU 12
Latest:
BVRAJU 13
After download, double click on it
BVRAJU 14
BVRAJU 15
What is Jupyter Notebook
The Jupyter Notebook is an open-source web application that allows you to create and share documents that
contain live code, equations, visualizations and narrative text.
BVRAJU 16
From here onwards rest is same as done earlier.
Another way to open Jupyter Notebook (common on both Windows OS and Mac OS)
Step1.
Step2.
Step3.
BVRAJU 17
Note: The Jupyter Notebook is an open-source web application
BVRAJU 18
Step4. After coding each line, save it (Press Ctrl + S or click on “Save” icon) (anyways it would be auto-saved
every 2 minutes) and click “run cell” icon:
Shift + Enter
(OR)
Alt + Enter: to execute and insert a new line
Note: To save the document, just click on “untitled” section at the top, give it a name (extension is .ipynb)
Note: To open it whenever required, File → Open and select the document
Program1: Reading tabular data file (data with rows and columns) into pandas
(Exs: Excel spreadsheet, csv, tsv)
Step1.
BVRAJU 19
Note: “as” stands for “alias name”
Step2. Now code one more statement and click “run cell”:
Note: Make sure that internet connection is available for this dataset
Note:
We got the formatted data. The underlying data file is tab separated.
It assumes that the 1st row is a header row.
BVRAJU 20
Program3: Explicitly specifying the delimiter symbol and solving header row problem
Step1.
BVRAJU 21
Notice that everything is put into one column.
Notice that here the 1st row is not a header row, it is a data row. So, we need to tell pandas that there is no
header row.
Step3.
BVRAJU 22
Program4: Explicitly specifying the column names
BVRAJU 23
Note: myufoObject[‘City] → selects the panda’s series (a particular columns data) from the dataframe
Note: column names are case-sensitive. If, we give type(ufo[‘city’]) → It gives error
Note: It has intellisence feature, just press tab after putting a dot, it helps us in selecting:
BVRAJU 24
Program: Creating a new series/column in a dataframe
Step1.
BVRAJU 25
Note: Notice that it is a tuple.
Step2. The requirement is to show only those movies whose duration is >= 200 minutes
The “loc” property is used to access a group of rows and columns by label(s) or a boolean array.
BVRAJU 26
Note:
1st arg: which rows we want
2nd arg: which columns we want
Note: If we don’t mention any column then all columns are displayed.
Ex2:
The value we pass in will always be interpreted as the label. It will never be interpreted as the integer position
along the index. So, we specified which rows we want by using the label of that specific row, which is 2.
BVRAJU 27
Note: If we develop and run any pandas code using Python IDLE and on the Python Shell:
Note: notice that it is single ampersand but not double. Remember that here we should not use Python’s “and”
operator.
BVRAJU 28
Program: Reading from a selection of columns
Note: this function gives both the indices and the rows data.
Program: Demo on dropping a column or a row from display, also covers usage of “axis” parameter
For drop():
axis = 0 (row) (default)
axis = 1 (column)
Dropping a column:
BVRAJU 29
Dropping a row:
Note: If you want to delete multiple columns at the same time, enter them in the form of a list.
BVRAJU 30
Now it shows row wise mean, for mean(), axis = 1 means rows wise.
Note: Recall that, “GROUP BY” Statement in SQL is used to arrange identical data into groups based on a
categorical column with the help of some functions.
Note: We use “groupby” when we have to analyze some pandas series by some category
Note: Had we not mentioned “beer_servings” then it would have displayed all the numerical columns
BVRAJU 31
Note: Also try count(), mean()
Ex3:
BVRAJU 32
Creating a new DataFrame from a collection of Series
BVRAJU 33
Step3. Create a dictionary
Note: Now we can drop any row with its label name (eg: df1.drop(“student2”))
Step1.
BVRAJU 34
Step2.
Program: To figure out whether there are any null values in a DataFrame
BVRAJU 35
Program: To figure out how many null values are present in a DataFrame
Note: Notice that the above command gives column wise count, instead if we want to count in the entire
DataFrame:
(a).
BVRAJU 36
(c). to do the operation along the columns:
Note: Notice that we have given “threshold” value as 2, which means that do not delete the row if there are at
least 2 non-nan values.
(OR)
BVRAJU 37
(f). Filling with “mean” of the column
Case1: 100 students’ data of GITAM University and 100 students’ data of Andhra University. Here we join row
wise
Case2: One survey group collected some information about IIT Bombay CSE Students like name, gender, phone
Another survey group collected some additional information about the same students like address, +2 college
name, native palace. Here we join column wise
Step1.
BVRAJU 38
Step2. Let’s get data from one more CSV file
Note: Keep in mind that dimensions should match along the axis you are concatenating on.
BVRAJU 39
Note: The left most values are just row labels, they are not indices, indices would be present internally and they
are unique (0, 1, 2, ….13)
Note: Make sure that both the dataframes have equal number of columns.
Note: Notice the indices for rows, all the labels would be unique now. To set “RegNum” as index label:
BVRAJU 40
Now we can say:
BVRAJU 41
So, let’s use better datasets:
BVRAJU 42
Program: Setting a column as Index
BVRAJU 43
Step1. If we want to get the details of empid: 103
Note: Notice that we couldn’t get the row of empid: 103, because it is not an index, it is just a value in the
dataframe
To modify itself:
BVRAJU 44
Now getting the row using empid value:
loc() vs iloc()
Difference1:
iloc() of an integer always gives only one record where loc() of an integer might result in multiple records
Step1.
BVRAJU 45
Step2.
Step3.
BVRAJU 46
Difference2:
Step1.
Step2.
Step3.
BVRAJU 47
Note: However, if we use slicing concept, then, even iloc can give multiple rows/records
BVRAJU 48
Note: Better use any column where the data is repeated
concat () is used to append one (or more) dataframes one below the other (or sideways, depending on
whether the axis option is set to 0 or 1).
merge () is used to combine two (or more) dataframes on the basis of values of common columns
BVRAJU 49
Program: Demo on apply() on a lambda expression
BVRAJU 50
For descending order:
BVRAJU 51
Chapter-2: Pandas - Set2
Creating a Series with Dictionary, using index and values attributes of Series class
RangeIndex object
Note: Since the column name is not specified, it appears as a RangeIndex object.
BVRAJU 52
Note:
Few more points on a DataFrame
• DataFrame can be thought of as a collection of Series objects. Pandas DataFrames allows for
data representation like Excel spreadsheets.
• DataFrame consists of rows and columns. Rows are observations or instances. Columns are
variables
With Original Column names – assigned at the time of creation of the DataFrame from a Dictionary
BVRAJU 53
Changing name of a Particular Column in a DataFrame
BVRAJU 54
Note: If data is entered as shown above, the column name list is not recognized, so, we have to set the
columns ourselves and create a DataFrame.
BVRAJU 55
Add a new Column with single value to a Dataframe
BVRAJU 56
Using iloc indexer to update a value
Ex1:
BVRAJU 57
Ex2: Another way
BVRAJU 58
Sorting by index column
BVRAJU 59
Sorting in descending order based on a column
BVRAJU 60
Add a new column (“City”) by repeating some City names:
BVRAJU 61
Sorting by two columns:
Multi-index allows you to select more than one row and column in your index. It is a multi-level or
hierarchical object for pandas object.
Now there are various methods of multi-index that are used such as MultiIndex.from_tuples,
MultiIndex.from_frame, etc which helps us to create multiple indexes from arrays, tuples, dataframes,
etc.
Ex1:
BVRAJU 62
Creating the data in the required format:
BVRAJU 63
Then, accessing it using row labels and column names:
Ex2:
BVRAJU 64
BVRAJU 65
Grouping by multiple columns in a DataFrame
BVRAJU 66
Pandas Data Pre-processing for Optimal Model Execution
BVRAJU 67
Operations that we perform during Data Pre-Processing / Data Cleansing
BVRAJU 68
Demo on Replacing the Missing Data – with the Mode
BVRAJU 69
Finding Mode
The mode is the value that appears most often in a set of data values.
Step3. Then, we want to know the value which appears the most for the “embark_town” column
BVRAJU 70
Another way:
Step4. Now replacing the null values with most frequent value
Note: ffill() is used to fill the missing value in the dataframe. 'ffill' stands for 'forward fill' and
BVRAJU 71
will propagate last valid observation forward.
Step2. To get the rows which contain null values on a particular column
BVRAJU 72
Step4. Now use “ffill”
Find the duplicate values among the entire row data of the DataFrame
BVRAJU 73
Step2. Know which are duplicate rows
BVRAJU 74
Step3. Show the duplicate Rows
BVRAJU 75
Step4. Remove the duplicate Rows – method 1
BVRAJU 76
Demo2 - Find the duplicate value in the specific column data of the DataFrame
BVRAJU 77
Chapter-3: NumPy
Datatypes
Data Structures
A Data Structure refers to the organization, management and storage of data that enables efficient
access and modification.
How fast an element can be searched, depends also on the Data Structure being used.
In Python:
BVRAJU 78
Note:
Sequential List is stored in contiguous location where as linked list elements are not stored in
contiguous location.
Direct access file is also known as random access or relative file organization.
BVRAJU 79
Relationship between Index and Records
BVRAJU 80
Algorithms
It is a set of procedures or methods to solve any computer solvable problem and refers to a step-by-
step procedure for executing a calculation.
BVRAJU 81
Method of Memory Storage Per Data Structure
Note: Show the storage and data access (with inserting an element) – Array Vs List
What is “NumPy”
BVRAJU 82
• NumPy stands for Numerical Python.
• It is the fundamental package for scientific computing with Python.
• NumPy is a Python library used for working with arrays.
• Using the Numpy libraries, mathematical operations on arrays are executed very efficiently.
Numpy arrays are the main reason we use Numpy and they come in two flavors
• Vectors (1-d arrays)
• Matrices (2-d arrays)
Note:
Lists are similar to arrays in python, but it is a slower process (because of the pointer). On the other hand,
NumPy arrays are stored at one continuous location in memory, so it is straightforward to access them very
efficiently.
BVRAJU 83
NumPy Datatypes
BVRAJU 84
Note: np.array() converts a normal Python list to an array
Note: NumPy Arrays can be created using Dictionaries and Sets as well.
BVRAJU 85
Program: If we need a true copy of an array - demo
Note: If we use the copy(), only the values are copied into the new location (different address).
BVRAJU 86
Note: Had we passed simple list we would have got 1-D array
Program: To demonstrate that in NumPy arrays, the operations are propagated to the individual elements.
Notice that all values > 50 are changed to 50 and all values < 15 are changed to 15.
Eg: salaries of employees must be between a given range only like between 10000 and 1 Lakh only.
BVRAJU 88
Program: Generating arrays using arange() – Ex2
BVRAJU 89
Program: To generate ones
Note: generating values from start to end based on number of points we want
BVRAJU 90
Program: To generate random numbers between 0 to 1
Shortcut way:
BVRAJU 91
Program: Knowing datatype of an array
BVRAJU 92
NumPy Operations
BVRAJU 93
Program: Comparing Arrays
Note: We can use the relational operators (< , >, <=, >= , == and !=) to compare the arrays of same size. These
operators compare the corresponding elements of the arrays and return another array with Boolean values.
Note: The any() can be used to determine if any one element of the array is True. Whereas the all() can be used
to determine whether all the elements in the array are True.
BVRAJU 94
Program: Demo on logical_and(), and logical_or() functions
Note: where() can be used to create a new array (subgroup) based on whether a given condition is True or False.
BVRAJU 95
Note: To compare the corresponding elements of two arrays and retrieve the biggest elements.
int is equal to int32. If we want int16 or int64 then we have to say np.int16 and np.int64 respectively.
arr[result]
b=a
This is a simple assignment that does not make any new copy. It means “b” is not a new array and memory is
not allocated to “b” elements. Here there is only one array referenced/pointed by two variables.
BVRAJU 96
So, this is known as aliasing, means giving another name to the existing object. Hence any modifications made
to the array via one variable will reflect for the 2nd variable as well.
To prove that both point to the same memory locations, check their ids
id(a)
id(b)
Note: This is done by the view(). It creates a new copy of the existing array. The new array will have different
memory locations. However, if the newly created array is modified, the original array also gets modified.
BVRAJU 97
To prove that they are stored in different memory locations, check their ids
id(a)
id(b)
Note: Here we use the copy(). Here the new array would be independent and modifying one array won’t affect
the 2nd one.
BVRAJU 98
To prove that they are stored in different memory locations, check their ids
id(a)
id(b)
BVRAJU 99
Note: the “shape” attribute gives the shape of an array. The shape is a tuple listing the number of elements
along each dimension. A dimension is also called an axis in python.
Note: We can also change the shape using this “shape” attribute
Note: reshape() doesn’t modify the original array’s shape, whereas shape member modifies the original array’s
shape.
Note: the “size” attribute gives the total no. of elements in the array.
BVRAJU 100
(b) itemsize attribute
Note: the “itemsize” attribute gives the memory size of an array element in bytes
the “dtype” attribute gives the datatype of the elements in the array
BVRAJU 101
Note: To mention the datatype explicitly while creating the array:
BVRAJU 102
Program: Demo on reshape()
Note: reshape() is used to change the shape of an array. The new array should have the same no. of elements as
in the original array.
BVRAJU 103
Program: Demo on flatten()
Note: In mathematics, a matrix represents a rectangular array of elements arranged in rows and columns. If a
matrix has only 1 row, it is called as “row matrix’. If a matrix has only 1 column, then it is called as “column
matrix”. Row matrix and Column matrix are nothing but 1-D arrays.
To work with matrices, numpy provides a special object called “matrix”. In numpy, a matrix is a specialized 2D
array that retains its 2D nature through operations.
BVRAJU 104
Another way (by directly passing the elements):
Yet Another way (by passing a string with elements with semicolons after each row):
BVRAJU 105
Note: We can find diagonal even on a normal 2-D array
BVRAJU 106
Program: Finding Max and Min elements in a Matrix
BVRAJU 107
i.e., 28 * 80 * 162 (OR) 6 * 120 * 504
Step2.
BVRAJU 108
Note: default is axis = 1
BVRAJU 109
Program: Multiplication of two matrices
BVRAJU 110