Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
32 views110 pages

GVPCOEW-Pandas and Numpy For Data Analysis - DONE

The document provides a comprehensive guide on using Python for data analysis, focusing on the Pandas and NumPy libraries. It covers fundamental concepts of data analysis, installation instructions, and practical programming examples for manipulating and analyzing data using DataFrames and arrays. The content is structured into chapters that detail various functionalities and applications of these libraries in real-world scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views110 pages

GVPCOEW-Pandas and Numpy For Data Analysis - DONE

The document provides a comprehensive guide on using Python for data analysis, focusing on the Pandas and NumPy libraries. It covers fundamental concepts of data analysis, installation instructions, and practical programming examples for manipulating and analyzing data using DataFrames and arrays. The content is structured into chapters that detail various functionalities and applications of these libraries in real-world scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

Python for Data Analysis

(Libraries used: Pandas & NumPy)

Contents
Chapter-1 (Pandas)
1. What is Data Analysis
2. Where is Data Analysis used
3. Use Cases – Applications of Data Analytics
4. What is Pandas
5. Structural Perspective of Datatypes
6. What is a DataFrame
7. How to install pandas
8. Using iPython Notebook
9. Installing pandas with Anaconda (Steps)
10. What is Jupyter Notebook
11. How to use Anaconda
12. Program: Reading tabular data file (data with rows and columns) into pandas – demo on
read_table()
13. Program: To display 1st 7 rows only
14. Program: Explicitly specifying the delimiter symbol and solving header row problem
15. Program: Explicitly specifying the column names
16. Program: demo on read_csv()
17. Program: Demo on type() and selecting a pandas series from a DataFrame
18. Program: Creating a new series/column in a DataFrame
19. Program: Filtering rows of a pandas DataFrame by column value
20. Program: using the loc indexer – conditional lookup
21. Program: Applying multiple filter criteria to a pandas DataFrame
22. Program: Reading from a selection of columns
23. Program: Demo on iterrows()
24. Program: Demo on dropping a column or a row from display, also covers usage of “axis”
parameter
25. Program: Demo on mean() method
26. Program: Demo on “groupby” in pandas
27. Program: Checking for null values in a DataFrame
28. Program: To figure out whether there are any null values in a DataFrame
29. Program: To figure out how many null values are present in a DataFrame
30. Program: Dealing with null values/Missing data
31. Program: Concatenation of DataFrames
32. Program: Setting a column as Index
33. loc() vs iloc()
34. Program: Getting unique elements of a column in a DataFrame
35. Program: Getting count of unique elements of a column in a DataFrame
36. Program: Getting count of each unique element of a column in a DataFrame

BVRAJU 1
37. Program: Demo on apply() on a DataFrame
38. Program: Demo on apply() on a lambda expression
39. Program: Knowing column names of a DataFrame
40. Program: Sorting a DataFrame

Chapter-2: Pandas – Set2


1. Creating a Series with Dictionary, using index and values attributes of Series class
2. RangeIndex object
3. Applying Lambda function on a Series
4. Change the column names of a DataFrame
5. Changing name of a Particular Column in a DataFrame
6. Setting Column names while creating a DataFrame
7. Changing Row Index
8. Renaming Row Index
9. Add a new Column with single value to a Dataframe
10. Add a new Row with same value to a Dataframe
11. Using iloc indexer to update a value
12. Sorting by index column
13. Sorting in descending order based on a column
14. Sorting using two columns
15. Hierarchical Indexing / Multi-Indexing
16. Grouping by multiple columns in a DataFrame
17. Pandas Data Pre-processing for Optimal Model Execution
18. Demo on Replacing the Missing Data – with the Mode
19. Demo on fillna()’s parameter: method = ‘ffill’
20. Demo1 on Processing Duplicate Data
21. Demo2 - Find the duplicate value in the specific column data of the DataFrame
22. Demo3 - Remove duplicate rows based on one or more columns

Chapter-3: NumPy
1. Datatypes
2. Data Structures
3. Algorithms
4. Method of Memory Storage Per Data Structure
5. NumPy Arrays Vs Python Sequences(Lists, Tuples, Dictionaries & Sets)
6. Why use NumPy Arrays
7. NumPy Datatypes
8. Program: NumPy Demo1
9. Program: NumPy Demo2 – transforming the 1D array into 2D array
10. Program: NumPy avoids copies wherever possible – demo
11. Program: If we need a true copy of an array – demo
12. Program: To demonstrate that in NumPy arrays, the operations are propagated to the
individual elements.
13. Program: Indexing Demo1
14. Program: Indexing Demo2
15. Program: Array Indexing Demo3 – Generating arrays using arange()

BVRAJU 2
16. Program: Generating arrays using arange() – Ex2
17. Program: To generate zeros
18. Program: To generate ones
19. Program: Demo on linspace()
20. Program: To generate identity matrix
21. Program: To generate random numbers between 0 to 1
22. Program: Generating random integers
23. Program: Finding index of max value and min value in an array
24. Program: Knowing datatype of an array
25. Program: Slicing / Indexing 2-D arrays
26. NumPy Operations
27. Program: Performing arithmetic operations between two 1-D arrays
28. Program: Applying a scalar value to a 1-D array
29. Program: Using Universal array Functions on 1-D arrays
30. Program: Comparing Arrays
31. Program: Demo on any() and all()
32. Program: Demo on logical_and(), and logical_or() functions
33. Program: Demo on where()
34. Program: To retrieve non-zero elements from an array
35. Program: Aliasing the arrays()
36. Program: Viewing arrays (shallow copying)
37. Program: Copying arrays (deep copying)
38. Program: Attributes of an Array – ndim
39. Program: Attributes of an array – shape
40. Program: Attributes of an array – size, itemsize, dtype, nbytes
41. Program: Demo on reshape()
42. Program: Demo on flatten()
43. Program: Matrix object in numpy – demo1
44. Program: Getting diagonal elements of a Matrix
45. Program: Finding Max and Min elements in a Matrix
46. Program: Finding Sum and Average of Matrix elements
47. Program: Product of Matrix elements
48. Program: Sorting of Matrix elements
49. Program: Transpose of a Matrix
50. Program: Addition of two matrices
51. Program: Multiplication of two matrices

BVRAJU 3
Chapter-1: Pandas
Pandas – A Python data analytics library

What is Data Analysis

Data plays an important role in businesses.

Examples:

• A Chain of Hospitals (like Apollo Hospitals) contain data related to medical reports, prescriptions,
feedback of their patients.
• A Bank having thousands of branches (like AXIS Bank) contains lakhs of customers’ transaction details.
• Stock exchanges have Share market data of thousands of companies.
• Mobile phone network companies (like Airtel) contain crores of customers’ data like their location
moving data, voice data
• Loan lending financial companies (like Bajaj FinSev) contain lakhs of customers’ loan repayment data
• Supermarket/Retail chains (like Big Bazaar, More, Reliance, Spencer’s, Shoppers Stop) contain lakhs of
customers’ purchase details
• E-Shopping companies (like Flipkart) contain lakhs of customers’ online shopping data

Such data is stored in a system called “Data Warehouse”. (OLAP System) This data can be queried to answer
the questions raised by the management of an organization. This is called Data Analysis/Data Analytics.

Note: The Data Warehouse is different from RDBMS like Oracle which is used for day-to-day Business
Transactions with their customers. (OLTP System)

Data analysis is a process of inspecting the data with the goal of discovering useful information, drawing
conclusions and supporting decision-making.

This is common in other domains like Crime Case Analysis by Police Department, Analyzing losing in an Election
in a State/National Vote Elections.

Ex1: 911 Emergency Calls

BVRAJU 4
1st 5 records:

Question1: What are the top 5 zipcodes for 911 calls

Ans:

Step1.

BVRAJU 5
Note: Notice that the value counts are in descending order

Step2.

Question2: What are the top 5 townships(twp) for 911 calls

Ans:

Question3: How many unique title codes are there?

Ans:

Step1.

Step2.

BVRAJU 6
Question4: What is the most common “Reason” (a new column created using the prefix of the title) for a 911
calls

Ans:

Step1.

Step2.

Where is Data Analysis Used

Data analysis is used in various fields and industries where data is collected and analyzed to gain insights, make
business decisions:

• Hotel Business
• Health Care
• Automobiles Sales
• Social Sciences like Political Science
• Sports Analysis
• E-Commerce Sales Analysis
• Uber Data Analytics
• Etc.

Use Cases – Applications of Data Analytics

Ex: Co-branded Credit Card

BVRAJU 7
Advantages offered by CitiBank to Customers:
• No joining fees
• No renewal fees
• No annual fee
• No add-on fees
• Every time the customer spent some money (say 100Rs) with his/her credit card, customer gets 2 JP (Jet
Airways Royalty Program) miles free. Also provide additional benefits like extra baggage allowance.

Now what is the advantage for both these companies – JetAirways and Citibank for giving this free
stuff?
• CitiBank, after analysis on the spending history of their customers, CitiBank has identified Air travelers
as highly profitable customer segment.
• Now CitiBank is targeting them with more focused marketing strategies and providing higher customer
satisfaction through more relevant offers.
• More profitable customers for a Credit Card issuing Bank are those who forget (due to busy life or some
other reason) to repay the spent amount within billing time and pays interest for the same.

This is a WIN-WIN situation for CitiBank, Jet Airways and for the Customer.

What is pandas

It’s an open-source library in python for data analysis, data manipulation & data visualization all.

The word pandas is an acronym which is derived from "Python and data analysis" and "panel data".

Library Highlights

• A fast and efficient DataFrame object for data manipulation with integrated indexing;
• Tools for reading and writing data between in-memory data structures and different formats: CSV and
text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
• Intelligent data alignment and integrated handling of missing data: gain automatic label-based
alignment in computations and easily manipulate messy data into an orderly form;
• Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
• Aggregating or transforming data with a powerful group by engine allowing split-apply-combine
operations on data sets;
• High performance merging and joining of data sets;
• Python with pandas is in use in a wide variety of academic and commercial domains, including Finance,
Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

BVRAJU 8
• etc.

Data Structures used in Pandas


• Series
• DataFrame

Structural Perspective of Datatypes

Collected data, can be divided into 3 categories:

Structured Data

• It has a Schema Structure.


• The data is stored in fixed-fields such as RDBMS, spreadsheets etc

Un-Structured Data

• It doesn’t have a Schema Structure.


• Social Media (Tweets, Facebook Posts, Comments, Likes etc), Images, Audio and Videos, E-
Commerce Reviews, News Portals and NoSQL Databases, Online Discussion Forums like
Stackoverflow.com, Question and Answer forums like Quora.com

Semi-Structured Data

• It has a Schema Structure.


• Examples: XML, JSON, HTML tables, Email

What is a DataFrame

It is basically a 2-D data structure used to represent a tabular data.

BVRAJU 9
At its core, it is very much like operating a headless version of a spreadsheet, like in Excel or a table in RDBMS.

Most of the datasets you work with will be what are called dataframes.

You may be familiar with this term already, it is used across other languages (Like “R”), but, if not, a dataframe is
most often just like a spreadsheet, columns and rows, that's all there is to it!

From here, we can utilize Pandas to perform operations on our data sets at lightning speeds. Each of those
columns is known as Pandas series.

How to install pandas

Step1. From the windows explorer, go to the python installation directory, from there go to the scripts directory
and use the “pip” tool to install pandas:

Using iPython Notebook

Until now we've worked with Python either directly via the interactive Python console, or by writing Python
programs using a text editor.

However, there are other ways to work with Python. IPython is a set of tools originally developed to make it
easier for scientists to work with Python and data.

Installation of ipython notebook:

BVRAJU 10
Starting iPython Notebook:

Now a local server gets started…

Note: If we use “Anaconda” Software, it comes with a lot of built-in libraries like pandas. So, no need to install
anything explicitly.

Installing pandas with Anaconda (Steps)

Note: Anaconda is a Python distribution. It’s an alternative way to install pandas and its dependencies:

Note: Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for
Scientific computing and Data Science.

Step1.

BVRAJU 11
Step2.

Step3.

Step4.

BVRAJU 12
Latest:

BVRAJU 13
After download, double click on it

BVRAJU 14
BVRAJU 15
What is Jupyter Notebook

The Jupyter Notebook is an open-source web application that allows you to create and share documents that
contain live code, equations, visualizations and narrative text.

How to use Anaconda

From the start menu:

The following is open:

BVRAJU 16
From here onwards rest is same as done earlier.

Another way to open Jupyter Notebook (common on both Windows OS and Mac OS)

Step1.

Step2.

Step3.

BVRAJU 17
Note: The Jupyter Notebook is an open-source web application

Step3. We should see the following interface:

BVRAJU 18
Step4. After coding each line, save it (Press Ctrl + S or click on “Save” icon) (anyways it would be auto-saved
every 2 minutes) and click “run cell” icon:

Note: Shortcut to run:


Ctrl + Enter

Shift + Enter
(OR)
Alt + Enter: to execute and insert a new line

Note: To save the document, just click on “untitled” section at the top, give it a name (extension is .ipynb)

Note: To know the present/current working directory, just type: “pwd”

Note: To open it whenever required, File → Open and select the document

Note: To put some normal text/ description:

Program1: Reading tabular data file (data with rows and columns) into pandas
(Exs: Excel spreadsheet, csv, tsv)

Step1.

BVRAJU 19
Note: “as” stands for “alias name”

Step2. Now code one more statement and click “run cell”:

Note: Press tab key to get suggestions.

Note: Make sure that internet connection is available for this dataset

Note:

We got the formatted data. The underlying data file is tab separated.
It assumes that the 1st row is a header row.

Program2: To display 1st 7 rows only

BVRAJU 20
Program3: Explicitly specifying the delimiter symbol and solving header row problem

Let’s take another dataset which is not delimited by tab

Step1.

BVRAJU 21
Notice that everything is put into one column.

Notice that here the delimiter is pipe symbol ( | )

Step2. Modify the read_table()

Notice that here the 1st row is not a header row, it is a data row. So, we need to tell pandas that there is no
header row.

Step3.

BVRAJU 22
Program4: Explicitly specifying the column names

Note: For that 1st we have to create a list of strings

Now the output looks perfect.

Note: for read_table() default separator is “\t”

Program5. Demo on read_csv()

Program: Demo on type() and selecting a pandas series from a DataFrame

Note: type() function gives the datatype of the object

BVRAJU 23
Note: myufoObject[‘City] → selects the panda’s series (a particular columns data) from the dataframe

Note: type(myufoObject[‘City’]) → gives pandas.core.series.Series datatype

Note: column names are case-sensitive. If, we give type(ufo[‘city’]) → It gives error

Note: It has intellisence feature, just press tab after putting a dot, it helps us in selecting:

BVRAJU 24
Program: Creating a new series/column in a dataframe

Program: Filtering rows of a pandas DataFrame by column value

Step1.

BVRAJU 25
Note: Notice that it is a tuple.

Step2. The requirement is to show only those movies whose duration is >= 200 minutes

Step3. Writing it in a compact way:

Program: using the loc indexer - (conditional lookup)

Note: “loc” is an attribute/property of DataFrame

The “loc” property is used to access a group of rows and columns by label(s) or a boolean array.

BVRAJU 26
Note:
1st arg: which rows we want
2nd arg: which columns we want

Note: If we don’t mention any column then all columns are displayed.

Ex2:

Note: Here “2” is a row label.

The value we pass in will always be interpreted as the label. It will never be interpreted as the integer position
along the index. So, we specified which rows we want by using the label of that specific row, which is 2.

Ex3: Getting multiple rows

Ex4: Slicing both rows and columns:

BVRAJU 27
Note: If we develop and run any pandas code using Python IDLE and on the Python Shell:

Program: Applying multiple filter criteria to a pandas DataFrame

Note: parentheses are important.

Note: notice that it is single ampersand but not double. Remember that here we should not use Python’s “and”
operator.

Note: for OR operation we have to use single pipe symbol.

BVRAJU 28
Program: Reading from a selection of columns

Program: Demo on iterrows()

Note: this function gives both the indices and the rows data.

Program: Demo on dropping a column or a row from display, also covers usage of “axis” parameter

For drop():
axis = 0 (row) (default)
axis = 1 (column)

Dropping a column:

BVRAJU 29
Dropping a row:

Note: default axis value is 0.

Note: If you want to delete multiple columns at the same time, enter them in the form of a list.

Program: Demo on mean() method

Note: It displays the average of each numeric column

BVRAJU 30
Now it shows row wise mean, for mean(), axis = 1 means rows wise.

So, default axis value is 0 (i.e., move column wise)

Note: for drop() function


axis = 1 is equivalent to axis = ‘columns’
axis = 0 is equivalent to axis = ‘index’

Program: Demo on “groupby” in pandas

Note: Recall that, “GROUP BY” Statement in SQL is used to arrange identical data into groups based on a
categorical column with the help of some functions.

Note: We use “groupby” when we have to analyze some pandas series by some category

Note: Had we not mentioned “beer_servings” then it would have displayed all the numerical columns

BVRAJU 31
Note: Also try count(), mean()

To understand the 25%, 50% and 75% statistics:

Ex1: Notice that it is sorted data

Ex2: Notice that it is unsorted data

Ex3:
BVRAJU 32
Creating a new DataFrame from a collection of Series

Step1. Create a series

Ex: Subject1 (Maths) details of 3 students

Step2. Create another series

Ex: Subject2 (Physics) details of 3 students

BVRAJU 33
Step3. Create a dictionary

Giving labels/columns to the above two series

Step4. Create a DataFrame

Note: Now we can drop any row with its label name (eg: df1.drop(“student2”))

Program: Demo on using multiple aggregate functions at one go

Program: Checking for null values in a DataFrame

Step1.

BVRAJU 34
Step2.

Program: To figure out whether there are any null values in a DataFrame

BVRAJU 35
Program: To figure out how many null values are present in a DataFrame

Note: Notice that the above command gives column wise count, instead if we want to count in the entire
DataFrame:

Program: Dealing with null values/missing data

(a).

(b). Dropping Rows having “nan” values

BVRAJU 36
(c). to do the operation along the columns:

(d). Dropping based on “threshold” value

Note: “thresh” refers to non-nan values

Note: In the above example, only row 0 has 3 non-null values.

Note: Notice that we have given “threshold” value as 2, which means that do not delete the row if there are at
least 2 non-nan values.

(OR)

Deletes all rows with 2 or more NaN values

(e). Filling the “nan” values

BVRAJU 37
(f). Filling with “mean” of the column

Program: Concatenation of DataFrames

Note: Concatenation of DataFrames is also known as Binding of DataFrames.

Note: Adding Rows to existing Rows or adding Columns to existing Columns

Case1: 100 students’ data of GITAM University and 100 students’ data of Andhra University. Here we join row
wise

Case2: One survey group collected some information about IIT Bombay CSE Students like name, gender, phone
Another survey group collected some additional information about the same students like address, +2 college
name, native palace. Here we join column wise

Step1.

BVRAJU 38
Step2. Let’s get data from one more CSV file

Step3. Now concatenating them

Note: Keep in mind that dimensions should match along the axis you are concatenating on.

We use the pd.concat() and pass a list of DataFrames to concatenate together.

BVRAJU 39
Note: The left most values are just row labels, they are not indices, indices would be present internally and they
are unique (0, 1, 2, ….13)

Note: Make sure that both the dataframes have equal number of columns.

Note: To ignore the index:

pd.concat([student1, student2], ignore_index=True)

Note: Notice the indices for rows, all the labels would be unique now. To set “RegNum” as index label:

BVRAJU 40
Now we can say:

To concatenate column wise, mention axis=1

If we give axis=1 for the above example:

BVRAJU 41
So, let’s use better datasets:

BVRAJU 42
Program: Setting a column as Index

Consider the following DataFrame:

BVRAJU 43
Step1. If we want to get the details of empid: 103

Note: Notice that we couldn’t get the row of empid: 103, because it is not an index, it is just a value in the
dataframe

Step2. Now setting empid column as index column

Note: Notice that the DataFrame “df” is not modified.

To modify itself:

BVRAJU 44
Now getting the row using empid value:

Resetting the index:

loc() vs iloc()

loc() is label based (can be an integer index or a sting index)

iloc() is integer based

Difference1:

iloc() of an integer always gives only one record where loc() of an integer might result in multiple records

Step1.

BVRAJU 45
Step2.

Step3.

BVRAJU 46
Difference2:

Step1.

Note: Notice that this time the index is a string.

Step2.

Step3.

BVRAJU 47
Note: However, if we use slicing concept, then, even iloc can give multiple rows/records

Program: Getting unique elements of a column in a DataFrame

BVRAJU 48
Note: Better use any column where the data is repeated

Program: Getting count of unique elements of a column in a DataFrame

Program: Getting count of each unique element of a column in a DataFrame

concat () vs merge () vs join ()

concat () is used to append one (or more) dataframes one below the other (or sideways, depending on
whether the axis option is set to 0 or 1).

merge () is used to combine two (or more) dataframes on the basis of values of common columns

join () is used to merge 2 dataframes on the basis of the index.

Program: Demo on apply() on a DataFrame

(a). without using apply()

(b). with using the apply() function

For applying transformation on each value

BVRAJU 49
Program: Demo on apply() on a lambda expression

Program: Knowing column names of a DataFrame

Program: Sorting a DataFrame

BVRAJU 50
For descending order:

BVRAJU 51
Chapter-2: Pandas - Set2

Creating a Series with Dictionary, using index and values attributes of Series class

RangeIndex object

Note: Since the column name is not specified, it appears as a RangeIndex object.

Applying Lambda function on a Series

BVRAJU 52
Note:
Few more points on a DataFrame

• DataFrame can be thought of as a collection of Series objects. Pandas DataFrames allows for
data representation like Excel spreadsheets.
• DataFrame consists of rows and columns. Rows are observations or instances. Columns are
variables

Change the column names of a DataFrame

With Original Column names – assigned at the time of creation of the DataFrame from a Dictionary

Then, assigning new names:

BVRAJU 53
Changing name of a Particular Column in a DataFrame

Setting Column names while creating a DataFrame

BVRAJU 54
Note: If data is entered as shown above, the column name list is not recognized, so, we have to set the
columns ourselves and create a DataFrame.

Changing Row Index

Renaming Row Index

BVRAJU 55
Add a new Column with single value to a Dataframe

Add a new Row with same value to a Dataframe

BVRAJU 56
Using iloc indexer to update a value

Ex1:

BVRAJU 57
Ex2: Another way

Ex3: Yet another way:

BVRAJU 58
Sorting by index column

BVRAJU 59
Sorting in descending order based on a column

Sorting using two columns

BVRAJU 60
Add a new column (“City”) by repeating some City names:

Sorting by one column:

BVRAJU 61
Sorting by two columns:

Hierarchical Indexing / Multi-Indexing

Multi-index allows you to select more than one row and column in your index. It is a multi-level or
hierarchical object for pandas object.

Now there are various methods of multi-index that are used such as MultiIndex.from_tuples,
MultiIndex.from_frame, etc which helps us to create multiple indexes from arrays, tuples, dataframes,
etc.

Ex1:

BVRAJU 62
Creating the data in the required format:

Then, accessing it using Row labels:

Then, accessing it using column names:

BVRAJU 63
Then, accessing it using row labels and column names:

Ex2:

BVRAJU 64
BVRAJU 65
Grouping by multiple columns in a DataFrame

BVRAJU 66
Pandas Data Pre-processing for Optimal Model Execution

BVRAJU 67
Operations that we perform during Data Pre-Processing / Data Cleansing

BVRAJU 68
Demo on Replacing the Missing Data – with the Mode

Step1. Get the data

BVRAJU 69
Finding Mode

The mode is the value that appears most often in a set of data values.

Step2. Finding number of null values on each column

Step3. Then, we want to know the value which appears the most for the “embark_town” column

BVRAJU 70
Another way:

Step4. Now replacing the null values with most frequent value

Demo on fillna()’s parameter: method = ‘ffill’

Note: ffill() is used to fill the missing value in the dataframe. 'ffill' stands for 'forward fill' and

BVRAJU 71
will propagate last valid observation forward.

Step1. Get the data

Step2. To get the rows which contain null values on a particular column

Step3. Display the before and after rows too

BVRAJU 72
Step4. Now use “ffill”

Demo1 on Processing Duplicate Data

Find the duplicate values among the entire row data of the DataFrame

Step1. Get the data from StudentsDuplicateCSVFile.csv

BVRAJU 73
Step2. Know which are duplicate rows

BVRAJU 74
Step3. Show the duplicate Rows

BVRAJU 75
Step4. Remove the duplicate Rows – method 1

Step4. Drop the Duplicate Rows – method 2

BVRAJU 76
Demo2 - Find the duplicate value in the specific column data of the DataFrame

Demo3 - Remove duplicate rows based on one or more columns

BVRAJU 77
Chapter-3: NumPy

Datatypes

It is a classification that identifies the type of data.

Examples: Integers, floats, booleans, strings, lists etc

Note: Datatypes also determine the size of the data.

Data Structures

Data Structures deal with group of data.

A Data Structure refers to the organization, management and storage of data that enables efficient
access and modification.

How fast an element can be searched, depends also on the Data Structure being used.

Types of Data Structures used in Computer Science:

In Python:

BVRAJU 78
Note:

Sequential List is stored in contiguous location where as linked list elements are not stored in
contiguous location.

Direct access file is also known as random access or relative file organization.

BVRAJU 79
Relationship between Index and Records

Book Index Example:

BVRAJU 80
Algorithms

It is a set of procedures or methods to solve any computer solvable problem and refers to a step-by-
step procedure for executing a calculation.

Flow Chart of an Algorithm:

BVRAJU 81
Method of Memory Storage Per Data Structure

Note: Show the storage and data access (with inserting an element) – Array Vs List

What is “NumPy”

BVRAJU 82
• NumPy stands for Numerical Python.
• It is the fundamental package for scientific computing with Python.
• NumPy is a Python library used for working with arrays.
• Using the Numpy libraries, mathematical operations on arrays are executed very efficiently.

Numpy arrays are the main reason we use Numpy and they come in two flavors
• Vectors (1-d arrays)
• Matrices (2-d arrays)

Note: Numpy is available as a built-in library in Anaconda Python distribution

Note:

Lists are similar to arrays in python, but it is a slower process (because of the pointer). On the other hand,
NumPy arrays are stored at one continuous location in memory, so it is straightforward to access them very
efficiently.

NumPy Arrays Vs Python Sequences(Lists, Tuples, Dictionaries & Sets)

Note: Remember that Python Core supports 1-D arrays only.

Why use NumPy Arrays

BVRAJU 83
NumPy Datatypes

Program: NumPy Demo1

BVRAJU 84
Note: np.array() converts a normal Python list to an array

Program: NumPy Demo2 – transforming the 1D array into 2D array

Note: NumPy Arrays can be created using Dictionaries and Sets as well.

Program: NumPy avoids copies wherever possible - Demo

BVRAJU 85
Program: If we need a true copy of an array - demo

Note: Notice that “a” hasn’t changed now.

Note: If we use the copy(), only the values are copied into the new location (different address).

Program: converting nested lists to a matrix (2-D array)

BVRAJU 86
Note: Had we passed simple list we would have got 1-D array

Program: To demonstrate that in NumPy arrays, the operations are propagated to the individual elements.

Note: In Python Core, the elements get repeated

Program: Indexing Demo1

Note: 2, 3, 4 are indices.


BVRAJU 87
Note: The last two results are examples of conditional selection

Program: Indexing Demo2

Note: This assignment is known as broadcasting

Notice that all values > 50 are changed to 50 and all values < 15 are changed to 15.

Eg: salaries of employees must be between a given range only like between 10000 and 1 Lakh only.

Program: Array Indexing Demo3 – Generating arrays using arange()

BVRAJU 88
Program: Generating arrays using arange() – Ex2

Note: Press tab for displaying code suggestions

Note: We can even give a fraction value as step value.

Program: To generate zeros

BVRAJU 89
Program: To generate ones

Program: Demo on linspace()

Note: generating values from start to end based on number of points we want

Note: To generate linearly spaced points

Note: We can even give negative values

Program: To generate identity matrix

BVRAJU 90
Program: To generate random numbers between 0 to 1

Note: For each execution the output may differ.

Program: Generating random integers

Arguments: min, max, total no. of elements

Shortcut way:

Program: Finding index of max value and min value in an array

BVRAJU 91
Program: Knowing datatype of an array

Program: Slicing / Indexing 2-D arrays

BVRAJU 92
NumPy Operations

• Array with Array


• Array with Scalars
• Universal Array Functions

Program: Performing arithmetic operations between two 1-D arrays

Note: The sizes of the two arrays must be same

Program: Applying a scalar value to a 1-D array

Program: Using Universal array Functions – on 1-D arrays

BVRAJU 93
Program: Comparing Arrays

Note: We can use the relational operators (< , >, <=, >= , == and !=) to compare the arrays of same size. These
operators compare the corresponding elements of the arrays and return another array with Boolean values.

Program: Demo on any() and all()

Note: The any() can be used to determine if any one element of the array is True. Whereas the all() can be used
to determine whether all the elements in the array are True.

These functions return a single boolean value, either True or False.

BVRAJU 94
Program: Demo on logical_and(), and logical_or() functions

Program: Demo on where()

Note: where() can be used to create a new array (subgroup) based on whether a given condition is True or False.

Syntax: newarray = where(condition, expression1, expression2)


If the condition is True, expression1 is executed else expression2 is executed.

BVRAJU 95
Note: To compare the corresponding elements of two arrays and retrieve the biggest elements.

int is equal to int32. If we want int16 or int64 then we have to say np.int16 and np.int64 respectively.

Program: To retrieve non-zero elements from an array

To get the elements:

arr[result]

Program: Aliasing the arrays()

Note: If “a” is an array and assigned it to “b”

b=a

This is a simple assignment that does not make any new copy. It means “b” is not a new array and memory is
not allocated to “b” elements. Here there is only one array referenced/pointed by two variables.

BVRAJU 96
So, this is known as aliasing, means giving another name to the existing object. Hence any modifications made
to the array via one variable will reflect for the 2nd variable as well.

To prove that both point to the same memory locations, check their ids

id(a)

id(b)

They would be same

Program: Viewing arrays (shallow copying)

Note: This is done by the view(). It creates a new copy of the existing array. The new array will have different
memory locations. However, if the newly created array is modified, the original array also gets modified.

BVRAJU 97
To prove that they are stored in different memory locations, check their ids

id(a)

id(b)

Now, they would be different

Program: Copying arrays (deep copying)

Note: Here we use the copy(). Here the new array would be independent and modifying one array won’t affect
the 2nd one.

BVRAJU 98
To prove that they are stored in different memory locations, check their ids

id(a)

id(b)

They would be different

Program: Attributes of an Array – ndim

Program: Attributes of an array – shape

BVRAJU 99
Note: the “shape” attribute gives the shape of an array. The shape is a tuple listing the number of elements
along each dimension. A dimension is also called an axis in python.

For a 1D array, shape gives the no. of elements in a row.


For a 2D array, shape gives the no. of rows and no. of columns in each row

Note: We can also change the shape using this “shape” attribute

Note: reshape() doesn’t modify the original array’s shape, whereas shape member modifies the original array’s
shape.

Program: Attributes of an array – size, itemsize, dtype, nbytes

(a) size attribute

Note: the “size” attribute gives the total no. of elements in the array.

BVRAJU 100
(b) itemsize attribute

Note: the “itemsize” attribute gives the memory size of an array element in bytes

Note: In newer versions of Numpy, integer occupies 8 bytes.

(c) dtype attribute

the “dtype” attribute gives the datatype of the elements in the array

BVRAJU 101
Note: To mention the datatype explicitly while creating the array:

(d) nbytes attribute

Note: It gives the total number of bytes occupied by an array

BVRAJU 102
Program: Demo on reshape()

Note: reshape() is used to change the shape of an array. The new array should have the same no. of elements as
in the original array.

BVRAJU 103
Program: Demo on flatten()

Note: It returns a copy of the input array, flattened to one dimension.

Program: Matrix object in numpy – demo1

Note: In mathematics, a matrix represents a rectangular array of elements arranged in rows and columns. If a
matrix has only 1 row, it is called as “row matrix’. If a matrix has only 1 column, then it is called as “column
matrix”. Row matrix and Column matrix are nothing but 1-D arrays.

To work with matrices, numpy provides a special object called “matrix”. In numpy, a matrix is a specialized 2D
array that retains its 2D nature through operations.

BVRAJU 104
Another way (by directly passing the elements):

Yet Another way (by passing a string with elements with semicolons after each row):

Program: Getting diagonal elements of a Matrix

BVRAJU 105
Note: We can find diagonal even on a normal 2-D array

For reverse diagonal:

BVRAJU 106
Program: Finding Max and Min elements in a Matrix

Program: Finding Sum and Average of Matrix elements

Program: Product of Matrix elements

BVRAJU 107
i.e., 28 * 80 * 162 (OR) 6 * 120 * 504

Program: Sorting of Matrix elements

Step1. Let’s consider the following matrix:

Step2.

BVRAJU 108
Note: default is axis = 1

Program: Transpose of a Matrix

Program: Addition of two matrices

BVRAJU 109
Program: Multiplication of two matrices

BVRAJU 110

You might also like