0% found this document useful (0 votes)

66 views35 pages

Unit Ii Getting Started With Pandas

Pandas is a Python library used for data manipulation and analysis. It provides high-performance, user-friendly data structures and data analysis tools to make working with structured data fast, easy, and expressive. Some key features of Pandas include labeled axes for data alignment, integrated time series functionality, handling of missing data, and relational operations. Pandas has applications in domains like economics, recommendation systems, stock prediction, neuroscience, statistics, advertising, and analytics due to its ability to efficiently handle large datasets.

Uploaded by

T. Sruthi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views35 pages

Unit Ii Getting Started With Pandas

Uploaded by

T. Sruthi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

UNIT II
GETTING STARTED WITH PANDAS

UNIT II Syllabus:
Getting Started with pandas: Introduction to pandas, Library Architecture, Features,
Applications, Data Structures, Series, DataFrame, Index Objects, Essential Functionality
Reindexing, Dropping entries from an axis, Indexing, selection, and filtering),Sorting and
ranking, Summarizing and Computing Descriptive Statistics, Unique Values, Value Counts,
Handling Missing Data, filtering out missing data.

27

Introduction to pandas:
Pandas contains high-level data structures and manipulation tools designed to make
data analysis fast and easy in Python. Pandas is built on top of NumPy and makes it
easy to use in NumPy-centric applications. Consider the following requirements:

• Data structures with labeled axes supporting automatic or explicit data alignment.
This prevents common errors resulting from misaligned data and working with
differently-indexed data coming from different sources.
• Integrated time series functionality.
• The same data structures handle both time series data and non-time series data.
• Arithmetic operations and reductions (like summing across an axis) would pass
on the metadata (axis labels).
• Flexible handling of missing data.
• Merge and other relational operations found in popular database databases
(SQLbased, for example).

I wanted to be able to do all of these things in one place, preferably in a language

well suited to general purpose software development. Python was a good candidate
language for this, but at that time there was not an integrated set of data structures
and tools providing this functionality.

Over the last four years, pandas has matured into a quite large library capable of
solving a much broader set of data handling problems than I ever anticipated, but it
has expanded in its scope without compromising the simplicity and ease-of-use that
I desired from the very beginning. I hope that after reading this book, you will find it
to be just as much of an indispensable tool as I do.

Throughout the rest of the book, I use the following import conventions for pandas:

Panda’s Library Architecture:

The following list gives us an idea about the hierarchy of the files within Pandas
Library Architecture:

1. pandas/core:

In Pandas library architecture, this part consists of basic files about the data
structures present within the library. For examples, data structures – Series and
DataFrames. There are various Python files within the core. The most important of
them being:

 api.py: Important key modules which will be used later are imported using
these files.

28

base.py: This will provides the base for all the other classes present, like

PandasObject and StringMIxin.
 common.py: It controls the common utility methods which help in handling
various data structures.
 config.py: This helps to handle configurable objects found throughout the
package.
These are the essential python classes which handle most of the working in the core
of Pandas.

2. pandas/src:

This contains algorithms which provide basic functionality to the library. The code
here is usually written in C or Cython.

3. pandas/io:

pandas/io, an essential part of the Pandas library architecture. This contains input
and output tools which help Pandas handle files of various file formats. Essential
modules found here are:

 api.py: This module handles various imports needed for input and output
functions.
 auth.py: This module handles authentications and the methods dealing with it.
 common.py: Common functionality of input and output functions are taken care
of by this module.
 data.py: This module helps to handle data with is input or output.

4. pandas/tools:

The algorithms of pandas/tools are for auxiliary data. These help various functions
like pivot, merge, join, concatenation, and other such functions for manipulating the
data sets.

5. pandas/sparse:

This part consists of sparse versions of various data structures like DataFrames and
Series. A sparse version means that the data is mostly missing or unavailable.

6. pandas/stats:

This part of the Pandas library architecture consists of a panel and linear regression
and also contains moving window regression. Various statistics-related functions
can be found in this portion.

7. pandas/util:

Various utilities, testing tools, development can be found here. In pandas/util,

classes are used to make testing and debugging any part of the library.

29

8. pandas/rpy:

It consists of an interface to connect to R programming, called RPy2. Using Pandas

with both R and Python can help you to have a much better grasp over data analysis.

Panda’s Features:

o It has a fast and efficient DataFrame object with the default and customized
indexing.
o Used for reshaping and pivoting of the data sets.
o Group by data for aggregations and transformations.
o It is used for data alignment and integration of the missing data.
o Provide the functionality of Time Series.
o Process a variety of data sets in different formats like matrix data, tabular
heterogeneous, time series.
o Handle multiple operations of the data sets such as subsetting, slicing,
filtering, groupBy, re-ordering, and re-shaping.
o It integrates with the other libraries such as SciPy, and scikit-learn.
o Provides fast performance, and If you want to speed it, even more, you can
use the Cython.

Applications of Pandas:
In this list we will cover the most fundamental applications of Pandas:

1. Economics:

Economics is in constant demand for data analysis. Analyzing data to form patterns
and understanding trends about how the economy in various sectors is growing, is

30

something very essential for economists. Therefore, a lot of economists have started
using Python and Pandas to analyze huge datasets. Pandas provide a
comprehensive set of tools, like dataframes and file-handling. These tools help
immensely in accessing and manipulating data to get the desired results. Through
these applications of Pandas, economists all around the world have been able to
make breakthroughs like never before.

2. Recommendation Systems:

We all have used Spotify or Netflix and been appalled at the brilliant
recommendations provided by these sites. These systems are a miracle of Deep
Learning. Such models for providing recommendations is one of the most important
applications of Pandas. Mostly, these models are made in python and Pandas being
the main libraries of python, used when handling data in such models. We know
that Pandas are best for managing huge amounts of data. And the recommendation
system is possible only by learning and handling huge masses of data. Functions like
groupBy and mapping help tremendously in making these systems possible.
3. Stock Prediction

The stock market is extremely volatile. However, that doesn’t mean that it cannot be
predicted. With the help of Pandas and a few other libraries like NumPy and
matplotlib, we can easily make models which can predict how the stock markets
turn out. This is possible because there is a lot of previous data of stocks which tells
us about how they behave. And by learning these data of stocks, a model can easily
predict the next move to be taken with some accuracy. Not only this, but people can
also automate buying and selling of stocks with the help of such prediction models.
4. Neuroscience:

Understanding the nervous system has always been in the minds of humankind
because there are a lot of potential mysteries about our bodies which we haven’t
solved as of yet. Machine learning has helped this field immensely with the help of
the various applications of Pandas. Again, the data manipulation capabilities of
Pandas have played a major role in compiling a huge amount of data which has
helped neuroscientists in understanding trends that are followed inside our bodies
and the effect of various things on our entire nervous system.
5. Statistics:

Pure maths itself has made much progress with the various applications of Pandas.
Since Statistic deals with a lot of data, a library like Pandas which deals with data
handling has helped in a lot of different ways. The functions of mean, median and
mode are just very basic ones which help in performing statistical calculations. There
are a lot of other complex functions associated with statistics and pandas plays a
huge role in these so as to bring perfect results.

6. Advertising:

Advertising has taken a huge leap in the 21st Century. Nowadays advertising has
become very personalized which helps companies to get more and more customers.

31

This again has been possible only because of the likes of Machine Learning and Deep
Learning. Models going through customer data learn to understand what exactly the
customer wants, providing companies with great advertisement ideas. There are
many applications of Pandas in this. The customer data often rendered with the help
of this library, and a lot of functions present in Pandas also help.
7. Analytics:

Analytics has become easier than ever with the use of Pandas. Whether it is website
analytics or analytics of some other platform, Pandas do it all, with its amazing data
manipulation and handling capabilities. The visualization capabilities of pandas play
a big role too in this field. It not only takes in data and displays it but also helps in
applying a lot of functions over the data.

8. Natural Language Processing:

NLP or Natural Language processing has taken the world by a storm and it is
creating a lot of buzzes. The main concept is to decipher human language and
several nuances related to it. This is very difficult, but with the help of the various
applications of Pandas and Scikit-learn, it is easier to create an NLP model which we
can be improved continuously with the help of various other libraries and their
functions.
9. Big Data:

One of the applications of Pandas is that it can work with Big data too. Python has a
good connection with Hadoop and Spark, allowing Pandas to have access to Big
Data. One can easily write to Spark or Hadoop also with the help of Pandas.
10. Data Science:

Pandas and Data science are almost synonymous. Most of the examples are a
product of Data Science itself. It is a very broad umbrella which encompasses
anything that deals with analyzing data, and thus almost all applications of Pandas
fall under the scope of Data science. Pandas mainly used for processing the data.
Therefore Data Science on Python without Pandas is very difficult.

Introduction to pandas Data Structures:

To get started with pandas, you will need to get comfortable with its two workhorse
data structures: Series and DataFrame. While they are not a universal solution for
every problem, they provide a solid, easy-to-use basis for most applications.

Series:
A Series is a one-dimensional array-like object containing an array of data (of any
NumPy data type) and an associated array of data labels, called its index. The
simplest
Series is formed from only an array of data:

32

The string representation of a Series displayed interactively shows the index on the
left and the values on the right. Since we did not specify an index for the data, a
default one consisting of the integers 0 through N - 1 (where N is the length of the
data) is created. You can get the array representation and index object of the Series
via its values and index attributes, respectively:

DataFrame:
A DataFrame represents a tabular, spreadsheet-like data structure containing an
ordered collection of columns, each of which can be a different value type (numeric,
string, boolean, etc.).
The DataFrame has both a row and column index; it can be thought of as a dict of
Series (one for all sharing the same index). Compared with other such DataFrame-
like structures you may have used before (like R’s data.frame), row oriented and
column-oriented operations in DataFrame are treated roughly symmetrically. Under
the hood, the data is stored as one or more two-dimensional blocks rather than a list,
dict, or some other collection of one-dimensional arrays. The exact details of
DataFrame’s internals are far outside the scope of this book.

Note: While DataFrame stores the data internally in a two-dimensional format, you
can easily represent much higher-dimensional data in a tabular format using
hierarchical indexing, a subject of a later section and a key ingredient in many of the
more advanced data-handling features in pandas.

40

`

Index Objects:
pandas’ Index objects are responsible for holding the axis labels and other metadata
(like the axis name or names). Any array or other sequence of labels used when
constructing a Series or DataFrame is internally converted to an Index:

Essential Functionality:
In this section, I’ll walk you through the fundamental mechanics of interacting with
the data contained in a Series or DataFrame. Upcoming chapters will delve more
deeply into data analysis and manipulation topics using pandas. This book is not
intended to serve as exhaustive documentation for the pandas library; I instead focus
on the most important features, leaving the less common (that is, more esoteric)
things for you to explore on your own.
Reindexing:
A critical method on pandas objects is reindex, which means to create a new object
with the data conformed to a new index. Consider a simple example from above:

Dropping entries from an axis:

Dropping one or more entries from an axis is easy if you have an index array or list
without those entries. As that can require a bit of munging and set logic, the drop

45

method will return a new object with the indicated value or values deleted from an
axis:

Indexing, selection, and filtering:

Series indexing (obj[...]) works analogously to NumPy array indexing, except you
canuse the Series’s index values instead of only integers. Here are some examples
this:

Sorting and ranking:

Sorting a data set by some criterion is another important built-in operation. To sort
lexicographically by row or column index, use the sort_index method, which returns
a new, sorted object:

Summarizing and Computing Descriptive Statistics:

pandas objects are equipped with a set of common mathematical and statistical
methods.Most of these fall into the category of reductions or summary statistics,
methods that extract a single value (like the sum or mean) from a Series or a Series of
values from the rows or columns of a DataFrame. Compared with the equivalent
methods of vanilla NumPy arrays, they are all built from the ground up to exclude
missing data. Consider a small DataFrame:

56

Unique Values, Value Counts, and Membership:
Another class of related methods extracts information about the values contained in
a one-dimensional Series. To illustrate these, consider this example:

Handling Missing Data:

Missing data is common in most data analysis applications. One of the goals in
designing pandas was to make working with missing data as painless as possible.

For example, all of the descriptive statistics on pandas objects exclude missing data
as you’ve seen earlier in the chapter.

I do not claim that pandas’s NA representation is optimal, but it is simple and

reasonably consistent. It’s the best solution, with good all-around performance
characteristics and a simple API, that I could concoct in the absence of a true NA
data type or bit pattern in NumPy’s data types. Ongoing development work in
NumPy may change this in the future.

Filtering Out Missing Data:

You have a number of options for filtering out missing data. While doing it by hand
is always an option, dropnacan be very helpful. On a Series, it returns the Series with
only the non-null data and index values:

Python Pandas Tutorial For Beginners
No ratings yet
Python Pandas Tutorial For Beginners
203 pages
XII-IP-Python & MySQL 2 Chapters (25.26)
No ratings yet
XII-IP-Python & MySQL 2 Chapters (25.26)
268 pages
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
100% (18)
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
208 pages
Python Pandas Tutorial
96% (28)
Python Pandas Tutorial
178 pages
Learning Pandas Library
100% (2)
Learning Pandas Library
271 pages
Python Pandas Tutorial PDF
100% (1)
Python Pandas Tutorial PDF
13 pages
Linear Algebra Pure Applied 1st Edition Edgar G. Goodaire Instant Download
No ratings yet
Linear Algebra Pure Applied 1st Edition Edgar G. Goodaire Instant Download
52 pages
Pandas Guide for Data Science
No ratings yet
Pandas Guide for Data Science
42 pages
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual Matt Harrison Instant Download
No ratings yet
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual Matt Harrison Instant Download
135 pages
Python Pandas
100% (1)
Python Pandas
96 pages
R Programming Manual 24-25
No ratings yet
R Programming Manual 24-25
58 pages
Day 10 Pandas For Data Science Part 1
No ratings yet
Day 10 Pandas For Data Science Part 1
38 pages
Unit III - Notes
No ratings yet
Unit III - Notes
12 pages
Module 4
No ratings yet
Module 4
57 pages
Unit V Pandas AIML A B Lastupdated 18-06-2024
No ratings yet
Unit V Pandas AIML A B Lastupdated 18-06-2024
33 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
49 pages
FDS Exp4
No ratings yet
FDS Exp4
5 pages
Web Technology Lab
No ratings yet
Web Technology Lab
50 pages
Pandas 1702216043
No ratings yet
Pandas 1702216043
86 pages
Pandas
No ratings yet
Pandas
8 pages
DA&V Module 6 (SAMI)
No ratings yet
DA&V Module 6 (SAMI)
10 pages
Pandas Basics: Data Structures & Features
No ratings yet
Pandas Basics: Data Structures & Features
30 pages
UNIT II Material
No ratings yet
UNIT II Material
34 pages
Pandas Learndatasci
No ratings yet
Pandas Learndatasci
86 pages
Unit 3 (FODS)
No ratings yet
Unit 3 (FODS)
34 pages
Lab Manual ET Lab III
No ratings yet
Lab Manual ET Lab III
38 pages
Pandas Series - Notes For PA3
No ratings yet
Pandas Series - Notes For PA3
9 pages
L1 Pandaseries
No ratings yet
L1 Pandaseries
21 pages
Week 4.1
No ratings yet
Week 4.1
16 pages
Pandas
No ratings yet
Pandas
13 pages
Practical 7
No ratings yet
Practical 7
8 pages
Report File
No ratings yet
Report File
40 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
Introduction To The Pandas Library - The Backbone o
No ratings yet
Introduction To The Pandas Library - The Backbone o
3 pages
Haar Measure on Compact Groups
No ratings yet
Haar Measure on Compact Groups
12 pages
Comprehending The Statistics of Zomato
No ratings yet
Comprehending The Statistics of Zomato
33 pages
Adobe Scan 28-Apr-2025
No ratings yet
Adobe Scan 28-Apr-2025
3 pages
Python Ca22
No ratings yet
Python Ca22
14 pages
Python Pandas Tutorial
No ratings yet
Python Pandas Tutorial
6 pages
Pandas Definitions Summary
No ratings yet
Pandas Definitions Summary
2 pages
Data Analytics Using Python
No ratings yet
Data Analytics Using Python
22 pages
pandas: High-Performance Python Data Tool
No ratings yet
pandas: High-Performance Python Data Tool
2 pages
Pandas Understanding and Architecture
No ratings yet
Pandas Understanding and Architecture
2 pages
Pandas - Panel Data System
No ratings yet
Pandas - Panel Data System
4 pages
CSL 410 L13
No ratings yet
CSL 410 L13
16 pages
Mypnotes
No ratings yet
Mypnotes
3 pages
Python Pandas
No ratings yet
Python Pandas
2 pages
Python Pandas
No ratings yet
Python Pandas
13 pages
Python Pandas
No ratings yet
Python Pandas
21 pages
Python Libraries Seminar Report
100% (2)
Python Libraries Seminar Report
16 pages
Pandas
No ratings yet
Pandas
10 pages
Python Modules & Data Tools Guide
No ratings yet
Python Modules & Data Tools Guide
9 pages
Pandas Library
No ratings yet
Pandas Library
12 pages
All Document Reader 1715619870900
No ratings yet
All Document Reader 1715619870900
6 pages
Python CA2
No ratings yet
Python CA2
11 pages
AIES Assignment1
No ratings yet
AIES Assignment1
15 pages
Python Pandas Beginner's Guide
No ratings yet
Python Pandas Beginner's Guide
45 pages
Saurabh mgnm801 Ca2
No ratings yet
Saurabh mgnm801 Ca2
13 pages
Matrix
No ratings yet
Matrix
3 pages
Notes On Pandasmanpreet
No ratings yet
Notes On Pandasmanpreet
4 pages
Mathematics
No ratings yet
Mathematics
26 pages
Research Paper Presentation Pandas Moshiul Arefin
No ratings yet
Research Paper Presentation Pandas Moshiul Arefin
30 pages
Sustainable Water Management Model
100% (3)
Sustainable Water Management Model
21 pages
TN Board Samacheer Kalvi Class12 Business Mathematics and Statistics Vol 1 Book EM
No ratings yet
TN Board Samacheer Kalvi Class12 Business Mathematics and Statistics Vol 1 Book EM
216 pages
Fundamentals of Signal Enhancement and Array Signal Processing - 2017 - Benesty - Front Matter
No ratings yet
Fundamentals of Signal Enhancement and Array Signal Processing - 2017 - Benesty - Front Matter
11 pages
Additional Mathematics: Paper 4037/11 Paper 11
No ratings yet
Additional Mathematics: Paper 4037/11 Paper 11
16 pages
Maths - SrSec - 2024-25 XII
No ratings yet
Maths - SrSec - 2024-25 XII
2 pages
12 MATHS-2023-24 Practice Paper
No ratings yet
12 MATHS-2023-24 Practice Paper
5 pages
Matrices in Power Systems
100% (1)
Matrices in Power Systems
4 pages
Discrete Math Structures Course Guide
No ratings yet
Discrete Math Structures Course Guide
21 pages
Python Data Analysis with Pandas
No ratings yet
Python Data Analysis with Pandas
30 pages
Evolve Matrix Guide 9-2021-1
No ratings yet
Evolve Matrix Guide 9-2021-1
12 pages
Linear Arrays With Arbitrarily Distributed Elements : H. Unzf
No ratings yet
Linear Arrays With Arbitrarily Distributed Elements : H. Unzf
2 pages
Matrices
No ratings yet
Matrices
24 pages
Matrix Structural Analysis Course
No ratings yet
Matrix Structural Analysis Course
5 pages
Creating Sparse Finite-Element Matrices in MATLAB Loren On The Art of MATLAB
No ratings yet
Creating Sparse Finite-Element Matrices in MATLAB Loren On The Art of MATLAB
8 pages
Lab 1 Introduction To Matlab:: Objectives
No ratings yet
Lab 1 Introduction To Matlab:: Objectives
9 pages
Q R N N: Uaternion Ecurrent Eural Etworks
No ratings yet
Q R N N: Uaternion Ecurrent Eural Etworks
19 pages
Homogeneous Coordinates and Computer Graphics: Tom Davis
No ratings yet
Homogeneous Coordinates and Computer Graphics: Tom Davis
14 pages
Linear Algebra Exam 2021-2022
No ratings yet
Linear Algebra Exam 2021-2022
2 pages
Section Check in Core Pure Matrices
No ratings yet
Section Check in Core Pure Matrices
15 pages
Tinney 1
No ratings yet
Tinney 1
9 pages
Bini Cime PDF
No ratings yet
Bini Cime PDF
281 pages
CE6602-Structural Analysis-II
No ratings yet
CE6602-Structural Analysis-II
20 pages
Non-Negative Matrix Factorization (NMF) : Benjamin Wilson
No ratings yet
Non-Negative Matrix Factorization (NMF) : Benjamin Wilson
43 pages
Performance Analysis of A Flexible Manufacturing System: A Statistical Approach
No ratings yet
Performance Analysis of A Flexible Manufacturing System: A Statistical Approach
13 pages

Unit Ii Getting Started With Pandas

Uploaded by

Unit Ii Getting Started With Pandas

Uploaded by

I wanted to be able to do all of these things in one place, preferably in a language

Panda’s Library Architecture:

Various utilities, testing tools, development can be found here. In pandas/util,

It consists of an interface to connect to R programming, called RPy2. Using Pandas

8. Natural Language Processing:

Introduction to pandas Data Structures:

Dropping entries from an axis:

Indexing, selection, and filtering:

Sorting and ranking:

Summarizing and Computing Descriptive Statistics:

Handling Missing Data:

I do not claim that pandas’s NA representation is optimal, but it is simple and

Filtering Out Missing Data:

You might also like