UNIT II
GETTING STARTED WITH PANDAS
UNIT II Syllabus:
Getting Started with pandas: Introduction to pandas, Library Architecture, Features,
Applications, Data Structures, Series, DataFrame, Index Objects, Essential Functionality
Reindexing, Dropping entries from an axis, Indexing, selection, and filtering),Sorting and
ranking, Summarizing and Computing Descriptive Statistics, Unique Values, Value Counts,
Handling Missing Data, filtering out missing data.
27
Introduction to pandas:
Pandas contains high-level data structures and manipulation tools designed to make
data analysis fast and easy in Python. Pandas is built on top of NumPy and makes it
easy to use in NumPy-centric applications. Consider the following requirements:
• Data structures with labeled axes supporting automatic or explicit data alignment.
This prevents common errors resulting from misaligned data and working with
differently-indexed data coming from different sources.
• Integrated time series functionality.
• The same data structures handle both time series data and non-time series data.
• Arithmetic operations and reductions (like summing across an axis) would pass
on the metadata (axis labels).
• Flexible handling of missing data.
• Merge and other relational operations found in popular database databases
(SQLbased, for example).
I wanted to be able to do all of these things in one place, preferably in a language
well suited to general purpose software development. Python was a good candidate
language for this, but at that time there was not an integrated set of data structures
and tools providing this functionality.
Over the last four years, pandas has matured into a quite large library capable of
solving a much broader set of data handling problems than I ever anticipated, but it
has expanded in its scope without compromising the simplicity and ease-of-use that
I desired from the very beginning. I hope that after reading this book, you will find it
to be just as much of an indispensable tool as I do.
Throughout the rest of the book, I use the following import conventions for pandas:
Panda’s Library Architecture:
The following list gives us an idea about the hierarchy of the files within Pandas
Library Architecture:
1. pandas/core:
In Pandas library architecture, this part consists of basic files about the data
structures present within the library. For examples, data structures – Series and
DataFrames. There are various Python files within the core. The most important of
them being:
api.py: Important key modules which will be used later are imported using
these files.
28
base.py: This will provides the base for all the other classes present, like
PandasObject and StringMIxin.
common.py: It controls the common utility methods which help in handling
various data structures.
config.py: This helps to handle configurable objects found throughout the
package.
These are the essential python classes which handle most of the working in the core
of Pandas.
2. pandas/src:
This contains algorithms which provide basic functionality to the library. The code
here is usually written in C or Cython.
3. pandas/io:
pandas/io, an essential part of the Pandas library architecture. This contains input
and output tools which help Pandas handle files of various file formats. Essential
modules found here are:
api.py: This module handles various imports needed for input and output
functions.
auth.py: This module handles authentications and the methods dealing with it.
common.py: Common functionality of input and output functions are taken care
of by this module.
data.py: This module helps to handle data with is input or output.
4. pandas/tools:
The algorithms of pandas/tools are for auxiliary data. These help various functions
like pivot, merge, join, concatenation, and other such functions for manipulating the
data sets.
5. pandas/sparse:
This part consists of sparse versions of various data structures like DataFrames and
Series. A sparse version means that the data is mostly missing or unavailable.
6. pandas/stats:
This part of the Pandas library architecture consists of a panel and linear regression
and also contains moving window regression. Various statistics-related functions
can be found in this portion.
7. pandas/util:
Various utilities, testing tools, development can be found here. In pandas/util,
classes are used to make testing and debugging any part of the library.
29
8. pandas/rpy:
It consists of an interface to connect to R programming, called RPy2. Using Pandas
with both R and Python can help you to have a much better grasp over data analysis.
Panda’s Features:
o It has a fast and efficient DataFrame object with the default and customized
indexing.
o Used for reshaping and pivoting of the data sets.
o Group by data for aggregations and transformations.
o It is used for data alignment and integration of the missing data.
o Provide the functionality of Time Series.
o Process a variety of data sets in different formats like matrix data, tabular
heterogeneous, time series.
o Handle multiple operations of the data sets such as subsetting, slicing,
filtering, groupBy, re-ordering, and re-shaping.
o It integrates with the other libraries such as SciPy, and scikit-learn.
o Provides fast performance, and If you want to speed it, even more, you can
use the Cython.
Applications of Pandas:
In this list we will cover the most fundamental applications of Pandas:
1. Economics:
Economics is in constant demand for data analysis. Analyzing data to form patterns
and understanding trends about how the economy in various sectors is growing, is
30
something very essential for economists. Therefore, a lot of economists have started
using Python and Pandas to analyze huge datasets. Pandas provide a
comprehensive set of tools, like dataframes and file-handling. These tools help
immensely in accessing and manipulating data to get the desired results. Through
these applications of Pandas, economists all around the world have been able to
make breakthroughs like never before.
2. Recommendation Systems:
We all have used Spotify or Netflix and been appalled at the brilliant
recommendations provided by these sites. These systems are a miracle of Deep
Learning. Such models for providing recommendations is one of the most important
applications of Pandas. Mostly, these models are made in python and Pandas being
the main libraries of python, used when handling data in such models. We know
that Pandas are best for managing huge amounts of data. And the recommendation
system is possible only by learning and handling huge masses of data. Functions like
groupBy and mapping help tremendously in making these systems possible.
3. Stock Prediction
The stock market is extremely volatile. However, that doesn’t mean that it cannot be
predicted. With the help of Pandas and a few other libraries like NumPy and
matplotlib, we can easily make models which can predict how the stock markets
turn out. This is possible because there is a lot of previous data of stocks which tells
us about how they behave. And by learning these data of stocks, a model can easily
predict the next move to be taken with some accuracy. Not only this, but people can
also automate buying and selling of stocks with the help of such prediction models.
4. Neuroscience:
Understanding the nervous system has always been in the minds of humankind
because there are a lot of potential mysteries about our bodies which we haven’t
solved as of yet. Machine learning has helped this field immensely with the help of
the various applications of Pandas. Again, the data manipulation capabilities of
Pandas have played a major role in compiling a huge amount of data which has
helped neuroscientists in understanding trends that are followed inside our bodies
and the effect of various things on our entire nervous system.
5. Statistics:
Pure maths itself has made much progress with the various applications of Pandas.
Since Statistic deals with a lot of data, a library like Pandas which deals with data
handling has helped in a lot of different ways. The functions of mean, median and
mode are just very basic ones which help in performing statistical calculations. There
are a lot of other complex functions associated with statistics and pandas plays a
huge role in these so as to bring perfect results.
6. Advertising:
Advertising has taken a huge leap in the 21st Century. Nowadays advertising has
become very personalized which helps companies to get more and more customers.
31
This again has been possible only because of the likes of Machine Learning and Deep
Learning. Models going through customer data learn to understand what exactly the
customer wants, providing companies with great advertisement ideas. There are
many applications of Pandas in this. The customer data often rendered with the help
of this library, and a lot of functions present in Pandas also help.
7. Analytics:
Analytics has become easier than ever with the use of Pandas. Whether it is website
analytics or analytics of some other platform, Pandas do it all, with its amazing data
manipulation and handling capabilities. The visualization capabilities of pandas play
a big role too in this field. It not only takes in data and displays it but also helps in
applying a lot of functions over the data.
8. Natural Language Processing:
NLP or Natural Language processing has taken the world by a storm and it is
creating a lot of buzzes. The main concept is to decipher human language and
several nuances related to it. This is very difficult, but with the help of the various
applications of Pandas and Scikit-learn, it is easier to create an NLP model which we
can be improved continuously with the help of various other libraries and their
functions.
9. Big Data:
One of the applications of Pandas is that it can work with Big data too. Python has a
good connection with Hadoop and Spark, allowing Pandas to have access to Big
Data. One can easily write to Spark or Hadoop also with the help of Pandas.
10. Data Science:
Pandas and Data science are almost synonymous. Most of the examples are a
product of Data Science itself. It is a very broad umbrella which encompasses
anything that deals with analyzing data, and thus almost all applications of Pandas
fall under the scope of Data science. Pandas mainly used for processing the data.
Therefore Data Science on Python without Pandas is very difficult.
Introduction to pandas Data Structures:
To get started with pandas, you will need to get comfortable with its two workhorse
data structures: Series and DataFrame. While they are not a universal solution for
every problem, they provide a solid, easy-to-use basis for most applications.
Series:
A Series is a one-dimensional array-like object containing an array of data (of any
NumPy data type) and an associated array of data labels, called its index. The
simplest
Series is formed from only an array of data:
32
The string representation of a Series displayed interactively shows the index on the
left and the values on the right. Since we did not specify an index for the data, a
default one consisting of the integers 0 through N - 1 (where N is the length of the
data) is created. You can get the array representation and index object of the Series
via its values and index attributes, respectively:
33
34
35
DataFrame:
A DataFrame represents a tabular, spreadsheet-like data structure containing an
ordered collection of columns, each of which can be a different value type (numeric,
string, boolean, etc.).
The DataFrame has both a row and column index; it can be thought of as a dict of
Series (one for all sharing the same index). Compared with other such DataFrame-
like structures you may have used before (like R’s data.frame), row oriented and
column-oriented operations in DataFrame are treated roughly symmetrically. Under
the hood, the data is stored as one or more two-dimensional blocks rather than a list,
dict, or some other collection of one-dimensional arrays. The exact details of
DataFrame’s internals are far outside the scope of this book.
Note: While DataFrame stores the data internally in a two-dimensional format, you
can easily represent much higher-dimensional data in a tabular format using
hierarchical indexing, a subject of a later section and a key ingredient in many of the
more advanced data-handling features in pandas.
36
37
38
39
40
`
Index Objects:
pandas’ Index objects are responsible for holding the axis labels and other metadata
(like the axis name or names). Any array or other sequence of labels used when
constructing a Series or DataFrame is internally converted to an Index:
41
42
Essential Functionality:
In this section, I’ll walk you through the fundamental mechanics of interacting with
the data contained in a Series or DataFrame. Upcoming chapters will delve more
deeply into data analysis and manipulation topics using pandas. This book is not
intended to serve as exhaustive documentation for the pandas library; I instead focus
on the most important features, leaving the less common (that is, more esoteric)
things for you to explore on your own.
Reindexing:
A critical method on pandas objects is reindex, which means to create a new object
with the data conformed to a new index. Consider a simple example from above:
43
44
Dropping entries from an axis:
Dropping one or more entries from an axis is easy if you have an index array or list
without those entries. As that can require a bit of munging and set logic, the drop
45
method will return a new object with the indicated value or values deleted from an
axis:
Indexing, selection, and filtering:
Series indexing (obj[...]) works analogously to NumPy array indexing, except you
canuse the Series’s index values instead of only integers. Here are some examples
this:
46
47
48
Sorting and ranking:
Sorting a data set by some criterion is another important built-in operation. To sort
lexicographically by row or column index, use the sort_index method, which returns
a new, sorted object:
49
50
51
52
Summarizing and Computing Descriptive Statistics:
pandas objects are equipped with a set of common mathematical and statistical
methods.Most of these fall into the category of reductions or summary statistics,
methods that extract a single value (like the sum or mean) from a Series or a Series of
values from the rows or columns of a DataFrame. Compared with the equivalent
methods of vanilla NumPy arrays, they are all built from the ground up to exclude
missing data. Consider a small DataFrame:
53
54
55
56
Unique Values, Value Counts, and Membership:
Another class of related methods extracts information about the values contained in
a one-dimensional Series. To illustrate these, consider this example:
57
Handling Missing Data:
Missing data is common in most data analysis applications. One of the goals in
designing pandas was to make working with missing data as painless as possible.
For example, all of the descriptive statistics on pandas objects exclude missing data
as you’ve seen earlier in the chapter.
58
I do not claim that pandas’s NA representation is optimal, but it is simple and
reasonably consistent. It’s the best solution, with good all-around performance
characteristics and a simple API, that I could concoct in the absence of a true NA
data type or bit pattern in NumPy’s data types. Ongoing development work in
NumPy may change this in the future.
Filtering Out Missing Data:
You have a number of options for filtering out missing data. While doing it by hand
is always an option, dropnacan be very helpful. On a Series, it returns the Series with
only the non-null data and index values:
59
60
61