George H. Data Science From Scratch... 2020
George H. Data Science From Scratch... 2020
Introductio n
“A data scientist is a professional who outshines any software
engineer in statistics and outshines any statistician in software
engineering.”
- Josh Wills.
In the past decade, there has been an increase in the demand for data
scientists in the IT and business world. This is due to the fact that a
lot of companies were gathering data on their clients and they needed
experts to analyze them. A data scientist is a person who combines
statistics, programming, and research skills to extract and interpret
information from large amounts of data. Most data scientists started
their careers as a data analyst or statistician and further expanded
their roles to include programming.
Why Python?
There are a lot of arguments about the first language to master when
starting data science. A few people suggested R, some mentioned
Java, but the majority supported Python. Python has been called “a
great introductory language,” it’s an object-oriented, high-level
programming language. Guido van Rossum created the language,
and it was adopted in 1989. He named the program after his favorite
comedy show, Monty Python. He got the idea from the ABC
language he had assisted in creating.
Python can be used to develop online and offline games, create
graphic user interfaces, network programming, data analysis,
scripting, machine learning, artificial intelligence, and many more.
As an Object-oriented language, the objects and data structures used
in creating a program are subject to manipulation by the user.
Writing codes in Python is very similar to writing instructions in
English. Therefore, Python as a language is not machine-readable
and requires an interpreter to translate it into a machine-readable
language. The codes can only run after translation.
Some programming languages become obsolete after a few years of
operation and are substituted with languages that are more effective
and relevant. Python is still relevant and very much useful despite
being 30 years old. That’s why it is very popular among people who
are learning how to program for the first time. It was chosen as an
introductory language for the following reasons:
1. Type exit()
2. Type quit()
3. Hold down control and Z, then press enter.
Chapter 2: Python 101
Python Syntax
The set of rules that specifies how the users and system write and
interpret codes on Python is called Python syntax. Prior to writing
and running programs on python, you have to get accustomed to its
syntax.
Indentation
A lot of programming language separate blocks of code with curly
braces, but not Python. Rather, it uses indentation to set the boundary
of a block of code.
Before you can grasp the purpose of indentation in Python, you have
to understand what a block of code means. A block of code is a
group of statements executed one after the other. Do you remember
what a statement is? It’s an executable command.
if opinion == yes: -----------------
-----i
print(“I love Python”) ------------------ii
loop = True --------------------iii
Line i,ii, and iii make up “if” block of code. The system runs line i,
then line ii, and finally line iii. There is an indentation in line ii in
the example above. You indent by pressing tab on the next line.
While you can also use space to indent (4 spaces are equivalent to a
tab), never use both space and tab at the same time. The level of
indentation matters, statements within the same level of indentation
make up a block.
It’s possible to have more than one set of indentation, there’s no
limit. For example
>>> def house_rent_cost(weeks):
cost=35*weeks
if weeks >= 8
cost -=
70
weeks >=
3:
-=
20
cost
How many blocks are present in the code above?
NB:
Comments
Comments refer to statements added to a code that describes or
explains what it does. Leaving a comment can help you and other
person reviewing your code understand the purpose of your code. A
comment has no impact on the code, as the Python interpreter
automatically skips it. You can create a comment at any point in the
code by starting the comment with the hash symbol ‘ # ’. The
moment the interpreter recognizes the hash symbol, it skips the
words until it reaches the end of the line.
To write comments that span across multiple lines, you can either
start each line with the hash symbol or you can surround the
comments with triple quotes “““ ””” .
def increase_income(rating,sal,percentage):
#increase income of workers
“““increase rating based on rating and
percentage
rating 1- 6 10% increase”””
The rules above also guide the naming of every other type of
identifier.
1. Integer
2. Float Numbers
3. Complex Numbers
4. Long ( it’s now part of Integer )
A major advantage of using Python is that when you run your
program, it automatically recognizes the numeric data type even if
you don’t declare it’s type.
1. Integers
They are whole numbers that do not contain a decimal point. It can
be a positive or a negative number, as long as it does not have a
decimal point or number. There are four main types of integers:
_ Regular integers: These are just regulars numbers e.g 496,
-324, 17, etc.
_ Octal literals: These are numbers written to base 8. To declare
this type of integer, you have to begin the numbers with 0O or 0
o (zero and lower case o or upper case O, in that order).
Example
1. >>> z = 0x24567
>>> print z
148839 # the interpreter converted z to its natural (base 10)
2. >>> y = 0XABCD
>>> print y
43981
2. Floating Numbers
These are real numbers with decimal points. They are popularly
referred to as floats. They can also be written in the form where e
represents the 10th power.
NB : All integers are floats but not all floats are integers.
>>> 5.4e3
5400.0
>>>5.4e2
540
3. Complex Numbers
These are numbers that contain both real and imaginary numbers.
E.g
>>> z = 4 + 5j # 4 and 5 are real numbers
>>> y = 7 + 2j # j is an imaginary number
>>> w = 3 +6j
>>> u = z + y + w
>>> print u
(14+13j )
Strings
Strings are groups of letters and/or characters delimited with
quotation marks, single or double. Once a string is declared, it can’t
be changed.
How to Assign a String
To assign a string to a variable, you have to define it with quotation
marks ‘ ’ or “ ” or “ “ “ ” ” ” . Triple quotations marks are used
for strings that spill over to another line.
Examples
1. How will you assign the string ‘blue’ to a variable
(bag_colour)?
Typing bag_colour = blue is very
wrong.
Solution: Type
>>>bag_colour = ‘blue’
To verify the assignment print bag_number. The output should be
>>>print (bag_colour)
blue
2. Assign a multi-line string.
Solution:
>>> multi_line = ‘‘‘ The road to becoming a fully qualified data scientist is
long, but with discipline and the right mindset, you can make it shorter.
There are no shortcuts to data science, but there are ways to shorten the
journey, reading this book is one of them. ’’’
>>>print (multi_line)
The road to becoming a fully qualified data scientist is long, but with discipline
and the right mindset, you can make it shorter.
There are no shortcuts to data science, but there are ways to shorten the
journey, reading this book is one of them.
NB:
W
O
r
l
d
!
0
1
2
3
4
5
6
7
8
9
10
11
The ouput :
>>> print ( string1[4:10])
o Worl
>>> print ( string1 [1:5])
ello
HELLO WORLD!
Chapter 4: In-built Python Features
Python Keywords
Python keywords are words that have a specific function in
programming. The words cannot be used to name a variable, define a
function, constant or any other type of identifier. Using a keyword
for a purpose different from its function will lead to problems when
running your program. The keywords are listed in
alphabetical order:
and as assert
break class continue
def del elif
else except false
finally for from
global if import
in is lambda
non local not or
pass print raise
return true try
while with yield
NB: The naming of a function follows the same rules that guide the
naming of a variable.
Examples
NB: When writing text in the print() function, you have to be careful
with spacing. You have to leave spaces at the appropriate place so
your text won’t get muddled up. Visualize how you want your
statement to look like and write your code around that.
When you run the code written above, it will bring up something like
this
Hi, can you enter your full name ?
The text above is a prompt asking the user to enter the necessary
information. Press enter, after inputting the necessary information.
Luke Evans
Wow, your name sounds intelligent Luke Evans!
How old were you on your last birthday?
As you can see the spaces used in the code are appropriate and the
name entered was able to fit in well.
19
Really, you are 19 years old, Luke Evans!
Can you see how interactive the program is? It’s engaging the user
while asking for information.
At this point, if the answer of the user isn’t 18, the program will end.
quit() will close the program. But, if the answer is 18, the program
goes on.
if sport == 'yes':
print("Awesome!")
sport_type = input("What type of sport? ")
else:
print("Thank you for filling the survey.")
quit()
NB : How to use the ‘if’ and ‘else’ statement will also be explained
later .
max()
This function is used to print the highest value among a set of values
or variables.
Example
min()
This function is used to print the lowest value among a set of values
or variables. Examples
len()
The function prints the number of items inside a variable.
Example
range()
The range() function is used to produce a set of numbers. Range(n)
will produce a set of numbers that starts from 0 and ends at n-1. For
example, range(13) is equivalent to [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12]. For range( 2, 15 ), the numbers start at 2 and end at 14(15-
1).
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14]
range(2, 15, 3) means the numbers will start from 2, end at 14, and
will move 3 numbers per step. It is equivalent to [ 2, 5, 8, 11, 14]
Syntax for range():
range(start, stop, size_of_step)
Chapter 5: Basic Operators
Python operators are symbols and words used to execute operations
on values and variables. There are seven categories of operators used
in Python:
1. Arithmetic Operators
2. Logical Operators
3. Relational Operators
4. Bitwise Operators
5. Membership Operators
6. Assignment Operators
7. Identity Operators.
For the purpose of this book, only 4 of the operators will be
discussed.
Arithmetic Operators
These are operators that perform mathematical operations. You can
use the operators to create algorithms that automatically solve
mathematical problems. There are seven types of arithmetic
operators:
i. Addition
This operator adds two or more values together. The same
addition symbol ‘+’ used for normal mathematical expressions is
used for addition in Python.
Examples
Both methods are correct. When you run it, 36 will display on the
next line.
ii. Subtraction
This operator subtracts one value from another. The symbol that
represents subtraction in Python is ‘–’.
Example
iii. Multiplication
This operator multiples two values. The symbol used for this
operation is different from the symbol used for normal mathematical
expressions. The asterisk ‘*’ symbol is used for multiplication in
Python.
Example
iv. Division.
This operator performs the division operation. The slash ‘/’ symbol
the carries out division operation in Python. Python 2.7 automatically
performs floor division when ‘/’ is used. To perform normal division
operations, you have to import it by typing
>>> from _future_ import division
Example
v. Floor Division
This operator performs the division operation and produces a
result without the decimal number(s). The symbol for this
operator is double slash ‘//’.
Examples
vi. Exponent
The exponent operator performs the ‘raise to power’ function in
Python. The symbol used to perform an exponential calculation is
‘**’
Examples
vii. Modulus
This operator produces the value leftover after performing a division
operation. The percentage symbol is used for modulus in Python is
‘%’
Examples
Operator
Addition
Subtraction
Multiplication
Division
Floor division
Exponent
Modulus
Symbol
+
-
*
/
//
**
%
num_maleguest = 22
num_femaleguest = 28
total_guest = num_femaleguest + num_maleguest #additio n
print (“ a. There are ” +str(total_guest) + “ guests in the bar”)
num_pplgroup = total_guest//3 #floor division
water_ppl = total_guest%3 #modulus
print (“ b. There are ” +str(num_pplgroup) + “ people in a group.”)
print (“ c. ” +str(water_ppl) + “ people got water.”)
ttl_malebag = num_maleguest * 2 #multiplication
ttl_femalebag = num_femaleguest * 3
ttl_giftbag = ttl_femalebag + ttl_malebag
print (“ d. Men were given” +str(ttl_malebag) + “ bags, women were given ”
+str(ttl_femalebag) + “ bags, and ” +str(ttl_giftbag)+ “ bags were given in total.”)
square = (num_maleguest**2)/ num_femaleguest
#exponentiation, division
print (“ e. The answer is ”+str(square))
print ( “f. ” +str(num_femaleguest – num_maleguest)+ “ women didn’t get to dance.”)
#subtraction
i. Equal to
The symbol ‘=’ is used to represent this operation. It is used to assign
the value on the right to the variable on the left.
Example
>>> i = 5
j = 10
k = 25
v. Divide and
This operator divides the value of the variable on the left by the
value on the right, then assigns the quotient to the variable on the
left. The symbol used to carry out this operation ‘/=’.
Example
>>> l = 16
l /= 5 # basically the expression means l = 15*5
print (l)
3.2
Operator
Equal to
Add and
Subtract and
Multiply and
Divide and
Exponent
and
Modulus
And
Symbol
=
+=
-=
*=
/=
**=
%=
// =
i. equal to
This operator checks whether the value on the right is the same as the
value on the left. The symbol used to perform this operation is ‘==’.
Examples
1. >>> 6 == 5
False
2. >>> i = 12
J = 15
i == j
False
3. >>> j = 12
k = 12
j == k
True
3. >>> j = 15
k = 12
j>k
True
True
3. >>> j = 2
k = 100
j >= k
False
Operator
Equal to
Less than
Greater than
Less than or equal to
Greater than or equal to
Not equal to
Symbol
==
<
>
<=
>=
!=
Logical operators
Logical operators specify to the interpreter the conditions a statement
can be True or False. There are three types of logical operators in
Python.
or,
and,
not.
or
If there are two operations and or is used, the operator tells the
interpreter to test the correctness of the first operation and only test
the second operation if the first is False.
Examples
>>>(8 != 2) or (8 <= 2) “““ the first operation is True while the
second operation is False, but with the ‘or’ operator the interpreter only tests
the second operation only when the first is False. If the second is also False, it
prints False”””
True
and
This operator tells the interpreter that both operations have to be
True to print a True. If one of the operations is false, then Python
prints False.
Examples
>>>(8 != 2) and (8 <= 2) # operation 2 is False
False
>>>(2 <= 8) and (2 == 2) # both operations are True
True
>>>(2 >= 8) and (2 != 2) # both operations are False
False
not
This operator tells the interpreter to print the opposite state of
correctness of the operation.
Examples
>>>not(8 != 2) # operation is True
False
>>>not(2 >= 8) # operation is False
True
>>>not(2 != 2) # operation is False
True
Conditional Statements
Condition statements are used to execute actions based on whether a
condition is determined to be True or False. The use of if-else
conditional statements or expressions is a very important part of
programming; they help to shorten codes and prevent codes from
being unnecessarily long. It’s easier to write codes with conditi0onal
statements.
Syntax of if-else statements:
if condition:
block_1_statements
elif condition_2:
block_2_statements
else:
block_3_statements
if, else, and elif are Python keywords that used to write conditional
statements. Logical operators are used to create the conditional
statements .
Flow:
The interpreter tests condition, if it is true, it executes
block_1_statements. If condition is false, it moves on and tests
condition_2. A true result will lead to the execution of
block_2_statements. If it is a false result, the interpreter will execute
block_3_statements.
Examples
>>>full_name = input("Hi, can you enter your full name? ")
print ("Wow, your name sounds intelligent " + name + "!")
age = input(“How old were you on your last birthday? “)
print("Really, you are " + age + " years old, " + full_name + "!")
if age == '18':
else:
print("Thank you for filling the survey.")
quit()
There are three different types of output for this program, depending
on the value the user inputs .
Output 1:
Hi, can you enter your full name? Luke Evans
Wow, your name sounds intelligent Luke Evans!
How old were you on your last birthday? 18
Really, you are 18 years old, Luke Evans!
Excellent, this questionnaire is for you!
How many hours per day do you spend on your smartphone? 13
Output 2:
Hi, can you enter your full name? Luke Evans
Wow, your name sounds intelligent Luke Evans!
How old were you on your last birthday? 19
Really, you are 19 years old, Luke Evans!
Good, please answer this questionnaire
How many hours per day do you spend on your system? 9
Output 3:
Hi, can you enter your full name? Luke Evans
Wow, your name sounds intelligent Luke Evans!
How old were you on your last birthday? 12
Really, you are 12 years old, Luke Evans!
Thank you for filling the survey. # Closes the program
In the first output, the if condition is True and the interpreter
executes the if block statements. In the second output, the if
condition is False and the interpreter moves on to the elif block. The
elif block was tested to be True and the elif block statements were
executed. In the third output, both the if and elif conditions were
False. The interpreter moved on to the else statement and quit the
program.
Nested if statement
A nested if statement occurs when another if statement is present
inside a if statement.
Example
>>> number = input (" Enter a number: ")
if number >= 0:
if number == 0:
print (" Input is equal to zero " )
else:
print ( " Input is a positive number " )
else:
print ( " Input is a negative number " )
Output 1:
Enter a number: 0
Input is equal to zero
Output 2:
Enter a number: 12
Input is a positive number
Output 3:
Enter a number: -3
NB: Do not forget to end the if statement with the symbol ‘:’ to
prevent syntax error .
Loops
Loop refers to the programming construct that controls the flow of a
program. It is used to perform a set of statements repeatedly. There
are 2 types of loop statements in Python, they are:
Example
>>>bag_brands = ['Gucci', 'Chanel', 'louis Vuitton', 'Michael Kors', 'Buscemi']
for choice in bag_brands:
if choice == 'Gucci':
if choice == 'Chanel':
print('If you are choosing ' + choice)
if choice == 'Buscemi':
>>>integer = [12, 75, 87, 34, 45, 56, 67, 78, 87, 98, 54, 34, 65, 87, 42]
square = 0
for value in integer:
square = value**2
print (square)
NB:
if choice == 'Buscemi':
print('My all-time favorite is ' + choice + ', you definitely
have to pick this.')
print ('Price is $1,800.')
else:
The condition set in the program is that the choice of bag has to be
Buscemi. Until the condition is met, the program will keep printing
the else statement.
>>>sum = 0
for value in range(0, 35, 3):
sum = sum + value
print(sum)
print('The final sum is', sum )
The range function used tells the interpreter to start from 0, end at
34, and move 3 numbers at a time. Upon executing, the following
will display on the screen:
0
3
9
18
30
45
63
84
108
135
165
198
The final sum is 198
The first thing the interpreter does is to check if the condition is true,
if true, it executes the statement(s) in the body of the while loop.
Then it starts again at the condition of the loop and keeps executing
the command(s) until the condition turns false .
Example
>>>number = 0
print('The first number is:' , number)
while number < 10:
number = number + 1
print(True)
The output will be:
Tru e
True
True
True
True
True
True
True......
The program will keep running till you close the window because the
condition will always remain true.
Lists
Lists are used to store data in Python. It is a data type that can store
several other data types such as strings, integers, and objects. Lists
are very powerful as they can hold more than one data type at once
and can be modified at any point after creation. They perform the
same function as an array in other programming languages. Lists are
ordered and have a specific count, hence, each element in a list has
its own specific spot. Knowledge of how to create, use, and
manipulate a list is crucial to a data scientist whose main job is to
analyze and extract data. Every single thing you need to know about
a list is covered in this chapter.
print ( simple_list )
print ( multiD_list)
num_list = [1, 5, 7, 2, 8, 8, 8, 6, 3]
print("\nList with the repeated Numbers: ")
print ( num_list )
# how to create a List with different data types: strings and numbers
variety_list = [1, 5, 'Data', 8, 'Science', 6, 'From' , 3, 'Scratch' ]
print("\nList with different data types: ")
print (variety_list)
Science
From
Scratch
[1, 5, 7, 2, 8, 8, 8, 6, 3]
Output :
Intial empty list:
[]
Output:
Elements from the list:
Data
Science
From
Access elements in a Multi-Dimensional list:
Science
From
Access an element with negative index:
Scratch
5
# to remove an element from a particular location in a list using the pop() function
s_list.pop( )
print ("\nList after popping a specific element: ")
print ( s_list)
Output:
Intial s_list:
['my', 'baby', 'cat', 'ate', 'a', 'big', 'meal', 1, 2, 'before', 'it', 'slept']
# Perform slicing
slice_list = D_List[5:10] # start printing from the 6th element to the 10th element
print ("\n Slicing elements in a range 5-10: ")
print ( slice_list )
A good data scientist should know how to create, add, remove, and
manipulate lists in any form. It lessens the stress involved in
extracting data and makes it easier and faster .
Tuples
Tuples are very similar to lists. They are also used to store data and
can hold multiple data types at once. However, there are two major
differences between lists and tuples. A tuple is delimited with
parentheses ( ) not square brackets [ ] and it cannot be modified after
it’s created. Once a tuple is created, it holds the same value forever
until it’s deleted.
Filled_tuple2 = ( Filled_tuple2, )
# the comma is necessary to create a tuple with a single element
print ( Filled_tuple2 )
2nd Tuple:
('Data', 'Science', 'From', 'Sratch')
Output:
Removal of the First character:
('A', 'T', 'A', 'S', 'C', 'I', 'E', 'N', 'C', 'E')
a, b = b, a # a is now 20 and b 10
print (' The value of a is: ' +str(a))
print (' The value of b is: ' +str(b))
# combination of tuple and function for multiple assignment
def product_and_sum( c,d ): # defining the function of product_and_sum
return ( c*d ),( c+d )
The output:
The value of a is: 10
The value of b is: 20
The value of a is: 20
The value of b is: 10
The value of mn is: (12, 7)
The value of m and n are: 50 and 15
print ( Tuple_a )
Output:
Traceback (most recent call last):
File "C:/Python27/Lib/idlelib/nc.py", line 3, in <module>
print ( Tuple_a )
NameError: name 'Tuple_a' is not defined
Sets
Sets are unordered collections of data type. Unlike list, when you
store data in a set, it doesn’t retain the order. Sets do not hold
duplicate elements and can be edited at any time.
The mathematical sets and the Python sets are very much identical.
They both undergo union, intersection, and difference operations.
Python sets have the fastest method to check for the presence of an
element in it.
Output:
Intial Empty Set:
set()
As you can see, the order in which the elements are stored into set is
quite different from the order that ends up being printed. There’s no
way to predict how the interpreter will print the data present in a set,
it’s best to use a list or tuple if you have to store data in a particular
order. The next objective is to learn how to add, delete, add, and edit
sets in different ways.
How to Modify a Set
The built-in add() function is used to add elements to a set. You can
only add a single element when using the add() function except when
it is combined with a for loop. With the for loop the add() function
can add as many elements as needed. Without the for loop the only
way to add multiple elements at once is with the update() function.
The update function can add tuples and strings as elements because
they cannot be modified. Lists can also be added but not as elements
because they can be edited. However, the most important thing to
note in all three cases is that the addition of duplicate elements must
be avoided at all cost. Duplicate elements are accepted when creating
a set but not when modifying, all elements must be unique when
updating or adding to a set. Duplicate elements will lead to an error
when the interpreter runs through the program.
The remove() function is used to delete elements in a set. If that
particular element does not exist in the set, a KeyError will occur and
the program will stop running. To prevent an interruption in the
running of the program, discard( ) function can be used. It will
remove the element if it exists, and if it does not it allow the program
to continue running without a hitch. The pop() function used to
delete elements in lists is also used to delete elements in sets, and it
can only remove an element at a time, starting from the bottom. To
erase or completely remove all the elements in a set, the clear()
function is used.
NB : Because a set is disordered it’s not possible to know which
element will be deleted by the pop() function. The best option
is to use a method that allows you to specify the element to
remove.
# Python program that demonstrate how to Add and remove elements from a Set
print ( set_a )
set_a.add( 7 )
set_a.add( 3 )
set_a.add( 15 )
set_a.add( 19 )
print ( set_a )
set_a.add(i)
set_a.add( ( 9, 11 ) )
print ( set_a )
print ( "\n Set after Adding elements with the Update function: " )
print ( set_a )
set_a.remove( 5 )
set_a.remove( 15 )
print ( set_a )
set_a.discard( 11 )
set_a.discard( 9 )
print ( set_a )
for i in range( 3, 4 ):
set_a.remove(i)
print ( "\nSet after Removing a range of elements: " )
print ( set_a )
print ( set_a )
# how to remove all the elements in a Set with the clear( ) function
set_a.clear( )
print ( set_a )
set()
Frozen Sets
Frozen sets are sets that can no longer be modified. They do not
respond to add(), remove(), pop() or any other function that applies
to set. However, they can be printed but the item to be printed has to
be specified in the print statement.
Examples
# Python program that demonstrates how FrozenSet works
# Create a Set
Set_a = ('D', 'a', 't', 'a', 'S', 'c', 'i', 'e', 'n', 'c', 'e' )
Output:
The FrozenSet is :
frozenset({'t', 'i', 'c', 'S', 'a', 'n', 'D', 'e'})
Empty FrozenSet:
frozenset( )
Dictionaries
Like set, the dictionary is an unordered collection of data and it can
contain multiple data types at once. What distinguishes Python
dictionaries from other data types is its ability to link one data type to
another. It works like a map in which you store a particular value
inside a location. The location and the value of dictionaries are called
the ‘key-value’ pair. A real-life dictionary is also a good example of
how a Python dictionary works. The words that are defined are the
keys and the definitions are the values . Just as a word can have
different meanings, a key can contain different values.
The values in a key can be modified but the key that holds a value
can’t be changed. While a key can hold identical values, the key
itself must be unique and be of a data type that is uneditable like
tuples, Integers, and Strings.
print ( Emp_Dict )
Dictry[2] = 'From'
Dictry[3] = 'Scratch'
print ( "\nDictionary after adding 3 elements: " )
print ( Dictry )
Example 2
Empty Dictionary:
{}
Modules
Some features of Python do not load automatically, to access them
you have to import the modules they are stored in. Modules are files
that contain codes, definitions, and statements.
How to Create, Name, and Save a Module
A module can be a class, function, and variable. To create a module
you have to define what it will contain and save it.
Examples
1. def multiply( c, d ):
product=c * d
return product
3. constantX = 15
constantY = 32
After defining the parameters of the module you have to save it as a
.py file on your system. It’s best to save the module with a name that
relates to what the module does. For Example 1 above, the name that
best fits the module is ‘multiply.py’, ‘print_func’ fits Example 2,
and ‘constant’ for Example 3. You must not use a Python keyword to
name a module to prevent errors while running the program.
To use the function defined in a module, you have to use the dot ‘.’
operator to access it:
>>>multiply.multiply( 3,5)
>>>print_func.print_function( “ I’m Luke Evans ”)
>>>print constantY
>>>print (“ The value of constant X = ” , constant.constantX )
Output:
15
Hi : I’m Luke Evans
32
(' The value of constant X = ', 15)
When you install Python, you gain access to tons of modules. You
can find them in the Lib directory of the Python program file
installed. An example of an existing module is the ‘math’ module:
>>> import mat h
>>> print (“ The real value of pi is ” , math.pi )
Output:
(' The real value of pi is', 3.141592653589793)
Exceptions
As a beginner, a lot of errors are bound to happen in the course of
running your codes. The moment the interpreter encounters an error,
it terminates the program. There are two possible errors that can
occur, a syntax error and an exception. A syntax error is caused
when a command statement is not written in the correct format. For
example,
Dictry = { }
print ( "Empty Dictionary: " )
print ( Dictry ))
SyntaxError: invalid syntax
The error was caused by the incorrect print statement in the 3rd line.
An exception occurs when a properly constructed command
statement results in an error. For an exception, the interpreter prints a
Traceback in the window. These Tracebacks show you exactly where
the error originated. For example,
Traceback (most recent call last):
File "C:/Python27/Lib/idlelib/bs.py", line 67, in <module>
Dictry_b.pop( 5 )
KeyError: 5
This traceback tells you that the error originated in the 67th line and
was caused due to the absence of key 5 in the dictionary Dictry_b .
Creating an Exception
It’s possible to envelop an exception in the middle of your code to
prevent it from running if your conditions are not satisfied. The raise
keyword combined with a conditional statement is used to
accomplish this. For example,
a = 13
b = 21
x=a+b
if x > 5:
raise Exception('x should not exceed 5. The value of x was: {}'.format(x))
print ( “ The value of x is less than 5. ” )
Handling Exceptions
Exceptions in Python can be caught and handled with a try and
except statement. The try statement is a separate block from the
except block. The try block contains the normal program to be
executed while the except block contains the alternative program(s)
to be executed if an exception occurs.
import sys
a = 13
b = 21
c=a+b
try:
print ( x )
print ( " x is less or equal to 5 " )
except:
print("Oops!",sys.exc_info()[0],"occured.")
print ( " x is not defined. " )
Output:
read_lines = read_lines.readlines() # this will read the lines in the text line by
line #
write_file = open( 'Science.txt', 'w' )
# 'w' will create a new file named Science if it does not exist in the Lib
directory! and destroy any existing file
append_file = open('appending_file.txt', 'a' )
# 'a' will append or add to the bottom of the existing file
write_file.close()
# this will close the file
It’s quite common for programmers to forget to close a file after they
are done with coding, to prevent this the open statement is written
with a with block.
with open( read_file.txt, 'r' ) as f:
data1 = read( f )
NB: For full understanding of HTML and its tags, visit website
.
This handful of features of a site will allow you to do a lot,
however, you won’t be able to get some complicated data this
way. Not all main data or content will be labelled main, in
most cases, you will have to inspect (ctrl + shift+ I on
Windows) the webpage.
Example 1.
Python 2.7
pip
Requests library
Lxml library ( website )
The Code
The first thing to do is to inspect the site if it permits data scraping.
To do that, go to the terms and conditions of the site. Amazon
permits extraction of data as long as it’s used to add value to the
world. Another method is to check the robots.txt file of the site. This
is done by adding robots.txt to the end of the sites' URL.
www.amazon.com/robots.txt
The program is designed to extract details of some sneakers sold on
Amazon:
from lxml import html
import csv,os,json
import requests
from exceptions import ValueError
from time import sleep
def AmazonProductParser(url):
heading = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
page = requests.get(url,heading=heading)
while True:
sleep(3)
try:
doc = html.fromstring(page.content)
XPATH_NAME = '//h1[@id="title"]//text()'
XPATH_DEAL_PRICE = '//span[contains(@id,"ourprice") or
contains(@id,"saleprice")]/text()'
XPATH_REAL_PRICE = '//td[contains(text(),"List Price") or
contains(text(),"M.R.P") or contains(text(),"Price")]/following-sibling::td/text()'
XPATH_CATEGORY = '//a[@class="a-link-normal a-color-tertiary"]//text()'
XPATH_STOCK_AVAILABILITY = '//div[@id="availability"]//text()'
PRODUCT_NAME = doc.xpath(XPATH_NAME)
PRODUCT_DEAL_PRICE = doc.xpath(XPATH_DEAL_PRICE)
PRODUCT_CATEGORY = doc.xpath(XPATH_CATEGORY)
PRODUCT_REAL_PRICE = doc.xpath(XPATH_REAL_PRICE)
PRODUCT_STOCK_AVAILABILITY =
doc.xpath(XPATH_STOCK_AVAILABILITY)
if not REAL_PRICE:
REAL_PRICE = DEAL_PRICE
if page.status_code!=200:
raise ValueError('captha')
data = {
'NAME':NAME,
'DEAL_PRICE':DEAL_PRICE,
'CATEGORY':CATEGORY,
'REAL_PRICE':REAL_PRICE,
'STOCK_AVAILABILITY':AVAILABILITY,
'URL':url,
}
return data
except Exception as e:
print (e)
if __name__ == "__main__":
ReadAsin()
The data file will be named Sneakers and it will be a .json file. It can
be opened with MS Word. The data in the file will have the
following structure
{
"NAME": "Teva Lightweight Waterproof Comfort Hiking Training Boxing
Wrestling Gym Arrowood Swift Mid Premier Sneakers",
"DEAL_PRICE": "$49.99 - $57.98",
"CATEGORY": "Clothing, Shoes & Jewelry > Men > Shoes > Fashion
Sneakers",
"REAL_PRICE": "$49.99 - $57.98",
"STOCK_AVAILABILITY": null,
"URL": "http://www.amazon.com/dp/B073Y6MPR3"
},
{
"NAME": "adidas Women's Cloudfoam Pure Running Shoe",
"DEAL_PRICE": "$35.00 - $155.00 Lower price available on select options",
"CATEGORY": "Clothing, Shoes & Jewelry > Women > Shoes > Fashion
Sneakers",
"REAL_PRICE": "$35.00 - $155.00 Lower price available on select options",
"STOCK_AVAILABILITY": null,
"URL": "http://www.amazon.com/dp/B0711R2TNB"
},
Python 3.0
pip
Requests library
Lxml library ( website )
Dateutil ( website )
The Code
After inspecting the site for permissions, create the program:
# -*- coding: utf-8 -*- # this help the interpreter deal with the Unicode characters
in the product details
from lxml import html
from json import dump, loads
from requests import get
import json
from re import sub
from dateutil import parser as dateparser
from time import sleep
def ExtractReviews(asin):
amzon_url = 'http://www.amazon.com/dp/'+asin
heading = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
for i in range(5):
reply = get(amzon_url, heading = heading, verify=False, timeout=30)
if reply.status_code == 404:
return {"url": amzon_url, "error": "page not found"}
if reply.status_code != 200:
continue
parser_ = html.fromstring(clean_reply)
XPATH_AGGREGATE_ = '//span[@id="acrCustomerReviewText"]'
XPATH_REVIEW_SECTION_1_ = '//div[contains(@id,"reviews-summary")]'
XPATH_REVIEW_SECTION_2_ = '//div[@data-hook="review"]'
XPATH_AGGREGATE__RATING_ = '//table[@id="histogramTable"]//tr'
XPATH_PRODUCT_NAME_= '//h1//span[@id="productTitle"]//text()'
XPATH_PRODUCT_PRICE_ = '//span[@id="priceblock_ourprice"]/text()'
raw_product_price = parser_.xpath(XPATH_PRODUCT_PRICE_)
raw_product_name = parser_.xpath(XPATH_PRODUCT_NAME_)
total_ratings = parser_.xpath(XPATH_AGGREGATE__RATING_)
reviews = parser_.xpath(XPATH_REVIEW_SECTION_1_)
if not reviews:
reviews = parser_.xpath(XPATH_REVIEW_SECTION_2_)
ratings_dict = {}
reviews_list = []
raw_review_author = review.xpath(XPATH_AUTHOR)
raw_review_rating = review.xpath(XPATH_RATING_)
raw_review_header = review.xpath(XPATH_REVIEW_HEADER)
raw_review_posted_date =
review.xpath(XPATH_REVIEW_POSTED_DATE)
raw_review_text1 = review.xpath(XPATH_REVIEW_TEXT_1)
raw_review_text2 = review.xpath(XPATH_REVIEW_TEXT_2)
raw_review_text3 = review.xpath(XPATH_REVIEW_TEXT_3)
# Cleaning data
author = ' '.join(' '.join(raw_review_author).split())
review_rating = ''.join(raw_review_rating).replace('out of 5 stars', '')
review_header = ' '.join(' '.join(raw_review_header).split())
try:
review_posted_date =
dateparser_.parse(''.join(raw_review_posted_date)).strftime('%d %b %Y')
except:
review_posted_date = None
review_text = ' '.join(' '.join(raw_review_text1).split())
raw_review_comments = review.xpath(XPATH_REVIEW_COMMENTS)
review_comments = ''.join(raw_review_comments)
review_comments = sub('[A-Za-z]', '', review_comments).strip()
review_dict = {
'review_comment_count': review_comments,
'review_text': full_review_text,
'review_posted_date': review_posted_date,
'review_header': review_header,
'review_rating': review_rating,
'review_author': author
}
reviews_list.append(review_dict)
data = {
'ratings': ratings_dict,
'reviews': reviews_list,
'url': amzon_url,
'name': product_name,
'price': product_price
}
return data
def ReadAsin():
# Add your own ASINs here
AsinList = [ 'B07KC21BMT',
'B07DPRQMDH',
'B07DPSVJMN',
'B07417N22S',
'B073Y6MPR3',
'B0711R2TNB',
'B000ARG5T8',
'B00D881KE6',
'B07TWMDM6Z',
'B07FYB1H5J', ]
extracted_data = []
if __name__ == '__main__':
ReadAsin()
The interpreter will run the program and print in the window:
Downloading and processing page http://www.amazon.com/dp/B07KC21BMT
Downloading and processing page http://www.amazon.com/dp/B07DPRQMDH
Downloading and processing page http://www.amazon.com/dp/B07DPSVJMN
Downloading and processing page http://www.amazon.com/dp/B073Y6MPR3
Downloading and processing page http://www.amazon.com/dp/B0711R2TNB
Downloading and processing page http://www.amazon.com/dp/B000ARG5T8
Downloading and processing page http://www.amazon.com/dp/B00D881KE6
Downloading and processing page http://www.amazon.com/dp/B07TWMDM6Z
Downloading and processing page http://www.amazon.com/dp/B07FYB1H5J
Done Scrapping Amazon Sneakers. Check the data file in the directory.
The data file is present in a ‘sneaker reviews.json file’. The file will
contain:
[
{
"ratings": {
"5 star": "60%",
"4 star": "19%",
"3 star": "6%",
"2 star": "5%",
"1 star": "10%"
},
"reviews": [
{
"review_comment_count": "",
"review_text": "They so white , my shoes voted for trump. They so white I
got pulled over and the cop kept on going . They so white my credit scored jumped
They so white I started balancing my checkbook They so white I took some random
kids to soccer practice and gave them orange slices They so white I started singing the
national.... Now my knees dirty",
"review_posted_date": "19 Oct 2018",
"review_header": "You can wear em to bed",
"review_rating": "5.0 ",
"review_author": "Tony"
},
{
"review_comment_count": "",
"review_text": "I love em! Looks good ..fits great..thanks you to who ever
created this shoe ..its clean and casual ..im getting every color.",
"review_posted_date": "11 Jul 2018",
"review_header": "I love em! Looks good",
"review_rating": "5.0 ",
"review_author": "Preston Moore"
},
{
"review_comment_count": "",
"review_text": "Love them! Pretty comfortable and breathable shoes.
Looks just like in the photos! I really like wearing them & they are just what I was
looking for. Great seller, communicates and very fast shipping five stars. Definitely
buying again.",
"review_posted_date": "27 Jul 2018",
"review_header": "Love them! It\u2019s worth it buy them!!!",
"review_rating": "5.0 ",
"review_author": "Gabby Lavorata"
},
{
"review_comment_count": "",
"review_text": "I\u2019m not a name brand person so them being knock
offs didn\u2019t bother me. I wore them to a concert and to a park they are very
comfortable",
"review_posted_date": "10 Oct 2018",
"review_header": "Comfortable",
"review_rating": "5.0 ",
"review_author": "Anonymous"
},
{
"review_comment_count": "",
"review_text": "Fit true to size. Good quality for the money. Expected alot
less. Was very comfortable. Shoelaces more for show than to actually use. material is
a stretchy mesh knit. It's very lightweight.",
"review_posted_date": "05 Jul 2018",
"review_header": "Cant beat for the price!" ,
"review_rating": "5.0 ",
"review_author": "becca bodey"
},
{
"review_comment_count": "",
"review_text": "Not bad at all! I really like them for the house and running
quick errands. soft, gentle, stylish. I recommend to size down.",
"review_posted_date": "30 Oct 2018",
"review_header": "Impressive!",
"review_rating": "5.0 ",
"review_author": "tania mattos"
},
{
"review_comment_count": "",
"review_text": "They fit perfect for me.I would recommend them to
anyone. Especially since I have special made devices on my feet & legs Thank you",
"review_posted_date": "04 Aug 2018",
"review_header": "AWESOME SHOES!!!",
"review_rating": "5.0 ",
"review_author": "Country Girl"
},
{
"review_comment_count": "",
"review_text": "They are so breathable and comfortable and very beatiful, I
cant wait to wear them to join the party.",
"review_posted_date": "13 Oct 2018" ,
"review_header": "breathable comfortable beautiful",
"review_rating": "5.0 ",
"review_author": "arthas"
}
],
NB :
The color of the line can be specified as any of the 7
main colours.
The marker can be specified as +, o, and *
The style of the line can be specified as '-', '--', '-.', ':',
'None', ' ', '', 'solid', 'dashed', 'dashdot', and 'dotted'.
#default width of the bar is 0.8, the statement above adds 0.15 to the width
plt.bar(year_w, fans ) # plot bars with left x-coordinates [ year_w ]
and heights [ fans ]
plt.title( " Bar chart of Fans Every Four Years " ) #title of bar chart
Solution:
#program that demonstrates how to construct a scatterplot
from matplotlib import pyplot as plt
year = [ 1988,1992, 1996, 2000, 2004, 2008, 2012, 2016 ]
fans = [ 32, 46, 150, 75, 250, 173, 173, 380 ]
year_number = [ 1, 2, 3, 4, 5, 6, 7, 8 ]
plt.scatter( year, fans )
xy=(year_count, fan_count),
S/N
1
2
3
4
5
6
7
8
9
10
Class A
8
20
48
55
67
74
81
89
92
97
Class B
32
47
49
50
63
73
80
80
98
99
Solution:
#program that demonstrates how to construct a scatterplot with equal axis
from matplotlib import pyplot as plt
Class_A = [ 8, 20, 48, 55, 67, 74, 81, 89, 92, 95 ]
Class_B = [ 32, 47, 49, 50, 63, 73, 80, 80, 88, 99 ]
Student_number = [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ]
plt.scatter( Class_A, Class_B )
for Student_number, A_count, B_count in zip( Student_number, Class_A, Class_B ) :
plt.annotate( Student_number,
Output:
Matplotlib is not only limited to line charts, bar charts, and
scatterplots, it can be used for various other graph illustrations such
as:
Paths
Three-dimensional plotting
Streamplot
Pie charts
Tables
Filled curves
Log plots
Polar plots, and many more.
Chapter 11: Linear Algebra
Algebra originated from the Arabic word ‘al-jabr’ meaning “the
reunion of broken parts”. It involves the use of known parts to find
out unknown parts in mathematics. Linear Algebra is a branch of
algebra that is concerned with linear functions and linear equations.
Basically, it is used to explain geometric terms like planes in
different dimensions and allows the performance of mathematical
calculations on them (planes). Ordinarily, algebra focuses on one-
dimensional scalar while linear algebra deals multi-dimensional
vectors and matrices.
Previous knowledge of linear algebra is not necessarily a prerequisite
for data science, however, you will need to master some aspects of
the topic that are absolutely necessary in data science. There are four
ways in which linear algebra is used in data science
Scalar
Vectors
Matrix
Tensor
S/N
1
2
3
4
5
6
7
8
9
10
Age
32
26
45
54
27
21
28
23
26
30
Weight ( kg )
81
61
74
72
75
65
80
68
78
83
Height
( cm )
164
165
171
170
179
177
187
`155
180
185
Solution:
from numpy import array # used to create arrays in
python
Worker1 = array( [ 32, # age
81, # weight,
164 ] ) # height
74,
171 ] )
Worker4 = array( [ 54,
72,
170 ]) …
The list can go on and on. Mathematical calculations can be
performed with the data in the vector as long as the vectors are of
equal length i.e have the same number of elements. It’s not possible
to add a two dimensional vector to a three dimensional vector. To
find the sum of the ages, weight, and height of two of the workers in
the company:
from numpy import array
def vector_add( a, b ):
y=a+b
print ( y )
return
Worker2 = array( [ 26,
61,
165 ] )
Worker3 = array( [ 45,
74,
171 ] )
vector_add( Worker2, Worker3)
Output of the program:
[ 71 135 336]
2. Grade 5 students in a school wrote 5 exams each on
different topics. The exams were graded over 50,
however, the results of the exam need to be recorded
over a hundred. Create a five-dimensional vector with
the grades and convert the scores over a hundred.
S/N
Maths
English
Geography
Spanish
Science
1
37
42
37
23
39
2
45
41
48
36
47
3
32
39
21
15
21
4
35
38
33
36
35
5
22
48
37
34
26
Solution:
from numpy import array
def vector_multiply( a, c ):
y = c*a
print ( y )
return
Maths = array( [ 37,
45,
32,
35,
22 ] )
English = array( [ 42,
41,
39,
38,
48 ] )
Geography = array( [ 37,
48,
21,
33,
37 ] )
Spanish = array( [ 23,
36,
15,
36,
34 ] )
Science = array( [ 39,
47,
21,
35,
26 ] )
print (" The score over hundred in Maths: ")
vector_multiply( Maths, 2)
print (" The score over hundred in English: ")
vector_multiply( English, 2)
print (" The score over hundred in Geography: ")
vector_multiply( Geography, 2)
print (" The score over hundred in Spanish: ")
vector_multiply( Spanish, 2)
print (" The score over hundred in Science: ")
vector_multiply( Science, 2)
Output:
The score over hundred in Maths:
[74 90 64 70 44]
The score over hundred in English:
[84 82 78 76 96]
The score over hundred in Geography:
[74 96 42 66 74]
The score over hundred in Spanish:
[46 72 30 72 68]
The score over hundred in Science:
[78 94 42 70 52]
G=[ [ 37, 45, 32, 35, 22], # matrix G has 5 rows and 5
columns.
[ 42, 41, 39, 38, 48], # row 1 represents English grades
[1, 1, 0, 1, 0, 0, 0, 0, 1, 1 ] ]
If a number is chosen at random, it’s quite easy to find out if the the
student with the corresponding number is male or female.
if S[ 5 ][ 8 ] == 1:
print ( ' Female ' )
else:
print ( " Male " )
Output:
Same Gender
Chapter 12: Statistics
Statistics is the practice of gathering and analyzing large quantities of
data to get information. Statistics are used to answer important
questions on data, especially population questions. It provides a
structured approach to solving each question, not based on bias and
intuition. However, statistics as a topic is very wide and has
numerous libraries so there is no direct or single way to approach
problems statistically. It’s easy to input the data but not so easy to
calculate and justify the results of the calculation with the various
methods available. This chapter is going to focus on two major types
of descriptive statistics. It will discuss their relevance and the way to
achieve them in Python. It will also focus on some of the most
powerful statistical libraries and tools in the Python arsenal. It will
discuss how to the libraries work and the ways to use them
productively.
Descriptive Statistics
Descriptive statistics is used to characterize and summarize a given
set of data based on its properties. Basically, it describes the main
features of the given data by producing short summaries of the data.
There are four different ways in which descriptive statistics can be
classified:
print ( x )
return
Example
1. Analyze the data gathered from a group of workers during
health week.
S/N
1
2
3
4
5
6
7
8
9
10
Age
32
26
45
54
27
21
28
23
26
30
Weight ( kg )
81
61
74
72
75
65
80
68
78
83
Height
( cm )
164
165
171
170
179
177
187
`155
180
185
Calculate the mean age, weight, and height.
Solution:
from _future_ import division
# remember that the division operator won’t work in Python 2.7 without the import
def mean( y ) :
x = sum( y ) / len( y )
print ( x )
return
Age = [ 32, 26, 45, 54, 27, 21, 28, 23, 26, 30]
Weight = [ 81, 61, 74, 72, 75, 65, 80, 68, 78, 83 ]
Height = [ 164, 165, 171, 170, 179, 177, 187, 155, 180, 185 ]
mean( Age )
mean( Height )
mean( Weight )
Output:
31.2
173.3
73. 7
The next measure is the median. The median refers to the middle
value in a given set of data. Selecting the median of a data is quite
tricky because it works based on the order the values are assigned to
the variable. It’s best to calculate the median of an ordered set of data
rather than random. The function to calculate the median is defined
as:
def median( y ):
y = len( y )
sorted_y = sorted( y )
midpoint = y // 2
if n %2 == 1:
return
Example :
Calculate the median age of the data gathered from the workers
during the health week.
Solution:
from _future_ import division
def median( y ):
g = len( y )
sorted_y = sorted(y)
# this arranges the data from smallest to largest
midpoint = g // 2
if g%2 == 1:
# this instructs the interpreter to return the middle value if odd
print ( sorted_y [ midpoint ] )
return
else:
less = midpoint - 1
high = midpoint
i = sorted_y [ less ] + sorted_y [ high ]
j=i/2
print ( j )
Age = [ 32, 26, 45, 54, 27, 21, 28, 23, 26, 30]
Weight = [ 81, 61, 74, 72, 75, 65, 80, 68, 78, 83 ]
Height = [ 164, 165, 171, 170, 179, 177, 187, 155, 180, 185 ]
median ( Age )
median ( Weight )
median ( Height )
Output:
27.5
74.5
174.0
σ2= 2
print ( x )
return
def mean_deviation( p ):
p_bar = mean(p)
n = len(p)
deviations = mean_deviation(p)
o = sum_of_squares(deviations) / (n - 1)
print (“ The variance is ” + 0)
standard_deviation = sqrt( o ) #the function for squareroot
print (“ The standard deviation is ” +standard_deviation )
def median( y ):
g = len( y )
sorted_y = sorted(y)
# this arranges the data from smallest to largest
midpoint = g // 2
if g%2 == 1:
# this instructs the interpreter to return the middle value if od d
print ( sorted_y [ midpoint ] )
return
else:
less = midpoint - 1
high = midpoint
i = sorted_y [ less ] + sorted_y [ high ]
j=i/2
print ( j )
Age = [ 32, 26, 45, 54, 27, 21, 28, 23, 26, 30]
Weight = [ 81, 61, 74, 72, 75, 65, 80, 68, 78, 83 ]
Height = [ 164, 165, 171, 170, 179, 177, 187, 155, 180, 185 ]
mean ( Age )
mean ( Weight )
mean ( Height )
median ( Age )
median ( Weight )
median ( Height )
variance ( Age )
variance ( Weight )
variance ( Height )
Output:
31.2
173.3
73.7
27.5
74.5
174.0
2. SymPy
f (y ) =
In calculus, you learned Riemann integration, which you can apply
here as
which has the usual interpretation as the area of the two rectangles
that make up f (x). So far, so good.
With Lebesgue integration, the idea is very similar except that you
focus on the y-axis rather than moving along the x-axis. The
question is given f (x) = 1, what is the set of x values for which this
is true? For example, this is true whenever x ∈ (0,1]. So now
there’s a correspondence between the values of the function (namely,
1 and 2) and the sets of x values for which this is true, namely,
{(0,1]} and {(1,2]}, respectively.
To compute the integral, you simply take the function values (i.e., 1,
2) and some way of measuring the size of the corresponding interval
(i.e., μ) as in the following:
Some of the notations above have been suppressed to emphasize
generality.
NB : The same value of the integral as in the Riemann case when
μ((0,1]) = μ((1,2]) = 1 was obtained.
By introducing the μ function as a way of measuring the intervals
above, you have introduced another degree of freedom in the
integration. This accommodates many weird functions that are not
tractable using the usual Riemann theory. Nonetheless, the key step
in the above discussion is the introduction of the μ function, which
you will encounter again as the so-called probability density
function.
Random Variables
Most introductions to probability jump straight into random
variables and then explain how to compute complicated integrals.
The problem with this approach is that it skips over some of the
important subtleties that will be considered now. Unfortunately, the
term random variable is not very descriptive. The better term is a
measurable function. To understand why this is a better term, it’s
necessary to dive into the formal constructions of probability by way
of a simple example. Consider tossing a fair six-sided die. There are
only six outcomes possible,
Ω = { 1, 2, 3, 4, 5, 6}
As you know, if the die is fair, then the probability of each outcome
is 1/6. To say this formally, the measure of each set (i.e.,{1},{2},...,
to {6}) is μ ({1}) = μ({2})...= to μ({6}) is equal to 1/6. In this case,
the μ function discussed earlier is the usual probability mass
function, denoted by P. The measurable function maps a set into a
number on the real line. For example,{1} → 1 is one such
uninteresting function.
Things are about to get more interesting. Suppose you were asked to
construct a fair coin from the fair die. In other words, you’re to
throw the die and then record the outcomes as if you had just tossed
a fair coin. How will you do this?
One way would be to dene a measurable function that says if the
die comes up 3 or less, then you declare heads and otherwise declare
tails. This strategy creates two different non-overlapping sets {1,2,3}
and {4,5,6}. Each set has the same probability measure,
P ({1,2,3}) = 1/2
P ({4,5,6}) = 1/2
And the problem is solved. Every time the die comes up {1,2,3}
record heads, and record tails otherwise.
Is this the only way to construct a fair coin experiment from a fair
die ?
Alternatively, you can dene the sets as {1}, {2}, {3,4,5,6}. The
corresponding measure for each set can be defined as the following
P ({1}) = 1/2
P ({2}) = 1/2
P ({3,4,5,6}) = 0
then, leading to another solution to the fair coin problem. To
implement this, all you need to do is ignore every time the die shows
3,4,5,6 and throw again. This is wasteful, but it solves the problem.
There’s a slightly more interesting problem when you toss two dice.
Assume that each throw is independent, meaning that the outcome of
one does not inuence the other.
What are the sets in this case? They are all pairs of possible
outcomes from two throws as shown below,
Ω ={ (1,1),(1,2),...,(5,6),(6,6)}
What are the measures of each of these sets? By virtue of the
independence claim, the measure of each is the product of the
respective measures of each element. For instance,
The next step is to collect all of the (a,b) pairs that sum to each of the
possible values from two to twelve.
from collections import defaultdict
dinv = defaultdict(list)
for i,j in d.iteritems():
dinv[ j ].append( I )
Convergence
The exclusion of probability density in raw data is a sign that the
sequence of random variables should be argued in an organized
order. An expression in rudimentary calculus,
xn → xo
which represents ‘xn’, the real number sequence. This means that
for any given ∈ > 0, no matter how small, you can exhibit a m such
that for any n > m, you have
|xn −xo| < ∈
Intuitively, this means that once you get past m in the sequence, you
get as to within Σ of xo. This means that nothing surprising happens
in the sequence on the long march to innity, which gives a sense of
uniformity to the convergence process. When arguing about
convergence for statistics, you want the same look-and-feel as you
have here, but because this is about random variables, there is nea ed
for other concepts. There are two moving parts for random variables.
Recall that random variables are really functions that map sets into
the real line:
X:Ω→ R.
Thus, one part to keep track of is the behavior of the subsets of Ω
while arguing about convergence. The other part is the sequence of
values that the random variable takes on the real line and how those
behave in the convergence process.
P (ω∈ Ω: lim X n
(ω) = X(ω) ) =1
Example
To get the feel for the mechanics of this kind of convergence,
consider the following sequence of uniformly distributed random
variables on the unit interval, Xn ∼ U[0,1]. Now, consider taking the
maximum of the set of n such variables as the following,
X (n) = max{X 1 ,..., X n }
Thus, this sequence converges almost surely. You can work this
example out in Python using Scipy to make it concrete with the
following code,
>>> from scipy import stats
>>> u=stats.uniform( )
>>> xn = lambda i: u.rvs(i).max()
>>> xn(5) 0.9667178384820029 9
Thus, the xn variable is the same as the X(n) random variable in the
example.
There are still some cases where a particular realization will skip
below the line. To get the probability guarantee of the denition
satised, you have to make sure that for whatever n ∈ you settle on,
the probability of this kind of noncompliant behavior should be
extremely small, say, less than 1%. Now, you can compute the
following to estimate this probability for n = 60 over 1000
realizations,
>>> import numpy as np
>>> np.mean([xn(60) > 0.95 for i in range(1000)])
0.96099999999999997
Example
To get some sense of the mechanics of this kind of convergence, let
{X1, X2, X3,...}be the indicators of the corresponding intervals,
(0,1],(0, 1 2],(1 2,1],(0, 1 3],(1 3, 2 3],(2 3,1]
Solution:
Keep splitting the unit interval into equal chunks and enumerate
those chunks with Xi . Because each Xi is an indicator function, it
takes only two values: zero and one. For example, for X2 = 1 if 0< x
≤ 1/2 and zero otherwise .
NB : x ∼ U(0,1). Which means that P(X2 = 1) = 1/2.
To compute the sequence of P(Xn > ∈ ) for each n for some ∈ ∈
by X 1 . For X 2 , P(X2 > ∈ ) = 1/2, for X3, P(X3 > ∈ ) = 1/3, and
between zero and one. This means that almost sure convergence fails
here even though there is convergence in probability. The key
distinction is that convergence in probability considers the
convergence of a sequence of probabilities whereas almost sure
convergence is concerned about the sequence of values of the
random variables over sets of events that ll out the underlying
probability space entirely (i.e., with probability one). This is a very
good example that can be integrated into Python. The following is a
function to compute the different subintervals,
>>> make_interval= lambda n: np.array(zip(range(n+1),range(1,n+1)))/n
>>> intervals= np.vstack([make_interval(i) for i in range(1,5)])
>>> print intervals
[[ 0. 1. ]
[ 0. 0.5 ]
[ 0.5 1. ]
[ 0. 0.33333333 ]
[ 0.33333333 0.66666667 ]
[ 0.66666667 1. ]
[ 0. 0.25 ]
[ 0.25 0.5 ]
[ 0.5 0.75 ]
[ 0.75 1. ]]
Now that the individual bit strings is available, the next objective is
to show convergence and that the probability of each entry goes to a
limit. For example, using ten realizations,
>>> print np.vstack([bits(u.rvs()) for i in range(10)])
[ [1 1 0 1 0 0 0 1 0 0]
[1 1 0 1 0 0 0 1 0 0]
[1 1 0 0 1 0 0 1 0 0]
[1 0 1 0 0 1 0 0 1 0]
[1 0 1 0 0 1 0 0 1 0]
[1 1 0 0 1 0 0 1 0 0]
[1 1 0 1 0 0 1 0 0 0]
[1 1 0 0 1 0 0 1 0 0]
[1 1 0 0 1 0 0 1 0 0]
[1 1 0 1 0 0 1 0 0 0] ]
NB :
Now, that it’s tted, the t can be evaluated using the predict
method,
>>> xi = np.linspace(0,10,15) # more points to draw
>>> xi = xi.reshape((-1,1)) # reshape as columns
>>> yp = lr.predict(xi)
Multilinear Regression
The Scikit-learn module easily extends linear regression to multiple
dimensions. For example, for multi-linear regression,
y = α0 +α1x1 +α2x2 +···+αnxn
The problem is to nd all of the α terms given the training set {x1,
x2,...,xn, y}. To create another sample data set:
>>> X = np.random.randint(20,size=(10,2))
>>> Y = X.dot([1, 3])+1 + np.random.randn(X.shape[0])*20
>>> lr=LinearRegression()
>>> lr.fit(X,Y) LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
>>> print lr.coef_ [ 0.35171694 4.04064287]
The coef_ variable now has two terms in it, corresponding to the
two input dimensions. The constant offset is already built-in and is
an option on the Linear Regression constructor .
Polynomial Regression
The data above can extend to include polynomial regression by using
the polynomial features in the preprocessing sub-module. To keep it
simple, let’s go back to the one-dimensional example. First, create
synthetic data,
Theory of Learning
There is nothing so practical as a good theory. In this section, the
formal framework for thinking about machine learning will be
established. This framework will help you think beyond particular
methods for machine learning so you can integrate new methods or
combine existing methods intelligently. Both machine learning and
statistics share the common goal of trying to derive understanding
from data. Some historical perspective helps. Most of the methods in
statistics were derived towards the start of the 20th century when
data were hard to come by .
Society was preoccupied with the potential dangers of human
overpopulation and work was focused on studying agriculture and
crop yields. At this time, even a dozen data points was considered
plenty. Around the same time, the deep foundations of probability
were being established by Kolmogorov. Thus, the lack of data meant
that the conclusions had to be buttressed by strong assumptions and
solid mathematics provided by the emerging theory of probability.
Furthermore, inexpensive powerful computers were not yet widely
available.
The situation today is much different: there are lots of data collected
and powerful and easily programmable computers are available. The
important problems no longer revolve around a dozen data points on
a farm acre, but rather millions of points on a square millimeter of a
DNA microarray. Does this mean that statistics will be superseded
by machine learning? In contrast to classical statistics, which is
concerned with developing models that characterize, explain, and
describe phenomena, machine learning is primarily concerned with
prediction, usually at the expense of all else.
Areas like exploratory statistics are very closely related to machine
learning, but the degree of emphasis on prediction is still
distinguishing. In some sense, this is unavoidable due to the size of
the data machine learning can reduce. In other words, machine
learning can help distill a table of a million columns into one
hundred columns, but is it still possible to interpret one hundred
columns meaningfully? In classical statistics, this was never an issue
because data were of a much smaller scale. Whereas mathematical
models, usually normal distributions, tted with observations are
common in statistics, machine learning uses data to construct models
that sit on complicated data structures and exploit nonlinear
optimizations that lack closed-form solutions.
A common maxim is that statistics is data plus analytical theory and
machine learning is data plus computable structures. This makes it
seem like machine learning is completely ad-hoc and devoid of the
underlying theory, but this is not the case, and both machine learning
and statistics share many important theoretical results.
Next, dene the target function below which just checks if the
number of zeros in the binary representation exceeds the number of
ones. If so, then the function outputs 1 and 0 otherwise (i.e.,Y =
{0,1}).
df.f=np.array(df.index.map(lambda i:i.count(’0’))
df.index.map(lambda i:i.count(’1’)),dtype=int)
df.head(8) # show top half only
f
x
0000 1
0001 1
0010 1
0011 0
0100 1
0101 0
0110 0
0111 0
The hypothesis set for this problem is the set of all possible functions
of X. The set D represents all possible input/output pairs. The
corresponding hypothesis set H has 216 elements, one of which
matches . There are 216 elements in the hypothesis set because for
each of sixteen input elements, there are two possible corresponding
values zero or one for each input. Thus, the size of the hypothesis set
is 2×2×···× 2 = 216. Now, presented with a training set consisting of
the rst eight input/output pairs, the goal is to minimize errors over
the training set (Ein ( )). There are 28 elements from the hypothesis
set that exactly match f over the training set. There is a need for
another element in the problem in order to proceed. The extra piece
is needed to assume that the training set represents a random
sampling (in-sample data) from a greater population (out-of-sample
data) that would be consistent with the population that would
ultimately predict upon.
There is a subtle consequence of this assumption—whatever the
machine learning method does once deployed, in order for it to
continue to work, it cannot disturb the data environment that it was
trained on. Said differently, if the method is not to be trained
continuously, then it cannot break this assumption by altering the
generative environment that produced the data it was trained on. For
example, suppose a model that predicts hospital readmissions based
on seasonal weather and patient health is developed. Because the
model is so effective, in the next six months, the hospital forestalls
readmissions by delivering interventions that improve patient health.
Clearly using the model cannot change seasonal weather, but
because the hospital used the model to change patient health, the
training data used to build the model is no longer consistent with the
forward-looking health of the patients. Thus, there is little reason to
think that the model will continue to work as well going forward.
Returning to the previous example, suppose that the rst eight
elements from X are twice as likely as the last eight. The following
code is a function that generates elements from X according to this
distribution.
np.random.seed(12)
def get_sample(n=1): ...
if n==1:
return‘{0:04b}’.format(np.random.choice(range(8)*2+range(8,16)))
else:
return [get_sample(1) for _ in range( n)]
The next block applies the function denition to the sampled data
to generate the training set consisting of eight elements.
Notice that even though there are eight elements, there is redundancy
because these are drawn according to an underlying probability.
Otherwise, there are just sixteen different elements and a training set
consisting of the complete specication of and then it would be
clear which h ∈ H to pick! However, this effect gives a clue as to
how it will ultimately work.
Given the elements in the training set, consider the set of elements
from the hypothesis set that exactly match. How to choose among
these? The answer is it does not matter! Why? Because under the
assumption that the prediction will be used in an environment that is
determined by the same probability, getting something outside of the
training set is just as likely as getting something inside the training
set. The size of the training set is key here— the bigger the training
set, the less likely that there will be real-world data that fall outside
of it and the better will perform. The following code shows the
elements of the training set in the context of all possible data.
This assumes that the hypothesis set is big enough to capture the
entire training set (which it is for this example).
df[’fhat’]=df.f.ix[train.index.unique()]
df.fhat
x
0000 Na N
0001 NaN
0010 1
0011 0
0100 1
0101 NaN
0110 0
0111 NaN
1000 1
1001 0
1010 NaN
1011 NaN
1100 NaN
1101 NaN
1110 NaN
1111 NaN
Name: fhat, dtype: float64
NB : There are NaN symbols where the training set had no values.
For deniteness, you can ll these in with zeros, although you can
ll them with anything you want so long as whatever you do is not
determined by the training set.
df.fhat.fillna(0,inplace=True) #final specification of fhat
Now, pretend you have deployed this and generate some test data.
test= df.f.ix[get_sample(50)]
(df.ix[test.index][’fhat’] != test).mean()
0.17999999999999999
The result shows the error rate, given the probability mechanism that
is generating the data. The following Pandas-fu compares the overlap
between the training set and the test set in the context of all possible
data. The NaN values show the rows where the test data had items
absent in the training data. Recall that the method returns zero for
these items. As shown, sometimes this works in its favor, and
sometimes not.
pd.concat([test.groupby(level=0).mean(),
train.groupby(level=0).mean()],
axis=1,
keys=[’test’,’train’])
test train
0000 1 NaN
0001 1 NaN
0010 1 1
0011 0 0
0100 1 1
0101 0 NaN
0110 0 0
0111 0 NaN
1000 1 1
1001 0 0
1010 0 NaN
1011 0 NaN
1100 0 NaN
1101 0 NaN
1110 0 NaN
1111 0 NaN
Note that where the test data and training data share elements, they
agree. When the test set produced an unseen element, it produces a
match or not. Now, you are in the position to ask how big the
training set should be to achieve a level of performance.
For example, on average, how many in-samples are needed for a
given error rate? For this problem, you can ask how large (on
average) must the training set be in order to capture all of the
possibilities and achieve perfect out- of-sample error rates? For this
problem, this turns out to be sixty-three.
>>> train=df.f.ix[get_sample(63)]
>>> del df[’fhat’]
>>> df[’fhat’]=df.f.ix[train.index.unique()]
>>> df.fhat.fillna(0,inplace=True) #final specification of fhat
>>> test= df.f.ix[get_sample(50)]
>>> (df.fhat.ix[test] != df.f.ix[test]).mean() # error rate 0.0
Notice that this bigger training set has a better error rate because it is
able to identify the best element from the hypothesis set because the
training set captured more of the complexity of the unknown . This
example shows the trade-offs between the size of the training set, the
complexity of the target function, the probability structure of the
data, and the size of the hypothesis set.
Theory of Generalization
The main question is how the method will perform once deployed. It
would be nice to have some kind of performance guarantee. In other
words, after working hard to minimize the errors in the training set,
what errors can you expect at deployment? In training, the in-sample
error, Ein ( ) is minimized, but that’s not good enough. There
for a given ∈ and δ. Informally, this says that the probability of the
respective errors differing by more than a given ∈ is less than some
quantity, δ. This basically means that whatever the performance on
the training set, it should probably be pretty close to the
corresponding performance once deployed.
Note that this does not say that the in-sample errors (Ein) are any
good in an absolute sense. It just says that you should not expect
much different after deployment. Thus, good generalization means
no surprises after deployment, not necessarily good performance, by
any means. There are two main ways to get at this: cross-validation
and probability inequalities. For cross-validation, there are two
entangled issues: the complexity of the hypothesis set and the
probability of the data. It is possible to separate these two by
deriving a separate notion of complexity free from any particular
data probability. VC Dimension . First, there is a need to quantify
model complexity. Let A be a class of sets and F = {x1, x2,...,xn}, a
set of n data points. Then, dene
NA(F) = #{F ∩ A : A ∈ A}
This counts the number of subsets of F that can be extracted by the
sets of A. The number of items in the set (i.e., cardinality) is noted
by the # symbol. For example, suppose F = {1} and A = {(x ≤a)}. In
other words, A consists of all intervals closed on the right and
parameterized by a. In this case, you have NA(F) =1 because all
elements can be extracted from F using A.
The shatter coefcient is dened as,
s(A,n) = N A (F)
where F consists of all nite sets of size n. Note that this sweeps over
all nite sets so you don’t need to worry about any particular data set
of nitely many points. The denition is concerned with A and how
its sets can pick off elements from the data set. A set F is shattered
by A if it can pick out every element in it. This provides a sense of
how the complexity in A consumes data. In the last example, the set
of half-closed intervals shattered every singleton set{x1}.
Now, this leads to the main denition of the Vapnik-Chervonenkis
dimension dVC which dened as the largest k for which s(A,n) = 2k ,
E out ( f ) ≤ E in ( f )+
with probability at least 1−δ. This basically says that the expected
out-of-sample error can be no worse than the in-sample error plus a
penalty due to the complexity of the hypothesis set. The expected in-
sample error comes from the training set but the complexity penalty
comes from just the hypothesis set, so you have disentangled these
two issues. A general result like this, for which you do not worry
about the probability of the data, is certain to be pretty generous, but
nonetheless, it tells you how the complexity penalty enters into the
out-of-sample error. In other words, the bound on Eout ( f ) gets
worse for a more complex hypothesis set. Thus, this generalization
bound is a useful guideline but not very practical if the plan is to get
a good estimate of Eout (f ).
Conclusion
Now, you’ve taken a step in the thousand-mile journey, you’ve read
this book. The concepts and technique learned in this book is
designed to guide beginners and submerge them into the world of
data science. While you learned some complicated programs and
techniques, there still room to learn more. There’s more to statistics,
probability, machine learning, and most of the topics taught in the
book. The basics taught here should pique your interest and make
you uncomfortable until you’ve mastered all there is to know about
data science.
Python isn’t the only programming language that is used for data
science; it’s just best to learn Python ‘first’. You can move on to
other programming languages and test your skills there, with the
knowledge and skills you’ve acquired here on python it won’t be so
difficult master other programming languages.
Machine learning is an entire field on its own, and there are
numerous resources available that digs deeper into the subject than
what’s taught here. The chapter on machine learning in this book
will serve as the foundation you need for future learning.
If you’re satisfied with the knowledge learned in this book, the next
course of action is to practice, practice, practice! You already
learned how to find and mine data in chapter 9, put it to use. There’s
data everywhere around you, start analyzing and solving problems.
Have fun creating algorithms that have impact in the society. If your
intention is to start a career with the skills learned here, participate in
competitions to improve yourself. The Internet is full of sites that
offer rewards to the winners of the competitions, sometimes
employment opportunities.
If you don’t succeed or solve the desired problem with the first
program you write, don’t get discouraged, call it version 1.0 and
keep upgrading till you achieve your goal.
“Inspiration is cheap, but rigor is expensive” – let this famous data
science quote be your watchword. Good luck.
Resources
Grus, J. (2015). Data science from scratch: first principles with
Python. First edition. Sebastopol, CA: O'Reilly.
Matthes, E. (2016). Python crash course: A hands-on, project-based
introduction to programming.
Johansen A. (2016). Python: The Ultimate Beginner's Guide!
CreateSpace Independent Publishing Platform.
https://towardsdatascience.com/a-definitive-guide-to-the-world-
within-data-science-90300bf6330
https://guide.freecodecamp.org/python
https://medium.com/@rathi.ankit/linear-algebra-for-data-science-
a9648b9daee0
https://www.scrapehero.com/tutorial-howu-to-scrape-amazon-
product-details-using-python/
https://www.sas.com/en_us/insights/analytics/what-is-a-data-
scientist.html
https://beginnersbook.com/2018/01/python-for-loop/
https://sefiks.com/2017/08/07/a-software-engineers-guide-to-
becoming-data-scientist/
https://www.python.org/
https://www.geeksforgeeks.org/python-list /
https://www.softwaretestinghelp.com/python/python-data-types/
https://www.w3schools.com
https://www.programiz.com/python-programming
https://www.tutorialspoint.com/pytho n
https://www.analyticsvidhya.com/blog/2017/05/41-questions-on-
statisitics-data-scientists-analysts/