Introduction to Data Science
Couse intro & Python tutorial
Bilal Shoaib Khan
Contact for the course
• Instructor: Dr. Bilal Shoiab Khan
–
[email protected] Plan for this lecture
• Data Science - why all the excitement
• What is data science
• Course information – syllabus, grading, etc.
• Basic Python programming
Data Scientists are in high demand
Also in academia
Pays Well
Demand will outpace supply
Data Scientist Job Trend in last 3 years
Job postings Jobseeker interest
0.151% 0.074%
Source: indeed.com
Data Science: Why all the Excitement?
e.g.,
Google Flu Trends:
Detecting outbreaks
two weeks ahead
of CDC data
New models are estimating
which cities are most at risk
for spread of the Ebola virus.
9
Why the all the Excitement?
10
The unreasonable effectiveness of Deep
Learning (CNNs)
2012 Imagenet challenge:
Classify 1 million images into 1000 classes.
11
“Big Data” Sources
User Generated (Web &
It’s All Happening On-line Mobile)
Every:
Click
Ad impression
Billing event
….
Fast Forward, pause,… .
Server request
Transaction
Network message
Fault
…
Internet of Things / M2M Health/Scientific Computing
Graph Data
Lots of interesting data
has a graph structure:
• Social networks
• Communication networks
• Computer Networks
• Road networks
• Citations
• Collaborations/Relationships
• …
Some of these graphs can get
quite large (e.g., Facebook*
user graph)
13
There's certainly a lot of it!
Data, data everywhere…
1 Zettabyte 1.8 ZB 8.0 ZB
logarithmic scale
800 EB
Data produced each year
161 EB
5 EB
1 Exabyte
120 PB
100-years of HD video + audio
60 PB
Human brain's capacity
1 Petabyte 14 PB
1 Petabyte == 1000 TB 2002 2006 2009 2011 2015
1 TB = 1000 GB
References
(2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf (2002) 5 EB: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm
(2011) 1.8 ZB: http://www.emc.com/leadership/programs/digital-universe.htm (life in video) 60 PB: in 4320p resolution, extrapolated from 16MB for 1:21 of 640x480 video
(2009) 800 EB: http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf (w/sound) – almost certainly a gross overestimate, as sleep can be compressed significantly!
(2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf (brain) 14 PB: http://www.quora.com/Neuroscience-1/How-much-data-can-the-human-brain-store
“Data is the New Oil”
– World Economic Forum 2011
“Data Science” an Emerging Field
O’Reilly Radar report, 2011 16
Data Science – A Definition
Data Science is the science which uses computer
science, statistics and machine learning,
visualization and human-computer interactions
to collect, clean, integrate, analyze, visualize,
interact with data to create data products.
17
Goal of Data Science
Turn data into data products.
How to use data?
• Data => exploratory analysis => knowledge
models => product / decision marking
• Data => predictive models => evaluate /
interpret => product / decision making
Data Scientist’s Practice
Clean,
prep
Hypothesize Large Scale
Digging Around Model Exploitation
in Data
Evaluate
Interpret
Example data science applications
• Marketing: predict the characteristics of high life time
value (LTV) customers, which can be used to support
customer segmentation, identify upsell opportunities,
and support other marking initiatives
• Logistics: forecast how many of which things you need
and where will we need them, which enables learn
inventory and prevents out of stock situations
• Healthcare: analyze survival statistics for different
patient attributes (age, blood type, gender, etc.) and
treatments; predict risk of re-admittance based on
patient attributes, medical history, etc.
More Examples
• Transaction Databases Recommender systems (NetFlix), Fraud
Detection (Security and Privacy)
• Wireless Sensor Data Smart Home, Real-time Monitoring,
Internet of Things
• Text Data, Social Media Data Product Review and Consumer
Satisfaction (Facebook, Twitter, LinkedIn), E-discovery
• Software Log Data Automatic Trouble Shooting (Splunk)
• Genotype and Phenotype Data Epic, 23andme, Patient-Centered
Care, Personalized Medicine
Data Science – One Definition
What’s Hard about Data Science
• Overcoming assumptions
• Making ad-hoc explanations of data patterns
• Overgeneralizing
• Communication
• Not checking enough (validate models, data pipeline
integrity, etc.)
• Using statistical tests correctly
• Prototype Production transitions
• Data pipeline complexity (who do you ask?)
Data Science concerns
About the course
• A mixture of theory and practice
• Introductory, broad overview of subjects
• Focus on practical aspects, but not on ever-changing
technology and tools
• Seminar style - I am here to learn as well as to teach
• Language choice: python
– Relatively easy to learn (for computer scientist) compared
to R (more popular among statisticians)
– Open source means easy access (as opposed to SAS or
MATLAB)
– Which one is more frequently used in data science?
Textbook
• Required:
– Data Science from Scratch (DSS) by Joel
Grus
– Python for Data Analysis (PDA) by Wes
McKinney
– Free e-book: Think Stats (TS) by Allen B.
Downey. PDF | website
• Optional: Python Data Science
Handbook (PDSH) by Jake VanderPlas
Grading policy
• 5% attendance and participation
• 20% homework assignments and in-class
exercises
• 25% midterm exam
• 50% final exam / project
• I reserve the right to slightly adjust the
weights of individual components if necessary
Brief introduction of Python
• Invented in the Netherlands, early 90s by Guido
van Rossum
• Open sourced from the beginning
• Considered a scripting language, but is much
more
– No compilation needed
– Scripts are evaluated by the interpreter, line by line
– Functions need to be defined before they are called
Different ways to run python
• Call python program via python interpreter from a Unix/windows
command line
– $ python testScript.py
– Or make the script directly executable, with additional header lines in the
script
• Using python console
– Typing in python statements. Limited functionality
>>> 3 +3
6
>>> exit()
• Using ipython console
– Typing in python statements. Very interactive.
In [167]: 3+3
Out [167]: 6
– Typing in %run testScript.py
– Many convenient “magic functions”
Anaconda for python3
• We’ll be using anaconda which includes python
environment and an IDE (spyder) as well as many
additional features
– Can also use Enthought
• Most python modules needed in data science are
already installed with the anaconda distribution
• Install with python 3.6 (and install python 2.7 as
secondary from anaconda prompt)
• Key diff between Python 2 and python 3
Ipython magic functions
• who, whos, who_ls
• time, timeit
• debug
• pwd, ls, cd, etc.
• ?
• ??
Python programming in <2 hours
• This is not a comprehensive python language
class
• Will focus on parts of the language that is worth
attention and useful in data science
• Two parts:
– Basics - today
– More advanced – next week and/or as we go
• Comprehensive Python language reference and
tutorial available in Anacondo Navigator under
“Learning” and on python.org
Formatting
• Many languages use curly braces to delimit blocks of code. Python
uses indentation. Incorrect indentation causes error.
• Comments start with #
• Colons start a new block in many constructs, e.g. function
definitions, if-then clause, for, while
for i in [1, 2, 3, 4, 5]:
# first line in "for i" block
print (i)
for j in [1, 2, 3, 4, 5]:
# first line in "for j" block
print (j)
# last line in "for j" block
print (i + j)
# last line in "for i" block print "done looping
print (i)
print ("done looping”)
• Whitespace is ignored inside parentheses and
brackets.
long_winded_computation = (1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 +
9 + 10 + 11 + 12 + 13 + 14 +
15 + 16 + 17 + 18 + 19 + 20)
list_of_lists = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
easier_to_read_list_of_lists =
[ [1, 2, 3],
[4, 5, 6],
[7, 8, 9] ]
Alternatively:
long_winded_computation = 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + \
9 + 10 + 11 + 12 + 13 + 14 + \
15 + 16 + 17 + 18 + 19 + 20
Modules
• Certain features of Python are not loaded by
default
• In order to use these features, you’ll need to
import the modules that contain them.
• E.g.
import matplotlib.pyplot as plt
import numpy as np
Variables and objects
• Variables are created the first time it is assigned a
value
– No need to declare type
– Types are associated with objects not variables
• X=5
• X = [1, 3, 5]
• X = ‘python’
– Assignment creates references, not copies
X = [1, 3, 5]
Y= X
X[0] = 2
Print (Y) # Y is [2, 3, 5]
Assignment
• You can assign to multiple names at the same
time
x, y = 2, 3
• To swap values
x, y = y, x
• Assignments can be chained
x=y=z=3
• Accessing a name before it’s been created (by
assignment), raises an error
Arithmetic
• a=5+2 # a is 7
• b = 9 – 3. # b is 6.0
• c=5*2 # c is 10
• d = 5**2 # d is 25
• e=5%2 # e is 1
Built in numerical types: int, float, complex
• f=7/2
# in python 2, f will be 3, unless “from __future__
import division”
• f = 7 / 2 # in python 3 f = 3.5
• f = 7 // 2 # f = 3 in both python 2 and 3
• f = 7 / 2. # f = 3.5 in both python 2 and 3
• f = 7 / float(2) # f is 3.5 in both python 2 and 3
• f = int(7 / 2) # f is 3 in both python 2 and 3
String - 1
• Strings can be delimited by matching single or double
quotation marks
single_quoted_string = 'data science'
double_quoted_string = "data science"
escaped_string = 'Isn\'t this fun'
another_string = "Isn't this fun"
real_long_string = 'this is a really long string. \
It has multiple parts, \
but all in one line.'
• Use triple quotes for multi line strings
multi_line_string = """This is the first line.
and this is the second line
and this is the third line"""
String - 2
• Use raw strings to output backslashes
tab_string = "\t" # represents the tab character
len(tab_string) # is 1
not_tab_string = r"\t" # represents the characters '\' and 't'
len(not_tab_string) # is 2
• Strings can be concatenated (glued together) with the + operator,
and repeated with *
s = 3 * 'un' + 'ium' # s is 'unununium'
• Two or more string literals (i.e. the ones enclosed between quotes)
next to each other are automatically concatenated
s1 = 'Py' 'thon'
s2 = s1 + '2.7'
real_long_string = ('this is a really long string. '
‘It has multiple parts, '
‘but all in one line.‘)
List - 1
integer_list = [1, 2, 3]
heterogeneous_list = ["string", 0.1, True]
list_of_lists = [ integer_list, heterogeneous_list, [] ]
list_length = len(integer_list) # equals 3
list_sum = sum(integer_list) # equals 6
• Get the i-th element of a list
x = [i for i in range(10)] # is the list [0, 1, ..., 9]
zero = x[0] # equals 0, lists are 0-indexed
one = x[1] # equals 1
nine = x[-1] # equals 9, 'Pythonic' for last element
eight = x[-2] # equals 8, 'Pythonic' for next-to-last element
• Get a slice of a list
one_to_four = x[1:5] # [1, 2, 3, 4]
first_three = x[:3] # [0, 1, 2]
last_three = x[-3:] # [7, 8, 9]
three_to_end = x[3:] # [3, 4, ..., 9]
without_first_and_last = x[1:-1] # [1, 2, ..., 8]
copy_of_x = x[:] # [0, 1, 2, ..., 9]
another_copy_of_x = x[:3] + x[3:] # [0, 1, 2, ..., 9]
List - 2
• Check for memberships
1 in [1, 2, 3] # True
0 in [1, 2, 3] # False
• Concatenate lists
x = [1, 2, 3]
y = [4, 5, 6]
x.extend(y) # x is now [1,2,3,4,5,6]
x = [1, 2, 3]
y = [4, 5, 6]
z = x + y # z is [1,2,3,4,5,6]; x is unchanged.
• List unpacking (multiple assignment)
x, y = [1, 2] # x is 1 and y is 2
[x, y] = 1, 2 # same as above
x, y = [1, 2] # same as above
x, y = 1, 2 # same as above
_, y = [1, 2] # y is 2, didn't care about the first element
List - 3
• Modify content of list
x = [0, 1, 2, 3, 4, 5, 6, 7, 8]
x[2] = x[2] * 2 # x is [0, 1, 4, 3, 4, 5, 6, 7, 8]
x[-1] = 0 # x is [0, 1, 4, 3, 4, 5, 6, 7, 0]
x[3:5] = x[3:5] * 3 # x is [0, 1, 4, 9, 12, 5, 6, 7, 0]
x[5:6] = [] # x is [0, 1, 4, 9, 12, 7, 0]
del x[:2] # x is [4, 9, 12, 7, 0]
del x[:] # x is []
del x # referencing to x hereafter is a NameError
• Strings can also be sliced. But they cannot modified (they are immutable)
s = 'abcdefg'
a = s[0] # 'a'
x = s[:2] # 'ab'
y = s[-3:] # 'efg'
s[:2] = 'AB' # this will cause an error
s = 'AB' + s[2:] # str is now ABcdefg
The range() function
for i in range(5):
print (i) # will print 0, 1, 2, 3, 4 (in separate lines)
for i in range(2, 5):
print (i) # will print 2, 3, 4
for i in range(0, 10, 2):
print (i) # will print 0, 2, 4, 6, 8
for i in range(10, 2, -2):
print (i) # will print 10, 8, 6, 4
>>> a = ['Mary', 'had', 'a', 'little', 'lamb']
>>> for i in range(len(a)):
... print(i, a[i])
...
0 Mary
1 had
2 a
3 little
4 lamb
Range() in python 2 and 3
• In python 2, range(5) is equivalent to [0, 1, 2, 3, 4]
• In python 3, range(5) is an object which can be iterated,
but not identical to [0, 1, 2, 3, 4] (lazy iterator)
print (range(3)) # in python 3, will see "range(0, 3)"
print (range(3)) # in python 2, will see "[0, 1, 2]"
print (list(range(3))) # will print [0, 1, 2] in python 3
x = range(5)
print (x[2]) # in python 2, will print "2"
print (x[2]) # in python 3, will also print “2”
x[2] = 5 # in python 2, will result in [0, 1, 5, 3, 4, 5]
x[2] = 5 # in python 3, will cause an error.
Ref to lists
• What are the expected output for the following code?
a = list(range(10))
b = a
b[0] = 100
print(a) [100, 1, 2, 3, 4, 5, 6, 7, 8, 9]
a = list(range(10))
b = a[:]
b[0] = 100
print(a) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
tuples
• Similar to lists, but are immutable
• a_tuple = (0, 1, 2, 3, 4) Note: tuple is defined by comma, not parens,
which is only used for convenience. So a = (1)
• Other_tuple = 3, 4 is not a tuple, but a = (1,) is.
• Another_tuple = tuple([0, 1, 2, 3, 4])
• Hetergeneous_tuple = (‘john’, 1.1, [1, 2])
• Can be sliced, concatenated, or repeated
a_tuple[2:4] # will print (2, 3)
• Cannot be modified
a_tuple[2] = 5
TypeError: 'tuple' object does not support item assignment
Tuples - 2
• Useful for returning multiple values from
functions
def sum_and_product(x, y):
return (x + y),(x * y)
sp = sum_and_product(2, 3) # equals (5, 6)
s, p = sum_and_product(5, 10) # s is 15, p is 50
• Tuples and lists can also be used for multiple
assignments
x, y = 1, 2
[x, y] = [1, 2]
(x, y) = (1, 2)
x, y = y, x
Dictionaries
• A dictionary associates values with unique keys
empty_dict = {} # Pythonic
empty_dict2 = dict() # less Pythonic
grades = { "Joel" : 80, "Tim" : 95 } # dictionary literal
• Access/modify value with key
joels_grade = grades["Joel"] # equals 80
grades["Tim"] = 99 # replaces the old value
grades["Kate"] = 100 # adds a third entry
num_students = len(grades) # equals 3
try:
kates_grade = grades["Kate"]
except KeyError:
print "no grade for Kate!"
Dictionaries - 2
• Check for existence of key
joel_has_grade = "Joel" in grades # True
kate_has_grade = "Kate" in grades # False
• Use “get” to avoid keyError and add default value
joels_grade = grades.get("Joel", 0) # equals 80
kates_grade = grades.get("Kate", 0) # equals 0
no_ones_grade = grades.get("No One") # default
default is None
#Which of the following is faster?
• Get all items
'Joel' in grades # faster. Hashtable
'Joel' in all_keys # slower. List.
all_keys = grades.keys() # return a list of all keys
all_values = grades.values() # return a list of all values
all_pairs = grades.items() # a list of (key, value) tuples
Dictionaries - 2
• Check for existence of key
joel_has_grade = "Joel" in grades # True
kate_has_grade = "Kate" in grades # False
• Use “get” to avoid keyError and add default value
joels_grade = grades.get("Joel", 0) # equals 80
kates_grade = grades.get("Kate", 0) # equals 0
no_ones_grade = grades.get("No One") # default
default is None
• Get all items In python3, The following will not return lists but
iterable objects
all_keys = grades.keys() # return a list of all keys
all_values = grades.values() # return a list of all values
all_pairs = grades.items() # a list of (key, value) tuples
Difference between python 2 and
python 3: Iterable objects vs lists
• In Python 3, range() returns a lazy iterable object.
– Value created when needed x = range(10000000) #fast
– Can be accessed by index x[10000] #allowed. fast
• Similarly, dict.keys(), dict.values(), and dict.items()
(also map, filter, zip, see next)
– Value can NOT be accessed by index
– Can convert to list if really needed
– Can use for loop to iterate
keys = grades.keys()
keys[0] # error
for key in keys: print (key) #ok
Control flow - 1
• if-else
if 1 > 2:
message = "if only 1 were greater than two..."
elif 1 > 3:
message = "elif stands for 'else if'"
else:
message = "when all else fails use else (if you want to)"
print (message)
parity = "even" if x % 2 == 0 else "odd"
• Difference between python 2 and python3 print
• In python 2, print is a statement
• Print(message) and print message are both valid
• In python 3, print is a function
• Only print(message) is valid
Truthiness
• True All keywords are case sensitive.
• False 0, 0.0, [], (), ‘’, None are considered
False. Most other values are True.
• None
• and In [137]: print ("True") if '' else print ('False')
False
• or
• not a = [0, 0, 0, 1]
• any any(a)
Out[135]: True
• all all(a)
Out[136]: False
Comparison
Operation Meaning a = [0, 1, 2, 3, 4]
b = a
< strictly less than c = a[:]
<= less than or equal
a == b
> strictly greater than Out[129]: True
>= greater than or equal a is b
Out[130]: True
== equal
!= not equal a == c
Out[132]: True
is object identity
a is c
is not negated object identity Out[133]: False
Bitwise operators: & (AND), | (OR), ^ (XOR), ~(NOT), << (Left Shift), >> (Right Shift)
Control flow - 2
• loops
x = 0
while x < 10:
print (x, "is less than 10“)
x += 1
What happens if we forgot to indent?
for x in range(10): Keyword pass in loops:
pass Does nothing, empty statement placeholder
for x in range(10):
if x == 3:
continue # go immediately to the next iteration
if x == 5:
break # quit the loop entirely
print (x)
Exceptions
try:
print 0 / 0
except ZeroDivisionError:
print ("cannot divide by zero")
https://docs.python.org/3/tutorial/errors.html
Functions - 1
• Functions are defined using def
def double(x):
"""this is where you put an optional docstring
that explains what the function does.
for example, this function multiplies its
input by 2"""
return x * 2
• You can call a function after it is defined
z = double(10) # z is 20
• You can give default values to parameters
def my_print(message="my default message"):
print (message)
my_print("hello") # prints 'hello'
my_print() # prints 'my default message‘
Functions - 2
• Sometimes it is useful to specify arguments by name
def subtract(a=0, b=0):
return a – b
subtract(10, 5) # returns 5
subtract(0, 5) # returns -5
subtract(b = 5) # same as above
subtract(b = 5, a = 0) # same as above
Functions - 3
• Functions are objects too
In [12]: def double(x): return x * 2
...: DD = double;
...: DD(2)
...:
Out[12]: 4
In [16]: def apply_to_one(f):
...: return f(1)
...: x=apply_to_one(DD)
...: x
...:
Out[16]: 2
Functions – lambda expression
• Small anonymous functions can be created
with the lambda keyword.
In [18]: y=apply_to_one(lambda x: x+4)
In [19]: y
Out[19]: 5
In [104]: def small_func(x): return x+4
...: apply_to_one(small_func)
Out[104]: 5
lambda expression - 2
• Small anonymous functions can be created
with the lambda keyword.
In [22]: pairs = [(2, 'two'), (3, 'three'), (1, 'one'), (4, 'four')]
...: pairs.sort(key=lambda pair: pair[0])
...: pairs
Out[22]: [(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four')]
In [107]: def getKey(pair): return pair[0]
...: pairs.sort(key=getKey)
...: pairs
Out[107]: [(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four')
Sorting list
• Sorted(list): keeps the original list intact and returns
a new sorted list
• list.sort: sort the original list
x = [4,1,2,3]
y = sorted(x) # is [1,2,3,4], x is unchanged
x.sort() # now x is [1,2,3,4]
• Change the default behavior of sorted
# sort the list by absolute value from largest to smallest
x = [-4,1,-2,3]
y = sorted(x, key=abs, reverse=True) # is [-4,3,-2,1]
# sort the grades from highest count to lowest
# using an anonymous function
newgrades = sorted(grades.items(),
key=lambda (name, grade): grade,
reverse=True)
List comprehension
• A very convenient way to create a new list
In [51]: squares = [x * x for x in range(5)]
In [52]: squares
Out[52]: [0, 1, 4, 9, 16]
In [64]: for x in range(5): squares[x] = x
*x
...: squares
Out[64]: [0, 1, 4, 9, 16]
List comprehension - 2
• Can also be used to filter list
In [65]: even_numbers = [x for x in range(5) if x % 2 == 0]
In [66]: even_numbers
Out[66]: [0, 2, 4]
In [68]: even_numbers = []
In [69]: for x in range(5):
...: if x % 2 == 0:
...: even_numbers.append(x)
...: even_numbers
Out[69]: [0, 2, 4]
List comprehension - 3
• More complex examples:
# create 100 pairs (0,0) (0,1) ... (9,8), (9,9)
pairs = [(x, y)
for x in range(10)
for y in range(10)]
# only pairs with x < y,
# range(lo, hi) equals
# [lo, lo + 1, ..., hi - 1]
increasing_pairs = [(x, y)
for x in range(10)
for y in range(x + 1, 10)]
Functools: map, reduce, filter
• Do not confuse with MapReduce in big data
• Convenient tools in python to apply function
to sequences of data
In [203]: def double(x): return 2*x In [205]: [double(i) for i in range(5)]
...: b=range(5) Out[205]: [0, 2, 4, 6, 8]
...: list(map(double, b))
Out[203]: [0, 2, 4, 6, 8]
In [204]: double(b)
Traceback (most recent call last):
…
TypeError: unsupported operand type(s) for *: 'int' and 'range'
Functools: map, reduce, filter
• Do not confuse with MapReduce in big data
• Convenient tools in python to apply function
to sequences of data
In [208]: def is_even(x): return x%2==0
...: a=[0, 1, 2, 3]
...: list(filter(is_even, a))
...:
Out[208]: [0, 2]
In [209]: [a[i] for i in a if is_even(i)]
Out[209]: [0, 2]
Functools: map, reduce, filter
• Do not confuse with MapReduce in big data
• Convenient tools in python to apply function
to sequences of data
In [216]: from functools import reduce
In [217]: reduce(lambda x, y: x+y, range(10))
Out[217]: 45
In [220]: reduce(lambda x, y: x*y, [1, 2, 3, 4])
Out[220]: 24
zip
• Useful to combined multiple lists into a list of
tuples
In [238]: list(zip(['a', 'b', 'c'], [1, 2, 3], ['A', 'B', 'C']))
Out[238]: [('a', 1, 'A'), ('b', 2, 'B'), ('c', 3, 'C')]
In [245]: names = ['James', 'Tom', 'Mary']
...: grades = [100, 90, 95]
...: list(zip(names, grades))
...:
Out[245]: [('James', 100), ('Tom', 90), ('Mary', 95)]
Argument unpacking
• zip(*[a, b,c]) same as zip(a, b, c)
In [252]: gradeBook = [['James', 100],
['Tom', 90],
['Mary', 95]]
...: [names, grades]=zip(*gradeBook)
In [253]: names
Out[253]: ('James', 'Tom', 'Mary')
In [254]: grades
Out[254]: (100, 90, 95)
In [259]: list(zip(['James', 100], ['Tom', 90], ['Mary', 95]))
Out[259]: [('James', 'Tom', 'Mary'), (100, 90, 95)]
args and kargs
• Convenient for taking variable number of
unnamed and named parameters
In [260]: def magic(*args, **kwargs):
...: print ("unnamed args:", args)
...: print ("keyword args:", kwargs)
...: magic(1, 2, key="word", key2="word2")
...:
unnamed args: (1, 2)
keyword args: {'key': 'word', 'key2': 'word2'}
Useful methods and modules
• The Python Tutorial
– Input and Output
• The Python Standard Library Reference
– Common string methods
– Regular expression operations
– Numeric and Mathematical Modules
– CSV File Reading and Writing
Files - input
inflobj = open(‘data’, ‘r’) Open the file ‘data’ for
input
S = inflobj.read() Read whole file into one
String
S = inflobj.read(N) Reads N bytes (N >= 1)
L = inflobj.readline () Read one line
L = inflobj.readlines() Returns a list of line strings
https://docs.python.org/3/tutorial/inputoutput.html
Files - output
outflobj = open(‘data’, ‘w’) Open the file ‘data’
for writing
outflobj.write(S) Writes the string S to
file
outflobj.writelines(L) Writes each of the
strings in list L to file
outflobj.close() Closes the file
https://docs.python.org/3/tutorial/inputoutput.html
Module math
Command name Description Constant Description
abs(value) absolute value e 2.7182818...
ceil(value) rounds up pi 3.1415926...
cos(value) cosine, in radians
floor(value) rounds down
log(value) logarithm, base e
log10(value) logarithm, base 10
max(value1, value2) larger of two values
min(value1, value2) smaller of two values
round(value) nearest whole number # preferred.
sin(value) sine, in radians import math
sqrt(value) square root math.abs(-0.5)
#bad style. Many unknown #This is fine
#names in name space. from math import abs
from math import * abs(-0.5)
abs(-0.5)
Module random
• Generating random numbers are important in
statistics
In [75]: import random
...: four_uniform_randoms = [random.random() for _ in range(4)]
...: four_uniform_randoms
...:
Out[75]:
[0.5687302894847388,
0.6562738117250464,
0.3396960191199996,
0.016968446644451407]
• Other useful functions: seed(), randint, randrange, shuffle, etc.
• Type in “random” and then use tab completion to see available
functions and use “?” to see docstring of function.
Important python modules for data
science
• Numpy
– Key module for scientific computing
– Convenient and efficient ways to handle multi
dimensional arrays
• pandas
– DataFrame
– Flexible data structure of labeled tabular data
• Matplotlib: for plotting
• Scipy: solutions to common scientific computing
problem such as linear algebra, optimization,
statistics, sparse matrix
Module paths
• In order to be able to find a module called myscripts.py, the
interpreter scans the list sys.path of directory names.
• The module must be in one of those directories.
>>> import sys
>>> sys.path
['C:\\Python26\\Lib\\idlelib', 'C:\\WINDOWS\\system32\\python26.zip',
'C:\\Python26\\DLLs', 'C:\\Python26\\lib', 'C:\\Python26\\lib\\plat-win',
'C:\\Python26\\lib\\lib-tk', 'C:\\Python26', 'C:\\Python26\\lib\\site-
packages']
>>> import myscripts
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
import myscripts.py
ImportError: No module named myscripts.py
Appendix
Sequence types: Tuples,
Lists, and Strings
Sequence Types
1. Tuple: (‘john’, 32, [CMSC])
A simple immutable ordered sequence of
items
Items can be of mixed types, including
collection types
2. Strings: “John Smith”
– Immutable
– Conceptually very much like a tuple
3. List: [1, 2, ‘john’, (‘up’, ‘down’)]
Mutable ordered sequence of items of mixed
types
Similar Syntax
• All three sequence types (tuples, strings, and
lists) share much of the same syntax and
functionality.
• Key difference:
– Tuples and strings are immutable
– Lists are mutable
• The operations shown in this section can be
applied to all sequence types
– most examples will just show the operation
performed on one
Defining Sequence
• Define tuples using parentheses and commas
>>> tu = (23, ‘abc’, 4.56, (2,3),
‘def’)
• Define lists are using square brackets and commas
>>> li = [“abc”, 34, 4.34, 23]
• Define strings using quotes (“, ‘, or “““).
>>> st = “Hello World”
>>> st = ‘Hello World’
>>> st = “““This is a multi-line
string that uses triple quotes.”””
Accessing one element
• Access individual members of a tuple, list, or string
using square bracket “array” notation
• Note that all are 0 based…
>>> tu = (23, ‘abc’, 4.56, (2,3), ‘def’)
>>> tu[1] # Second item in the tuple.
‘abc’
>>> li = [“abc”, 34, 4.34, 23]
>>> li[1] # Second item in the list.
34
>>> st = “Hello World”
>>> st[1] # 2nd character in string. Still str type
‘e’
Positive and negative indices
>>> t = (23, ‘abc’, 4.56, (2,3),
‘def’)
Positive index: count from the left, starting with 0
>>> t[1]
‘abc’
Negative index: count from right, starting with –1
>>> t[-3]
4.56
Slicing: return copy of a subset
>>> t = (23, ‘abc’, 4.56,
(2,3), ‘def’)
Return a copy of the container with a subset of the
original members. Start copying at the first index,
and stop copying before second.
>>> t[1:4]
(‘abc’, 4.56, (2,3))
Negative indices count from end
>>> t[1:-1]
(‘abc’, 4.56, (2,3))
Slicing: return copy of a subset
>>> t = (23, ‘abc’, 4.56,
(2,3), ‘def’)
Omit first index to make copy starting from
beginning of the container
>>> t[:2]
(23, ‘abc’)
Omit second index to make copy starting at first
index and going to end
>>> t[2:]
(4.56, (2,3), ‘def’)
Copying the Whole Sequence
• [ : ] makes a copy of an entire sequence
>>> t[:]
(23, ‘abc’, 4.56, (2,3), ‘def’)
• Note the difference between these two lines for mutable
sequences
>>> l2 = l1 # Both refer to 1 ref,
# changing one affects
both
>>> l2 = l1[:] # Independent copies,
two refs
The ‘in’ Operator
• Boolean test whether a value is inside a
container:
>>> t = [1, 2, 4, 5]
>>> 3 in t
False
>>> 4 in t
True
>>> 4 not in t
False
• For strings, tests for substrings
>>> a = 'abcde'
>>> 'c' in a
True
>>> 'cd' in a
True
>>> 'ac' in a
False
The + Operator
The + operator produces a new tuple, list, or string whose
value is the concatenation of its arguments.
>>> (1, 2, 3) + (4, 5, 6)
(1, 2, 3, 4, 5, 6)
>>> [1, 2, 3] + [4, 5, 6]
[1, 2, 3, 4, 5, 6]
>>> “Hello” + “ ” + “World”
‘Hello World’
The * Operator
• The * operator produces a new tuple, list, or string
that “repeats” the original content.
>>> (1, 2, 3) * 3
(1, 2, 3, 1, 2, 3, 1, 2, 3)
>>> [1, 2, 3] * 3
[1, 2, 3, 1, 2, 3, 1, 2, 3]
>>> “Hello” * 3
‘HelloHelloHello’
Mutability:
Tuples vs. Lists
Lists are mutable
>>> li = [‘abc’, 23, 4.34, 23]
>>> li[1] = 45
>>> li
[‘abc’, 45, 4.34, 23]
• We can change lists in place.
• Name li still points to the same memory
reference when we’re done.
Tuples are immutable
>>> t = (23, ‘abc’, 4.56, (2,3), ‘def’)
>>> t[2] = 3.14
Traceback (most recent call last):
File "<pyshell#75>", line 1, in -toplevel-
tu[2] = 3.14
TypeError: object doesn't support item assignment
• You can’t change a tuple.
• You can make a fresh tuple and assign its reference
to a previously used name.
>>> t = (23, ‘abc’, 3.14, (2,3), ‘def’)
• The immutability of tuples means they’re faster
than lists.
Operations on Lists Only
>>> li = [1, 11, 3, 4, 5]
>>> li.append(‘a’) # Note the
method syntax
>>> li
[1, 11, 3, 4, 5, ‘a’]
>>> li.insert(2, ‘i’)
>>>li
[1, 11, ‘i’, 3, 4, 5, ‘a’]
The extend method vs +
• + creates a fresh list with a new memory ref
• extend operates on list li in place.
>>> li.extend([9, 8, 7])
>>> li
[1, 2, ‘i’, 3, 4, 5, ‘a’, 9, 8, 7]
• Potentially confusing:
– extend takes a list as an argument.
– append takes a singleton as an argument.
>>> li.append([10, 11, 12])
>>> li
[1, 2, ‘i’, 3, 4, 5, ‘a’, 9, 8, 7, [10, 11,
12]]
Operations on Lists Only
Lists have many methods, including index, count, remove,
reverse, sort
>>> li = [‘a’, ‘b’, ‘c’, ‘b’]
>>> li.index(‘b’) # index of 1st
occurrence
1
>>> li.count(‘b’) # number of
occurrences
2
>>> li.remove(‘b’) # remove 1st
occurrence
>>> li
[‘a’, ‘c’, ‘b’]
Operations on Lists Only
>>> li = [5, 2, 6, 8]
>>> li.reverse() # reverse the list *in place*
>>> li
[8, 6, 2, 5]
>>> li.sort() # sort the list *in place*
>>> li
[2, 5, 6, 8]
>>> li.sort(some_function)
# sort in place using user-defined comparison
Tuple details
• The comma is the tuple creation operator, not
parens
>>> 1,
(1,)
• Python shows parens for clarity (best practice)
>>> (1,)
(1,)
• Don't forget the comma!
>>> (1)
1
• Trailing comma only required for singletons
others
• Empty tuples have a special syntactic form
>>> ()
()
>>> tuple()
()
Summary: Tuples vs. Lists
• Lists slower but more powerful than tuples
– Lists can be modified, and they have lots of handy
operations and mehtods
– Tuples are immutable and have fewer features
• To convert between tuples and lists use the list() and
tuple() functions:
li = list(tu)
tu = tuple(li)