School of Computer Science and Electronics Engineering, University of Essex
ILecture 2: Being a Data Scientist
CE880: An Approachable Introduction to Data Science
Haider Raza
Tuesday, 24 Jan 2023
1
About Myself
I Name: Haider Raza
I Position: Senior Lecturer in Artificial Intelligence
I Research interest: AI, Machine Learning, Data Science
I Contact:
[email protected] I Academic Support Hours: 1-2 PM on Friday via zoom. Zoom link is available
on Moodle
I Website: www.sagihaider.com
2
Different IDE for working in Python: Notebook, PyCharm, Colab, etc
A code editor is a tool that is used to write and edit code. They are usually
lightweight and can be great for learning. However, once your program gets larger, you
need to test and debug your code, that’s where IDEs1 come in.
1
KDNuggets
3
Different Python Text Editors
Sublime Text https://www.sublimetext.com/
Atom https://atom.io/
Vim https://www.vim.org/
Visual Studio Code https://code.visualstudio.com/
4
What is Anaconda?
Anaconda is a distribution of the Python and R programming languages that aims to
simplify package management and deployment.
1
Anaconda
5
Why people use Jupyter Notebooks for Programming in Python?
The Jupyter Notebook is an open-source web application that allows you to create
and share documents that contain live code, equations, visualizations and narrative
text. Uses include: data cleaning and transformation, numerical simulation, statistical
modeling, data visualization, machine learning, and much more.
1
Anaconda
6
Why we are using Google Colab?
I It is free
I Can use free GPU (i.e. NVIDIA Teslak80 v cost £3500.00)
I Colab provides inbuilt version control system using Git and it is quite easy to
create notes and documentations, includes figures, and tables using markdown
I Google servers using virtual machines and you don’t need to install common
packages such as NumPy, Pandas, Tensorflow, Keras
I Link to your GitHub profiles
1
Google Colab
7
Introduction to NumPy
What is NumPy?
NumPy is a Python library used for working with arrays.
I It also has functions for working in domain of linear algebra, fourier transform,
and matrices.
I NumPy was created in 2005 by Travis Oliphant. It is an open source project and
you can use it freely.
I NumPy stands for Numerical Python.
1
https://assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf
8
Why we are using NumPy
I Fundamental package for scientific computing with Python
I N-dimensional array object
I Linear algebra, Fourier transform, random number capabilities
I Building block for other packages (e.g. Scipy)
I Open source
Example (Code)
import numpy as np
9
NumPy Array
A numpy array is a grid of values, all of the same type, and is indexed by a tuple of
nonnegative integers. The number of dimensions is the rank of the array; the shape of
an array is a tuple of integers giving the size of the array along each dimension.
1
geeksforgeeks
10
NumPy Indexing
1
geeksforgeeks
11
NumPy Data Type
Every numpy array is a grid of elements of the same type.
1
medium
12
NumPy Math
Basic mathematical functions operate elementwise on arrays, and are available both as
operator overloads and as functions in the numpy module:
1
Standford 13
NumPy Math...
* is elementwise multiplication, not matrix multiplication. We instead use the dot
function to compute inner products of vectors, to multiply a vector by a matrix, and
to multiply matrices. dot is available both as a function in the numpy module and as
an instance method of array objects:
1
Standford
14
NumPy Cheat Sheet
1
https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf
15
Introduction to Pandas
What is Pandas?
Pandas is a Python library used for working with data sets.
I It has functions for analyzing, cleaning, exploring, and manipulating data.
I The name "Pandas" has a reference to both "Panel Data", and "Python Data
Analysis" and was created by Wes McKinney in 2008.
Example (Code)
import pandas as pd
16
Introduction to Pandas
What is Pandas?
Pandas is a Python library used for working with data sets.
I Easily handles missing data
I It uses Series for one-dimensional data structure and DataFrame for
multi-dimensional data structure
I It provides an efficient way to slice the data
I It provides a flexible way to merge, concatenate or reshape the data
I It includes a powerful time series tool to work with
I We can store multiple data types
17
What is Series?
A series is a one-dimensional data structure. It can have any data structure like
integer, float, and string. It is useful when you want to perform computation or return
a one-dimensional array. A series, by definition, cannot have multiple columns. For the
latter case, please use the data frame structure.
18
How Data Frame looks like?
1
W3resource
19
Creating DataFrame?
You can convert a numpy array to a pandas data frame with pd.DataFrame(). The
opposite is also possible. To convert a pandas Data Frame to an array, you can use
np.array()
1
CE880: An Approachable Introduction to Data Science
20
Reshaping with Pandas
Pandas use various methods to reshape the DataFrame and Series
1
CE880: An Approachable Introduction to Data Science
21
Subset Row with Pandas
1
CE880: An Approachable Introduction to Data Science
22
Subset Column with Pandas
1
CE880: An Approachable Introduction to Data Science
23
Summarising Data with Pandas
1
CE880: An Approachable Introduction to Data Science
24
Dealing with Missing Data using Pandas
1
CE880: An Approachable Introduction to Data Science
25
Plotting with Pandas
1
CE880: An Approachable Introduction to Data Science
26
Introduction to Matplotlib
What is Matplotlib?
Matplotlib is a low level graph plotting library in python that serves as a visualization
utility.
I Matplotlib was created by John D. Hunter.
I Matplotlib is open source and we can use it freely.
I Matplotlib is mostly written in python, a few segments are written in C,
Objective-C and Javascript for Platform compatibility.
27
What Matplotlib can do?
1
https://towardsdatascience.com/python-data-visualization-with-matplotlib-part-2-66f1307d42fb
28
Introduction to Matplotlib
Matplotlib is a library for making 2D plots in Python. It is designed with the
philosophy that you should be able to create simple plots with just a few commands:
1
matplotlib
29
Matplotlib types of plot
1 30
matplotlib
Matplotlib Plots Tweaking
1
matplotlib
31
Matplotlib Plots Organisation
1
matplotlib
32
Matplotlib Plots Label
1
matplotlib
33
Matplotlib Plots Explore and Save
1
matplotlib
34
What is Version Control System (VCS)?
Version Control Systems are the software tools for tracking/managing all the changes
made to the source code during the project development. It keeps a record of every
single change made to the code. It also allows us to turn back to the previous version
of the code if any mistake is made in the current version. Without a VCS in place, it
would not be possible to monitor the development of the project.
The three types of VCS are:
I Local Version Control System
I Centralized Version Control System
I Distributed Version Control System
35
What is Local Version Control System
Local Version Control System is located in your local machine.
I If the local machine crashes, it would not be possible to retrieve the files, and all
the information will be lost.
I If anything happens to a single version, all the versions made after that will be
lost.
I It is not possible to collaborate with other collaborators.
36
What is Centralized Version Control System
I There will be a single central server that contains all the files related to the
project
I Many collaborators checkout files from this single server (you will only have a
working copy).
I The problem is if the central server crashes, almost everything related to the
project will be lost.
37
What is Distributed Version Control System
I There will be one or more servers and many collaborators similar to the
centralized system.
I But the difference is, not only do they check out the latest version, but each
collaborator will have an exact copy (mirroring) of the main repository(including
its entire history) on their local machines.
I Each user has their own repository and a working copy. Even, if the server
crashes we would not lose everything as several copies are residing in several other
computers.
38
Difference between Git and GitHub
I Git is a version control tool (software) to track the changes in the source code.
I GitHub is a web-based cloud service to host your source code(Git repositories). It
is a centralized system.
I Git doesn‘t require GitHub but GitHub requires Git.
39
Introduction to GitHub
What is GitHub?
GitHub is a code hosting platform for collaboration and version control. GitHub lets
you (and others) work together on projects.
What GitHub Repository can do?
I A GitHub repository can be used to store a development project
I It can contain folders and any type of files (HTML, CSS, JavaScript, Documents,
Data, Images)
I A GitHub repository should also include a licence file and a README file about
the project
I A GitHub repository can also be used to store ideas, or any resources that you
want to share
40
What GitHub can do?
1
http://jr0cket.co.uk/
41
CE880: An Approachable Introduction to Data Science
42