Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
22 views42 pages

CE880 Lecture2 Slides

Uploaded by

Anand A J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views42 pages

CE880 Lecture2 Slides

Uploaded by

Anand A J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

School of Computer Science and Electronics Engineering, University of Essex

ILecture 2: Being a Data Scientist


CE880: An Approachable Introduction to Data Science

Haider Raza
Tuesday, 24 Jan 2023

1
About Myself

I Name: Haider Raza


I Position: Senior Lecturer in Artificial Intelligence
I Research interest: AI, Machine Learning, Data Science
I Contact: [email protected]
I Academic Support Hours: 1-2 PM on Friday via zoom. Zoom link is available
on Moodle
I Website: www.sagihaider.com

2
Different IDE for working in Python: Notebook, PyCharm, Colab, etc

A code editor is a tool that is used to write and edit code. They are usually
lightweight and can be great for learning. However, once your program gets larger, you
need to test and debug your code, that’s where IDEs1 come in.

1
KDNuggets

3
Different Python Text Editors

Sublime Text https://www.sublimetext.com/

Atom https://atom.io/

Vim https://www.vim.org/

Visual Studio Code https://code.visualstudio.com/

4
What is Anaconda?

Anaconda is a distribution of the Python and R programming languages that aims to


simplify package management and deployment.

1
Anaconda

5
Why people use Jupyter Notebooks for Programming in Python?

The Jupyter Notebook is an open-source web application that allows you to create
and share documents that contain live code, equations, visualizations and narrative
text. Uses include: data cleaning and transformation, numerical simulation, statistical
modeling, data visualization, machine learning, and much more.

1
Anaconda
6
Why we are using Google Colab?

I It is free
I Can use free GPU (i.e. NVIDIA Teslak80 v cost £3500.00)
I Colab provides inbuilt version control system using Git and it is quite easy to
create notes and documentations, includes figures, and tables using markdown
I Google servers using virtual machines and you don’t need to install common
packages such as NumPy, Pandas, Tensorflow, Keras
I Link to your GitHub profiles

1
Google Colab

7
Introduction to NumPy

What is NumPy?
NumPy is a Python library used for working with arrays.

I It also has functions for working in domain of linear algebra, fourier transform,
and matrices.
I NumPy was created in 2005 by Travis Oliphant. It is an open source project and
you can use it freely.
I NumPy stands for Numerical Python.

1
https://assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf

8
Why we are using NumPy

I Fundamental package for scientific computing with Python


I N-dimensional array object
I Linear algebra, Fourier transform, random number capabilities
I Building block for other packages (e.g. Scipy)
I Open source
Example (Code)

import numpy as np

9
NumPy Array

A numpy array is a grid of values, all of the same type, and is indexed by a tuple of
nonnegative integers. The number of dimensions is the rank of the array; the shape of
an array is a tuple of integers giving the size of the array along each dimension.

1
geeksforgeeks

10
NumPy Indexing

1
geeksforgeeks

11
NumPy Data Type

Every numpy array is a grid of elements of the same type.

1
medium

12
NumPy Math

Basic mathematical functions operate elementwise on arrays, and are available both as
operator overloads and as functions in the numpy module:

1
Standford 13
NumPy Math...

* is elementwise multiplication, not matrix multiplication. We instead use the dot


function to compute inner products of vectors, to multiply a vector by a matrix, and
to multiply matrices. dot is available both as a function in the numpy module and as
an instance method of array objects:

1
Standford

14
NumPy Cheat Sheet

1
https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf

15
Introduction to Pandas

What is Pandas?
Pandas is a Python library used for working with data sets.

I It has functions for analyzing, cleaning, exploring, and manipulating data.


I The name "Pandas" has a reference to both "Panel Data", and "Python Data
Analysis" and was created by Wes McKinney in 2008.

Example (Code)

import pandas as pd

16
Introduction to Pandas

What is Pandas?
Pandas is a Python library used for working with data sets.

I Easily handles missing data


I It uses Series for one-dimensional data structure and DataFrame for
multi-dimensional data structure
I It provides an efficient way to slice the data
I It provides a flexible way to merge, concatenate or reshape the data
I It includes a powerful time series tool to work with
I We can store multiple data types

17
What is Series?

A series is a one-dimensional data structure. It can have any data structure like
integer, float, and string. It is useful when you want to perform computation or return
a one-dimensional array. A series, by definition, cannot have multiple columns. For the
latter case, please use the data frame structure.

18
How Data Frame looks like?

1
W3resource

19
Creating DataFrame?

You can convert a numpy array to a pandas data frame with pd.DataFrame(). The
opposite is also possible. To convert a pandas Data Frame to an array, you can use
np.array()

1
CE880: An Approachable Introduction to Data Science

20
Reshaping with Pandas

Pandas use various methods to reshape the DataFrame and Series

1
CE880: An Approachable Introduction to Data Science

21
Subset Row with Pandas

1
CE880: An Approachable Introduction to Data Science

22
Subset Column with Pandas

1
CE880: An Approachable Introduction to Data Science

23
Summarising Data with Pandas

1
CE880: An Approachable Introduction to Data Science

24
Dealing with Missing Data using Pandas

1
CE880: An Approachable Introduction to Data Science

25
Plotting with Pandas

1
CE880: An Approachable Introduction to Data Science

26
Introduction to Matplotlib

What is Matplotlib?
Matplotlib is a low level graph plotting library in python that serves as a visualization
utility.

I Matplotlib was created by John D. Hunter.


I Matplotlib is open source and we can use it freely.
I Matplotlib is mostly written in python, a few segments are written in C,
Objective-C and Javascript for Platform compatibility.

27
What Matplotlib can do?

1
https://towardsdatascience.com/python-data-visualization-with-matplotlib-part-2-66f1307d42fb

28
Introduction to Matplotlib

Matplotlib is a library for making 2D plots in Python. It is designed with the


philosophy that you should be able to create simple plots with just a few commands:

1
matplotlib

29
Matplotlib types of plot

1 30
matplotlib
Matplotlib Plots Tweaking

1
matplotlib

31
Matplotlib Plots Organisation

1
matplotlib

32
Matplotlib Plots Label

1
matplotlib

33
Matplotlib Plots Explore and Save

1
matplotlib

34
What is Version Control System (VCS)?

Version Control Systems are the software tools for tracking/managing all the changes
made to the source code during the project development. It keeps a record of every
single change made to the code. It also allows us to turn back to the previous version
of the code if any mistake is made in the current version. Without a VCS in place, it
would not be possible to monitor the development of the project.
The three types of VCS are:

I Local Version Control System


I Centralized Version Control System
I Distributed Version Control System

35
What is Local Version Control System

Local Version Control System is located in your local machine.

I If the local machine crashes, it would not be possible to retrieve the files, and all
the information will be lost.
I If anything happens to a single version, all the versions made after that will be
lost.
I It is not possible to collaborate with other collaborators.

36
What is Centralized Version Control System

I There will be a single central server that contains all the files related to the
project
I Many collaborators checkout files from this single server (you will only have a
working copy).
I The problem is if the central server crashes, almost everything related to the
project will be lost.

37
What is Distributed Version Control System

I There will be one or more servers and many collaborators similar to the
centralized system.
I But the difference is, not only do they check out the latest version, but each
collaborator will have an exact copy (mirroring) of the main repository(including
its entire history) on their local machines.
I Each user has their own repository and a working copy. Even, if the server
crashes we would not lose everything as several copies are residing in several other
computers.

38
Difference between Git and GitHub

I Git is a version control tool (software) to track the changes in the source code.
I GitHub is a web-based cloud service to host your source code(Git repositories). It
is a centralized system.
I Git doesn‘t require GitHub but GitHub requires Git.

39
Introduction to GitHub

What is GitHub?
GitHub is a code hosting platform for collaboration and version control. GitHub lets
you (and others) work together on projects.
What GitHub Repository can do?

I A GitHub repository can be used to store a development project


I It can contain folders and any type of files (HTML, CSS, JavaScript, Documents,
Data, Images)
I A GitHub repository should also include a licence file and a README file about
the project
I A GitHub repository can also be used to store ideas, or any resources that you
want to share

40
What GitHub can do?

1
http://jr0cket.co.uk/

41
CE880: An Approachable Introduction to Data Science

42

You might also like