Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views44 pages

Introduction To Data Science UNIT 1

Data science is a multidisciplinary field focused on extracting insights from vast amounts of data using various tools and algorithms. It encompasses techniques from statistics, machine learning, and data engineering to uncover patterns and inform business decisions. The document outlines the history, applications, and core competencies of data science, emphasizing its importance in modern industries and the role of programming languages like Python in the data science pipeline.

Uploaded by

theiconicps
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views44 pages

Introduction To Data Science UNIT 1

Data science is a multidisciplinary field focused on extracting insights from vast amounts of data using various tools and algorithms. It encompasses techniques from statistics, machine learning, and data engineering to uncover patterns and inform business decisions. The document outlines the history, applications, and core competencies of data science, emphasizing its importance in modern industries and the role of programming languages like Python in the data science pipeline.

Uploaded by

theiconicps
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Introduction to Data

Science
Unit 1
What is Data Science

⚫ Data science – getting insight from data

⚫ Data Science is a blend of various tools, algorithms, and


machine learning principles with the goal to discover
hidden patterns from the raw data.

⚫ This aspect of data science is all about uncovering findings


from data. Diving in at a granular level to mine and
understand complex behaviors, trends, and inferences. It's
about surfacing hidden insight that can help enable
companies to make smarter business decisions.

⚫ “Data science is the discipline of making data useful.”


Data All Around
⚫Lots of data is being collected and
warehoused
⚫Web data, e-commerce
⚫Financial transactions, bank/credit
transactions
⚫Online trading and purchasing
⚫Social Network
Types of Data We Have
⚫Relational Data (Tables/Transaction/Legacy
Data)
⚫Text Data (Web)
⚫Semi-structured Data (XML)
⚫Graph Data
⚫Social Network, Semantic Web (RDF), …
⚫Streaming Data
Big Data and Data Science
⚫ Over the last decade there’s been a massive
explosion in both the data generated and retained
by companies. Sometimes we call this “big
data,” we’d like to build something with it.

⚫ Data scientists are the people who make sense


out of all this data and figure out just what can be
done with it.

⚫ Data Scientist is the“sexiest job title of the


21st century.
Data Science---Multidisciplinary field
⚫ Data science is an extension of various data analysis
fields such as data mining, statistics, predictive analysis
and many more.

⚫ Data Science is a huge field that uses a lot of methods


and concepts which belongs to other fields like
information science, statistics, mathematics, and
computer science.

⚫ Some of the techniques utilized in Data Science


encompasses machine learning, visualization, pattern
recognition, probability model, data engineering, signal
processing, etc.
A Brief History of Data Science
⚫ The term “Data Science” has emerged only recently to
specifically designate a new profession that is expected to
make sense of the vast stores of big data.
⚫ But making sense of data has a long history and has been
discussed by scientists, statisticians, librarians, computer
scientists and others for years. The following timeline traces
the evolution of the term “Data Science” and its use, attempts
to define it, and related terms.
⚫ Early Beginnings (1960s-1970s):
The roots of data science trace back to the world of statistics,
a discipline with a history dating back centuries. In the 1960s
and 1970s, as computers began to emerge, statisticians and
computer scientists started exploring ways to analyze and
visualize data using these early computing machines.
A Brief History of Data Science
⚫ Emergence of Data Mining (1980s-1990s):
Fast forward to the 1980s and 1990s, and we see the birth of data
mining as a distinct field. Researchers began developing algorithms to
uncover meaningful patterns and insights buried within vast datasets.
They started working on classification, clustering, and association rule
mining.
⚫ Growth of Data Warehousing (1990s):
Around the same time, data warehousing technologies became more
prominent. These systems allowed organizations to centralize and
manage large volumes of data effectively.

⚫ Rise of Machine Learning (1990s-Present):


Machine learning came into the limelight in the 1990s, with significant
progress in algorithms and techniques. This paved the way for
predictive modeling and data-driven decision-making. Advances in
computational power and the availability of labeled data fueled the
growth of machine learning.
A Brief History of Data Science
⚫ Big Data Era (2000s-Present):
The 2000s ushered in the era of big data, marked by the explosive
growth of data thanks to the internet, social media, and sensors.
Technologies like Hadoop and MapReduce were developed to
process and analyze these massive datasets.
⚫ Data Science as a Discipline (2000s-Present):
In the early 2000s, the term “data science” gained popularity as a
way to describe this interdisciplinary field that combines statistics,
computer science, domain expertise, and data analysis. Universities
and organizations started offering data science programs and
certifications.
⚫ Data Science in Industry (2010s-Present):
Data science became indispensable across various industries like
finance, healthcare, marketing, and e-commerce. Companies like
Google, Facebook, and Amazon played pivotal roles in shaping the
field and its applications.
A Brief History of Data Science
⚫ Tools and Frameworks (2010s-Present):
The open-source ecosystem for data science tools and
frameworks, including Python, R, TensorFlow, and scikit-
learn, expanded rapidly, making data analysis more
accessible to a broader audience.

The history of data science continues to evolve alongside


technology and the ever-growing ocean of data. Today, data
science is a crucial discipline in our modern world, with
applications spanning a wide array of domains. It’s likely to
remain a dynamic and evolving field for years to come.
What do Data Scientists do?
⚫ Data scientists are the key to realizing the opportunities
presented by big data.
⚫ They bring structure to it, find compelling patterns in it, and
advise executives on the implications for products, processes,
and decisions
⚫ Some of them are:
⚫ Internet Search
⚫ Digital Advertisements
⚫ Recommender Systems like Amazon, Netflix
⚫ Image Recognition e.g. tag your friends
⚫ Speech Recognition e.g. google Voice, Siri, Cortana etc.
⚫ Healthcare eg. Disease prediction
⚫ Airline Route Planning
⚫ Fraud and Risk Detection
⚫ Delivery logistics e.g. DHL, FedEx
Data Science Use Cases
⚫ The evolution of data science and advanced forms of
analytics has given rise to a wide range of applications that
are providing better insights and business value in the
enterprise.

⚫ While many different types of organizations


are implementing analytics applications driven by data
science, those applications are mostly focused on areas
that have proven their value over the past decade. By
digging deeper into them, businesses can gain benefits that
include competitive advantages over business rivals; better
service to customers, citizens, users and patients; and the
ability to respond more effectively to a rapidly changing
business environment that demands continuous adaptation.
Data Science Use Cases
Pattern recognition
⚫ identifying patterns in data sets is a fundamental data science project.
For example, pattern recognition helps retailers and e-commerce
companies spot trends in customer purchasing behavior.
⚫ Companies such as Amazon and Walmart have long used data science
approaches to discover purchasing patterns. In one interesting early
example, Walmart noticed that many customers making purchases in
anticipation of a hurricane or tropical storm also bought strawberry Pop-
Tarts. Such correlations, often unexpected, can help drive more effective
purchasing, inventory management and marketing strategies.

Predictive modeling
⚫ While predictive analytics has been around for decades, data science
applies machine learning and other algorithmic approaches to large
data sets to improve decision-making capabilities by creating models
that better predict customer behavior, financial risks, market trends and
more.
Data Science Use Cases
⚫ Predictive analytics applications are used in a wide range of
industries, including financial services, retail, manufacturing,
healthcare, travel and government.
⚫ For example, manufacturers use predictive maintenance
systems to help reduce equipment breakdowns and improve
production uptime. Airplane makers Boeing and Airbus also
depend on predictive maintenance to improve their fleet
availability. Similarly, Chevron, BP and other companies in the
energy sector use predictive modelling to improve equipment
reliability in settings where maintenance is costly, difficult and
expensive to perform.

Recommendation engines and personalization systems


⚫ It traditionally has been very difficult to tailor products and
services to the specific needs of individuals; doing so was too
time-consuming and costly.
Data Science Use Cases
⚫ Fortunately, the combination of data science, machine learning
and big data now enables organizations to build a detailed profile
of individual customers. Over time, their systems can learn
people's preferences and match them with others who have
similar preferences -- an approach known as hyper-
personalization.
⚫ Companies such as Home Depot, Lowe's and Netflix use hyper-
personalization techniques driven by data science to better focus
their offerings to customers through recommendation engines
and personalized marketing.
In Delivery Logistics
⚫ Various Logistics companies like DHL, FedEx, etc. make use of
Data Science. Data Science helps these companies to find the
best route for the Shipment of their Products, the best time
suited for delivery, the best mode of transport to reach the
destination, etc.
Data Science Use Cases
Medicine and Drug Development
⚫ The process of creating medicine is very difficult and
time-consuming and has to be done with full disciplined
because it is a matter of Someone’s life.
⚫ Without Data Science, it takes lots of time, resources, and
finance or developing new Medicine or drug but with the
help of Data Science, it becomes easy because the
prediction of success rate can be easily determined based
on biological data or factors. The algorithms based on
data science will forecast how this will react to the human
body without lab experiments.
Myths about Data Science
⚫ Ph.D is Mandatory to Become a Data Scientist
⚫ A Full-Time Data Science Degree is a Must for Making the Transition
⚫ All your previous Work Experience will Translate to the Data Science
Domain
⚫ Necessary to have a Computer
Science/Mathematics/Statistics/Programming Background
⚫ Learning a Tool is Enough to Become a Data Scientist
⚫ Deep Learning Requires Computational Power that Only Top
Companies Have
⚫ Once Built, AI Systems will Continue to Evolve and Generalize by
Themselves Universe of AI Jobs
⚫ Data Science is Only About Building Predictive Models
⚫ Participating in Data Science Competitions Translates to Real-Life
Projects
⚫ Data Collection is a Breeze, the Focus should be on Building Models
Myths about Data Science
⚫ Data Science is all about building machine learning and deep
learning models
⚫ Only people with a programming or mathematical
background can become Data Scientists
⚫ Data Analysts, Data Engineers, and Data Scientists all
perform the same tasks
Choosing a data science language
⚫ Choosing the correct tool makes your life easier.
⚫ Data scientists usually use only a few languages
because they make working with data easier.
⚫ With this in mind, here are the four top languages for
data science work in order of preference (used by
91 percent of the data scientists out there):
⚫ Python
⚫R
⚫ SAS
⚫ SQL
Outlining the core competencies of a
data scientist
⚫ data scientist requires knowledge of a broad range of skills
in order to perform the required tasks. In fact, so many
different skills are required that data scientists often work
in teams.
⚫ Someone who is good at gathering data might team up with
an analyst and someone gifted in presenting information.
⚫ It would be hard to find a single person with all the
required skills.
⚫ following list describes areas in which a data scientist
could excel
⚫ Data capture
⚫ Analysis
⚫ Presentation
Outlining the core competencies of a data
scientist
⚫Data capture:
⚫ Capturing data begins by managing a data source using
database management skills. However, raw data isn’t
particularly useful in many situations— you must also
understand the data domain. Finally, you must have
data‐modeling skills so that you understand how the
data is connected and whether the data is structured.
⚫Analysis:
⚫ You perform some analysis using basic statistical tool
skills, much like those that just about everyone learns in
college. However, the use of specialized math tricks and
algorithms can make patterns in the data more obvious
or help you draw conclusions
Outlining the core competencies of a data
scientist
⚫Presentation:
⚫ It’s important to provide a graphical presentation of
these patterns to help others visualize what the numbers
mean and how to apply them in a meaningful way. More
important, the presentation must tell a specific story so
that the impact of the data isn’t lost.
Understanding the role of programming
⚫ A data scientist may need to know several programming
languages in order to achieve specific goals.
⚫ Manually performing these tasks is time consuming and error
prone, so programming presents the best method for achieving
the goal.
⚫ Given the number of products that most data scientists use, it may
not be possible to use just one programming language.
⚫ Yes, Python can load data, transform it, analyze it, and even
present it to the end user, but it works only when the language
provides the required functionality.

⚫ The languages you choose depend on a number of criteria.


⚫ How you intend to use data science in your code
⚫ Your familiarity with the language
⚫ The need to interact with other languages
⚫ The availability of tools to enhance the development environment
⚫ The availability of APIs and libraries to make performing tasks easier
Creating the Data Science Pipeline
Data science pipeline, which requires the data scientist to follow
particular steps in the preparation, analysis, and presentation of the
data.

⚫ Preparing the data- The data that you access from various sources
doesn’t come in an easily packaged form, ready for analysis. you may
also need to transform it to make all the data sources cohesive and
amenable to analysis. Transformation may require changing data
types, the order in which data appears, and even the creation of data
entries based on the information provided by existing entries.

⚫ Performing exploratory data analysis-data science provides access


to a wealth of statistical methods and algorithms that help you
discover patterns in the data. A single approach doesn’t ordinarily do
the trick. You typically use an iterative process to rework the data from
a number of perspectives.
Creating the Data Science Pipeline
⚫ Learning from data-As you iterate through various statistical
analysis methods and apply algorithms to detect patterns, you begin
learning from the data. . In fact, it’s the fun part of data science
because you can’t ever know in advance precisely what the data will
reveal to you. Of course, the imprecise nature of data and the finding
of seemingly random patterns in it means keeping an open mind.

⚫ Visualizing-Visualization means seeing the patterns in the data and


then being able to react to those patterns. It also means being able
to see when data is not part of the pattern. Think of yourself as a
data sculptor — removing the data that lies outside the patterns (the
outliers) so that others can see the masterpiece of information
beneath.

⚫ Obtaining insights and data products-The insights you obtain


from manipulating and analyzing the data help you to perform real‐
world tasks. For example, you can use the results of an analysis to
make a business decision.
Understanding Python’s Role in Data Science

⚫ Many different ways are available for accomplishing data science


tasks.
⚫ Python represents one of the few single‐stop solutions that you can
use to solve complex data science problems.
⚫ Instead of having to use a number of tools to perform a task, you can
simply use a single language, Python, to get the job done.
⚫ The Python difference is the large number scientific and math
libraries created for it by third parties. Plugging in these libraries
greatly extends Python and allows it to easily perform tasks that other
languages could perform, but with great difficulty.

⚫ Python is that it supports four different coding styles:


⚫ Functional
⚫ Procedural
⚫ Object Oriented
⚫ imparative
Learning to Use Python Fast
⚫ As part of Data Science many things are required to
be done. Python can be used to perform various task
of data science pipeline like:
⚫ Loading data
⚫ Training model
⚫ Visualizing data
Performing Rapid Prototyping and
Experimentation
⚫ Python is all about creating applications quickly and
then experimenting with them to see how things work.
⚫ The act of creating an application design in code
without necessarily filling in all the details is
prototyping and is faster in python.
⚫ Data science doesn’t rely on static solutions. You may
have to try multiple solutions to find the particular
solution that works best.
⚫ After you create a prototype, you use it to experiment
with various algorithms to determine which algorithm
works best in a particular situation.
Considering Speed of Execution
⚫Following factors control the speed of
execution for your data science application:
⚫Dataset size
⚫Loading technique
⚫Coding style
⚫Machine capability
⚫Analysis algorithm
Using the Python Ecosystem for Data Science
This section provides an overview of the libraries you use for the
data science examples
⚫ Accessing scientific tools using SciPy
The SciPy stack (http://www.scipy.org/) contains a host of other
libraries that you can also download separately. These libraries
provide support for mathematics, science, and engineering.
⚫ These libraries are
⚫ NumPy
⚫ SciPy
⚫ matplotlib
⚫ IPython
⚫ Sympy
⚫ Pandas
⚫ SciPy is a general‐purpose library that provides
functionality for multiple problem domains. It also provides
support for domain‐specific libraries, such as Scikit‐learn,
Scikit‐image, and statsmodels.
Using the Python Ecosystem for Data Science
⚫ Performing fundamental scientific computing
using NumPy
The NumPy library (http://www.numpy.org/) provides the
means for performing n‐dimensional array manipulation.
NumPy functions that include support for linear algebra,
Fourier transform, and random‐number generation.

⚫ Performing data analysis using pandas


The pandas library (http://pandas.pydata.org/) provides
support for data structures and data analysis tools. The
basic principle behind pandas is to provide data analysis
and modeling support for Python.
Using the Python Ecosystem for Data Science
⚫ Implementing machine learning using Scikit‐learn
⚫ The Scikit‐learn library (http://scikit‐learn.org/stable/) is
one of a number of Scikit libraries that build on the
capabilities provided by NumPy and SciPy to allow Python
developers to perform domain‐specific tasks. It provides
access to the following sorts of functionality:
⚫ Classification
⚫ Regression
⚫ Clustering
⚫ Dimensionality reduction
⚫ Model selection
⚫ Preprocessing
Using the Python Ecosystem for Data Science
⚫ Plotting the data using matplotlib
The matplotlib library (http://matplotlib.org/) provides you with a
MATLAB‐like interface for creating data presentations of the
analysis you perform.
The library is currently limited to 2D output, but it still provides
you with the means to express graphically the data patterns you
see in the data you analyze. Without this library, you couldn’t
create output that people outside the data science community
could easily understand.

⚫ Parsing HTML documents using Beautiful Soup


The Beautiful Soup library (http://www.crummy.com/software/
BeautifulSoup/) download is actually found at https://pypi.python.
org/pypi/beautifulsoup4/4.3.2. This library provides the means for
parsing HTML or XML data in a manner that Python
understands. It allows you to work with tree‐based data.
Working at the command line or in the IDE
⚫ Anaconda is a product that makes using Python even
easier. It comes with a number of utilities that help you
work with Python in a variety of ways.
⚫ Anaconda is actually a compilation of several open source
applications. You can use these applications individually or
in cooperation with each other to achieve specific coding
goals.

⚫ Jupyter Notebook/IPython Notebook


⚫ Spyder
⚫ IPython QT Console
⚫ Anaconda Command Prompt
⚫ Python Interprepter
⚫ Ipython Console
Creating new sessions with Anaconda
Command Prompt
⚫ Only one of the Anaconda utilities provides direct access to
the command line, Anaconda Command Prompt.
⚫ When you start this utility, you see a command prompt at
which you can type commands.
⚫ Using Anaconda Command prompt you can start various
utility through command:
⚫ jupyter notebook
⚫ spyder
⚫ ipython to open IPython Console
⚫ python to open python interpreter
⚫ IPython qtconsole
Python Interpreter
Python Interpreter
⚫ Python interpreter is in an interactive mode when it reads
commands from a tty. The primary prompt is the following:
>>>
⚫ When it shows this prompt, it means it prompts the
developer for the next command. This is the REPL. Before it
prints the first prompt, Python interpreter prints a welcome
message that also states its version number and a copyright
notice.
This is the secondary prompt:

⚫ This prompt denotes continuation lines.

⚫ You quit the interpreter by typing quit() and pressing Enter.


Entering the IPython environment
⚫ The Interactive Python (IPython) environment
provides enhancements to the standard Python
interpreter.
⚫ Run Native Shell Commands
⚫ Syntax Highlighting
⚫ Proper Indentation
⚫ Tab Completion
⚫ Documentation like str.capitalize?
Entering IPython QTConsole environment
⚫ It adds a GUI on top of IPython that makes using the
enhancements that IPython provides a lot easier,
Editing scripts using Spyder
⚫ Spyder is a fully functional Integrated Development
Environment (IDE). You use it to load scripts, edit
them, run them, and perform debugging tasks.

You might also like