Introduction to data science
Data science:
Data science is the domain of study that deals with vast volumes of data using modern tools
and techniques to find unseen patterns, derive meaningful information, and make business
decisions. Data science uses complex machine learning algorithms to build predictive
models.
The data used for analysis can come from many different sources and presented in various
formats.
Data science is about extraction, preparation, analysis, visualization, and maintenance of
information. It is a cross disciplinary field which uses scientific methods and processes to
draw insights from data.
The Data Science Lifecycle
Data science’s lifecycle consists of five distinct stages, each with its own tasks:
Capture: Data Acquisition, Data Entry, Signal Reception, Data Extraction. This stage
involves gathering raw structured and unstructured data.
Maintain: Data Warehousing, Data Cleansing, Data Staging, Data Processing, Data
Architecture. This stage covers taking the raw data and putting it in a form that can be used.
Process: Data Mining, Clustering/Classification, Data Modeling, Data Summarization. Data
scientists take the prepared data and examine its patterns, ranges, and biases to determine
how useful it will be in predictive analysis.
Analyze: Exploratory/Confirmatory, Predictive Analysis, Regression, Text Mining,
Qualitative Analysis. Here is the real meat of the lifecycle. This stage involves performing
the various analyses on the data.
Applications of data science in various fields
Major Applications of Data Science
1. In Search Engines
The most useful application of Data Science is Search Engines. As we know when we want
to search for something on the internet, we mostly used Search engines like Google,
Yahoo, Safari, Firefox, etc. So Data Science is used to get Searches faster.
2. In Transport
Data Science also entered into the Transport field like Driverless Cars. With the help of
Driverless Cars, it is easy to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm and with the help
of Data Science techniques, the Data is analyzed like what is the speed limit in Highway,
Busy Streets, Narrow Roads, etc. And how to handle different situations while driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an
issue of fraud and risk of losses. Thus, Financial Industries needs to automate risk of loss
analysis in order to carry out strategic decisions for the company. Also, Financial Industries
uses Data Science Analytics tools in order to predict the future.
For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data
Science is used to examine past behavior with past data and their goal is to examine the
future outcome.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user
experience with personalized recommendations.
For Example, When we search for something on the E-commerce websites we get
suggestions similar to choices according to our past data and also we get recommendations
according to most buy the product, most rated, most searched, etc. This is all done with the
help of Data Science.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
Detecting Tumor.
Drug discoveries.
Medical Image Analysis.
Virtual Medical Bots.
Genetics and Genomics.
Predictive Modeling for Diagnosis etc.
INTRODUCTION TO DATA SCIENCE
12 CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
6. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When we upload
our image with our friend on Facebook, Facebook gives suggestions Tagging who is in the
picture. This is done with the help of machine learning and Data Science. When an Image is
Recognized, the data analysis is done on one’s Facebook friends and after analysis, if the
faces which are present in the picture matched with someone else profile then Facebook
suggests us auto-tagging.
7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science. Whatever
the user searches on the Internet, he/she will see numerous posts everywhere.
example: Suppose I want a mobile phone, so I just Google search it and after that, I
changed my mind to buy offline. Data Science helps those companies who are paying for
Advertisements for their mobile. So everywhere on the internet in the social media, in the
websites, in the apps everywhere I will see the recommendation of that mobile phone which
I searched for. So this will force me to buy online.
8. Airline Routing Planning
With the help of Data Science, Airline Sector is also growing like with the help of it, it
becomes easy to predict flight delays. It also helps to decide whether to directly land into
the destination or take a halt in between like a flight can have a direct route from Delhi to
the U.S.A or it can halt in between after that reach at the destination.
9. Data Science in Gaming
In most of the games where a user will play with an opponent i.e. a Computer Opponent,
data science concepts are used with machine learning where with the help of past data the
Computer will improve its performance. There are many games like Chess, EA Sports, etc.
will use Data Science concepts.
10. Medicine and Drug Development
The process of creating medicine is very difficult and time-consuming and has to be done
with full disciplined because it is a matter of Someone’s life. Without Data Science, it takes
lots of time, resources, and finance or developing new Medicine or drug but with the help of
Data Science, it becomes easy because the prediction of success rate can be easily
determined based on biological data or factors. The algorithms based on data science will
forecast how this will react to the human body without lab experiments.
11. In Delivery Logistics
Various Logistics companies like DHL, FedEx, etc. make use of Data Science. Data
Science helps these companies to find the best route for the Shipment of their Products, the
best time suited for delivery, the best mode of transport to reach the destination, etc.
INTRODUCTION TO DATA SCIENCE
13 CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
01. Autocomplete
AutoComplete feature is an important part of Data Science where the user will get the
facility to just type a few letters or words, and he will get the feature of auto-completing the
line. In Google Mail, when we are writing formal mail to someone so at that time data
science concept of Autocomplete feature is used where he/she is an efficient choice to
auto-complete the whole line. Also in Search Engines in social media, in various apps,
AutoComplete feature is widely used.
Data security issues
What is Data Security?
Data security is the process of protecting corporate data and preventing data loss through
unauthorized access. This includes protecting your data from attacks that can encrypt or
destroy data, such as ransomware, as well as attacks that can modify or corrupt your data.
Data security also ensures data is available to anyone in the organization who has access
to it.
Some industries require a high level of data security to comply with data protection
regulations. For example, organizations that process payment card information must use
and store payment card data securely, and healthcare organizations in the USA must
secure private health information (PHI) in line with the HIPAA standard.
Data Security vs Data Privacy
Data privacy is the distinction between data in a computer system that can be shared with
third parties (non-private data), and data that cannot be shared with third parties (private
data). There are two main aspects to enforcing data privacy:
Access control—ensuring that anyone who tries to access the data is authenticated to
confirm their identity, and authorized to access only the data they are allowed to access.
Data protection—ensuring that even if unauthorized parties manage to access the data,
they cannot view it or cause damage to it. Data protection methods ensure encryption,
which prevents anyone from viewing data if they do not have a private encryption key, and
data loss prevention mechanisms which prevent users from transferring sensitive data
outside the organization.
Data security has many overlaps with data privacy. The same mechanisms used to ensure
data privacy are also part of an organization’s data security strategy.
The primary difference is that data privacy mainly focuses on keeping data confidential,
while
EVALUATION METRICS
Model Evaluation Metrics define the evaluation metrics for evaluating the performance
of a
machine learning model, which is an integral component of any data science project. It
aims to
estimate the generalization accuracy of a model on the future (unseen/out-of-sample)
data.
Confusion Matrix
A confusion matrix is a matrix representation of the prediction results of any binary
testing that is
often used to describe the performance of the classification model (or “classifier”)
on a set of
test data for which the true values are known.
The confusion matrix itself is relatively simple to understand, but the related terminology
can be
confusing.
Evolution of Data Science: Growth & Innovation
Data science was born from the idea of merging applied statistics with computer science.
The resulting field of study would use the extraordinary power of modern computing.
Scientists realized they could not only collect data and solve statistical problems but also
use that data to solve real-world problems and make reliable fact-driven predictions.
1962: American mathematician John W. Tukey first articulated the data science dream. In
his now-famous article “The Future of Data Analysis,” he foresaw the inevitable emergence
of a new field nearly two decades before the first personal computers. While Tukey was
ahead of his time, he was not alone in his early appreciation of what would come to be
known as “data science.”
1977: The theories and predictions of “pre” data scientists like Tukey and Naur became
more concrete with the establishment of The International Association for Statistical
Computing (IASC), whose mission was “to link traditional statistical methodology, modern
computer technology, and the knowledge of domain experts in order to convert data into
information and knowledge.”
1980s and 1990s: Data science began taking more significant strides with the emergence
of the first Knowledge Discovery in Databases (KDD) workshop and the founding of the
International Federation of Classification Societies (IFCS).
1994: Business Week published a story on the new phenomenon of “Database Marketing.”
It described the process by which businesses were collecting and leveraging enormous
amounts of data to learn more about their customers, competition, or advertising
techniques.
1990s and early 2000s: We can clearly see that data science has emerged as a
recognized and specialized field. Several data science academic journals began to
circulate, and data science proponents like Jeff Wu and William S. Cleveland continued to
help develop and expound upon the necessity and potential of data science.
2000s: Technology made enormous leaps by providing nearly universal access to internet
connectivity, communication, and (of course) data collection.
2005: Big data enters the scene. With tech giants such as Google and Facebook
uncovering large amounts of data, new technologies capable of processing them became
necessary. Hadoop rose to the challenge, and later on Spark and Cassandra made their
debuts.
2014: Due to the increasing importance of data, and organizations’ interest in finding
patterns and making better business decisions, demand for data scientists began to see
dramatic growth in different parts of the world.
2015: Machine learning, deep learning, and Artificial Intelligence (AI) officially enter the
realm of data science.
2018: New regulations in the field are perhaps one of the biggest aspects in the evolution in
data science.
2020s: We are seeing additional breakthroughs in AI, machine learning, and an ever-more-
increasing demand for qualified professionals in Big Data
Roles in Data Science
Data Analyst
Data Engineers
Database Administrator
Machine Learning
INTRODUCTION TO DATA SCIENCE
4 CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
Data Scientist
Data Architect
Statistician
Business Analyst
Data and Analytics Manager
1. Data Analyst
Data analysts are responsible for a variety of tasks including visualisation, munging, and
processing of massive amounts of data. They also have to perform queries on the
databases from time to time. One of the most important skills of a data analyst is
optimization.
Few Important Roles and Responsibilities of a Data Analyst include:
Extracting data from primary and secondary sources using automated tools
Developing and maintaining databases
Performing data analysis and making reports with recommendations
To become a data analyst: SQL, R, SAS, and Python are some of the sought-after
technologies for data analysis.
2. Data Engineers
Data engineers build and test scalable Big Data ecosystems for the businesses so that the
data scientists can run their algorithms on the data systems that are stable and highly
optimized. Data engineers also update the existing systems with newer or upgraded
versions of the current technologies to improve the efficiency of the databases.
Few Important Roles and Responsibilities of a Data Engineer include:
Design and maintain data management systems
INTRODUCTION TO DATA SCIENCE
5 CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
Data collection/acquisition and management
Conducting primary and secondary research
To become data engineer: technologies that require hands-on experience include Hive,
NoSQL, R, Ruby, Java, C++, and Matlab.
3. Database Administrator
The job profile of a database administrator is pretty much self-explanatory- they are
responsible for the proper functioning of all the databases of an enterprise and grant or
revoke its services to the employees of the company depending on their requirements.
Few Important Roles and Responsibilities of a Database Administrator include:
Working on database software to store and manage data
Working on database design and development
Implementing security measures for database
Preparing reports, documentation, and operating manuals
To become database administrator: database backup and recovery, data security, data
modeling, and design, etc
4. Machine Learning Engineer
Machine learning engineers are in high demand today. However, the job profile comes with
its challenges. Apart from having in-depth knowledge of some of the most powerful
technologies such as SQL, REST APIs, etc. machine learning engineers are also expected
to perform A/B testing, build data pipelines, and implement common machine learning
algorithms such as classification, clustering, etc.
Few Important Roles and Responsibilities of a Machine Learning Engineer include:
Designing and developing Machine Learning systems
Researching Machine Learning Algorithms
INTRODUCTION TO DATA SCIENCE
6 CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
Testing Machine Learning systems
Developing apps/products basis client requirements
To become machine learning engineer: technologies like Java, Python, JS, etc. Secondly,
you should have a strong grasp of statistics and mathematics.
5. Data Scientist
Data scientists have to understand the challenges of business and offer the best solutions
using data analysis and data processing. For instance, they are expected to perform
predictive analysis and run a fine-toothed comb through an “unstructured/disorganized” data
to offer actionable insights.
Few Important Roles and Responsibilities of a Data Scientist include:
Identifying data collection sources for business needs
Processing, cleansing, and integrating data
Automation data collection and management process
Using Data Science techniques/tools to improve processes
To become a data scientist, you have to be an expert in R, MatLab, SQL, Python, and other
complementary technologies.
6. Data Architect
A data architect creates the blueprints for data management so that the databases can be
easily integrated, centralized, and protected with the best security measures. They also
ensure that the data engineers have the best tools and systems to work with.
Few Important Roles and Responsibilities of a Data Architect include:
Developing and implementing overall data strategy in line with business/organization
Identifying data collection sources in line with data strategy
Collaborating with cross-functional teams and stakeholders for smooth functioning of
database systems
Planning and managing end-to-end data architecture
To become a data architect: requires expertise in data warehousing, data modelling,
extraction transformation and loan (ETL), etc. You also must be well versed in Hive, Pig,
and Spark, etc.
7. Statistician
A statistician, as the name suggests, has a sound understanding of statistical theories and
data organization. Not only do they extract and offer valuable insights from the data
clusters, but they also help create new methodologies for the engineers to apply.
Few Important Roles and Responsibilities of a Statistician include:
Collecting, analyzing, and interpreting data
Analyzing data, assessing results, and predicting trends/relationships using statistical
methodologies/tools
Designing data collection processes
To become a statistician: SQL, data mining, and the various machine learning technologies.
8. Business Analyst
The role of business analysts is slightly different than other data science jobs. While they
do have a good understanding of how data-oriented technologies work and how to handle
large volumes of data, they also separate the high-value data from the low-value data.
Few Important Roles and Responsibilities of a Business Analyst include:
Understanding the business of the organization
Conducting detailed business analysis – outlining problems, opportunities, and solutions
Working on improving existing business processes
To become business analyst: understanding of business finances and business
intelligence, and also the IT technologies like data modelling, data visualization tools, etc.