1 UNIT I DATA SCIENCE
Data Science
Data science is a deep study of the massive amount of data, which involves extracting meaningful
insights from raw, structured, and unstructured data that is processed using the scientific method,
different technologies, and algorithms.
It is a multidisciplinary field that uses tools and techniques to manipulate the data so that you can
find something new and meaningful.
Data science uses the most powerful hardware, programming systems, and most efficient
algorithms to solve the data related problems. It is the future of artificial intelligence.
In short, we can say that data science is all about:
o Asking the correct questions and analyzing the raw data.
o Modeling the data using various complex and efficient algorithms.
o Visualizing the data to get a better perspective.
o Understanding the data to make better decisions and finding the final result.
Example:
Let suppose we want to travel from station A to station B by car. Now, we need to take
some decisions such as which route will be the best route to reach faster at the location, in which
route there will be no traffic jam, and which will be cost-effective. All these decision factors will
act as input data, and we will get an appropriate answer from these decisions, so this analysis of
data is called the data analysis, which is a part of data science.
2 UNIT I DATA SCIENCE
What is Data Science & Advantages and disadvantages of Data Science
Data science has become an essential part of any industry today. It’s a method for
transforming business data into assets that help organizations improve revenue, reduce costs, seize
business opportunities, improve customer experience, and more. Data science is one of the most
debated topics in the industries these days. Its popularity has grown over the years, and companies
have started implementing data science techniques to grow their business and increase customer
satisfaction. Data science is the domain of study that deals with vast volumes of data using modern
tools and techniques to find unseen patterns, derive meaningful information, and make business
decisions.
3 UNIT I DATA SCIENCE
Advantages of Data Science :- In today’s world, data is being generated at an alarming rate. Every
second, lots of data is generated; be it from the users of Facebook or any other social networking
site, or from the calls that one makes, or the data which is being generated from different
organizations. And because of this huge amount of data the value of the field of Data Science has
a number of advantages. Some of the advantages are mentioned below :-
Multiple Job Options :- Being in demand, it has given rise to a large number of career
opportunities in its various fields. Some of them are Data Scientist, Data Analyst, Research
Analyst, Business Analyst, Analytics Manager, Big Data Engineer, etc.
Business benefits :- Data Science helps organizations knowing how and when their products sell
best and that’s why the products are delivered always to the right place and right time. Faster and
better decisions are taken by the organization to improve efficiency and earn higher profits.
Highly Paid jobs & career opportunities :- As Data Scientist continues being the sexiest job and
the salaries for this position are also grand. According to a Dice Salary Survey, the annual average
salary of a Data Scientist $106,000 per year.
Hiring benefits :- It has made it comparatively easier to sort data and look for best of candidates
for an organization. Big Data and data mining have made processing and selection of CVs, aptitude
tests and games easier for the recruitment teams.
4 UNIT I DATA SCIENCE
Disadvantages of Data Science :- Everything that comes with a number of benefits also hassome
consequences . So let’s have a look at some of the disadvantages of Data Science :-
Data Privacy :- Data is the core component that can increase the productivity and the revenue of
industry by making game-changing business decisions. But the information or the insightsobtained
from the data can be misused against any organization or a group of people or any committee etc.
Extracted information from the structured as well as unstructured data for further use can also
misused against a group of people of a country or some committee.
Cost :- The tools used for data science and analytics can cost a lot to an organization as some of
the tools are complex and require the people to undergo a training in order to use them. Also, it is
very difficult to select the right tools according to the circumstances because their selection is
based on the proper knowledge of the tools as well as their accuracy in analyzing the data and
extracting information.
Facets of Data:
Very large amount of data will generate in big data and data science. These data is various types and
main categories of data are as follows:
a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
g) Audio, video and images
Structured Data
• Structured data is arranged in rows and column format. It helps for application to retrieve and process
data easily. Database management system is used for storing structured data.
5 UNIT I DATA SCIENCE
• The term structured data refers to data that is identifiable because it is organized in a structure. The
most common form of structured data or records is a database where specific information is stored based
on a methodology of columns and rows.
Structured data is also searchable by data type within content. Structured data is understood by computers
and is also efficiently organized for human readers.
• An Excel table is an example of structured data.
Unstructured Data
• Unstructured data is data that does not follow a specified format. Row and columns are not used for
unstructured data. Therefore it is difficult to retrieve required information. Unstructured data has no
identifiable structure.
• The unstructured data can be in the form of Text: (Documents, email messages, customer feedbacks),
audio, video, images. Email is an example of unstructured data.
• Even today in most of the organizations more than 80 % of the data are in unstructured form. This
carries lots of information. But extracting information from these various sources is a very big challenge.
• Characteristics of unstructured data:
1. There is no structural restriction or binding for the data.
2. Data can be of any type.
3. Unstructured data does not follow any structural rules.
4. There are no predefined formats, restriction or sequence for unstructured data.
5. Since there is no structural binding for unstructured data, it is unpredictable in nature.
Natural Language
• Natural language is a special type of unstructured data.
• Natural language processing enables machines to recognize characters, words and sentences, then apply
meaning and understanding to that information. This helps machines to understand language as humans
do.
• Natural language processing is the driving force behind machine intelligence in many modern real-
world applications. The natural language processing community has had success in entity recognition,
topic recognition, summarization, text completion and sentiment analysis.
•For natural language processing to help machines understand human language, it must go through
6 UNIT I DATA SCIENCE
speech recognition, natural language understanding and machine translation. It is an iterative process
comprised of several layers of text analysis.
Machine - Generated Data
• Machine-generated data is an information that is created without human interaction as a result of a
computer process or application activity. This means that data entered manually by an end-user is not
recognized to be machine-generated.
• Machine data contains a definitive record of all activity and behavior of our customers, users,
transactions, applications, servers, networks, factory machinery and so on.
• It's configuration data, data from APIs and message queues, change events, the output of diagnostic
commands and call detail records, sensor data from remote equipment and more.
• Examples of machine data are web server logs, call detail records, network event logs and telemetry.
• Both Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions generate machine data.
Machine data is generated continuously by every processor-based system, as well as many consumer-
oriented systems.
• It can be either structured or unstructured. In recent years, the increase of machine data has surged. The
expansion of mobile devices, virtual servers and desktops, as well as cloud- based services and RFID
technologies, is making IT infrastructures more complex.
Graph-based or Network Data
•Graphs are data structures to describe relationships and interactions between entities in complex
systems. In general, a graph contains a collection of entities called nodes and another collection of
interactions between a pair of nodes called edges.
• Nodes represent entities, which can be of any object type that is relevant to our problem domain. By
connecting nodes with edges, we will end up with a graph (network) of nodes.
• A graph database stores nodes and relationships instead of tables or documents. Data is stored just like
we might sketch ideas on a whiteboard. Our data is stored without restricting it to a predefined model,
allowing a very flexible way of thinking about and using it.
• Graph databases are used to store graph-based data and are queried with specialized query languages
such as SPARQL.
Graph databases are capable of sophisticated fraud prevention. With graph databases, we can use
relationships to process financial and purchase transactions in near-real time. With fast graph queries,
we are able to detect that, for example, a potential purchaser is using the same email address and credit
card as included in a known fraud case.
7 UNIT I DATA SCIENCE
• Graph databases can also help user easily detect relationship patterns such as multiple people associated
with a personal email address or multiple people sharing the same IP address but residing in different
physical addresses.
• Graph databases are a good choice for recommendation applications. With graph databases, we can
store in a graph relationships between information categories such as customer interests, friends and
purchase history. We can use a highly available graph database to make product recommendations to a
user based on which products are purchased by others who follow the same sport and have similar
purchase history.
Graph theory is probably the main method in social network analysis in the early history of the social
network concept. The approach is applied to social network analysis in order to determine important
features of the network such as the nodes and links (for example influencers and the followers).
• Influencers on social network have been identified as users that have impact on the activities or opinion
of other users by way of followership or influence on decision made by other users on the network as
shown in Fig. 1.2.1.
Graph theory has proved to be very effective on large-scale datasets such as social network data. This is
because it is capable of by-passing the building of an actual visual representation of the data to run
directly on data matrices.
Audio, Image and Video
• Audio, image and video are data types that pose specific challenges to a data scientist. Tasks that are
trivial for humans, such as recognizing objects in pictures, turn out to be challenging for computers.
8 UNIT I DATA SCIENCE
•The terms audio and video commonly refers to the time-based media storage format for sound/music
and moving pictures information. Audio and video digital recording, also referred as audio and video
codecs, can be uncompressed, lossless compressed or lossy compressed depending on the desired quality
and use cases.
• It is important to remark that multimedia data is one of the most important sources of information and
knowledge; the integration, transformation and indexing of multimedia data bring significant challenges
in data management and analysis. Many challenges have to be addressed including big data,
multidisciplinary nature of Data Science and heterogeneity.
• Data Science is playing an important role to address these challenges in multimedia data. Multimedia
data usually contains various forms of media, such as text, image, video, geographic coordinates and
even pulse waveforms, which come from multiple sources. Data Science can be a key instrument
covering big data, machine learning and data mining solutions to store, handle and analyze such
heterogeneous data.
Streaming Data
Streaming data is data that is generated continuously by thousands of data sources, which typically send
in the data records simultaneously and in small sizes (order of Kilobytes).
• Streaming data includes a wide variety of data such as log files generated by customers using your
mobile or web applications, ecommerce purchases, in-game player activity, information from social
networks, financial trading floors or geospatial services and telemetry from connected devices or
instrumentation in data centers.
Difference between Structured and Unstructured Data
9 UNIT I DATA SCIENCE
Data Science Benefits
Benefits of Data Science
Data science has many benefits, and it's quickly becoming an essential tool for businesses of all sizes.
1. Improved Decision-Making
By using data to address problems and inform viewpoints, data scientists play a critical role in allowing
better decision-making. To analyze and process massive datasets and to extract insightful data, they use
a variety of methodologies. Data scientists' work offers data-driven insights that can enable companies
and organizations to make wise decisions.
A data scientist might examine patient data in a healthcare organization, for instance, to find trends and
patterns that can improve patient outcomes. In the retail sector, data analysis may be used to develop
new goods and services and to have a better understanding of consumer behavior.
2. Increased Efficiency
Business operations can be made more efficient and costs can be cut with the use of data science.
Businesses can spot inefficiencies and potential improvement areas by analyzing data. Afterwards,
modifications that boost efficiency while cutting expenses can be made using the knowledge.
To analyze its supply chain and locate bottlenecks that are creating delays, for instance, a corporation
could use data science. The organization can shorten delivery times and boost overall efficiency by
altering their supply chain in response to this information.
3. Enhanced Customer Experience
Discovering customer preferences and behavior can be accomplished through data analysis. The
customer experience can be improved by using this information to create goods and services that are
catered to the needs of the user.
Using data science, a business may, for example, analyze prior customer purchases and make customized
product recommendations. The probability of repeat business might rise as a result of this.
10 UNIT I DATA SCIENCE
4.Competitive Advantage
By empowering them to make better decisions and discover new opportunities, data science may provide
firms a competitive edge. Businesses may remain competitive by utilizing data to obtain insights into
their processes and customers.
A store, for instance, could use data science to examine sales data and spot fresh trends. Based on this
knowledge, the merchant can create new products or change their marketing plan to benefit from these
trends before their rivals.
5.Predictive Analytics
Based on past data, data science can be used to forecast future results. Businesses can find trends and
forecast future occurrences by using machine learning algorithms to analyze massive datasets. A
healthcare professional could, for instance, use data science to identify the individuals most at risk of
contracting a specific disease and provide preventive care due to this predictive analysis.
6.Personalized Marketing and Customer Segmentation
Organizations can segment their consumer bases and develop individualized marketing efforts using data
science. Businesses may send tailored and relevant communications that increase customer engagement
and conversion rates by analyzing consumer data and behavior. This allows them to better understand
individual preferences and needs.
For instance, a retail business can utilize data science approaches to recognize high-value clients and
develop tailored marketing campaigns or loyalty schemes to improve client retention. Similar to this, an
e-commerce platform can make pertinent product recommendations based on a user's browsing history
and buying habits by using customer segmentation.
7.Better Healthcare Outcomes
The healthcare sector could undergo a transformation because of data science. Data scientists can gain
insights to increase diagnosis precision, optimize treatment strategies, and improve patient care,
eventually resulting in better healthcare outcomes, by analyzing patient data, medical records, and
clinical studies.
Additionally, by taking into account a patient's unique traits, such as genetics, lifestyle, and previous
treatment outcomes, data science enables the optimisation of treatment programmes. Data scientists can
find patterns and connections in large-scale clinical data that help them choose the best treatments for
certain patient profiles.
8.Efficient Resource Allocation
Utilizing data on resource utilization, demand trends, and supply chain dynamics, data science aids
organizations in maximizing resource allocation. As a result, waste is reduced and operational efficiency
is increased while resources like inventory, people, and equipment are appropriately allocated.
9.Continuous Improvement
Organizations with a culture of continual development benefit from data science. Organizations can
assess performance, monitor advancement, and pinpoint areas for development by analyzing data. This
data-driven strategy encourages an attitude of constant improvement and innovation.
10.Innovation and New Opportunities
Last but not least, data science may help companies innovate and spot new opportunities. Data science
is becoming a driving force behind innovation, allowing companies to find fresh perspectives and
untapped potential. Additionally, data science can find new business prospects by examining competition
data, market dynamics, and consumer behavior.
11 UNIT I DATA SCIENCE
Data science-driven innovation goes beyond just product creation. Additionally, it can apply to process
innovation, in which businesses employ data analysis to spot inefficiencies, bottlenecks, and potential
for automation or optimization.
What Is the Data Science Process?
The data science process is a systematic approach to solving a data problem. It provides a
structured framework for articulating your problem as a question, deciding how to solve it, and
then presenting the solution to stakeholders.
Data Science Life Cycle
Another term for the data science process is the data science life cycle. The terms can be used
interchangeably, and both describe a workflow process that begins with collecting data, and ends
with deploying a model that will hopefully answer your questions. The steps include:
Framing the Problem
Understanding and framing the problem is the first step of the data science life cycle. Thisframing
will help you build an effective model that will have a positive impact on your organization.
Collecting Data
The next step is to collect the right set of data. High-quality, targeted data—and the mechanisms
to collect them—are crucial to obtaining meaningful results. Since much of the roughly 2.5
quintillion bytes of data created every day come in unstructured formats, you’ll likely need to
extract the data and export it into a usable format, such as a CSV or JSON file.
Cleaning Data
Most of the data you collect during the collection phase will be unstructured, irrelevant, and
unfiltered. Bad data produces bad results, so the accuracy and efficacy of your analysis will
depend heavily on the quality of your data.
Cleaning data eliminates duplicate and null values, corrupt data, inconsistent data types, invalid
12 UNIT I DATA SCIENCE
entries, missing data, and improper formatting.
This step is the most time-intensive process, but finding and resolving flaws in your data is
essential to building effective models.
Exploratory Data Analysis (EDA)
Now that you have a large amount of organized, high-quality data, you can begin conducting an
exploratory data analysis (EDA). Effective EDA lets you uncover valuable insights that willbe
useful in the next phase of the data science lifecycle.
Model Building and Deployment
Next, you’ll do the actual data modeling. This is where you’ll use machine learning, statistical
models, and algorithms to extract high-value insights and predictions.
The big data ecosystem and data science
Distributed file systems
A distributed file system is similar to a normal file system except that it runs on multiple servers
at once. Because it is a file system, you can do almost all the same things you would do on a normal
file system. Actions such as storing, reading, and deleting files and adding security to files are at
the core of every file system, also the distributed one. Distributed file systems have some
significant advantages:
They can contain files larger than any one computer disk.
Files get automatically replicated across multiple servers for redundancy or parallel operations
while hiding the complexity of doing so from the user.
The system scales easily, you are no longer bound by the memory or storage restrictions of a
single server.
Distributed programming framework
Once you have the data stored on the distributed file system, you want to exploit it. An important
aspect of working on a distributed hard disk is that you will not move your data to your program,
but rather you will move your program to the data. When you start from scratch with a normal
general-purpose programming language such as C, Python, or Java, you need to deal with the
complexities that come with distributed programming such as restarting jobs that have failed,
tracking the results from the different subprocesses, and so on. Luckily, the open-source
community has developed many frameworks to handle this for you and give you a much better
experience working with distributed data and dealing with many of the challenges it carries.
Data integration framework
13 UNIT I DATA SCIENCE
Once you have a distributed file system in place, you need to add some data. This means that you
need to move data from one source to another, and this is where the data integration frameworks
such as Apache Sqoop and Apache Flume excel. The process is similar to an extract, transform,
and load process in a traditional data warehouse.
Machine learning frameworks
When you have the data in place, it’s time to extract the coveted insights. This is where you rely
on the fields of machine learning, statistics, and applied mathematics. Before World War II,
everything needed to be calculated by hand, which severely limited the possibilities of data
analysis. After World War II computers and scientific computing were developed. A single
computer could do all the counting and calculations and a world of opportunities opened. Ever
since this breakthrough, people just need to derive the mathematical formulas, write them in an
algorithm, and load their data to be sorted by data profiling software.
NoSQL databases
If you need to store huge amounts of data, you require software that is specialized in managing
and querying this data. Traditionally this has been the playing field of relational databases such as
Oracle SQL, MySQL, Sybase IQ, and others. While they still are the go-to technology for many
use cases, new types of databases have emerged under the grouping of NoSQL databases.
The name of this group can be misleading as “No” in this context stands for “Not Only.” A lack
of functionality in SQL is not the biggest reason for the paradigm shift, and many of the NoSQL
databases have implemented a version of SQL themselves. But traditional databases had some
shortcomings that did not allow them to scale well. By solving some of the problems of traditional
databases, NoSQL databases allow for a virtually endless growth of data.
Many different types of databases have arisen, but they can be categorized into the following types:
Column databases-Data is stored in columns, which allows some algorithms to perform much
faster queries. Newer technologies use cell-wise storage. Table-like structures are still very
important.
Document stores-Document stores no longer use tables but store every observation in a
document. This allows for a much more flexible data scheme.
Streaming data-Data is collected, transformed, and aggregated not in batches but in real time.
Although we have categorized it here as a database to help you in tool selection, it is more a
particular type of problem that drove creation of technologies like Storm.
Key-value stores-Data is not stored in a table; rather you assign a key for every value such as
org.marketing.sales.2015: 20000. This scales very well but places almost all the
implementation on the developer.
SQL on Hadoop-Batch queries on Hadoop are in a SQL-like language that uses the map-reduce
14 UNIT I DATA SCIENCE
framework in the background.
New SQL-This class combines the scalability of NoSQL databases with the advantages of a
relational database. They all have a SQL interface and a relational data model.
Graph databases-Not every problem is best stored in a table. Some problems are more naturally
translated into graph theory and stored in graph databases. A classic example of this is a social
network.
Service programming
Suppose that you have made a world-class soccer prediction application on Hadoop, and you want
to allow others to use the predictions made by your application. However, you have no idea of the
architecture or technology of everyone keen on using your predictions. Service tools excel here by
exposing big data applications to other applications as a service. Data scientists sometimes need
to expose their models through services. The best known example is the REST service were REST
stands for representational state transfer. It is often used to feed websites with data.
Security
Do you want everybody to have access to all of your data? If so, you probably need to have
fine-grained control over the access to data but don’t want to manage this on an application- by-
application basis. Big data security tools allow you to have central and fine-grained control over
access to the data. It could be important to you and any businesses that you own to keep your data
as secure as possible so that no one will be able to gain any unwanted access to your files.
Companies similar to Secure Data Technologies will be able to offer you the relevant management
services to ensure that your files have the relevant security placed on them so they are kept
confidential to you. Big data security has become a topic in its own right, and data scientists will
usually only be confronted with it as a data consumer, seldom will they implement the security
themselves.