Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views20 pages

Unit1 Fds

The document provides an overview of data science, highlighting its evolution from statistics and its applications across various sectors including commercial, governmental, and educational. It details the data science process, which consists of steps such as defining research goals, data retrieval, preparation, exploration, modeling, and presenting results. Additionally, it categorizes different types of data, including structured, unstructured, and machine-generated data, and discusses the importance of data cleansing and integration in the preparation phase.

Uploaded by

yogeshkumarcpt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views20 pages

Unit1 Fds

The document provides an overview of data science, highlighting its evolution from statistics and its applications across various sectors including commercial, governmental, and educational. It details the data science process, which consists of steps such as defining research goals, data retrieval, preparation, exploration, modeling, and presenting results. Additionally, it categorizes different types of data, including structured, unstructured, and machine-generated data, and discusses the importance of data cleansing and integration in the preparation phase.

Uploaded by

yogeshkumarcpt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 20

cse

UNIT I - INTRODUCTION
Data
In co mput i ng, data isinformation that has been translated i nto a form that is eff icient for
mo v ement or processing
Data Science
Data science is an evolutionary extension of statistics capable of dealing with the massi v e
amounts of data produced today. It adds methods from computer science to the repertoire of
statistics.
Benefits and uses of data science
Data science and big data are used al most everywhere in both commercial and no nco mmerci al
Settings
 Co mmerci al co mpani es in almost every industry use data science and big data to gai n
insights into their customers, processes, staff, completion, and products.
 Many companies use data science to offer customers a better user experience, as well as
to cross-sell, up- sell, and personalize their offerings.
 Governmental organizations are also aware of data’s value. Many governmental
organizations not only rely on internal data scientists to discover valuable information,
but also share their data with the public.
 Nongovernmental organizations (NGOs) use it to rase money and defend their causes.
 Universities use data science in their research but also to enhance the study experience
of their students. The rise of massi v e open online courses (MOOC) produces a lot of
data, which allows universities to study how this type of learning can co mpl ement
traditional classes.
Facets of data
In data science and big datayou’ll come across many different types of data, and each of themtends
to require different tools and techniques. The main categories of data are these:
 Structured
 Unstructured
 Natural language
 Machi ne- generated
 Graph- based
 Audio, video, and images
 Streaming
Let’s explore all these interesting data types.
Structured data
 Structured data is data that depends on a data model and resides in a fixed field within a
record. As such, it’s often easy to store structured data in tables within databases or
Excel files
 SQL, or Structured Query Language, is the preferred way to manage and query data
that resides in databases.
1
cse
Unstructured data
Unstructured data is data that isn’t easy to fit i nto a data model because the content is context- specif ic or
varying. One example of unstructured data is your regular email
Natural language
 Natural language is a special type of unstructured data;it’s challenging to process because it requires
knowledge of specific data science techniques and linguistics.
6
cse
 The natural language processing co mmunity has had success in entity recognition, topic recognition,
summari zat i on, text compl etion, and sentiment anal ysi s, but models trained in one domain don’t
generalize well to other domains.
 Even state-of- the- art techniquesaren’t able to decipher the meaning of every piece of text.
Machine-generated data
 Machi ne- generated data is informationthat’s automatical l y created by a computer, process,
application, or other machine without human intervention.
 Machine-generated data is becoming a major data resource and will continue to do so.
 The anal ysi s of machi ne data rel i es on highl y scalable tools, due to its high vo lume and speed.
Examples of machine data are web server logs, call detail records, network event logs, and telemetry.
Graph-based or network data

“Graph data” can be a confusing termbecause any data can be shown in a graph
 Graph or network data is, in short, data that focuses on the relationship or adjacency of objects.
 The graph structures use nodes, edges, and properties to represent and store graphical data.
 raph-based data is a natural way to represent social networks, and its structure allows you to
calculate specific metrics such as the influence of a person and the shortest path between two people.
cse
Streaming data
 The data flows into the system when an event happens instead of being loaded into a data store in a
batch.
 Examples are the“What’s trending” on Twitter, live sporting or music events, and the stock market.
Data Science Process
Overview of the data science process
he typical data science process consists of six steps through whichyou’ll iterate, as shown in figure
1. The first step of thi s process is setting a research goal. The main purpose here is maki ng sure all the
stakeholders understand the what, how, and why of the project. In every serious project this will result
in a project charter.
2. The second phase is data retrieval. You want to have data available for analysis, so this step includes
f inding suitable data and getting access to the data from the data owner. The result is data in its raw
form, which probably needs polishing and transformation before it becomes usable.
3. Now that you have the raw data, it’s time to prepare it. This includes transforming the data from
a raw
form into data that’s directly usable in your models. To achieve this, you’ll detect and correct
different
kinds of errors in the data, co mbi ne data from different data sources, and transform it. If you hav e
successfully completed this step, you can progress to data visualization and modeling.
4
cse
4. The fourth step is data exploration. The goal of this step is to gain a deep understanding of the data.
You’ll look for patterns, correlations, and devi ati ons based on visual and descriptive techniques. The
insights you gain fromthis phase will enable you to start modeling.
5. Finally, we get to model building (often referred to as “data modeling” throughout this book). It is
now
that you attempt to gain the insights or make the predictions stated in your project charter. Now is the
ti me to bri ng out the heavy guns, but remember research has taught us that often ( but not always) a
combination of simple models tends to outperform one complicated model . If you’ve done this phase
right,you’re almost done.
6. The last step of the data science mo del is presenting your results and auto mat i ng the anal ysi s, if
needed. One goal of a project is to change a process and/or make better decisions. You may still need
to convince the business that your findings will indeed change the business process as expected. This
is where you can shine in your influencer role. The importance of this step is more apparent in projects
on a strategic and tactical level. Certain projects require you to perform the business process over and
over again, so automating the project will save time.
efining research goals
A project starts by understanding the wha t, the why, and the howof your project. The outcome should bea
clear research goal, a good understanding of the context, well- defined deliverables, and a plan of action with a
ti metabl e. This information is then best placed in a project charter.
Spend time understanding the goals and context of your research
 An essential o utco me is the research goal that states the purpose of your assignment in a clear and
focused manner.
 Understanding the business goals and context is critical for project success.
 Continue asking questions and devising examples until you grasp the exact business expectations,
identif y how your project fits in the bigger picture, appreciate how your research is going to change
the business, and understand howthey’ll use your results
Create a project charter
A project charter requires teamwork, and your input covers at least the following:
 A clear research goal
 The project mission and context
 Howyou’re going to perform your analysis
 What resources you expect to use
 Proof thatit’s an achievable project, or proof of concepts
 Deliverables and a measure of success
 A timeline
Retrieving data
 The next step in data science is to retrieve the required data. Sometimes you need to go into the field
and design a data collection process yourself, but most of thetime youwon’t be involved in this step.
 Many companies will have already collected and stored the data for you, and what theydon’t have can
often be bought from third parties.
 More and mo re organizations are maki ng even high- quality data freely available for public and
commercial use.
 Data can be stored in many forms, ranging fromsimple text files to tables in a database. The objective
now is acquiring all the data you need.
Start with data stored within the company (Internal data)
5
cse
 Most companies have a program for maintaining key data, so much of the cleaning work may already
be done. This data can be stored in official data repositories such as databases, data marts, data
warehouses, and data lakes maintained by a teamof IT professionals.
 Data warehouses and data marts are home to preprocessed data, data lakes contain data in its natural or
raw format.
 Finding data even within your own company can sometimes be a challenge. As companies grow, their
data becomes scattered around many places. the data may be dispersed as people change positions and
leave the company.
 Getting access to data is another difficult task. Organizations understand the value and sensitivity of
data and often have policies in place so everyone has access to what they need and nothing more.
 T hese policies translate i nto physical and digital barriers cal l ed Chinese wall s. These “walls” are
mandatory and well-regulated for customer data in most countries.
External Data
 If data isn’t available inside your organization, look outside your organizations. Companies provide
data so that you, in turn, can enrich their services and ecosystem. Such is the case with Twitter,
L i nkedI n, and Facebook.
 More and more governments and organizations share their data for free with the world.
 A list of open data providers that should get you started.
Data Preparation (Cleansing, Integrating, Transforming Data)
Y our model needs the data in a specific format, so data transformation will always come into play. It’s a
good
habit to correct data errors as early on in the process as possible. However, this isn’t always
possible in a
realistic setting, so you’ll need to take corrective actions in your program.
Cleansing data
Data cleansing is a sub process of the data science process that focuses on removing errors in your data so
your data becomes a true and consistent representation of the processes it originates from.
 The first type is theinterpretation error, such as when you take the value in your data for granted, like
saying that aperson’s age is greater than 300 years.
 The second type of error points to inconsistencies between data sources or against your company’s
standardized values.
An example of this class of errors is putting“Female” in one tableand “F” in another when they represent
the same thing: that the person is female.
Overview of common errors
6
cse
ometi mes you’ll use more advanced methods, such as simple modeling, to find and identify data
errors;
diagnostic plots can be especially insightful. For example, in figure we use a measure to identif y data points
that seem out of place. We do a regression to get acquainted with the data and detect the inf l uence of
individual observations on the regression line.
Data Entry Errors
 Data collection and data entry are error- prone processes. They often require human intervention, and
introduce an error into the chain.
7
cse
 Data collected by machines or computers isn’t free from errors. Errors can arise from human
sloppiness, whereas others are due to machine or hardware failure.
 Detecting data errors when thevariables you study don’t have many classes can be done by
tabulating
the data with counts.
 When you have a variable that can take only two values: “Good” and “Bad”, you can create a
frequency table and see if those are truly the only two values present. In tablethe values “Godo” and
“Bade” point out something went wrong in at least 16 cases.
Most errors of this type are easy to fix with simple assignment statements and if-thenelse
rules:
fx==“Godo”:
x=“Good”
if x==“Bade”:
x=“Bad”
Redundant Whitespace
 Whitespaces tend to be hard to detect but cause errors like other redundant characters would.
 The whitespace cause the miss match in the string such as “FR ” – “FR”, dropping the
observations
thatcouldn’t be matched.
 If you know to watch out for them, fixing redundant whitespaces is luckily easy enough in mo st
programming languages. T hey all provide string functions that will remo ve the l eadi ng and trailing
whitespaces. For instance, in Python you can use the strip() function to remo v e leading and trailing
spaces.
Fixing Capital Letter Mismatches
Capital letter mismatches are co mmo n. Most programming languages make a distinction between “Brazil”
and“brazil”.
In this case you can solve the problem by applying a function that returns both strings in lowercase, such as
.lower() in Python.“Brazil”.lower() ==“brazil”.lower() should result in true.
Impossible Values and Sanity Checks
Here you check the value against physically or theoretically impossible values such as people taller than 3
meters or someone with an age of 299 years. Sanity checks can be directly expressed with rules:
check =0 <=age <=120
Outliers
An outlier is an observation that seems to be distant from other observations or, mo re specifically, one
observation that follows a different logic or generative process than the other observations. The easiest way to
find outliers is to use a plot or a table with the minimumand maximum values.
The plot on the top shows no outliers, whereas the plot on the bottom shows possible outliers on the upper
side when a normal distribution is expected.
8
cse
Dealing with Missing Values
issing values aren’t necessari l y wrong, but you still need to handle them separately; certain model i ng
techniques can’t handle missing values. They might be an indicator that something went wrong in
your data
collection or that an error happened in the ETL process. Common techniques data scientists use are listed in
table
I ntegrating data
Y our data co mes from several di f f erent places, and in this substep we focus on integrating these di f f erent
sources. Data varies in size, type, and structure, ranging from databases and Excel files to text documents.
The Different Ways of Combining Data
Y ou can perform two operations to combine information fromdifferent data sets.
 J oining
 A ppending or stacking
J oining Tables
 Joining tables allows you to combine the information of one observation found in one table with the
information that you find in another table. The focus is on enriching a single observation.
 Let’s say that the first table contains information about the purchases of a customer and the other table
contains information about the region where your customer lives.
 J oining the tables allows you to combine the i nformation so that you can use it for your model , as
shown in figure.
Figure. Joining two tables on the itemand region key
9
cse
To join tables, you use variables that represent the same object in both tables, such as a date, a country name,
or a Social Security number. These common fields are known as keys. When these keys also uniquely def ine
the records in the table they are calledprimary keys.
The number of resulting rows in the output table depends on the exact join type that you use
A ppending Tables
Appending or stacking tables is eff ecti vely adding
observations from one table to another table.
 One table contains the observations from the mo nth J anuary and the second table contains
observations from the mo nth February. The result of appending these tables is a larger one with the
observations from January as well as February.
Figure. Appending data from tables is a common operation but requires an equal structure in the tables begin
appended,
T ransforming data
Certain models require their data to be in a certain shape. T ransf ormi ng your data so it takes a suitable form
for data modeling.
Relationships between an input variable and an output variable aren’t always linear. Take, for
instance, a
relationship of the form y = aebx. T aking the log of the independent variables simplif ies the estimation
problem dramatical l y. Transforming the input variables greatly simplifies the estimation problem. Other times
you might want to combine two variables into a new variable.
10
cse
Reducing the Number of Variables
 Having too many variables in your model makes the model difficult to handle, and certain techniques
don’t perform well when you overload them with too many input variables. For instance, all
the
techniques based on a Euclidean distance perform well only up to 10 variables.
 Data scientists use special methods to reduce the number of variables but retain the maximum amount
of data.
11
cse
Figure shows how reducing the number of variables makes it easier to understand the key values. It also
shows how two variables account for 50.6% of the variation within the data set ( component1 = 27.8% +
component2 = 22.8%). T hese variables, called “component1” and “component2,” are both combi nations
of
the original variables. They’re theprincipal componentsof the underlying data structure
Turning Variables into Dummies
 Dummy variablescan only take two values: true(1) or false(0). They’re used to indicate the absence of
a categorical effect that may explain the observation.
 In this caseyou’ll make separate columns for the classes stored in one variable and indicate it with 1 if
the class is present and 0 otherwise.
 An example is turning one column named Weekdays into the columns Monday through Sunday. You
use an indicator to show if the observation was on a Monday; you put 1 on Monday and 0 elsewhere.
 T urni ng variables i nto dummies is a technique that’s used in model i ng and is popular with, but not
exclusive to, economists.
Figure. T urni ng variables i nto dummies is a data transformation that breaks a variable that has mul tipl e
classes into multiple variables, each having only two possible values: 0or1
Exploratory data analysis
During exploratory data anal ysi s you take a deep dive i nto the data ( see f i gure below). I nformation
becomes much easier to grasp when shown in a picture, therefore you mainly use graphical techniques to
gain an understanding of your data and the interactions between variables.
The goal isn’t to cleanse the data, but it’s common that you’ll still discover anomalies you missed
before,
forcing you to take a step back and fix them.
12
cse
13
cse
 The visualization techniques you use in this phase range from simple line graphs or histograms, as
shown in below figure , to more complex diagrams such as Sankey and network graphs.
 Sometimes it’s useful to compose a composite graph from simple graphs to get even more insight
i nto the data Other ti mes the graphs can be ani mated or made interactive to make it easier and,
let’s admit it, way more fun
The techniques we described in this phase are mainl y visual, but in practice they’re certainly not limited
to
visualization techniques. T abul ation, clustering, and other model i ng techniques can also be a part of
exploratory analysis. Even building simple models can be a part of this step.
Build the models
 With clean data in place and a good understanding of the content, you’re ready to build models
with
the goal of maki ng better predictions, classifying objects, or gaining an understanding of the system
thatyou’re model i ng.
 This phase is much mo re focused than the exploratory analysis step, because you know what
you’re
looking for and what you want the outcome to be.
Building a mo del is an iterative process. The way you build your mo del depends on whether you go with
classic statistics or the somewhat more recent machine learning school, and the type of technique you want to
use. Either way, most models consist of the following main steps:
 Selection of a modeling technique and variables to enter in the model
 Execution of the model
 Diagnosis and model comparison
Model and variable selection
You’ll need to select the variables you want to include in your model and a modeling technique.
You’ll need
to consider mo del performance and whether your project meets all the requirements to use your model , as
well as other factors:
 Must the model be moved to a production environment and, if so, would it be easy to implement?
 How difficult is the maintenance on the model: how long will it remain relevant if left untouched?
 Does the model need to be easy to explain?
Model execution
 Onceyou’ve chosen a modelyou’ll need to implement it in code.
14
cse
 Most programming languages, such as Python, already have l i brari es such as StatsModels or Scikit-
learn. These packages use several of the most popular techniques.
 Coding a model is a nontrivial task in most cases, so having these libraries available can speed up the
process. As you can see in the following code, it’s fairly easy to use l i near regression with
StatsModels or Scikit- learn
 Doing this yourself would require much mo re effort even for the si mple techniques. The following
listing shows the execution of a linear prediction model.
Model diagnostics and model comparison
 You’ll be bui lding multiple model s from which you then choose the best one based on multiple
criteria. Working with a holdout sample helps you pick the best-performing model.
 A holdout sample is a part of the data you leave out of the model building so it can be used to evaluate
the model afterward.
 The principle here is simple: the model should work on unseen data. Y ou use only a fraction of your
data to estimate the model and the other part, the holdout sample, is kept out of the equation.
 The model is then unleashed on the unseen data and error measures are calculated to evaluate it.
 Multiple error measures are available, and in figure we show the general idea on comparing models.
The error measure used in the example is the mean square error.
Formula for mean square error.
Mean square error is a simple measure: check for every prediction how far it was from the truth, square this
error, and add up the error of every prediction.
15
cse
bove f igure compares the performance of two model s to predict the order size from the price. The first
mo del issize= 3 * priceand the second model issize = 10.
 To estimate the mo del s, we use 800 rando ml y chosen observations out of 1,000 (or 80%), without
showing the other 20% of data to the model.
 Once the model is trained, we predict the values for the other 20% of the variables based on those for
which we already know the true value, and calculate the model error with an error measure.
 Then we choose the model with the lowest error. In this example we chose model 1 because it has the
lowest total error.
Many models make strong assumptions, such as independence of the inputs, and you have to verify that these
assumptions are indeed met. This is calledmo del diagnostics.
Presenting findings and building applications
 Sometimes people get so excited about your work that you’ll need to repeat it over and over
again
because they value the predictions of your models or the insights that you produced.
 This doesn’t always mean that you have to redo all of your analysis all the time. Sometimes
it’s
suf f icient that you implement only the model scoring; other times you might build an application that
automatical l y updates reports, Excel spreadsheets, or PowerPoint presentations. The last stage of the
data science process is where yoursoft skillswill be most useful, and yes,they’re extremely important.
Data mining
Data mi ni ng is the process of discovering actionable information from large sets of data. Data mi ni ng uses
mathematical analysis to derive patterns and trends that exist in data. Typically, these patterns cannot be
discovered by traditional data exploration because the relationships are too compl ex or because there is too
much data.
These patterns and trends can be collected and defined as adata mining model. Mining models can be applied
to specific scenarios, such as:
 Forecasting: Estimating sales, predicting server loads or server downtime
16
cse
 Risk and probability: Choosing the best customers for targeted mailings, determining the probable
break-even point for risk scenarios, assigning probabilities to diagnoses or other outcomes
 Recommendations: Determining which products are likely to be sold together, generating
recommendations
 Finding sequences: Analyzing customer selections in a shopping cart, predicting next likely events
 Grouping: Separating customers or events i nto cluster of related items, anal yzi ng and predicting
affinities
Building a mining mo del is part of a larger process that includes everything from asking questions about the
data and creating a model to answer those questions, to deploying the model into a working environment. This
process can be defined by using the following six basic steps:
1. Defining the Problem
2. Preparing Data
3. Exploring Data
4. Building Models
5. Exploring and V alidating Models
6. Deploying and Updating Models
The following diagram describes the relationships between each step in the process, and the technologies in
Microsoft SQL Server that you can use to complete each step.
Defining the Problem
The first step in the data mining process is to clearly define the problem, and consider ways that data can be
utilized to provide an answer to the problem.
This step includes analyzing business requirements, defining the scope of the problem, defining the metrics by
which the mo del will be evaluated, and defining specif ic objectives for the data mining project. These tasks
translate into questions such as the following:
 What are you looking for? What types of relationships are you trying to find?
 Does the problem you are trying to solve reflect the policies or processes of the business?
 Do you want to make predictions from the data mining model, or just look for interesting patterns and
associations?
 Which outcome or attribute do you want to try to predict?
17
cse
 What kind of data do you have and what kind of information is in each column? If there are multiple
tables, how are the tables related? Do you need to perform any cleansing, aggregation, or processing to
make the data usable?
 How is the data distributed? Is the data seasonal? Does the data accurately represent the processes of
the business?
Preparing Data
 The second step in the data mining process is to consolidate and clean the data that was identified in
theDefining the Problem step.
 Data can be scattered across a company and stored in different formats, or may contain inconsistencies
such as incorrect or missing entries.
 Data cleaning is not just about remov i ng bad data or interpolating mi ssi ng values, but about findi ng
hidden correlations in the data, identifying sources of data that are the most accurate, and determining
which columns are the most appropriate for use in analysis
ExploringData
Exploration techniques include calculating the minimum and maximum values, calculating mean and standard
deviations, and looking at the distribution of the data. For example, you mi ght determine by reviewing the
max i mum, mi ni mum, and mean values that the data is not representative of your customers or business
processes, and that you therefore must obtain more balanced data or review the assumptions that are the basis
for your expectations. Standard deviations and other distribution values can provide useful information about
the stability and accuracy of the results.
Building Models
The mining structure is linked to the source of data, but does not actually contain any data until you process it.
When you process the mi ni ng structure, SQL Server A nalysis Services generates aggregates and other
statistical information that can be used for analysis. This information can be used by any mining model that is
based on the structure.
Exploring and Validating Models
Before you deploy a model into a production environment, you will want to test how well the model performs.
A lso, when you build a model, you typically create multiple models with different configurations and test all
models to see which yields the best results for your problem and your data.
Deploying and Updating Models
After the mining models exist in a production environment, you can perform many tasks, depending on your
needs. The following are some of the tasks you can perform:
 Use the models to create predictions, which you can then use to make business decisions.
 Create content queries to retrieve statistics, rules, or formulas fromthe model.
 Embed data mining functionality directly i nto an application. You can include Analysis Management
Objects (AMO), which contains a set of objects that your application can use to create, alter, process,
and delete mining structures and mining models.
 Use Integration Services to create a package in which a mining model is used to intelligently separate
incoming data into multiple tables.
 Create a report that lets users directly query against an existing mining model
 Update the models after review and analysis. Any update requires that you reprocess the models.
 Update the model s dynami cal l y, as mo re data co mes i nto the organization, and maki ng constant
changes to improve the effectiveness of the solution should be part of the deployment strategy.
18
cse
Data warehousing
Data warehousing is the process of constructing and using a data warehouse. A data warehouse is constructed
by integrating data from multiple heterogeneous sources that support analytical reporting, structured and/or ad
hoc queries, and decision maki ng. Data warehousing involves data cleaning, data integration, and data
consolidations.
Characteristics of data warehouse
The main characteristics of a data warehouse are as follows:
 Subj ect- Oriented
A data warehouse is subj ect- oriented since it provides topic-wise information rather than the
overall processes of a business. Such subjects may be sales, promotion, inventory, etc
 Integrated
A data warehouse is developed by integrating data from varied sources i nto a consistent format.
The data must be stored in the warehouse in a consistent and universally acceptable manner in terms of
naming, f ormat, and coding. This facilitates effectivedata analysis.
 Non- V olatile
Data once entered i nto a data warehouse must remain unchanged. All data is read- only. Previous
data is not erased when current data is entered. This helps you to anal yze what has happened and
when.
 Time- V ariant
The data stored in a data warehouse is documented with an element of ti me, either explicitly or
implicitly. An example of time variance in Data Warehouse is exhibited in the Primary Key, which
must have an element of time like the day, week, or month.
Database vs. Data Warehouse
Although a data warehouse and a traditional database share some similarities, they need not be the same idea.
The main difference is that in a database, data is collected for multiple transactional purposes. However, ina
data warehouse, data is collected on an extensive scale to perform analytics. Databases provide real- time data,
while warehouses store data to be accessed for big analytical queries.
Data Warehouse A rchitecture
Usually, data warehouse architecture comprises a three- tier structure.
Bottom Tier
The bottom tier or data warehouse server usually represents a relational database system. Back-end tools are
used to cleanse, transform and feed data into this layer.
Middle Tier
The middle tier represents an OLAP server that can be implemented in two ways.
The ROLAP or Relational OLAP mo del is an extended relational database management system that maps
multidimensional data process to standard relational process.
The MOLAP or multidimensional OLAP directly acts on multidimensional data and operations.
Top Tier
This is the front- end client interface that gets data out from the data warehouse. It holds various tools like
query tools, analysis tools, reporting tools, anddata mining tools.
How Data Warehouse Works
Data Warehousing integrates data and i nformation collected from various sources i nto one comprehensive
database. Forexample, a data warehouse might combine customer information from an organization’s
point-
of- sale systems, its mailing lists, website, and comment cards. It mi ght also incorporate confidential
19
cse
information about empl o yees, salary information, etc. Businesses use such components of data warehouse to
analyze customers.
Data mining is one of the features of a data warehouse that involves looking for meani ngf ul data patterns in
vast volumes of data and devising innovative strategies for increased sales and profits.
Types of Data Warehouse
There are three main types of data warehouse.
Enterprise Data Warehouse (EDW)
This type of warehouse serves as a key or central database that facilitates decision-support services throughout
the enterprise. The advantage to this type of warehouse is that it provides access to cross-organizational
i nformation, offers a unified approach to data representation, and allows running complex queries.
Operational Data Store (ODS)
This type of data warehouse refreshes in real- time. It is often preferred for routi ne activities like storing
employee records. It is required when data warehouse systems do not support reporting needs of the business.
Data Mart
A data mart is a subset of a data warehouse built to maintain a particular department, region, or business unit.
Every department of a business has a central repository or data mart to store data. The data from the data mart
is stored in the ODS periodically. The ODS then sends the datato the EDW, where it is stored and used.
Summary
In this chapter you learned the data science process consists of six steps:
 Setti ng the research goal—Defining the what, the why, and the how of your project in a project
charter.
 Retrieving data—Finding and getting access to data needed in your project. This data is either found
within the company or retrieved from a third party.
 Data preparation—Checking and remediating data errors, enriching the data with data from other data
sources, and transforming it i nto a suitable format for your models.
 Data exploration—Diving deeper into your data using descriptive statistics and visual techniques.
 Data modeling—Using machine learning and statistical techniques to achieve your project goal.
 Presentation and automati on—Presenting your results to the stakeholders and industrializing your
analysis process for repetitive reuse and integration with other tools.
20

You might also like