Es
CONTEN SS =m
KOE 093 : Data Warehous 1s & Data Mi
ANALYSIS OF AKTU PAPERS s
UNIT: : DATA WAR oa taeenTaee aes
EHOUSING ebaeniee
(1-1 D to 1-19 D)
-19D)
Overview, Definitio
, Dat .
Data Warehouse ny Data Warehousing C
ee een DST Architect, Diterence betwee
can Dats Cubes Sa, Warton Multi Dimensional Date
UNIT-2: DATA WAREHOUSE PRO ones Fact Constellations, Concept.
Warehousing Strate
gy, W, (2-1 D to 2-
Processes, Warehouse canine a /management and Suj ° map)
rating Systems for Implementation, Ha Dd
ea Data Warehousing, Client /Se1 dware and
Model & Data Warhowsing, Perle Prosesor aeocem
De ies eee Warehousing Software,
UNITS: DATAMINING
Overview, Motivatio ce
: -. Definiti aon 1D to 3-19 D)
Form of Data Pre-processing ou Cle mictionalties, Data Process
Data, (Binning, Ghostering, Regression, © Missing Valuss, Nosy
inspection), Inconsistent Data RE fom dT es
Dans Reduction:-Data Cube Agere a and Transformation.
Data Compression, Numerceity Red Disses
ow Concept hierarchy generation eae Discretization and
1-4 : CLASSIFICATION AND CLUSTERING
Definition, Data Generalization, Analytic Characterizati (#1 Diees7D)
attribute relevance, Mining Class comparisons, moe poe
large Databases, Statistical-Based Al Sat ea
Algori oecsion Tree" her Aer Distance-Based
Sina Niyand c Md aes eee pores Introduction,
aoa al Clustering CURE and Chameleon Fe oe
iB al Custering Density Based Methods-
DBSCAN, OF cs Gi Based Methods- STING, CLIQUE. Model Based
fethod Statistical A\ ‘ation rules: Introduct
Them ses, Basic ca PP Pull cation rl ere, New
Network approach. ued Algorts, News
UNIT-5 : DATA VISUALIZATION (5-1 D to 5-18 D)
von, Historical information, Query Facil
and Tools. OLAP Servers, MOLAP, MOLAP, SOLAP, Osta Mining
interac, ‘security, Backup and Recovery, Tuning Data mail
Testing Data ‘Wecehouse. Warehousing applications and Recent
Nee Type Minin Applications, Web Mining, Spatial
SHORT
QUESTIONS (s.1D to SQ-210)
(sp-1D to SP-31D)
SOLVED PAPERS (2013-14TO 2018-19)www.askbooks.net
*AKTU Quantums «Toppers Notes :Books
*Practical Files «Projects *IITJEE Books
www.askbooks.net
All AATU QUANTUMS are available
Si eh a ie ek eines
perenr etry
* Your complete engineering solution.
* Hub of educational books.
ee eet eee ee ee ee eee ee eee ec eee
SPP Ree Mau Cu mor Celt
Rose CR Ac eR Cl Re Ee Ce ecru
2. We don't intend to infringe any copyrighted material.
CRO aT au RR
CR men lela eae COCR Cue aber)
4. All the logos, trademarks belong to their respective owners.A-1D(CSIT6)
Data Warehousing & Data Minin =
ered ar
AKTU Papers
Pd
Data warehousing
components
3. | Building a data
warehouse
4. | Mapping the data
warehouse
5. | Difference between
data warehouse
and database
6. | Data cube
110%, 1.11
8. | Concept hierarchy
‘Total Questions
1.12%, 1.13
A-2D(CSIT-6)
Warehouse strategy
Hardware of data
‘Topies
Warehouse planning | 0 0
‘warehouse 001 00 a
Clientéserver
computing model | ° ° °|
Distributed DBMS |
implementation 0:°0.,.0.,0,,.0
‘Total Questions oo o1 0}
Topies 5
g
Introduction 1
Data processing | 0
Data cleaning 1
Data reduction | 0
Total Questions 2
we 2016-17
2015-16
1
0 | 39, 3.10, 3.12)
* = Asked in different years
* = Asked in different yearsata Mining
‘Data Warehousing &
46,47, 48,
4.9, 4.10,
4.12, 4.19
4.15, 4.16¢,
4.17, 4.18
4.20
421%, 4.22,
4.25, 4.26
4.27, 4.28
4.30, 4.31
0
4.34, 4.35*
A4D(CSITS) Analysis of Previous AKTU Papers
Unit-5 : Data Visualization and Overall Perspective
* = Asked in different years
Qi
Total QuestionsCONTENTS
Overview on 1-89 to 1-2D
Definition
Data Warehousing Components 1-80 to 14D
Building a Data Warehouse 1-40 to 1-6D
Warehouse Database
1-8 to 1-9D
Mapping the Data Warehouse
ta Pl fultiprocessor Architecture
Difforence between Database nw 1-100 to 1-12)
System and Data Warehouse
Multidimensional Data Modet
Data Cube 1-190 to 1-18D
1-18D to 1-17D
Stars
Snow Fl
Fact Constellations
1-170 to 1-19D
Concept Hierarchy
\ Questions-Answers aI
Ee eee
“Long Answer Type and Medium Answer Type Questions
Gael. | What do you mean by data warehouse? Discuss ite hey
features with euitable example.
“Answer
sr adata warehouse (DW) is collection of corporat information and dats
onal systems and external data sores,
ved to support business decisions by allowing
nnd reporting at diferent aggregate eves
Key features of data warehouse a0:
ie inject oiented A datawareoue cane uso aaa aTSGler
Set er cxame ae, marketing ete canbe a parle
we enaced | A data warehouse integrates gta from multe es
nteareted | plo aplaton And Bmtaosinrmation nS
aren a yarn, ati nfrmatin seria cmmet
format
ce ares rian Hitrcl dataiaketin ata wach FS
‘Time vara gata ron 3 onthe © months 2 monte Sh
Sider data from 8 "Ba wareout can bald
‘tidressesasociated
4 Non-volatile: Data warchoute
Non-volatie Cy wven naw dataisenteredinit Date
ae et reced For example chase n trate
: find a the lve of
Data granularity : Data eran
Breatserdate. tn OT, the data
Ie tgoe product. For example,
eval th ‘pave recorded the ‘maximum sal
Setben all the stores, which region NayL-8pcsit9)
GeTE Deerte the component of data warehouse
Tomwer |
Following are the various componen
‘a. Data warehouse database:
[RDBMS technology. Due to some constraints,
database are used
1. RDBMS are deployed in parallel to allow scalability
2 New index structures are wsed to bypass relational table scans and
improve sped.
8. Multidimensional databases
Ave to relational data model
bi ETLtoole: The functionality of sourcing, acquisition, cleanup and
transformation tole also called as ETL. tols includes
1 Removing unwanted data from operational databases.
2 Converting to common data names and definitions.
Establishing defaults for missing data,
1s of data warehouse
‘The database is implemented on the
different approaches to
xo used to overcome any limitations
Fi: 1.4. Information ow ofa data warehouse,
14p(csir6) Data Warehousing
Matadata «Mota date abut data hat dover the ata
travotse is wed for building, intaining, managing sed eroghe
data warehouse. ae ee
Access tools Acexs ool are divided into four main eateries
1. Query and reporting eos
2 Application development tools
2. Online analytial processing too
4 Datamining tole
‘@ Data warehouse bus architecture: Data warehouse bus determines
the flow of data in our warchoue. The data low ina data warehouse
fan be categorized as Inflow, Upiow, Downiow, Outflow and Meta
fw
PARTS
Building a Data Warehouse, Warehouse Database.
Questions-Anewers
Long Answer Type and Medium Answer Type Questions
‘GaeHR Hepa the concept o building a data warehouse
ewer |
Following steps should be adopted to build a successful data warehouse
1. Business considerations :
‘a. Approach : For data warehouse development, one of the two
approaches is used:
{Top-down approach : In the top-down approach, data
warehouse is bil first. Thedata marts are then created from
the data warehouse,
i, Bottom-upapproach In the bottom-up approach, data mats
are created fiat and then data warehouse is built.
1b Organizational isrues : Most 1S organizations have expertise in
‘developing operational systems.
12 Design considerations : There are several points related to data
‘warehouse design
‘a Data content : The data warehouse system should not contain as
‘much detail-level ataas the operational system used to source thisData Warehousing & Date ining 1s(cs1r6)
Mctnta: Metdaaiedataabout at. means tina erp
Met eda help to organize, find and understand
oak the ds
Duta distribution :Iebecome necessary t knowhow the dts
Pesto aieded arcs mulpleservers and which wars should
et acrecs to whch te of data
i Tools: The tools provide the facilities for defining the
Teel cast and deanop rae, data movement, wer Gr,
‘eprtng and at sal
ePerfrmnnce considerations: An ideal datawarehovse sytem
‘ouldsoport interactive query processing
‘Technical considerations A numberof echnical nae re 10 be
Ziletel ches nplementig ond buldiaga data warehouse ester
2 Mtardware platform The data warcouse erver has tobe able
{eroppr luge ata volumes and complex queries.
bh The database management aytem that supports the warehouse
craters
Communication infrastructure : A data warehouse user
Sopa lrg bund winters with the data warehouse
Turetievea large amount of data for analsi
4 The hardware platform and software to support the metadata
repairs
The ajtems management framevork that enables centralized
Tanagement and administration of the ote envionment
4. Implementation consideration : The implementation of data
terthoue requires the ategraion of many products
2 heces tool Ranking statistical nal tie eis analysis
‘iti ieligens nfraton spying ar sme fhe exaples
‘aces tala ype
bb Datacrractio, leanupand transformation and migration
©. Data placement stratogien: Ara data warehoaoe grow, here
‘ould bes way to tore the datain astarage media and ditribule
{hedatein the data warchoute arose multiple servers,
Metadata: Metadata deta bout data. Temes tsa description
tects afthe dt eel orgs, Sd and ndervand
© User sophisticated levels : A certain degree of sophistication if
required ta effectively ue the warehouse
B
Lepccsnrs)
‘+ Mapping the relational databue tothe multiprocessing hardware
architectures allows au" aful implementation of date
‘warehouse, i
(ilar |
| eng Anewee Type and Medlum / wer Type @uenions |
QueTA, | Enumerate the steps involved in mapping the data
warehouse to a multiprocessor architecture,
‘ARTO 2016-17, Maris 10)
oR
What is the architecture of data warehouse operations ?
=
‘Stpsinvalvedinmapping the dtawarcsetoamuiroesoraitecae
1. Relational database technology for data warehouse :
‘a. “Linear speed up :The ability toinerease the numer of processor
to reduce response time.
1b. Linear seale up: The ability to provide same performance on the
same requests as the database size increases,
‘Types of parallelism :
i." Horizontal parallelism :In this, different server threads or processes
___ handle multiple requests at the same time.
fi, Vertical parallelism : This form of parallelism decomposes the sri
SQL query into lower level operations such as scan, join, sort ete
Response Tine
Serial RDBMS,
Sort HovzotalParlletam Vertical araleiem
(Data Parttonng) (Query Petting)
Pau r00us
7 A in
et HA 0 cu atLapcstrs) shia
ata Warehousing
Data Warohousng& Data Mining
sie Date partitioning: Data parting 9 the Ke ompont™ Th Shared dink architecture Shared dink arehites
Date, partition! og cation of database operations, Pariion can be trance of shared ener a ee
‘abhi serverucachot whith xrunaingon anode of isivted
Leonel gon endotu dried
done randowaly or intelligently
ges Includes random data striping across
bp Intelligent partitioning: Assumes that DBMS knows were 4
Inte em Ps leated and doce nt wastetime searching fr
‘erose all disks.
@ Hash partitioning : A hash algorit!
Partition mumber based on the value of the
tach row.
4. Key range partitioning : Rows are placed end located inthe
Meions according to the vale ofthe partitioning Key
‘e. Schema partitioning : An entire table is placed 0” one
Scheme ier table ie placed on diferent disk ete, This is usefol for
Small reference tables.
User defined partitioning :It allows atable tobe partitioned on
the basis ofa user defined expression.
thm is used to caleulate the
partitioning key for
2 Database architectures of parallel processing : There are three
Dents software architecture styles for parallel processing:
pee Ghared memory or shared-everything architecture 1t has ce Shared nothing architecture :In shared architecture estes,
the folowing characteristics ane CPU is connected wo a given disk Iftable or database
tao that diak shared nothing systems are concerned ith
(eioe to disks, not with acess to memory.
1 Multiple Processing Units (PU) share memory,
Tis hmple to implement and provide a single system image
Kistimpating an RDBMS on SMP (SymmetricLop (csr)
Data Warchousing & Data Mining
‘Parallel DBMS features:
Parallel environment
DBMS management tools
Price / Performance
‘Scope and techniques of parallel DBMS operations
Optimized implementations
Application transparency
4. Alternative technologies : For improving performance in data
‘varehouse environment include following :
fa Advanced database indexing products
b. Multidimensional databases
‘e Specialized RDBMS
5 Parallel DBMS Vendors:
‘a. Oracle :Support parallel database processing.
by Informix: It supports oll parallelism.
"TEM ea pall centres roduc D2 arae
tion)
d._ SYSBASE: It implemented its parallel DBMS functionality in @
product called SYSBASE MPP (SYSBASE+NCR).
RETA dene at vacchous. What srt shouldbe taken
Pearse
‘care while designing a warehouse ?
Tarver]
Data warehouse : Refer @ 1.1, Page 1-2D, Unit-.
‘The strategies that should be taken care while designing =
warehouse are:
1. Educate yourself: We must understand what users want because the
purpose afa data warehouse aystem is to provide decision-makers the
‘curate, timely information they need to make the right cheices
2 Determine business requirements : To determine business
requirements are should understand the following:
‘a. Why the requestor needs a data warehouse.
b.Whatare they trying to accomplish - saving time in collecting data,
higher quality of data, supporting certain applications etc., we need
to tie these business objectives to data sources.
‘What business rues to follow and what users and/or applications to
support.
‘Make a timeline: Break up business objectives mentioned above into
two to three month incremental deliverables.
(Choosing architecture, methodology and technology and building team.
Data Warehousing
i acho os ste Youd
ia ameeriasior
CONCEPT OUTLINE
[A database system deseriben procensing at operational sites
‘whereas adata warehouse describes processing at warshoure. |
“Arulidimensional data mode used forthe design of corporate
‘data warehouses.
Quertions-Answere
Long Answer Type and Medium Anawer Type Questions
Tee | what i
can wurhrur How dos ifr ron 2
databace?
Data warehouse : Refer Q.1.1, Page 1-20, Unit
Difference:
[ENo] Data warehouse Database
Tit involves istorieal | involves day-trday proce |
processing of information.
Eo Tie ie used to analyze the | Itis used to run the business
Duviness.
3. [Tefoeuses on information out, | Te focuses on data.
[Abbreviation | It stands for ‘Onlin ‘on 1. Metadata is data about data. It means itis a description and context of
Ie Procesing | Trnsctinn the date Tbelpa a organi, find and nferstand Sat
Prove 2, Indata warehouse, metadata ae the data that defines warehouse object.
i is used for transaction ‘i
2 [Use Ttis used for Query | Tt 3. Metadata can be classified into two types : Technical metadata a
Processing. Processing | Business metadata 7 sanded
3, [Data Teholds historical data. | Itholds current data. Importance of metadat
Tt stores only relevant | It stores ll data 1. Metadata drives data warehouse processes.
data. - 2, Metadata gives user the meaning ofeach data element,
+ [tye TKisenalysis driven, |Itisapplicatondriven.| 5. Mfotadataestabishes the context for data clements
[s [source | The data comes from | Itis the orignal source
various OLTP sources | of data. 1, Multidimensional ata model stores datain the form of data cube. Adata
‘i cube allows data to be viewed in multiple dimensions.
© [Purpose | To help with planning | To control and run
1s bap with ig Fe mera as] Daa warehouses ed Onn oat PrcesingOLAP ns ae
decision support. tasks. based on a multidimensional data model
7
ovpplie\ Table _
vest] sn ey te
Specie] Saree
sappberirm one
1168p CsT-6)_ Data Warehousing
Fact constellations :
tte fact constellation ean have multiple fact tales that share many
imonsion tables.
12. Thistype of chema can be viewed asa collection of stars, snow flake
ind hence is called a galaxy schema ora fact constellation
‘The main disadvantage of fact constellation schemas is its more
complicated design
Example Let us assume that Decean Electronics would like to have another
Barbe for aupoly and delivery. It may contain five dimensions, or keys
{act ata, delivery gent, origin, destination alongwith the meri measure
led and the eostof delivery. Itcan be seen thot
fas the numberof units sup
ta eifact tables can share the same item-dimension table as well as time-
eet aes table, A fact constellation schema is shown in Fig. 1.10.3
(sar)
Fig. 1.108. Fact constellation.
Difference:
S.No. | __ Star schema Fact constellation |
7 [im star schema, each | In fact constellation, each
{Timension is represented by | dimension is represented by
only one table. ‘multiple fact tables.
Z| Weissimple tounderstandand | Tt is more complex and hard te
‘easily designed, design.
3 | Tedoes not use normalization. | Ituses normalization.
[Tit caves the space due to | It does not save space due to
single fact table. multiple fat table.
Fig. 1.10.2. Snowflake scheme.1-17 (CRIT,
WCTATT] suppone hat ndata ware
couree, emester and
renulwing four dimension student, Curse,
of he foo acanuree auch na count and SV Sed
sannrst the wert concepteal level (for example, for # ven
Wage tt Guroe,nementer and instructor combination) the
tiadeng Steamure tore the actual course grade of Une student,
‘THfferconceptul level, avgyrade mores the average grads for
aiotren combination.
1 Draw a enow flake schema diagram for the data warehouse
iL Starting with the bare cuboid [otudent, course, semester,
instructor), what speciic OLAP operations for example ol
from remester to year) should one perform in order lst
ch student of the
Data Warehousing & Data Mining :
University.
Tmadent id
course. id
semester id
instructor 1d
count
ave_grade
Semester
dimension table
Semester id
semester
year
‘i Starting with the base cuboid {student, course, semester, instructor]
1. Roll-up on course from (course_key) to major.
: {elbaponstadent from (student_key) to University.
‘on course, student with department = “CS" and University =
“Big University”, =
‘4 Drill-down on student from University to student name,
Lapiewrre
CONCEPT GUTLINE
rete cycle graph ofomeeaa, where
[7 Sapiens goede rohan
cach of the concept
‘WHe TAB] Describe concept hierarchy with example.
“Answer
1. Concept hierarchy represents the relationship between data elements
in such a way that they can relate to each other as one shave another,
one below another.
2 Example of concept hierarchy is Date hiorarchy which forms
relationship as Year ~> Month -» Day -» Week ete.
Date
Year
‘Month
Day,
Week
3, There are three main types of hierarchies in data warehouse design
a. Balanced hierarchy
b. Unbalanced hierarchy
Ragged hierarchy
4. Concepthierarchy reduces the data by collecting and replacing low level
‘concepts by higher level eoncepta.a. Determining tne aimensiony ua.
b. Determining the location to place the hi rarchy ofeach dimension
‘of information,
Partitioning : Refer Q 1.4, Page 1-6D, Unit-1
@@o
Cr
Part-1
Part-2 :
Port-3
Part-4
CONTENTS
Support Processes
Warehouse Planning and...
Implementation
Hardware and Operating
Data Warehousing
ClientServer os
Computing Model and
Data Ws
Parallel
Cluster Systems
Distributed DBMS .
Implementations
Warehousing Software and
Warchoure Schema Design
22D to 23D
29D to 2-6)
v» 26D to 2-7D
27D to 211D
2-120 to 2-14Dhouse & Process Technology
Ware
sep cats) Pats
—rarese decisions och
ogy volver any important
Aare ee poran cx that make w the otic
cverbed
aS ete elements of warehouse.
Far urna x oi sent
se date warehouse rollout plan : Alle user
Preliminary dette inen data archers he
sd necnaryb re
Frotminary date warehouse architecture defines the overall
Popa rater oes alot
Shorted data warehouse environment and tool : Create
Ser tend ene tha open to meet warehousing
mee
‘TEGHET eran warhouse management nd esport processes
Warehouse management and support processes are designed to address the
aspects of the planning and managing DW project, subject to successful
{implementation and extension of software.
‘Steps in warehouse management and support processes :
1. Define issue tracking and resolution process :It includes following
‘quidelines: soe desertion, urgency, raised by, asigned to, date pened
‘ate close, resolved by and resolution description.
2 Perform capacity planning I can be done is following forms :
Data Warehousing & Data Mining 2epcsrrs)
{Space required: Spee requremenio a iarniney sma
atin, beckopandrecovery rsa)” indexing sists
seoerstin metadata
1h Machine processing pover:T chores scout tat
talableand meth pocnngequremen
Network bandwidth very al armpin the wrk
‘iat btre proseting wh ouch leu
4 Define warehense pening leet dane the chain fr
Erlvng removing cdr dam he Gta arth Sn cack
Sen ge coelry orlig eye
4 Desinseeriymamagument i eps tin eta route ere
ave ib lm ofinrmaio ber nia or dn oan
Unter nr Varinu stp involved in ecrty management
i Determine and evaluate IP assets
Analyze risk
fii, Definesecurty practices
jv, Implement practices
‘Monitor violations and take corresponding actions
vi. _Re-evaluate I assets and risk
5 Define backup and recovery strategy:
i. Data to be backed up Identify the data that must be backed up
‘ona regular basis This eives us an indication of the regular backup
fi, Batch window of the warehouse : It determines the maximum
allowable down time for the warehouse.
‘ii, Maximum acceptable time for recovery : It determines the
‘maximum acceptable time for the warehouse data and metadata to
be restored.
iv. Acceptable costs for backup and recovery : Different backup
‘mechanisms imply different backup costs.
6 Setupcollection of warehouse usage statistics: Warehouse usage
‘statistics are collected to provide the data warehouse designer with
‘inputs for further refining the data warehouse design and to track the
‘general usage and acceptance of warehouse.waren
ea sort note on date warehouse Planning.
activities related to planning og
ng describes the
‘Te date warehouse planing di aches for data Warehine
‘Tae date rthe data warehouse. Different aPPr
hie and orient the team
1 Assemble and ren ere and rete abt the pro.
Distribute copes of DW strateny
Se Setup teams and specify oes.
Je Give traning ifreqired.
Set upmilestones and check points
Conant ecotonal requirements analysis : It means gain a
Cont ee ending of the information needs of decision maker,
a eee eiotonal source system audit : SUEY CUFFED source
‘tte for data warehouse
tee icgjeal ana physical warehouse schema : I includes two
schema design techniques
Je Noctaleation: Normalize led atrbute data soas ofall within,
rot specified range, such as 0.010 10.
i Dimensional modeling: This technique produces denormalizea,
aamerpcn dergns consisting of fact and dimension tables. A
‘$l Riso the dimensional star echema als exists (ce, nowake
schema)
& Produce source-to-arget field mapping: The source-totarget field
‘napping documents how fields in the operational systems are
transformed into the data warehouse fics
6 Select development and production environment and tools
fnalizes the computing environment and tol sat for rollout based on
the results of development and production environment.
1L Create prototype for this rollout Tt creates a prototype ofthe dato
‘warehouse using the final tools and produetion environment.
Create implementation plan for this rollout : It drafts 22
implementation plan forthe rllout.
HRA] expiain att stepe and guidelines for data warehouse
‘implementation.
‘Steps for data warehouse implementation :
1 Requirements analysis and
data warehousing involves defining
®
enterprise ne
Data Warehousing & Data Mining 25DICSIT-S)
architecture, carving out capacity planning and alecting the hard
architecture, carey pacity planning an sletngthe hardware
2 Hardware integration : Once the hardware and software have been
felectd, they need tobe pt together by integrating the server, the
Storage devices and the cient software tos
3. Modeling : Modeling is s major step that involves designing the
‘warehouse schema and views. This may inva using the modeling tool
Tf the data warehouse in complex
4. Physical modeling : This involves designing the physical data
‘warehouse organization, data placement data partitioning, deciding on
{ccess methods and indexing
{5 Sources : The data for the data warehouse is likely to come from a
umber of datasources. This step involves identifying and connecting
the sources using gateways, ODBC drives or other wrappers
BTL: Thedata from the source aystems will need togo through an ETL
process. The step of designing and smplementing the ETL process may
avolve identifying a suitable ETL tool vendor and purchasing and
{implementing the tol
Populate the data warehouse : Once the BTL toolshave been agreed
tipon, testing the tools wil be required, perhaps using a staging area
{& User applications :For the data warehouse tobe seful there must be
‘cod-user applications. This step involves designing and implementing
‘applications requiredby the end users.
‘9. Roll-out the warchouse and applications: Once the data warehouse
has been populated and the end-user applications are tested, the
Warehouse system and the applications may be rolled ox forthe wser
‘community to se.
Guidelines for data warehouse implementation :
1. Build incrementally : Data warehouses must be built incrementally.
It's reoommended that a data part may frst be built with one particular
project in mind and then data warehouse can be implemented in an
Merative manner allowing al data parts toextract informatio from the
data warehouse
2 Need a champion :A data warehouse project must have a champion
foie willing to cary ot considerable research into expected costs and
benefits ofthe projet.
3. Senior management support : A data warehouse project must be
falls supported by the senior management, Give the resource intensive
ful sure puch projeets and the time they take to implement, «
paNhouse projet cals for a sustained commitment from senior
management.
‘4. Bnmure quality: Only data that hasbeen cleaned shouldbe loaded in
the data warehouseWarehouse & Process
2-8DCSITs) Data Werehowe Technology
7 hardware, sft
ne ictal costs (hardware, sofware, an,
lan : My benefits anda project plan (including an fy
Beonlee eyareoute project must be clearly outlined ang
‘understood by all stakeholders.
Teaining:Adata warehouse projet must not overlook data Warehouse
traning requirement.
fh Adepbity The projet sould bud in adaptability that chang,
ae ica warehouse when required. Like any system, y
Fan calmed ta change, as needs of an enterprise chang,
8. The project must be managed by both IT and business professionals in
the enterprise.
PART-3
Hardware and Operating Systems for Data Warehousing.
: Questions-Answers
‘Long Answer Type and Medium Answer Type Questions
[QaeRS |] Explain hardware and operating systems used in data
warehouse.
Hardware and operating ayatem refersto the server platforms and operati
! serv operating
yiem that serve asthe computing environment ofthe data warehouse.
ware and operating system used in data warehouse are:
Parallel hardware technology :
Symmetric multi These
system consists of pair of
about 6 recaor that share a common memory and operating
nly managed nie fevoueet are shared hence, they can be
Sender ake we of high ped interconnections ih
[cru}/cPu(ePy
i 3
S Memory
Data Warehousing & Data Mining 27DicsnT6)
'b. Massively parallel processor systems : It uses a large number
of procesore which comm nome message interface.
Bach processor has its own CPU, memory and disk subsystem,
ig, 2.52. MPP architecture
2 Clustered system : These systems are configured with multiported
‘array 20 that nodes which have direc disk access enjoy same disk VO
rates as standalone SMP systems. Nodes which not have direct disk
fcess must use the high-speed cluster interconnect mechanism.
cooo) joo
=| |=
gees 8688
Fig. 2.58. Closter of four SMP systems.
TERRE teaton extern for hrdare slstion
‘SMP Nodes
‘The following selections are recommended for hardware selection:
Delivery lead time
Reference sites
“Availabilty of support
* PART-
Sie Me ind Data Warehousing,
int Server Computing Mote ora oes tems,
eon ee1a Warehouse & Process Technology,
2eDicsirs)
sapere mt
soca eet
————~"goNGEPT OUTLINE Pree
_BoNcEPT |
Spice era cn eh ean ee r
Seger eer etetiniet ciated ae
‘ De Pete gar teaiat prep cesta ey
lel processing ism
vo fragments to speed up the execution of Programs
File services
Desktop | Busnes pe
‘GavRA | Explain client/server architecture. ‘cient | peat
‘Aaewor Advantages of twortier architecture
1 Chienvserver architecture is a network architecture in which each 1, Interoperability
Computer on the network is either a client oF a server. areenaatonn
12 Chentserver architecture works when the client sends a request to the att pases
cre erer the network connection, which is then processed and a
delivered othe client 4. Transparency
Components of client/server architecture + 5. Security
1. Client :It is a computer which processes the request service from the Disadvantages of twotier architecture :
a 1. Networktrafeis handle les efficient.
2 Server: Any computer can provide services tothe client. Fhe tent and gerveraze tightly coupled
Comanmioton mllerare:Acmpeter hrongh wich len tot 2 ‘Three er architecture’ Inbethreeier areitectreamidleware
; ae ress te cient environment andthe database management
int |—raquear > Bern 5s weed vironment. It i used in large environment.
“application
Advantages of clientserver architecture: omer
Dieavaninos of elletinerverarchtecare ‘Advantages of three-tier architecture
‘Single point of feiure
1 Improve performance
2, Improve lesbilty
2 Costly tomsintain DB server
BBBRT] Wt ar he tes of tonsnorver architecture?RRR Deere anata
memory architecture all nthe system are directly
Indentet men ae cor ee
nother processor's memory.
‘Two types of distributed memory architecture are:
1 Shared nothing architecture
‘a. Shared nothing architecture is used in
tmhich each node have their own memory,
inpavloutput interfaces.
bb. Bach node do not shares any resources with other nodes and
communicate with each other by passing messages.
distributing computing in
‘storage and independent
2 Shared disk architecture :
a Ashared disk architecture i
7 sa distributed computing architecture
in whieh all disks are ted computing architect
Data Warehousing & Data Mining 2uDCsIT6)
b. Multiple processors can access all disks directly via
intercommunication network and every processor has local memory.
Global Shard Dink Subsystem
rata aay OAS
7 en
1. Inaclusteraystem, every processor unit (PU) executes a copy of operating
‘item andthe inter PU conmanieations are performed over an opet-
systeme-based interconnection.
2. Custer aystem is designed for high availability by providing shared
acces to disks.
2. Cluster system de
‘avery high-speed:
hundreds of PUs.
scribes many characteristies of MPP system, including
rsalable interconnection mechanism and support for
3
away] Fewer] PEway)
sup] [sme] [SMP
dst sep 668 608
Fig. 2.10.1 Distributed memory clusterPARTS:
caused DBMS Implementations, Warehousing
Distr and WoreKouse Schema Design.
-—GeNGEPT OUTLINE
|
[+ Sebemain stoi
description of the entire database
1. Connectivity tools:
system in heterogeneous environment,
For example:
i. IBM: Data joiner
ii Oracle Transparent gateway
i, SAS:SAS/connect
Sybase : Enterprise connect
2 Extraction tools: There are two pri
re are two primary methods to use extracto®
‘os .,bulkextraction and change-based replication,
For example:
i. Apertus carleton : Passport
a atinam: InfoPump
‘Transformation tools : These too
se tools has following features
i, Field splitting and consolidation :
Standardization
Data Warehousing & Data Mining 21D (csaT.6)
For example
Data flux : Data quality workbench
Prism: Quality manager
Pine cone systems : Content tracker
Data loaders: It transforms data into data warehouse
6 Data access and retrieval tools : These tools are classified into two
categories
i. OLAP tools: These allow users to make ad hoc queries or generate
queries against warchouse database
i, Reporting tools : These allow users to produce scanned snd
sophisticated reports based on warehouse data
1. Data modeling tools: These tools allow users to prepare and maintain
‘an information model ofboth source and target database
For examp!
i. Cayenne software, Terrain
{i Relational matters, Syntagma designer
iii, Sybase, PowerDesigner WarchouseArchitect
& Warehouse management tools : These tools assist warehouse admin
in the day-to-day management and administration ofthe warehouse
For example :
i. Pine cone systems, usage tracker, refreshment tracker.
ii. Red rick systems, enterprise control and coordination.
Que Baz, | Discuss various warehouse schema design techniques.
=a
‘Various warehouse schema design techniques are
1. OLTPsystems use normalized data structures
2 Dimensional modeling for decisional systems :
number of techniques for denormalizing database to cre
8. Star schema : Refer Q.1.10, Page 1-14D, Unit-l
Dimensional hierarchies : Each dimension will have hierarchies that
imply grouping and structure.
5. Granularity of the fact tabl
is to determine the granularity of the fact
ean the lowest level of information that willbe
‘This constitutes twosteps
‘a. Determine which dimensions will be included.
1b. Determine where along the hierarchy of each dim
information will be kept
It provides
sate schema,
"The frst step in designing afact table
table. By granularity, we
stored in the fat table.
yension theauDCsT
‘Aggregates or summaries : Aggregates are the summarization of
‘taetrelnted data forthe purpose of improved performance. Aggrogatos
‘are tobe considered for te when the number of detailed records to be
[processed is lange and/or the processing of the customer queries begins
to impact the performance.
Dimensional attributes : The attribute values are used to establish
the context ofthe facts.
‘Multiple star schemas : A data warehouse will have multiple star
schemas, many fact tables,
900
Data Mining
CONTENTS
a
Motivation
Definition and Funetionalities
Partt
3-20 to 37D
Part2 1 Data Proce 7 3-80 to 3-90
Form of Data Pre-Processing
Partd 1 Data Cleaning : Missing Values
Noiry Data (Binning, Clustering
Regression, Computer and
Homan Tnypection) |
Taconsatent Data
3-90 to 3-18
farts4 Data eduction: Data Cube.. sists |
‘Aggregation
Dimensionality Reduction
Data Compression
Numerovity Reduction
Diseretization and Concept
rehy Generation and
Decision TreeOverview, Motivation, Definition and Functionalities,
s2DCsITo)
GONGEPT OUTLINE 5
minin ‘organizationstoturnraw data |
= Data miningisaprocess used by
into useful information,
+ Punctionalities of data mining
1. Characterization 2. Discrimination
7 4 Outlier analysis,
‘Que, | Explain data, information and knowledge.
(ARTO 2014.
irks 05
Answer
Data : Data are raw facts and figures that can be processed or stored by &
computer. For example, text, numbers, symbols, ete.
Information : Information is data that has been processed into a form that
ives it meaning. For example, analysis of retail of sale data can provide
information on which products are selling.
‘Knowledge : Knowledge is the understanding of rules needed to interpret
information. For example, information on retail market sales ean be analyzed
with promotional efforts to yield knowledge of customer behaviour.
Data [APBlied for Formation] Build and [rao
Monte’ fo, [nformation] Put 204 [Knowledge
‘QueS2, | What is data mining? Define the major issues in data
mining. [ARTO BOTH, Marks 05
oR
Describe challenges to data mining regarding data mining
issues.
methodology and user interaction
CRemeranaTe eral 8 Diverse data types issues:
‘Data Warehousing & Data Mining
aapicsars)
al
Data mining : D
mining is defined as a process used to extract usable
data from a larger set of any raw data
Key features of data mi
sheer
“Major issues in data mining
L
‘Automatic pattern predictions based on trend and behaviour analysis.
Prediction based on likely outcomes
(Creation of decision oriented information,
Focus on large datasets and databases for analysis
Clustering based on groups of facts not previously known,
‘Mining methodology and user interaction issues :
‘a, Mining different kinds of knowledge in databases: Different,
‘users may be interested in diferent kinds of knowledge.
b. Interactive mining of knowledge at multiple levels of
abstraction :1t allows users to focus the search fr pattern fom,
different angles.
& Incorporation of background knowledge : Background
knowledge is used to guide discovery process and to express the
discovered patterns.
Data mining query Languages and adhoc data mining: Data
‘mining query language siould be integrated with data warehouse
query language.
‘e Presentation and visualization of data mining results : Once
‘the patterns are discovered it needs tobe expressed in high level
languages,
{Handling noisy or incomplete data The data cleaning methods
‘arerequired to handle the noise and incomplete objects while mining
the data regularities.
& Pattern evaluation :The patterns discovered shouldbe intresting
because they represent common knowledge,
2 Performance issues :
1 pimciney and scalability of data mining algorithms : 7
Bmcioney 7d avon om nog aan of tain Stas
see ipothm musts ecient ends
tk. Poste, dvtibutedandneremental mining algorithms;
Paral date ved ou of dane mde triton of
Th actors wh ety ata mining methods mate the
ata a aan dite ata ining ahsDate Mining
sapere s
+ Mormon system a
Gear Jonna ining nin en
; ARTO 2017-18, Marks 10)
sndaul? (RR BOTA
ae tnt DD tent cn dri
Jone ig om nt =
i etn dr FS
1 Dale len a an rove fy and eran
Pee ern
bs imote
‘Cleaning in case of missing values.
i ‘where noise is a random ot
iL Cleaning noisy data,
error 4
ii Cleaning with a
transformation ols
2 Data integration :
Data integration is defined as
fources combined in a common 8
binder:
1 Dataintegration using data migration tools
i Data integration using data synchronization tol.
ii Data integration using BTL (Extract-Load-Transformation)
process.
8 Data selection :
a Dataacleconis defined as the process where data relevant tothe
Analisis decided and retrieved fom the data collection.
b eincades:
i Data selection using neural network.
Data selection uring decision trees.
4% Data selection using Naive Bayes.
it. Data selection using clustering, regression, ete
4 Data transformation :
Tn this step, data is tranaformed or consolidated into forms
appropriate for mining by performing summary or
sining by performing summary or ageree?
3 Datatransformaton isa two step process:
Data mapping : Assigning elements from source base"
‘destination to eapture transformations.
QF variance
discrepancy detection and data
heterogeneous data from multiple
rarce (Data Warehouse).
Data Warehousing & Data Mining SoD cents
4K. Code generation : Creston of the actual transformation
program ™
5 Data mining
‘8. Data mining in defined ax « ver techniques th
ver techniques that are applied to
‘extract palternn potentially wea oe
b. Iineludes
4. Trannorms tank relevant st into patterns
ii, Decides purpose of model uring classification or
characterization
6 Pattern evaluation : Pattern evaluation is defined as an identifying
Aritly increasing patterns representin: knawledge based on given
7. Knowledge representation: Krwnledgereprewntation isdefined as
technique which utilizes visualization tools t» represent data mining
i a
| ne
Data mining
A Patieras
SRT ow data mining eytems ae lasted Deeb cach
classification with example. [AKTU 2016-17, Marks 10)
‘Neng ten canbe clasiedacering ott lowing:
1. Database technology
2 Statistios
3, Machine learningData mining system can also be classified a8
a Clamification based on the databases mined : Database system
ceercaified acrordng to different criteria such as data models
Saree fataet For example, fwe classify database according tthe
tee etc thea we may have a relational, transactional, object
relational, or data warehouse mining system,
1b. Classification based on the kind of knowledge mined : It means
the data mining system is classified on the basis of functionalities such
ascharacterization, disrimination, association analysis, lasiication,
prediction, outlier analysis, evolution analysis. A comprehensive data
fining eystem usually provides multiple integrated data mining
functionalities.
Classification based on the techniques utilized : We can classy
a data mining system according tothe kind of techniques used in user
autonomous systems, interactive exploratory systems, query-driven,
systems or the methods of analysis employed such as machine learning,
‘statistics, visualization, pattern recognition, neural networks.
4 Classification basedon the applications adapted: We can classify
‘a data mining system according to the applications adapted. The
applications are as follows: finance, telecommunications, oc
ot lecommunications, DNA, stock
‘QeeTRT] explain data mining functionalities,
Following are th data mining functional
1. Data characterization st aon
Seams fhe ent
Data
housing & Data Mining s-7Dcsrr6),
2 Data discrimination : It refers to the mapping or cass
class with some predefined group or clas,
sociation analysis : It analyses the set of items that frequently
‘appear together ina transactional dataret,
4 Classification : In classification, data are grouped into predefined
clases,
ation of &
& Prediction It refers to predict some unavailable data value athor
than class labels, i
4 Cluster analysis: Casifcaton an prediction analyze cls labeled
data objets whereas clustering analyees data objec weibout conuling
‘Snown ls abe
1. Outlier analysis: Outliers are data elements that cannot be grouped
‘in agiven class or cluster. a
& Evolution analysis: Evolution analysis refers tothe description and
Imodel regularities or trends for cjects whose behaviour ehange over
time
‘Guess. | Describe the difference between the following,
approaches for the integration of data mining system with database
‘or data warehouse systems: no coupling, loose coupling and semi
tight coupling. [AKTU 2016-16, Marks 7.5]
Ifa data mining systems not integrated with a database or adata warehouse
system, then there will be no system to communicate with. This scheme is
known as the non-coupling scheme.
‘Various integration schemes are as follows :
‘a, Nocoupling : In this scheme, the data mining system does not utilize
‘any of the database or data warehouse functions. It fetches the data
from a particular source and processes that data using some data mining
algorithms.
‘b. Loose coupling : In this scheme, the data mining system may use|
‘some of the functions of database and data warehouse system. It fetches
the data from the data respiratory and performs data mining on that
data.
© ‘Semi-tight coupling: Tn thisscheme, the data mining system et
with a database or a data warehouse system and efficient
{implementations of afew data mining primitives can be provided in the
ms ‘mining system is smoothly
Tight coupling : In this scheme, the data system is
integrated into the database or data warehouse system, The data ming
subsystem is treated as one functional component of an information
system.Data Mining
sapesmrs
El
Da'a Processing, Form of Dolo
GONGEPT OUTLINE ]
ta into usable and desired
nf dat
Data processing isthe conversa |
Forms of data processing ar°
GaeRAT] what are the different forms of data processing ?
[ARTO 2014-15, Marks 06
Different forms of data processing are:
1. Data cleaning : Data cleaning isa process to remove the noisy dats,
clean the data by filling in the missing values and correct the
inconsistenciesin data.
2 Data integration : Data integration is a technique that combines the
data from multiple heterogeneous data sources into a coherent dats
store, Data integration may involve inconsistent data and therefore
needs data cleaning.
‘& Datatransformation :In this.ten data ic trancformod or consolidated
Data Warehousing & Data Mining s9pcstr6)
© Generalization : In generalization low-level data are replaced
swith high-level data by using concept hierarchies climbing
d._ Normalization : Normalization scaled atribute data so as to fall
‘within a small specified range, such as0.0 to 1.0. tis oftwo types:
i. Min-max normalization : It isa technique that belps to
normalize dat. It ill cal the data between O andl
i, rescore normalization : Transform the data by converting
the values to-a common seale with an average of zero and 3
standard deviation of one.
Attribute/feature construction : New attributes constrocted from
the given ones.
4 Data redu Data reduction is used to obtain reduced
representation of data in small values by maintaining the integrity of
original data,
eS, | Data consolidation is data modeling activity. This
[ARTO 2018-14, Marks 05;
statement is true or not ? Justify.
sane
The statement is true as data consoldation means transforming data
{no the forms that are appropriate for mining by performing certain
operations.
‘The normal data which we obtain from diferent datasources is notin
lable frm tobe stored in data warehouses or for performing data
‘Mining operations. So, data is modeled for further activites ater
performing data consolidation.
2. Data consolidation javolve the following operations:
Refer Q. 27, Page 3-8D, Unit.
PART-3
‘Data Cleaning : Missing Values, Noisy Data (Binning, Clustering
‘Regression, Computer and Furman Inspection) Inco
msiatent Data.s10D (CST) EE
How to handle noi
Tamer]
Noise isa
Following ae
L
2
ay data?
werent
sre
a untalented ien
Tine ea tai ec
een
Sh 2.2.0 inthis
emer
eee at
cern ete Std
seat pornn mune
Binning: It
Regression :
‘a. Data canbe smoothed by fitting the data into a regression functions,
Danese regression and multiple linear regression are type of
regression,
regression task begins with a dataset in which the target values
are known.
e Forexample regression model could be used to predict the value
tf house based on location, number of rooms, lt size, and other
factors.
Clustering
fa Outliers may be detected by clustering, where similar values are
organized into groups, or clusters.
Data Warehousing & Data Mining
‘Que 3.10. | Elaborate the different strategies for d
sD csaTs)
Vale a loti oF th st of cers may be considered
cutrs
cc Forexanple, clustering analysincante wedi area such t market
research, pattern recogetion, data analysis, and image processing
Combined computer and human inspection >The outliers anal
belentited with the help of computer and human inspection. The
tutlers patterns can be informative or garbage. Humans can srt out
the garbage patterns,
[ARTU9017-18, Marks 10]
aal
‘Data is leaned through processors suchas data migration, data serabing
and data auditing 7
1
Data migration :
fa. During data migration, transformation rules are specified (for
example, replacing sex by gender) toclean the data.
1b. Transcription errors incomplete information, and lack of standard
formats are alzo addressed during data migration.
a scrubbing :
‘a. Itinvolvesdetecting and removing errors and inconsistencies from
data in order to improve the quality of data.
1b. Data scrubbing involves a complex cleaning and mapping process
that is the moat labor intensive part of building adata warehouse.
fe During the cleaning process, desired informations filtered out and
its quality is maintained for the target system,
Data auditing :
‘a, Data auditing tools make it possible to discover rules and
relationships or to signal violation of stated rules by scanning
data,
Ttenhances the systems reliability and makes it possible to prevent,
detect, and climinate data errors irregularities, and fraud.
b
Quo BAT, | List the ways to handle the missing values. What do you
mean by inconsistent data ?
wel
‘Ways to handle missing values are
L
‘This is usually done when class label is missing.
Ignore the tuplCad
foameomse ees
How to handle noiy data?
Famer]
‘None na random error or variance in a measured variable
Following are the data smoothing techniques
oe inning: Ttisatechniqn in which fst ofallwesort the dataand then
pinition the data into equa frequency bins, Fr example,
Price = 4, 8, 15,21, 21,24, 25, 28, 34
‘a. Partition into (equal-frequency) bins:
8,15, Binb: 21, 21,24, Bin: 25, 28, 348
+b. Smoothing by bin means: In smoothing by bin, each value in a
bine replaced by the mean value ofthe bin.
Bina: 9, 9, 9, Bin b: 22, 2, 22, Bin ¢:29, 29, 29
e. Smoothing by bin boundaries: In smoothing by bin boundaries,
tach bin value is replaced by the closest boundary value
Bin a: 4, 4, 15, Binb: 21, 21, 24, Bin: 25, 25, 4
2 Regression
‘a. Data can be smoothed by fitting the datainto a regression functions
Linear regression and multiple linear regression are type of
regression.
b. A regression task begins with adataset in which the target values
are known,
For example, regression model could be used to predict the value
of a house based on location, number of rooms, lot size, and other
factors.
Bin:
3 Clustering
‘a. Outliers may be detected by clustering, where similar values are
organized into groups, or clusters
Data Warehousing & Data Mining
auDcsirs
Values that fall outside of the sot
Vals de ofthe se ofelumers may be considered
© Forexample, clustering analysis can be used in area uch as market,
esearch, pattern recgnition, data analysis, and image preceoring
4 Combined computer and human inspection : Tae outlierscan lu
be ented withthe ip of puter nd oman pein
outliers patterns can be informative or garbage. Humans c cot ot
‘the garbage patterns, . oo
‘Que 3.40. | Elahorate the different strategies for data cleaning
[ARTO 2017-18, Marks 10)
“Answer
Data is leaned through processors such as data migration, data ser
and data auditing ratios nner
1. Data migration
‘a, During data migration, transformation rules are specified (for
example, replacing sex by gender) to clean the data
b, Transcription errors incomplete information and lack of standard
formats are also addressed Suring data migration
2 Data scrubbing :
a. It involvesdetecting and removingerrors and inconsistencies from
data in order to improve the quality of data.
b. Data serubbing involves a complex cleaning and mapping process
that is the most labor intensive part of building adata warehouse
‘e During the cleaning process, desired informations filtered ost and
its quality is maintained for the target system.
3 Data auditing
a. Data auditing tools make it possible to discover rules
relationships or to signal violation of stated rules by scanning
data.
bb. Ttenhances the systems reliability and makes it posible to prevent,
detect, and eliminate dataerrors, irregularities, and fraud
and
Que BAI] List the ways to handle the missing values. What do you
‘mean by inconsistent data?
a
‘Ways to handle missing values are
1. Ignore the tuple: This is usually done when las label isingsg (CSTP6) ____———_
‘lin the missing
Bboy aot be fens
Uren global constant
sissing attribute va nthe missing valve
‘
«
se the attribute oot alue to ili the missing value: This may
Unethe mons Pithregression or decision tree induction, |
ae aermimg mean ral samples belong 0
occur when similar datas keptin
Inconsistent data: Dat inconsistency aE data mt be
method. Show using Chi-square
Fees] Bolan Criaquare tot
ani preferred reading are independent or not from
SEMI Given are te obeorved counts).
Male | Female | Total
Fiction 250 200 fod
Non-Fiction | 50 1000 a
Total 30 1200 | 1500
[AKTU 2016-16, Marks 15
Taaewer |
‘Acorrelation relationship between two categorical (discrete) attributes,
‘A.and B, can be discovered by a (Chi-square) test.
2 The value also known as the Pearson y* statistics) is computed as
py lObeerved -Bepeted?
bd ‘Expected
§ § lye
a
her oj the observed frequency (e., actual count ofthe join event
(A,B) ahde, isthe expected frequency ofA, B),which ean computed
count (A~6))xcount(B = by)
ant (A = 4)» count (B= ,)
where, .
xe
Data Warehousing & Data Mining
ssspcstrs)
americas
L | Mate ‘Total |
Rain 280 |
NewPcten |" oo |
| tet 300 00
1. Suppose that a group of 1,500 people was surveyed. The gender of each
Person was noted, Each person was polled as to whether their preferred
‘ype of reading material was fiction or non-fiction. Thus, we have Geo
attributes, gender and preferred reading.
2 The observed frequency (or count) of each possible joint event is
‘Summarized inthe contingency table at shown, where the numbers in
parentheses are the expected
Male | Female | Total]
Fiction | 250190) | 200.360) 450
Non-Fiction | 50(210) | 10001640) 1050
‘Total |" 300 001500
8. The expected frequency for the eel (male, Setion) is
N
and soon
4. Using equation for y* computation, we get
Sloe?
epee
(250-907 | (50-2107 | (200-3607 1000-840)"
SPOT Ht as@ i itrT S60 Tt ean Hie
= 284.44 + 121.90 + 71.41 + 90.48 = 507.93
5. For this 2x2 table, the degrees of freedom are (2-1) (21) = 1. For 1
degree of freedom, the 3" value needed to reject the hypothesis at the
0.001 significance level is 10.828. Since our computed value is above
this, we can reject the hypothesis that gender and preferred reading are
independent and conclude that the two attributes are (strongly)
correlated for the given group of people.Data Warehousing & Data Mining
Answer
Methods for attribute subset selection are :
3-15 D (CS/IT-6)
1. Stepwise forward selection : In this method, the best of the origi
i ii . " thi i
attributes is determine and addedtothereduecdset.
For example : Initial attribute set : (Al, A2, A3, Ad, A5}
Initial reduced set : {) = {Al} = (Al, A4)
Reduced attribute set : {A2, A3, A5}
2, Stepwise backward elimination : It removes the worst attribute
remaining in the set.
For example : Initial attribute set : {A1, A2, A3, A4, A5)
{A1, A3, A4, A5) = {A1, Ad, A5}
Reduced attribute set : (A1, A5)
3. Combination of forward selection and backward elimination :
This procedure selects the best attribute and removes the worst from
remaining attributes.
For example :
Initial attribute set : (Al, A2, A3, A4, A5)
Reduced attribute set in stepwise forward selection : {A2, 43, A5)
Reduced attribute set in stepwise backward elimination :
{A1, A5}
Reduced attribute set : (Al. A2, A3, A5}
4. Decision tree induction : It constructs a flowchart where the best
attribute is chosen to partition the data into individual classes.
For example : Initial attribute set : {A,, A, Ay, Ay As, Ag}
Fig. 3.14.1. Decision tree.
Reduced attribute set: {Ay Ay Ag)
Que 3.15. | Write a short note on dimensionality reduction.saepeesmr
orteansformation are applied sos tobtain a reluced or
tation of the original data.
dimensionality reduction =
Jection is a process of removing
1. Dataeneoding
‘compressed represe
2 There aro two components of
‘a. Feature selection : Feature #¢
features that are not relevant or are redundant
av extraction + Feature extraction i8@ Process of
Fontan rata deta nt etares etl fr modeling
4. Theverusnthds ued for dimensionality edeton include
ree rele tranaform 1s iar signal processing technique
Wavelet ern data vector into numeral diferent vector
Otel consents
th Pncipl Component Analysis (PCA) In this the data in a
anc eon! pceis mapped ta data in alower dimension
{pace invte the fllowing tps:
1 Contract the covariance matrix ofthe date
ii Compute the eigen vectra of hi matrix
igen vectors creopndng othe largest eigenvalues are
eda ecnatus fare rctionoariance ofthe original
tata
BO] Discuss numerosity reduction in detail
Taever |
Innumerosity reduction, data volume can be reduced by choosing alternative
forms of data representation. The various methods used for numerosity
reduction include:
‘8. Regression and log-linear model : These models are used to
approximate the given data,
b. Histograms : Histogr binni
rams us jing to approximate data
Aistributions.Itdivide data into buckets and store average sum for each
bucket.
‘© Clustering: Partition data set into clusters based on similarit
on similarity and
ag eter representation ony =
‘Sampling : It allows a large data set to be represent
® set to be re mucl
smaller random sample ofthe data. aan
BERRI] Disngsien berweon dimensionality reduc
Warehousing & Data Mining sa7D!
Taswer |
[S.No.[ Dimensionality E
reduction
1. [In dimensionality reduction, | 1a numerosity reduction, data
volume is reduced by choosing
alternating, smaller forms of
obtain areduced or compressed data representation.
representation of original data
Numerosity
reduction
2. [Methods for dimensionality| Methods for numerosity
reduction are: reduction are
a. Wavelet transforms | a. Regression and log-linear
model (parametric)
b, Principal Component| b, Histograms, clustering,
Analysis (PCA) sampling (non-parametric. |
3. [Tt can be used for removing | It is merely a representation |
irrelevant and redundant| technique of original data to
can nee
nti met edt a) ai etd treo
tein me sn an
FEB wo hort nte on concept herrcy oneatio for
pumerie data.
Me etrinotedesiyntangudrobi
sees leonept wig coment
Concept hererhy generation for numerics data methods
1 Binning
co ingest dwn plitingtehiq aeons aber
orvins
t.Stmngiean anspesoed dirt cit
Histogram anges
aan nograns arin the vale or annul into
cated tet
1 Hisograne anal
a Chumter evga Tis und to parton the
eas
spin ranges
an unsupervised discretization technique.
lata into clusters oFnnswere i
1. Categorical data are discrete data.
2. Categorical attributes have
finite number of distinct values, with no
ordering among the values,
3. There are several methods for generation of concept hierarchies for
categorical data :
a. Specification of a Partial orderin;
the schema level by experts: Concept hierarchies for categorical
attributes or dimensions typically involve a group of attributes. A
user or an expert can easily define concept hierarchy by specifying
@ partial or total ordering of the attributes at a schema level.
b. Specification of a portion of a hierarchy by explicit data
grouping: Ina large database, it is unrealistic to define an entire
concept hierarchy by explicit value enumeration. However, it is
realistic to specify explicit groupings for a small Portion of the
intermediate level data.
g of attributes explicitly at
ce Specification of a set of attributes but not their partial
ordering: Auser may specify a set of attributes forming a concept
hierarchy, but omit to specify their partial ordering. The system
can then try to automatically generate the attribute ordering so as
to construct a meaningful concept hierarchy.
a = Specification of only of partial set of attributes : To handle
Partially specified hierarchies, it is important to embed data
semantics in the database schema so that attributes with tight
semantic connections can be pinned together.
Que 3.20. | What do you mean by data mining ? Differentiate
between data mining technique and data mining strategy.
AKTU 2013-14, Marks 05
Answer |
Data mining : Refe- Q. 3.2, Page 3-2D,Unit-3.CONTEN1L® |
STi ete al
Parva + Classification : Definition —-~
Dots Generalization
‘Characterization
Attribute Relevance
48D to 5D
aot ee cower |
eet
ee eet ert
saber ee
we cnet sth
es ee
Se cs
pert + ml tn
a ie
eee
Se Set
Dewy ue ee
bscattortc
Lorman
Sine ute
ce: ste Arp
Pact famine aren
4140 to 418D
4180 to 4-200
Part
Part5
4-20D to 4-270
Part: 4270 to 4-900
ee 4$1D to 4-38D
Part.o + Basic Algorithme a
Parallel and Distributed Algorithme eer
Neural Network Approach = 4-34 to 4-87)
1D CSITS)
Define the terms data generalization and analytical
Questions-Answers
‘Long Answer Type and Medium Answer Type Questions
sSaS—e=a,.:_ >_o= =? mend
characterization with example.
Data generalization :
1. Data generalization summarizes data by replacing relatively low level
values with higher level concepts.
2. Data generalization approaches include : Data cube approach and
attribute oriented induction approach.
8. Data generalization is a form of descriptive data mining,
4. Forexample, let us consider the database of XYZ electronics, instead of
‘ceamining individual customer transactions, sales manager may prefer
fo view the generalized data to higher levels, such as summarized by
customers groups according to regions, income, ete.
Analytical characterization
1. Analytical characterization performs attribute and dimension relevance
‘analysis in order to filter out irrelevant or weakly attributes.
2 It is performed to overcome the various limitations of cl
characterization.
8. For example, employee birth date, birth_month, birth year are not
relevant to the employee's salary but experience is highly relevant to
the salary of employee.snousing & Data Mining 43D (CSIT6)
are a 4
Data Warehousing & Data 8
Gavad | Explain data cube approach and attribute orienteq
2 [ARTO 2014-15, Marks 05
approach. a
generalization.
Discuss basic approaches of data
‘over |
‘There are two basic approaé
1. Data cube approach :
‘a Itisalso known as OLAP appr
b_Intthis approach, computation
ches of data generalization
roach
‘and results are stored in the data
cube
<¢_ Ieigan ficient approach sit is helpful to make the past selling
raph.
ions on a data cube
4. Ituses rollup and drill-down operat
2 Attribute oriented inductior
{tis an online data analysis, query orien
based approsch.
b. _Inthis approach, we perform generalization on the basi of different
‘Values of each attributes within the relevant data set. After that,
ame tuples are merged and their respective counts are accumulated
in order to perform aggregation.
Attribute oriented induction approach used two methods :
i. Attribute removal
i Attribute generalization
PART-2
Mining Class Comparisons.
7 ed and generalization
cations, users may not be interested in having a single clss*
desription but they need Yo compare twtr one canee tae agiogue?
savor
nom nee
a
ispecies ges
‘Stops of class comparisons are :
Ppascacaee
pa aiemen lO
3. Synchronous generalization
1 Pmt ft and opin
PURI et cn cast cet ang
Be Mara
ied
{Statistic is «component of data mining that provides the tls
" aod
1 tse ih ieee
It i the science of learning from data and includes everything from
collecting and organizing to analyzing and presenting data. Statistics
{Seana oe probe oda, opccaly erence’
4. Sica wed in data mining br empting ils ee! manage
thedata ands analysis ands automation dat anaes.
4 Main arons where tata] approech ued in data mining
Maine rinng
i Sinootdata
Sampling
ie Dateansis
RBBB] cpiain various mearares of central tendency.
‘Measures of central tendency are
Mean:
a. Itisaconter ofthe dataset.
b. Letdata set Xare in values a8 4,
Moan ot dna tit = 1S
2 Median:
sot ifthe numberof valves nis
‘a. Tein the middle vale ofthe ordered
creda number ori isthe average of mide two values ifm in
‘even number.
‘eth
bh Median =,45D(CSIT 6)
Data Warehousing & Data Mining
Data Warehousing 5 ON
tly occur value from a large dats Se
Median -2 Mean.
sre merage of tie argest and smallest valve of tact
‘Statistical Meosures in Large Database, Statistical-Based
“tlgorithme, Distance-Bosed Algorithms.
CONCEPT OUTLINE ca
Fire deasptve statisti are used in statistical measures co
| TY Measuring the central tendency
| 2. Measuring the dispersion of ata
+ Distance-based algorithms are
| 1. Simple approach
2__knearest neighbours _|
Questions-Answers
‘Long Anewer Type and Medium Answer Type Questions
ee
Que 46. |] Discuss various measures of dispersion of data.
oR
Measures ofdaprsin of data are
1. Range:The rage of the data neti the diference between
lowest value, 7 —
\ Range = HL
3 Teint nine in at
Quarles: The frst quartile is denoted by Qi the 25th
noted by Qi percentile
teeter ne ceased Se is the 75th percentile. The distance
bine tei ended quartile enue itn
7 range covered by the middle half of the data. This
Aistaneiscalled as Interquartile Range (QR), defined as:
IQR = Q3-Q1
3% Outliers : Outliers are the values higher/lower than 1.5*1QR.
46D (CaIT6) eo
sification and Clustering
4 Boxplot: Boxpos
are popular way of vialing a diet
borpit ncorpraenthe fie number wunmaryar
4 Typialy, the ends ofthe bo
the ends ofthe ox are the quartiles, 50
length is the Interquartile Range QR =
Themedian is marked by alin within the bx.
‘Twolne called whites outside the box
her) outside the bxexend othe malt
‘Minimon andlaret simun eertone
5 Standard deviation and variance :The standart ds
jance : The standard deviation ofadata
set gives a measure of how each value in a data set varies from the
Tester deviation ot oe treats yy — 5m
‘The basic properties of the standard deviation are:
fa. X measures spread about the mean and should be used only when the
smean is chosen as the measure of center.
b._=Oonly when there is no spread, that i, when all observations have
fhe same value. Othervise o> 0, the variance isthe mean ofthe squared
deviations about the by oF. The variance fn observations.,#,---5yi®
given by: 7
Sar ye Hes
‘Draw a box-and-whisker plot for the following data set
41, 141, 142, 148, 144, 144,144, 145, 146, 147,148, 148,
Que 4.
126, 182,138, 140,
149, 149, 160, 150,160, 164, 155,188, 158,
‘Also find the outliers. ARTU 2015-16, Marks 10)
‘Answer
145, 146,147,148,
Given: 126,182,198, 140, 141,141, 142, 145, 14, 14, 148,
148, 149, 149, 160,150, 150, 154, 188, 158, 198
sre Apbre are 25 data points, the median Q willbe = 46
Since there or elve value, othe mein the average ofthe middle
two:
41414 415
Cn
‘The median of the second half is
= 050210) «150162.76 00 there are no outliers at the
tye end.
per eo.aancutlr i any data pint ee thse
Tats 12.78 » 12875
pata ptt tn tha 128.7 refrain er
ive hefellowing et of atest 8,8 16 30), determine
deanna ite stimulator both the moan and standard devistion
oa (ERTIES Ware 05)
Given nae
2,510, 8,8, 18, 20
get
Fe Atty tt
p21, x= 20
143+9+15 +20
ibs 96
By ierng sy BTBEH
3+9415+!
= Sebeleem = 117
By ignoring
By ignoring sy
+8D(C8TT-) Cassifieatio and Chstering
By ignoring x, Ata th
‘
143494!
= 11819620 gas
By ignoring 2, . +H
4
849415
Ha
jo OF HO tH
1
¢
unser 25 497648256
= MBs 11E 875042567 5 9g
Jack kaif eatimate for moan i given by!
a= [in-267+0-s67+0-067 +05
= [fras2 vores «714
By ignoring 25
os [ivs-0.67 «(9-267 + 5-9." + (0-971
161.24 = JBI = 6.73
By ignoring x +
os [ha-26# 0-967 +as-26? an-80"
nes = Ear = 721
By ignoring #3:
4" Pra-v6r +a-o8r us-26" + F1‘49D (CSIT-6)
Mining
‘Data Warehousing & Data
[ieee JT = 198
arf
By ignoring 4°
oe Bjg.aorsa-96? 1-907 +00-9571
Ne
= Fnaann = 56ST = 751
a
By ion % om
Aa-96" 9.6) +(15-98)
= Jt -9.6' +(8-9.6) + 9°
ose
age [rater 04 - VIB = 608
sa
3
5
611
‘Jack knife estimate for standard deviation is given by
= alo) - (nw - D8
(7.144) — (6 - 1) (70)
295.72 ~ 28.44 = 7.28
FERRE write snort notes on
4 Quartiles
4 Histograms
il, Scatter plots
on
Explain the various graphs for statistical class description,
Different types of graphs are:
1, Histogram : in this, we partition the data distribution of an attribute
into dajoint aot but the width ofeach subset should be uniform. Each
taht ren by erectange wow sgt egal otha cunt ofthe
Seatter plots : This graphical method is used for determining the
existence of any relationship, pattern between two numerical attributes.
Inthis method, every pair of value considered as a pair of coordinates in
tan algebraic sense and plotted as points in the plane.
‘Quartile plots :A quartile pl
loti simple and et
Srilanka universiedacduutunee hen tae
aidstribtion Fir
for the given attribute. Second, it plots quarie ineenatn ae
‘mechanism used in this step is
ee P is slightly different from the percentile
QQ Quartile-Quartile) plot : A quartile-quartle plot graphs the
quartiles of one univariate distribution a
quartiles of another. It is a powerful visual
inst the corresponding
ton tool that allows the
‘user to view whether there i a shift in going
‘ the ite from one distribution to
TEWGIOT] Waites hort nte on Bayesian asian
[ARTO 2015-14, Marks 0
‘Bayesian classifiers are the statistical classifiers,
‘Bayesian classifiers can predict class membership probabilities such as
‘the probability that a given tuple belongs toa particular class,
‘Bayesian classifiers have lao exhibited high accuracy and speed when
applied to large databases,
‘Bayesian classification is based on Bayesian theorem.
Bayesian theorem : The purpose of Bayesian theorem is to predict the
class label fora given tuple, Let X bea data tuple. In Bayesian terms, Xs
‘considered “evidence.” Let Hbe some hypothesis, suchas thatthe data tuple
“Xbelonge toa specified class C. There are two types of probabilities
1. Posterior Probability (PUH/XI
2. Prior Probability [AED
4
‘where X is data tuple and H is some hypothesis. According to B
theorem,
to = PANN POD
ERT wr «sort not om Nate Bays clan
i! assy date. Naive
ANaive Bayes classifier uses probability theory to classify
‘Bayes is also known as simple ‘Bayes or independence Bayes.
[Naive Bayes isa kind of clasifir which uss the Bases theorem.
Itpredicta membership probabilities for each clas such asthe probability
eee eed data pnt elngs parila casDanang anne ——_* DIETS
4 he clans wth he ght probity i conse he ma ly
Ma tue knoe Marimum Astron MAP?
aus amir enue hat al he tee re uneaed
caro
Forni uty bconiee tobe anapleiitiae eundon
Fr ee re fnew features depend ach tes on
oat te ath ther features, a Nave Bayes classifier consis of
Ate ieeries ta independently contribote tote probability tha Ne fs
isan apple.
HERI] cansity the tuple x = (Colour » RED, Type = SV
Origin « DOMESTIC’) using Naive
Tralning data in given in the following table where clase label is
(STOLEN).
Colour | Type Origin | Stolen |
Red Sports | Domestic | Yeo
Red Sports | Domestic | No
Rt | Ent | Domes | Bo
No
Re Sports| Impered | Yeo | Therefore the prediction sno,
fellow Imported | No
we cet So pecee
Yaew | BY | paves | Bo Seen sce dbus otal
= (|S jt |< =a oot
v - jum, students no)
‘Tabled.
[ATU BO, Marke 15) ee Tneome | Student | Credit rating| Class: buys
= re
youth high [No | Fair No
a youth high [No | Excellent No
our on TOrein - Tniddle aged| high [No | Fair Yeu
| _| seston | mediom | Ne Fair te
Yea | No Yeu |No Yeo | No senior tow te, osttent | x
he. middle aged| low | Yeu | acelent fo
4 | 2 | Sports| 4 | 2 |Domestic| 3 | 3 youth medium |No | Fair Ne
Yawlaile youth Tow | Yee | Fair 7c
LAL SL SN as _ltnvered | 2 1? fenlor median [3e, | tame |S
middle aged | medium | No
middle aged| high _ | Yet
eenior | medium | No__!4-18 (CSIT-6)
ta Warehousing Data Mining
lacian correctior tee
aaa jacian correction i
“ ‘tw each count will
1a rang et are enue at apy avoiding #0
Wake nelle aiferenee in pol
a er anette
a eng tothe ear vrereminator used in the probability
falelation
Rewer ome = medium, student = no credit rating = fair)
Tage ame gee
area can be extzated based on the training 'ple®
PMbuya. computer = yes) = 9/4 = 0.643
‘Pibayn. computer = 0) = 6/14 = 0.367
‘Tocompate PX |G), fori 12,we compute the flowing conditional
probabilities
Pragessenior buys computerayes)= 49 0.383
Prage=senior|buys_computer=no)=2/5=0.400
‘Rincome=medium [buys.computersyet)= 49 = 0.444
‘Pincome=medium |buys_computer=no)= 25 =0.400
Pistudent=no buys computer=yes)=H9=0:333
‘Pistudent=no | buys_computer=no)=4/5=0.800
Prcredit_rating=fair | buys_computer=yes)=6/9=0.667
‘Pocredit_rating=fair | buys_computer=no)=2/5=0.400
Using the above probabilities
PAX|buys_computer=yes) = lage=senior |buys_computer=yes)
x Piincome=medium | buys_computer=yes) *
‘Pistudent=no /buys_computer=yes)
‘Preredit_rating=fair| buys_computer=yes) = 0.083
‘AX |buys_computersno)= Plage=senior | buys_computer=no)
»Pincome=medium |buys_computer=no)x
‘Pstudent=no | buys_computer=no)
~Prcredit.rating=fuir |buys_computer=no
‘Compute P(X|C)P(C) for each class:
‘POX buys_computer=yes) x Ptbuys_computer=yes)=0,083 x0,643=0,021.
‘PX buys computer=no) x Ptbuys_computer=no)=0.051 x0.
: i buys. -=n0)=0.051 x0.357=0.018
‘The Bayesian Classifier predicts buys_computer=yes for tuple X.
siding zero probability
tpnique sed for avoiding
051
ion and Clustering
Tra Explain distance-based algorithms in detail,
{_ Distance-besed algorithms are non-parametric methods that can be
‘used for classification.
‘These algorithms classify objects by the dissimilar
2 Thee alge clan cic arity between them as
‘4, There are two types of distance-based algorithm
‘a. Simple approach It assumes that each lassi *
enter or centri, Tye new item i placed inte Ca withthe
largest similarity val,
1b kenearest neighbour ‘The KNN scheme requiresnotonly train
‘ction aloo the odred nsfeation for exh tee, When 8
Classification ito he made fora new item, ts distance to each tem
{nthe training et muste determined Only the closest entrisin
the training set are considered, The new item is then placed in the
‘lass that contains the most items for ths set of loset items
Algorithm :
Input:
Tr (Training data
K [Number of neighbours
{input tupleto clasify
Output:
¢ _// lass to which is assigned
KN algorithm :
TAigorithm to classify tuple using KNN
N=e
Find set of neighbours, N, fort
for each d eT do
if |N|SK, then
N=Nuld);
‘lee
if 3 € N such that
sim, u) sim (then
begin
N=N-(u;
NeNuldi
™ classificatio
Find class for n :
‘cedlase to which the most u ¢ N are classified
|