Unit 11
Unit 11
DMBA210
MANAGEMENT INFORMATION SYSTEM1
Unit: 11 - Managing Data Resources
DMBA210 : Management Information System
Unit – 11
Managing Data Resources
TABLE OF CONTENTS
1 Introduction - -
5
1.1 Objectives - -
5 Database Concepts - -
6 Data Warehouses - -
20-24
6.1 Data mining uses 5 4
7 Summary - - 25
8 Glossary - - 26
9 Terminal Questions - - 27
10 Answers - - 28
11 References - - 29
1. INTRODUCTION
Data as a resource of any organisation has gained critical importance in recent years. It has to be
managed carefully with the objective of ensuring access, reliability, and security. Databases consist of
simple structures such as fields, records, and tables. When assembled within an architecture, these
simple structures provide an immensely useful yet manageable resource for organisa- tions. There
are many types of database designs, the most popular being that of tables related to each other, called
relational database. Commercial implemen- tations of such databases are called database
management systems (DBMS) that have many features to easily create, update, query, and manage
data. All such systems rely on SQL, a computer language that allows easy definition and manipulation
of tables and relations.
When organisations accumulate large masses of data, the focus shifts from simply using the data for
transactions to that of using the data for help in deci- sion making. Data is separated out into special
tables called warehouses that are then used for analysis.
1.1 Objectives
After studying this unit, you should be able to:
• describe the need for data management
2. CASE STUDY
For citizens of India, mobility from one state to another is a problem. If one moves, say, from Uttar
Pradesh to Karnataka, then in the new state of res- idence, one will have to open a new bank account,
obtain a new permit for cooking gas cylinders, get a new electricity connection, re-register an old
vehicle in the new state, and, if needed, get a new ration card. This is because these documents cannot
be transferred easily from Uttar Pradesh to Karnataka, as there are no legal provisions to do so. As
these documents require considerable time and effort while getting them for the first time, and
applying and waiting to get them a second time is a huge waste of effort.
It is partially to address this problem of transfer of documents that the Gov- ernment of India initiated
the Unique Identification Number (UID) scheme. Under this scheme, every citizen of India will be
provided a unique number that will be backed by an infrastructure to verify and authenticate the
number. A special organisation, called Unique Identification Authority of India (UIDAI), was created
in 2009 for this purpose, and was charged with issuing the UIDs to citizens. The UIDAI will eventually
provide a unique 12-digit number to all citizens of India with the assurance that the number is unique
(each number is associated with a unique citizen), is verifiable, and is valid across India.
Many citizens of India already have several documents that provide them with unique numbers:
However, not all citizens have a need for or use all these cards. For instance, the number of income
tax payers in India is a small fraction of the population (as agricultural income is not taxed and a bulk
of India’s population relies on agriculture). Furthermore, most citizens do not have a passport, as they
don’t need to travel across borders, and many do not have a driver’s licence either, as they do not own
a motorised vehicle. The ration card is meant for people below the poverty line, but can be issued to
any citizen of India. Thus, most or all of these cards that provide a unique number are not available
or are of little real use to all citizens of India. It is in this context that the UID becomes important.
An UID number can provide a basis for uniting these disparate identifica- tion projects under a
common umbrella. Thus, a citizen who has a PAN card and also a driver’s licence can be seen, through
the unique number, to be the same person. This will reduce redundancy in the issuing of unique
numbers as well as control fraud and misuse of the numbers.
As envisaged, the UIDAI will issue a unique number to citizens based on biometric authentication.
This number is called Aadhaar. With a scan of ten fingerprints and the iris, each citizen will receive a
unique 12-digit number. Aadhaar can be used by banks, or the tax authorities, or schools, or the ration
card agencies to issue cards or validation documents to citizens. The idea is that if a citizen presents
a bank card to a merchant for some commercial transaction, the merchant can verify that the card
belongs to the particular individual by checking against the UIDAI database. This veri- fication service
will be made available at a reasonable cost by the UIDAI.
Aadhaar, in this sense, becomes what in database terminology is called a primary key, a unique
identifier for a record that can be used across the data- base and across applications without worry
of duplication. Various agencies like banks, the tax authority, the passport agency, the motor vehicles
depart- ment, and the public distribution system can then use Aadhaar to issue their own verification
and authentication documents. A citizen can move from one part of the country to another, and with
Aadhaar he/she can retrieve or use his/her card anywhere and be assured that his/her identity is
authenticated.
The unique number is used for many transactions other than social secu- rity, including those of credit
card purchases, property purchases and col- lege enrolment.
The Aadhaar scheme has come in for a fair measure of criticism. For a coun- try as diverse and
complex as India, critics argues, such a scheme is not suitable. Some argue that the Aadhaar scheme
will link many vital sources of information about individuals under a common source and thus
compro- mise individual privacy. Those with dubious intentions can snoop into online and
computerised records of individuals and have access to a vast store of information, something that is
not possible without a primary key. Oth- ers contend that, in the case of poor and marginal citizens,
obtaining and maintaining such a number will become an additional burden, and instead of helping
them, it will further impede their ability to make a living and function effectively. Still, others argue
that Aadhaar will become another tool in the hands of a corrupt and power-hungry bureaucracy,
which will extract further rents from those unable to understand the value of this scheme and how it
can be used effectively.
Aadhaar in India is similar to a unique number given to citizens in other coun- tries. In the USA, all
citizens are required to have a Social Security Number (SSN) that was originally designed to provide
them with social security – such as pension, medical care, job-loss compensation, and so on – but it is
now used for many different purposes such as for opening a bank account, obtaining a driver’s licence,
getting a credit card, being registered for medical insurance, enrolling in school or college and so on.
Such schemes for pro- viding social security, along with a unique number, are prevalent in European
countries too, such as Spain and France. In all these countries, the unique number is used for many
transactions other than social security.
The role envisaged for Aadhaar is best captured by the Chairman of the UIDAI, Mr. Nandan Nilekani,
‘The name Aadhaar communicates the funda- mental role of the number issued by the UIDAI the
number as a universal identity infrastructure, a foundation over which public and private agencies
can build services and applications that benefit residents across India’.
1. Aadhaar’s guarantee of uniqueness and centralised, online identity verification would be the
basis for building these multiple services and applications, and facilitating greater connectivity
to markets.
2. Aadhaar would also give any resident the ability to access these ser- vices and resources,
anytime, anywhere in the country.
3. Aadhaar can, for example, provide the identity infrastructure for ensuring financial inclusion
across the country – banks can link the unique number to a bank account for every resident,
and use the online identity authentication to allow residents to access the account from any-
where in the country.
4. Aadhaar would also be a foundation for the effective enforcement of individual rights. A clear
registration and recognition of the individual’s identity with the state is necessary to
implement their rights – to employment, education, food, etc. The number, by ensuring such
registration and recognition of individuals, would help the state deliver these rights. Source:
uidai.gov.in (accessed on June 2011).
With advances in programming languages, this situation changed and data was maintained in
separate files that different programs could use. This improved the ability to share data, but it
introduced problems of data updating and integ- rity. If one program changed the data, other
programs had to be informed of this development and their logic had to be altered accordingly.
A start in organising data came with the idea of the relational data storage model, put forward by
British scientist E.F. Codd in 1970. Codd, then working with IBM in the USA, showed how data could
be stored in structured form in files that were linked to each other, and could be used by many
programs with simple rules of modification. This idea was taken up by commercial database software,
like Oracle, and became the standard for data storage and use.
SELF-ASSESSMENT QUESTIONS – 1
Multiple Choice Questions:
1 Before database systems were invented, the problem with storing data was that
a) data could not be found in the program
b) data and programs were not easily located on the computer
c) data was easily lost or corrupted
d) data and programs were together, and changing data required changing the program
True or False Questions:
2 E.F. Codd invented the idea of holding data in separate, linked files. (True/False)
1. Researchers estimate that the total amount of data stored in the world is of the order of 295
exabytes or 295 billion gigabytes. This estimate is based on an assessment of analogue and
digital storage technologies from 1986 to 2007. The report (by M. Hilbert and P. Lopez
appeared in Science Express in February 2011) states that paper-based storage of data,
which was about 33% in 1986, had shrunk to only about 0.07% in 2007, as now most of the
data is stored in digital form. Data is mostly stored on computer hard discs or on optical
storage devices.
2. The consulting firm IDC estimated (in 2008) that the annual growth in data takes place in
two forms:
(a) Structured: Here data is created and maintained in databases and fol- lows a certain data
model. The growth in structured data is about 22% annually (compounded).
(b) Unstructured: Here data remains in an informal manner. The growth in unstructured
data is about 62% annually.
3. The large online auction firm eBay has a data warehouse of more than 6 petabytes (6 million
gigabytes), and adds about 150 billion rows per day to its database tables.
The above examples highlight the incredible amounts of data that are being cre- ated and stored
around the world. Managing this data so that it could be used effectively presents a strong
challenge to database systems: The systems not only have to store the data but also have to make
it available almost instantly whenever needed, allow users to search through the data efficiently,
and also ensure that the data is safe and uncorrupted. Different aspects of the need for database
systems are discussed in the following sections.
course registration or for the library. Furthermore, the programs and applications that use the data
are not aware of where and how the data is maintained; they only need to know how to make
simple calls to access the data.
rights and privileges to do so. Modern database systems enable sophisticated ways in which these
four functions can be enabled or disabled for users and administrators.
SELF-ASSESSMENT QUESTIONS – 2
Multiple Choice Questions:
3 Serious challenges for modern databases include
a) managing concurrency, security and recovery from crashes
b) size
c) distributed access
d) support for heterogeneous databases
5. DATABASE CONCEPTS
A database is a collection of files that have stored data. The files and the data within them are related
in some manner – either they are from the same domain, the same function, the same firm, or some
other category. The files in the data- base are created according to the needs of the function or
department and are maintained with data that is relevant for the applications the department runs.
Consider another example of a ‘Product’ database. This may contain files related to the types of
products, the details about product features, the prices and history of prices of products, the regional
and seasonal sales figures of products, and the details of product designs. Such files could be used by
the manufacturing department to determine production schedules, by the market- ing department to
determine sales campaigns, or by the finance department to determine overhead allocations.
A collection of fields is called a record. Each record is like a row in a spread- sheet; it consists of a pre-
defined number of fields. In most cases, the sizes of the fields are fixed; this ensures that the total size
of a record is also fixed. Records are a collection of fields that have meaning in the context of the
application.
Table 11.1 shows five fields that define a record. Each field contains data pertaining to some aspect
of a student – roll number (Aadhaar number in this case), first name, last name, year of birth, and the
subject in which the student is majoring. The data in each field is written in characters or numbers.
For each record, there should be at least one field that uniquely identifies the record. This ensures
that even if there are two students with exactly the same name (say Aamir Khan), with the same year
of birth, and the same major, then there is at least one field that will distinguish the records of the two
students. In Table 11.1, Aadhaar number is the unique identifier. In other cases this could be an
employee number, a tax number, or even a random number generated by the system. This unique
field is called a primary key.
A table is contained in a file. Each table may contain a few records or a very large number of records.
A database consists of many files. Modern database systems allow table sizes to include billions of
records. Furthermore, very large tables may be split and stored on different servers. Figure 11.1
shows the basic elements of a database.
In relational databases, the tables are related to each other. These relations allow data to be linked
according to some logic and then extracted from the tables. A detailed example of this is provided in
a later section.
(such databases are often provided as off-the-shelf applications), but the use and maintenance is only
by the user.
Personal databases are highly tuned to the needs of the user. They are not meant to be shared. These
databases also cannot be shared, as they reside on personal devices; and this is a limitation of these
systems.
Enterprise or organisational databases are accessed by all members of the organ- isation. Figure 11.2
shows how these are typically organised in the client–server mode. A central database server
provides database capabilities to different appli- cations that reside on other computers. These client
applications interact with the database server to draw on data services, whereas the database server
is man- aged independently. An advantage of these database servers is that they can be made highly
secure, with strong access restrictions, and can also be backed up carefully to recover from crashes.
While designing client–server databases, a prime issue to be addressed is – Where the processing
should take place? If data processing has to be done on the client from, say, three tables then these
tables have to be moved across the network to the client, which should have enough computing
capacity to do the processing. If, on the other hand, the computing is done on the server then the
clients have to send processing requests to the server and await the results, and this puts a lot of load
on the server. Clients such as mobile phones or personal computers often do not have the processing
capacity to deal with large amounts of data, so the processing is invariably left to the server.
The architecture often used in enterprises is referred to as three-tier architec- ture. Here the clients
interact with application servers, which then call upon database servers for their data needs. Here the
load of processing for applica- tions and for data is spread across two sets of servers, thus enabling
greater efficiency. Figure 11.3 depicts this architecture.
Databases may be centralised or decentralised within organisations. Central- ised databases are
designed on the client–server model, with a two-tier or three-tier architecture. Decentralised or
distributed databases have tables dis- tributed across many servers on a network. The servers may
be geographically distributed, but for the applications they appear as a single entity. One type of
distributed server has the entire database replicated across many servers. This is called a
homogeneous database. Figure 11.4 shows this database. Those users who are close to a particular
server are able to access data from that particular one, whereas others access data from other,
physically closer serv- ers. When data is changed on any one server, it is also changed on the others.
Distributed databases can also be federated in nature. It means the databases across the network are
not the same; they are heterogeneous. In such archi- tecture, when application servers draw on the
databases, special algorithms pull together the required data from diverse servers and present a
consoli- dated picture. This architecture is useful where the data entry and manage- ment of servers
is particular to a region. For example, multinational banks use federated databases as their databases
in different countries operate on different currencies and exchange criteria, and rely on local data.
For applica- tions requiring global data, the applications use special logic for analysing the disparate
data.
A special class of software is used to connect disparate databases and these are known as middleware.
As databases can have different data structures for the same kind of data, the middleware software
allows the databases to read and write data to and from each other. For example, the data field for
‘student name’ may have a space for 30 characters in one database and 40 characters in another. The
fact that they are referring to the same concept is captured by the middleware that enables the
translation from one to the other. The middle- ware is also used by the Application Layer to read and
use data from many databases. In modern web-centric applications, the middleware plays a major
role in allowing the use of distributed databases by application servers.
SELF-ASSESSMENT QUESTIONS – 3
Fill in the blanks:
4 A __________________ database is one whose tables are maintained on various different servers.
Multiple Choice Questions:
5 _________________ connects distributed databases to different client devices.
a) Firmware
b) Software
c) Dataware
d) Middleware
6 Federated databases imply databases across a network that are
a) the same
b) heterogeneous
c) centralised
d) highly secure
7 Personal databases are databases
a) created by individual users for their personal use in organisations or at home
b) contain the personal data of every employee of an organisation
c) are highly tuned to the needs of the user and therefore not meant to be shared
d) both (a) and (b)
6. DATA WAREHOUSES
Since the inception of desktop computing, in the mid-1980s, around the world, there has been a
proliferation of data use and needs for data storage. Almost all employees of organisations, above a
certain size, now use computers and produce, modify or read data. For very large organisations, the
amount of data that is used on a day-to-day basis could be as high as in petabytes. With this huge
explosion in data, organisations felt the need for:
1. Consolidating much of the data from various databases into a whole that could be understood
clearly.
2. Focusing on the use of data for decision making, as opposed to simply for running transactions.
The need for creating data warehouses arose from the above two needs. The technology of data
warehouses draws on enterprise databases to create a separate set of tables and relations that can be
used to run particular kinds of queries and analytical tools. Warehouses are different from transaction
databases, as users can run complex queries on them, which are related to the functions of the
enterprise that need not affect the transaction processing.
To create a data warehouse, data is extracted from transactional tables and pre-processed to remove
unwanted data types and then loaded into tables in the warehouse. The extraction process requires
making queries into transactional databases that are currently being used. This is a challenge as the
data tables may be distributed across various servers, and the data may be changing rapidly. The data
obtained from these tables is maintained in a staging area, a temporary storage area, where the data
is scrubbed. The idea of data scrubbing is to remove clearly recognisable erroneous data. This task is
often difficult, as errors are not obvious – say a misspelt name or a wrong address – and require
careful examination to remove them. At the scrubbing stage, data is not corrected in any manner; it is
invariably removed from the collection of raw data.
Once the data is scrubbed or cleaned, it is loaded onto the tables that constitute the warehouse. When
an organisation is creating a warehouse for the first time, the entire data is loaded into a database,
using a particular design. Subsequent data that is obtained from the transaction databases is then
extracted, cleaned, and loaded incrementally to the earlier tables.
Data pertaining to a particular domain or a problem to be analysed is maintained in data marts. For
example, a mart may be created to examine sales data alone. This mart will collect data related to the
sales activities across the organisation and store them in the warehouse. However, it will exclude the
data related to production, finance, employees, and so on. The mart can then be analysed for
particular problems related to the sales trends, sales predictions and so on. Furthermore, the mart
may be updated on a periodic basis to include the fresh data available.
Data in warehouses can be stored in tables with timestamps. This is the dimen- sional method of
creating warehouses. The idea here is to store data in a single or a few, unrelated tables that are given
one additional attribute of a timestamp (that indicates when the data was collected or created). For
example, one table in a dimensional warehouse may include data on customers, sales, products,
orders, shipping, and a timestamp of each transaction. Each timestamp will pertain to one particular
event in the life of the organisation when a transaction occurred and the data was created. Such a
table can be analysed to examine trends in sales, fluctuations in orders across seasons, and so on.
Another method of storing data is in the regular tables-and-relations format of relational databases.
Here too an additional attribute of a timestamp is included within the tables.
Various kinds of analysis can be conducted on data available in warehouses including data mining,
online analytical processing, and data visualisation. These different methods are designed to extract
patterns and useful information from very large data sets. Online analytical process (OLAP) is used to
analyse and report data on sales trends, forecasts, and other time-series-based analy- ses. Such
analyses allow managers to see and visualise the data in a form that shows interesting and unusual
patterns that would not be easily visible from the analysis of transaction data alone.
In modern organisations, ones that have a strong online presence and collect data from customer
visits to websites and transaction data from different types of e-commerce sites, the extent and size
of the data is such that analysing it for patterns is almost impossible, unless a warehouse is used. For
example, one firm analyses data, using OLAP, from millions of visitors to different pages of its website
to dynamically place advertisements that would conform to the visitors’ interests, as determined by
the regions on the page the visitor hovers over or clicks on.
Data warehouses are an active area of development and have strong com- mercial potential for
database vendors. Almost all major commercial vendors of DBMS have products that can be used to
create and manage warehouses.
Data mining is used with data accumulated in data warehouses. Following are some examples of data
stored in warehouses that are used for mining:
1. Click-stream data: This data is collected from website pages as users click on links or other
items on the web page. Data on where a user clicks, after what interval, what page the users
goes to, does the user return and visit other links, etc., are collected. The data are mined to
identify which links are most frequently visited, for how long and by what kind of users. The
online search firm, Google, has initiated an entire field of mining click- stream data that is
known as web analytics.
2. Point-of-sale purchase data: Data obtained from retail sales counters is the classic data set
to which mining software was applied. The data per- tains to the item quantities, price values,
date and time of purchase, and details about customers that are obtained from point-of-sale
terminals. The data is used to perform ‘market basket’ analysis, which essentially shows what
kinds of items are typically purchased with each other. In a famous case, a large retailer found
from a market basket analysis that men in a certain part of the USA were likely to buy beer and
diapers on Thursday evenings. This was an unusual finding and the retailer sought to explain
why. It was later learned that many families with young children planned their weekend trips
on Thursday evening, at which point the women would ask men to go and buy diapers from
the store. The men would take this opportunity to buy beer, also for the weekend. The retailer
used this information to announce promotions and increase sales of both these products.
3. Online search data: This data is about search that users type in search boxes on web pages.
Many organisations collect the text typed in by users while they are searching for some
information. This text data reveals what users are interested in and is mined for patterns. The
data collected pertains to the search texts typed in, the time at which they are typed and the
number of times different searches are done. Many online retailers, such as Flipkart, mine this
data to identify what users are interested in and then make product suggestions based on
association rules. For example, users searching for books on Java programming may be offered
what others have seen and purchased, including associated books on program- ming they have
not considered.
4. Text data: This is text data that is posted by users on web pages, blogs, e-mails, wikis, twitter
feeds, and others. Many organisations have found that by mining this data they can glean
interesting insights and trends. Many tools and software programs have been created recently
to mine text data. One example is provided by the online site called Wordle.
The www.wordle.net site hosts an application that mines text submitted to it. The application counts
the frequency of words appearing in the submitted text and then creates a word ‘cloud’ with the most
frequent words appearing as the largest. For example, Figure 11.5 has a word cloud of the text of a
portion of the Constitution of India. Text from Part III of the Constitution, comprising of the
Fundamental Rights, was submitted to wordle.net. This part consists of about 13 pages of printed text.
Figure 5: Wordle cloud for text from part III of the Constitution of India pertaining to Fundamental
Rights
SELF-ASSESSMENT QUESTIONS – 4
Fill in the blanks:
8 A _______________ is a data warehouse containing data pertaining to a particular domain.
9 _________________ is the process of removing erroneous data from data ware- houses.
7. SUMMARY
• Different aspects of the need for database systems are data independence, reduced data
redundancy, data consistency, data access, data adminis- tration, managing concurrency,
managing security, recovery from crashes, and application development.
• The field in database whose values are unique is called primary key.
• In homogeneous databases, those users who are close to a particular server are able to access
data from that particular one, whereas other access data from other, physically closer servers.
• In heterogeneous databases, when application servers draw on the data- bases, special
algorithms pull together the required data from diverse serv- ers and present a consolidated
picture.
• Middleware is a special class of software which is used to connect dispa- rate databases.
• Different types of database designs are relational model, hierarchical model, object-oriented
model, network model, and object-relational model.
• Elements of DBMS are tables, queries, forms, and reports.
• Information regarding the data in tables – for what purpose it is created, by them, whom, who
maintains them, and so on – is referred to as metadata.
• The technology of data warehouses draws on enterprise databases to cre- ate a separate set of
tables and relations that can be used to run particular kinds of queries and analytical tools.
• Data mining means extracting patterns and knowledge from historical data that are typically
housed in data warehouses.
8. GLOSSARY
A software program that enables storage, access, and use of data by other
Database -
software applications.
Field - A defined space of given size, which stores the basic elements of data.
Contains details about how data is stored in files; provides informa- tion
Metadata -
on how the data can be used and managed.
Primary key - A field that contains data that uniquely identifies a record.
Distributed
- A database whose tables are maintained on various different servers.
database
9. TERMINAL QUESTIONS
1. What is the need for data management? Why is it difficult to manage data?
2. Describe some of the challenges of modern database management.
3. What is the difference between fields, records and files?
4. Why is a primary key needed?
5. What is the difference between a personal database and an organisa- tional database?
6. What is the advantage of three-tier architecture?
7. Why is middleware important?
8. Describe briefly how to create a data warehouse.
10. ANSWERS
Self-Assessment Questions
1. (d) data and programs were together, and changing data required changing the program
2. True
3. (a) managing concurrency, security and recovery from crashes
4. Distributed
5. (d) Middleware
6. (b) heterogeneous
7. (d) highly secure
8. Data mart
9. Data scrubbing
11. REFERENCES
• Hoffer, J.A., Prescott, M.B. and McFadden, F.R. (2007) Modern Database Management, 8th edn,
Prentice Hall, NJ.
E-References
• An article ‘How much Information is there in the World?’ in USC News, February 2011 is
available at: http://uscnews.usc.edu (accessed on June 2011).
• An article ‘eBay’s Two Enormous Data Warehouses’, in DBMS2, 2009 is available at:
http://www.dbms2.com/2009/04/30/ebays-two-enormous-data- warehouses/ (accessed on
June 2011).