Unit 1
Digital Data
Digital Data
Data growth has seen exponential since the advent of the computer
and the internet. Infact the computer and internet duo has imparted the
digital form to data
Digital data can be classified into 3 forms:
Unstuctured
Semi-structured
Structured
Usually data is in the unstructured format which makes extracting
information from it difficult
According to Merrill Lynch, 80-90% of business data is either
unstructured or semi structured
Gartner also estimated that unstructured data constitutes 80% of the
whole enterprise data
Definition of data and information and characteristics of good information
Data refers to raw basic facts i.e. price of a product, the number of products purchased, etc. that haven't yet
been processed.
For example, a price of $6 and a quantity of 10 do not convey any meaning to a customer at a point of sale
till. Information should be processed data that conveys meaning to the recipient.
For example, multiplying $6 by 10 gives us $60, which is the total bill that the customer should pay.
Good information should be timely and available when it is needed.
The following are the characteristics of good information.
Accurate – information must be free from errors and mistakes. This is achieved by following strict set
standards for processing data into information. For example, adding $6 + 10 would give us inaccurate
information. Accurate information for our example is multiplying $6 by 10.
Complete – all the information needed to make a good decision must be available. Nothing should be
missing. If TAX is an application to the computation of the total amount that the customer should pay then,
it should be included as well. Leaving it out can mislead the customer to think they should pay $60 only
when in actual fact, they must pay tax as well.
Cost Effective – the cost of obtaining information must not exceed the benefit of the information in
monetary terms.
User-focused – the information must be presented in such a way that it should address the
information requirements of the target user. For example, operational managers required very
detailed information, and this should be considered when presenting information to operational
managers. The same information would not be appropriate for senior managers because they would
have to process it again. To them, it would be data and not information.
Relevant – the information must be relevant to the recipient. The information must be directly
related to the problem that the intended recipient is facing. If the ICT department wants to buy a new
server, information that talks about a 35% discount on laptops would not be relevant in such a
scenario.
Authoritative – the information must come from a reliable source. Let's say you have a bank
account and you would like to transfer money to another bank account that uses a different currency
from yours. Using the exchange rate from a bureau de change would not be considered authoritative
compared to getting the exchange rate directly from your bank.
Timely – information should be available when it is needed. Let's say your company wants to merge
with another company. Information that evaluates the other company that you want to merge with
must be provided before the merger, and you must have sufficient time to verify the information.
Structured Data
Simply a data is something that provides information about a particular thing and can be used for analysis.
Data can have different sizes and formats.
For example, all the information of a particular person in Resume or CV including his educational details,
personal interests, working experience, address etc. in pdf, docx file format having size in kb’s.
This is very small-sized data which can be easily retrieved and analyzed. But with the advent of newer
technologies in this digital era, there has been a tremendous rise in the data size.
Data has grown from kilobytes(KB) to petabytes(PB). This huge amount of data is referred to as big data
and requires advance tools and software for processing, analyzing and storing purposes.
Structured Data Continue……..
The data that has a structure and is well organized either in the form of tables or in some other way and
can be easily operated is known as structured data.
Searching and accessing information from such type of data is very easy.
For example, data stored in the relational database in the form of tables having multiple rows and
columns.
The spreadsheet is an another good example of structured data.
Structured data is data that adheres to a pre-defined data
model and is therefore straightforward to analyse.
Structured data conforms to a tabular format with
relationship between the different rows and columns.
Common examples of structured data are Excel files or
SQL databases. Each of these have structured rows and
columns that can be sorted.
Structured data depends on the existence of a data model – a model of
how data can be stored, processed and accessed.
Because of a data model, each field is discrete and can be accesses
separately or jointly along with data from other fields. This makes
structured data extremely powerful: it is possible to quickly aggregate
data from various locations in the database.
Structured data is is considered the most ‘traditional’ form of data
storage, since the earliest versions of database management systems
(DBMS) were able to store, process and access structured data.
CHARACTERISTICS OF STRUCTURED DATA
Structured data is organized in semantic chunks(entities) with similar
entities grouped together to from relation or classes
Entities in the same group have the same description
Conforms to a
Data is stored in
Similar entities data model
the form of rows
are grouped and columns e.g
relational
Structured database
Data
Attributes in a
group are the Data resides in
same the fixed fields
Defination, format within a record
and meaning of or file
data is explicitly
known
CHARACTERISTICS OF STRUCTURED DATA
Highly Organised
Clearly Defined
Easy to Access
Easy to Analyse
Where does Structured Data
comes from?
SQL Databases
Spreadsheets
Sensors
Medical Devices
Online Forms
Point of Sales Systems
Web and Server Logs
How Easy it is to work with Structured Data?
Working with structured data is easy when it comes to storage, scalability,
security and update and delete operations
1) Storage: Both inbuilt and user-defined datatypes help with the storage
of structured data
2) Scalability: Increase in size has no issue with scalability
3) Security: Ensuring security is easy
4) Update and Delete operation: It is easy due to structured form of data
Hassle Free Retrieval
You wont get headache while retrieving desired information from structured data because of
following features:
1) Retrieving Information: A well designed structure helps in retrieval of data
2) Indexing and searching: Data can be indexed based not only on a text string but also on
other attributes. This enables streamlined search
3) Mining Data: Structured data can be easily mined and knowledge can be extracted from it.
4) BI operations: Business Intelligence works very well with structured data. Hence data
mining, data warehousing can be easily undertaken
Example of structured data in an excel sheet:
Unstructured Data
Unstructured data is information that either does not have a predefined
data model or is not organized in a pre-defined manner.
Unstructured information is typically text-heavy, but may contain data
such as dates, numbers, and facts as well. This results in irregularities
and ambiguities that make it difficult to understand using traditional
programs as compared to data stored in structured databases.
Common examples of unstructured data include audio, video files or
No-SQL databases.
The ability to store and process unstructured data has greatly grown in recent years, with
many new technologies and tools coming to the market that are able to store specialised
types of unstructured data.
MongoDB, for example, is optimized to store documents.
Apache Giraph, as an opposite example, is optimized for storing relationships between
nodes.
The ability to analyze unstructured data is especially relevant in the context of Big Data,
since a large part of data in organizations is unstructured.
Think about pictures, videos or PDF documents
. The ability to extract value from unstructured data is one of main drivers behind the
quick growth of Big Data
1) Indexing:
How to Manage Unstructured Data
It helps in searching and retrieval.
Based on text or some other attributes eg filename, the unstructured data is indexed.
Indexing in unstructured data is difficult because neither does this data have any pre-defined
attribute nor does it follow any pattern or naming conventions
. Text can be indexed based on text string but in case of non text files like audio/video etc indexing
depends on file names. This becomes hindrance when naming conventions are not being followed.
2) Tags/Metadata-
Using metadata, data in a document can be tagged.
This enables search and retrieval.
But in unstructured data, this is difficult as little or no metadata is available.
Structure of data has to be determined which is very difficult as the data itself has no particular
format and is coming from more than one source.
How to Manage Unstructured Data……
3) Taxonomy:
It is classifying data on the basis of relationships that exist between data.
Data can be arranged in groups and placed in hierarchies based on the taxonomy prevalent in an
organization.
Since the data is unstructured, naming conventions or standards are not consistent across an
organisation, thus making it difficult to classify data
Challenges in Storing Structured Data……
1) Storage Space: It is difficult to store and manage data. A lot of space is required to srtore
images, videos, audios etc.
2) Scalability: As the data grows, scalability becomes an issue and the cost of storing data also
increases
3) Retrieve information: Even if unstructured data is stored, it is difficult to retrieve and recover
unstructured data
4) Security: Ensuring security is difficult due to varied sources of data. Example: email, images,
audios.
5) Update and delete: It is difficult as data is stored in unstructured data
6) Indexing and searching: Indexing unstructured data is difficult and error prone as the structure
is not clear and attributes are not pre-defined. As result the searched results are not very
accurate. Also indexing becomes more difficult as the volume of the data grows
Solutions to Storage Challenges of Unstructured Data
1) Changing Formats: Unstructured data may be converted into formats which are easily managed, stored
and searched. For Example: IBM is working on providing a solution which will covert audio, video etc
into text
2) Developing new hardware: New hardware needs to be developed to support unstructured data. It may be
either complement the existing storage devices or may be a standalone for unstructured data
3) Storing in RDBMS/BLOBs: Unstructured data may be stored in relational databases which support
BLOBs (Binary Large Objects). While unstructured data such as video or image file cannot be stored
fairly nearly into a relational column, there is no such problem when it comes to storing its metadata
such as data and time of its creation, the owner or author of data etc
4) Storing in XML format: Unstructured data may be stored in XML format which tries to give some
structure to it by using tags and elements
example of unstructured data includes email responses, like this one:
UIMA: A possible solution to Unstructured data
UIMA stands for Unstructured Information Management Architecture and is a component architecture and
software framework implementation for the analysis of unstructured content like text, video and audio data.
Unstructured information represents the largest, most current and fastest growing source of information
available to businesses and governments.
The motivation to develop such a framework was to build a common platform for unstructured analytics, to
foster reuse of analysis components and to reduce duplication of analysis development. The pluggable
architecture of UIMA allows to easily plug-in your own analysis components and combine them together
with others. A full analysis task of a solution using unstructured analytics like search or government
intelligence applications is often not a monolithic thing but a multi-stage process where different modules
need to build on each other to get a powerful analysis chain. In some cases also annotators from different
specialized vendors may need to work together to produce the results needed. The UIMA application
interested in such analysis results does not need to know the details of how annotators work together to
create the results. The UIMA framework take care of the integration and orchestration of multiple
annotators.
So the major goal of UIMA is to transform unstructured information to structured information by
orchestrating analysis engines to detect entities or relations and thus to build the bridge between
the unstructured and the structured world.
What Can UIMA Be Used For?
UIMA is, by itself, an empty framework. Its purpose is to enable a world-wide, diverse community to develop
inter-operable, often complex analytic components, and allow them to be combined and run together, with
framework supplied scaled-out and remoting as needed
There are lots of use cases where UIMA may be applicable.
1. One of the major ones are search applications. Within search applications, the unstructured content that is
available mainly as text in various kinds must be processed and analyzed to be searchable. To obtain a
powerful search application, the text content must be analyzed to get the document language followed by
language dependent linguistic processing such as tokenization, lemmatization and part of speech detection.
After these steps a more sophisticated analysis like entity detection and relation detection between entities can
be done. For all these analysis steps UIMA and UIMA components can be used.
2. Another important use case is business or government intelligence. For example, UIMA analysis is used to
extract structured information from car repair reports. This data is then used for quality feed-back and problem
early warning systems.
3. Other possible solutions where UIMA can be used for are the analysis of call center notes to detect product
problems and customer issues or a public image monitoring solution to find out how others for example in
internet forums or press releases think about my product or company
Semi-Structured Data
Semi-structured data is a form of structured data that does not obey the formal structure of
data models associated with relational databases or other forms of data tables, but nonetheless
contains tags or other markers to separate semantic elements and enforce hierarchies of records
and fields within the data. Therefore, it is also known as self-describing structure.
In semi-structured data, the entities belonging to the same class may have different attributes
even though they are grouped together, and the attributes' order is not important.
Semi-structured data are increasingly occurring since the advent of the Internet where full-text
documents and databases are not the only forms of data anymore, and different applications
need a medium for exchanging information. In object-oriented databases, one often finds semi-
structured data.
Example of Semi-Structured data
semi-structured data does not conform to relational databases such as Excel or SQL, but
nonetheless contains some level of organization through semantic elements like tags. For
instance, consider HTML, which does not restrict the amount of information you can collect in
a document, but enforces a certain hierarchy:
Examples of Semi-Structured Data
Email
CSV, XML and JSON documents
NoSQL databases
HTML
Electronic data interchange (EDI)
RDF
Pros and Cons of Using a Semi-structured Data Format
Advantages
Programmers persisting objects from their application to a database do not need to worry
about object-relational impedance mismatch, but can often serialize objects via a light-
weight library.
Support for nested or hierarchical data often simplifies data models representing complex
relationships between entities.
Support for lists of objects simplifies data models by avoiding messy translations of lists
into a relational data model.
Disadvantages
The traditional relational data model has a popular and ready-made query language, SQL.
Prone to "garbage in, garbage out"; by removing restraints from the data model, there is
less fore-thought that is necessary to operate a data application.
NoSQL vs SQL- Why NoSQL is better for Big Data applications
Big Data NoSQL databases were pioneered by top internet companies like Amazon, Google,
LinkedIn and Facebook to overcome the drawbacks of RDBMS. RDBMS is not always the
best solution for all situations as it cannot meet the increasing growth of unstructured data.
As data processing requirements grow exponentially, NoSQL is a dynamic and cloud
friendly approach to dynamically process unstructured data with ease
1000 users of a web application, was a major load on the app, in the early days and 10,000
users were considered an extreme scenario.
As per the web statistics report in 2018, there are about 6 billion people who are connected
to the world wide web and the amount of time that the internet users spend on the web is
somewhere close to 35 billion hours per month, which is increasing gradually.
With the availability of several mobile and web applications, it is pretty common to have
billions of users- who will generate a lot of unstructured data. There is a need for a database
technology that can render 24/7 support to store, process and analyze this data.
To start with: Relational Databases –
The fundamental concept behind databases, namely MySQL, Oracle
Express Edition, and MS-SQL that uses SQL, is that they are all
Relational Database Management Systems that make use of relations
(generally referred to as tables) for storing data.
In a relational database, the data is correlated with the help of some
common characteristics that are present in the Dataset and the
outcome of this is referred to as the Schema of the RDBMS.
What is No SQL?
NoSQL is a database technology driven by Cloud Computing, the Web, Big Data and the
Big Users.
NoSQL now leads the way for the popular internet companies such as LinkedIn, Google,
Amazon, and Facebook - to overcome the drawbacks of the 40 year old RDBMS.
NoSQL Database, also known as “Not Only SQL” is an alternative to SQL database which
does not require any kind of fixed table schemas unlike the SQL.
NoSQL generally scales horizontally and avoids major join operations on the data. NoSQL
database can be referred to as structured storage which consists of relational database as
the subset.
NoSQL Database covers a swarm of multitude databases, each having a different kind of
data storage model. The most popular types are Graph, Key-Value pairs, Columnar and
Document.
Limitations of SQL
Relational Database Management Systems that use SQL are Schema –Oriented i.e. the structure of the data
should be known in advance ensuring that the data adheres to the schema.
Examples of such predefined schema based applications that use SQL include Payroll Management System,
Order Processing, and Flight Reservations.
It is not possible for SQL to process unpredictable and unstructured information. However, Big Data
applications, demand for an occurrence-oriented database which is highly flexible and operates on a schema
less data model.
SQL Databases are vertically scalable – this means that they can only be scaled by enhancing the horse
power of the implementation hardware, thereby making it a costly deal for processing large batches of data.
IT enterprises need to increase the RAM, SSD, CPU, etc., on a single server in order to manage the
increasing load on the RDBMS.
With increasing size of the database or increasing number of users, Relational Database Management
Systems using SQL suffer from serious performance bottlenecks -making real time unstructured data
processing a hard row to hoe.
With Relational Database Management Systems, built-in clustering is difficult due to the ACID properties of
transactions.
NoSQL vs SQL – 4 Key Differences:
1. Nature of Data and Its Storage- Tables vs. Collections
2. Speed – Normalization vs. Storage Cost
MS requires a higher degree of Normalization i.e. data needs to be broken down into several
small logical tables to avoid data redundancy and duplication. Normalization helps manage data
in an efficient way, but the complexity of spanning several related tables involved with
normalization hampers the performance of data processing in relational databases using SQL.
On the other hand, in NoSQL Databases such as Couchbase, Cassandra, and MongoDB, data is
stored in the form of flat collections where this data is duplicated repeatedly and a single piece
of data is hardly ever partitioned off but rather it is stored in the form of an entity. Hence,
reading or writing operations to a single entity have become easier and faster.
3. Horizontal Scalability vs. Vertical Scalability
The most beneficial aspect of NoSQL databases like HBase for
Hadoop, MongoDB, Couchbase and 10Gen’s is - the ease of
scalability to handle huge volumes of data.
For instance, if you operate an eCommerce website similar to
Amazon and you happen to be an overnight success - you will have
tons of customers visiting your website.
Under such circumstances, if you are using a relational database, i.e.,
SQL, you will have to meticulously replicate and repartition the
database so as to fulfill the increasing demand of the customers.
4. NoSQL vs SQL / CAP vs. ACID
Relational databases using SQL have been legends in the database landscape for
maintaining integrity through the ACID properties (Atomicity, Consistency, Isolated, and
Durable) of transactions and most of the storage vendors rely on properties.
NoSQL Databases work on the concept of the CAP priorities and at a time you can decide
to choose any of the 2 priorities out of the CAP Theorem (Consistency-Availability-
Partition Tolerance) as it is highly difficult to attain all the three in a changing distributed
node system.
One can term NoSQL Databases as BASE , the opposite of ACID - meaning:
BA= Basically Available –In the bag Availability
S= Soft State – The state of the system can change anytime devoid of executing any
query because node updates take place every now and then to fulfill the ever changing
requirements.
E=Eventually Consistent- NoSQL Database systems will become consistent in the long
run.
Why should you choose a NoSQL Database like HBase, Couchbase or Cassandra over RDBMS?
1)Applications and databases need to work with Big Data
2)Big Data needs a flexible data model with a better
database architecture
3)To process Big Data, these databases need continuous
application availability with modern transaction support.