Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views50 pages

Lecture 1-2 Big Data

The document outlines a lecture on structured and unstructured data, covering fundamental concepts such as data, information, and metadata, as well as the DIKW pyramid. It introduces NoSQL databases, their characteristics, and types, including key-value, column-based, document-based, and graph-based stores. Additionally, it discusses the evolution of data storage and retrieval techniques, emphasizing the importance of data in modern business and competitive advantage.

Uploaded by

tijjanifatima01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views50 pages

Lecture 1-2 Big Data

The document outlines a lecture on structured and unstructured data, covering fundamental concepts such as data, information, and metadata, as well as the DIKW pyramid. It introduces NoSQL databases, their characteristics, and types, including key-value, column-based, document-based, and graph-based stores. Additionally, it discusses the evolution of data storage and retrieval techniques, emphasizing the importance of data in modern business and competitive advantage.

Uploaded by

tijjanifatima01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

7082

CEM

Lecture 1 - Structured and Unstructured


Data

M.S DALHATU

2023- 2024
Outline 2

o Part1 - Data
• Basic concepts , data and information , the DIKW pyramid , metadata
• Data structures , structured, semi-structured, quasi-structured, and unstructured data
• Databases, RDBMS, ACID

o Part2 - A Brief Introduction to NoSQL

• NoSQL characteristics - NoSQL


• The CAP theorem – BASE

o Part3 - NoSQL Database Types

• NoSQL database types


• Key-value stores (Redis)
• Column-based stores (Bigtable).
• Document-based stores (MongoDB).
• Graph-based stores (Neo4j).
• NoSQL and Big Data
3

Lecture1 - Part 1
Data and Information
Basic Concepts 4

o What is data? what is information? How are they different? Can you give
examples of each?
o Can you think of higher levels?
o What is metadata?
Data 5

o Data - the foundation of technological activity

o Database - a highly organized collection of assembled data

o Database Management System - sophisticated software that controls the


database and the database environment
Data and Information 6

Data
o Example: does 178 mean anything to you? If yes, then what?
o Historically, this term refers to facts concerning objects and events that could be
recorded and stored on computer media. This data can be numeric, character, etc. This
definition of data has been expanded to refer to objects such as documents, emails,
tweets, images, audio, etc.
o Data is a representation of facts, concepts, or instructions in a formalized manner
o Data is stored representations of objects and events that have meaning and
importance in the user’s environment.
o Data is raw facts that constitute a building block of information
o It is important to note not all the data will convey useful information
Information
o The two concepts data and information are closely related, but we can define
information as: data that has been processed. Information gives us a deeper insight
into data.
The DIKW Pyramid 7

Knowledge
o Knowledge can be defined as:
acquiring, processing, understanding, and Wisdom

using information
Knowledge

Information
Wisdom
Data
o Wisdom can be defined as: knowledge in action
o It is how data and information can give us insight
Into future
Metadata 8
o Metadata is data that describes the properties or characteristics of end-user data and
the context of this data

o Metadata is the description of the data or “data about data”

o Metadata is available for query and manipulation

o Examples of metadata are data names, definitions, length (or size), and allowable
values. source of the data, where the data is stored, ownership and usage, etc

o Sometimes it is referred to as system catalog or data dictionary or data directory


(technically, metadata is the information stored in the data dictionary/system catalog)
Data in History 9

o People have been interested in data for at least the past 12,000 years.

o Non-computer, primitive methods of data storage and handling.


Data in History 10

o Shepherds kept track of their flocks with pebbles.

o A primitive but legitimate example of data storage and retrieval.


Data in History 11

o Dating back to 8500 B.C., unearthed clay tokens or “counters” may


have been used for record keeping in primitive forms of accounting.

o Tokens, with special markings on them, were sealed in hollow clay


vessels that accompanied commercial goods in transit.
Data Through the Ages 12

o Record-keeping - the recording of data to keep track of how much a


person has produced and what it can be bartered or sold for.

o With time, different kinds of data were kept


• calendars, census data, surveys, land ownership records, marriage records,
records of church contributions, family trees, etc.
Data Through the Ages 13

o Double-entry bookkeeping - originated in the trading centers of


fourteenth century Italy.

o The earliest known example is from a merchant in Genoa and dates to


the year 1340.
14
Early Data Problems Spawn Calculating Devices

o People interested in devices that could “automatically” process their


data.

o Blaise Pascal produced an adding machine that was an early version of


today’s mechanical automobile odometers.
Punched Cards - Data Storage 15

o Invented in 1805 by Joseph Marie Jacquard of France.

o Jacquard’s method of storing fabric patterns, a form of graphic data, as


holes in punched cards was a very clever means of data storage.

o Of great importance for computing devices to follow.


Era of Modern Information Processing 16

o The 1880 U.S. Census took about seven years to compile by hand.

o Basing his work on Jacquard’s punched card concept, Herman Hollerith


arranged to have the census data stored in punched cards and invented
machinery to tabulate them.

o In 1896 Hollerith formed the Tabulating Machine Company to produce


and commercially market his devices -- this later became IBM.
Era of Modern Information Processing 17

o James Powers developed devices to automatically feed cards into the


equipment and to automatically print results.

o In 1911 he established the Powers Tabulating Machine Company -- this


later became Unisys Corporation.
The Mid-1950s 18

o The introduction of electronic computers.

o Witnessed a boom in economic development.

o From this point onward, it would be virtually impossible to tie advances


in computing devices to specific, landmark data storage and retrieval
needs.
Modern Data Storage Media 19

o Punched paper tape - The earliest form of modern data storage,


introduced in the 1870s and 1880s.

o Punched cards were the only data storage medium used in the
increasingly sophisticated electromechanical accounting machines of
the 1920s, 1930s, and 1940s.
Modern Data Storage Media 20
o Middle to late 1930s saw the beginning of the era of erasable magnetic
storage media.

o By late 1940s, early work was done on the use of magnetic tape for
recording data.

o By 1950, several companies were developing the magnetic tape concept


for commercial use.
o Magnetic Tape - commercially available units in 1952.

o Direct Access Magnetic Devices - began to be developed at MIT in the late


1930s and early 1940s.
Modern Data Storage Media 21

o Magnetic Drum - early 1950s; forerunners of magnetic disk technology.

o Magnetic Disk - commercially available in mid 1950s.

o Compact Disk (CD) – introduced as a data storage medium in 1985.

o Solid-state technology – Flash drives.


Using Data for Competitive Advantage 22

o Data has become indispensable to every kind of modern business and


government organization.

o Data, the applications that process the data, and the computers on
which the applications run are fundamental to every aspect of every
kind of endeavor.
o Data is a corporate resource, possibly the most important corporate resource.
o Data can give a company a crucial competitive advantage.

o e.g., FedEx had a significant competitive advantage when it first provided


access to its package tracking data on its Web site.
Data Structures 23
o One of the main characteristics of Big Data, as we will see later, is that is comes in
different forms. In fact, unlike much of the traditional data analysis, Big Data deals
mainly with unstructured or semi-structured data, which requires different techniques
and tools to process and analyze.

o There are other classifications of data structures that we are not going to talk about
Categories of Data 24

o Data can be categorized into four categories, such as:


• Structured Data
• Semi-structured Data
• Quasi-structured Data
• Unstructured Data
Structured Data 25
o Data containing a defined data type, format, and structure.

o The most important structured data types are numeric, character, and dates.

o Structured data is stored in tabular form.

o Structured data is most commonly found in traditional databases and data


warehouses.

o Most often data comes unstructured, but structure is imposed on it by humans and
machines.

o Structured data is typically the easiest data format to interpret


Semi-structured Data 26
o Textual data files with a discernible pattern that enables parsing

o An example of this data is XML, JSON, HTML, CSV, etc.


Quasi-structured Data 2
7
o Textual data with erratic data formats that can be formatted with effort, tools, and time
(for instance, web clickstream data that may contain inconsistencies in data values
and formats)

o An example of this data structure is a hyperlink.

o Not all references will recognize this as a separate category of data structures
Unstructured Data 2
8
o Data that has no inherent structure, which may include text documents, PDF files,
images, videos.

o Unstructured data is context-specific


What is Information? 29

o Organize form of data is known as Information.

Definition:
o Data that have been processed so that they are meaningful.
o Data that have been processed for a purpose.
o Data that have been interpreted and understood by the recipient.
What is Management? 30

According to Theo Hiemann, management has three meanings,


o Management as a Noun: refers to the group of Managers.
o Management as a Process: refers to the functions of Management i.e.
Planning, Organizing, Directing, Controlling, etc.
o Management as a Discipline: refers to the subject of Management.

Management is an individual or a group of people that accept


responsibilities to run an organization. They plan, organize, direct and
control all the essential activities of the organization. Management does
not do the work themselves. They motivate others to do the work and co-
ordinate (i.e bring together) all the work for achieving the objectives of the
organization.
What is Information Management? 31

Application of Management techniques to collect


information, communicate it within and outside the
organization, process it to enable managers to make
quicker and better decisions.
What is Information storage & retrieval 32

Systematic process of collecting and cataloging data so that they can


located and displayed on request. Computer and data processing
techniques have made to access the high-speed and large amounts of
information for government, commercial and academic purposes.

A branch of computer or library science relating to the storage, locating,


searching, and selecting upon demand, relevant data on a given subject.
What is Information storage & retrieval 33

Basic concept on information storage.


It can refer to as a place like a storage room, where paper records are
kept. It can also refer to as a storage device such as computer hard disk,
CD, DVD, or similar device which can hold data.
Types of storage media
Storage keeps data and information for future use, Common storage
mediums are:
• Hard Disk
• Floppy Disk
• CD & DVD
• USB Flash Drive
Information storage & retrieval Cont. 34

Hard Disk:
• It is always inside the computer.
• It stores all the programs that the computer needs to work.
Information storage & retrieval Cont. 35

Floppy Disk:
• It is a portable storage medium.
• Put it into the computer save your information.
Information storage & retrieval Cont. 36

CD & DVD:
• At is a portable storage.
• At allows you to save information on it.
Information storage & retrieval Cont. 37
USB Flash Drive:
• At is very easy to carry .
• At holds more data than a floppy disk.
• is very small device than others.
Basic concept of Information Retrieval 38
"An information retrieval system is an information system, that is a system
used to store Items of information that need to be processed, searched,
retrieved, and disseminated to various user populations" (Saiton, 1983)

Major Components of IR
Information retrieval can be divided into several major constitutes which
include:
• Database
• Search mechanism
• Language
• Interface
Basic concept of Information Retrieval 39
"An information retrieval system is an information system, that is a system
used to store Items of information that need to be processed, searched,
retrieved, and disseminated to various user populations" (Saiton, 1983)

Major Components of IR
Information retrieval can be divided into several major constitutes which
include:
• Database
• Search mechanism
• Language
• Interface
Basic concept of Information Retrieval 40
Major Components of IR
• Database:
A system whose base, whose key concept is simply a particular way of
handling data and Its objective is to record and maintain information.
• Search mechanism:
Information organized systematically that can be searched and retrieved
when a corresponding search mechanism is provided.
• Search procedure can be categorized as basic or advance search procedure.
• Capacity of search mechanism determines what retrieval techniques will be
available to users and how information stored in databases can be retrieved.
Basic concept of Information Retrieval 41
Major Components of IR
• Library:
Information relies on language when being processed, transferred or
communicated.
Language can be identified as natural language and controlled
vocabulary.
• Interface
• Interface regularly considered whether or not an information retrieval system is
user Friendly. Quality of interface checked by interaction mode,
• Determines the ultimate success of a system for information retrieval.
Information Retrieval techniques 42
Major retrieval techniques are:
• Basic Retrieval Techniques
• Advanced Retrieval techniques

Boolean Searching
Logical operations are also known as Boolean operators.
• The AND operate for narrowing down a search
• The OR operate for broadening a search
• The NOT operator for excluding unwanted results
Information Retrieval techniques 43
Boolean
Information Retrieval techniques 44
Proximity Searching:
A proximity search allows you to specify how close two (or more) words
must be to each other in order to register a match.
There are three types of proximity searches:
• Word proximity
• Sentence proximity
• Paragraph proximity
Information Retrieval techniques 45
Range Searching:
It is most useful with numerical information. The following options are
usually available for range searching.
• Greater than (>) less than(<)
• Equal to (z)
• Not equal to (/=or 0)
• Greater than equal to (>=)
• Less than or equal to (e)
Information Retrieval techniques 46
Advanced Retrieval Techniques:
• Fuzzy searching
• Query expansion
• Multiple database searching

Fuzzy searching:
It is designed to find out terns that are spelled incorrectly at data entry and
query point.
For example the term computer could be misspelled as comp&r, compiter,
or comyter, Optical Character Recognition (OCR) or compressed texts could
also result in erroneous results, Fuzzy searching is designed for detection
and correcücn of spelling errors that result from OCR ard text compression.
Information Retrieval techniques 47
Fuzzy searching:
Information Retrieval techniques 48
Query expansion:
Query expansion is a retrieval technique that allows the end user to
improve retrieval performance by revising search queries based on results
already retrieved
Information Retrieval techniques 49
Information Retrieval Systems:
o Online systems
o CD-ROM systems
o OPAC
o Web information Retrieval Systems

Online systems
Online information retrieval systems allow the user to search databases located
remotely with the help of the computer and telecommunication technology.
o Basic searching techniques
o Advanced retrieval techniques
Examples:
Library of congress, University of Punjab Library
Information Retrieval techniques 50
Future Trends in Online Information Retrieval Systems:
o A great increase in the number of information services that can be
accessed from around the world.
o Specialized systems will be more "user oriented," easily accessible.
o They should be oriented to natural language rathar than controlled
vocabularies.
o Computer aided instruction should be incorporated into systems.
o Future of online systems must require less effort to use, They should
adapt to the user rathar than expecting the user to adapt them,.
Internet and web developments have brought significant changes to the
economics of the information industry, from which end-users benefits.
Through the information to rage and retrieval system, users can freely or
on payment of a fee access the relevant information.

You might also like