Evolution of Data Management
Database Management Systems
Early days → DBMS
• Collecting, storing, processing and retrieving data
• Application built on top of file systems
Examples:
• Bank
• Hospital
Redundancies → same info in more than one places
Inconsistencies → different values for the same info
• Plethora of drawbacks
1. Data redundancy and inconsistency
◦ Multiple data formats, duplication in different files
2. Difficulty in accessing data
◦ Need to write a new program to carry out each new task
3. Data isolation
◦ Multiple files and formats
4. Integrity problems
◦ Integrity constraints (e.g., account balance > 0) become “buried” in program
code rather than being stated explicitly ◦ Hard to add new constraints or change
existing ones
5. Atomicity of updates
◦ Example: Transfer of funds from one account to another should either be
completed or not happen at all
◦ Failures may leave data in an inconsistent state with partial updates carried
out
6. Concurrent access by multiple users
◦ Needed for performance ◦ Example: Two people reading a balance (e.g.,
100) and then withdrawing money (e.g., 50 each) at the same time
◦ Uncontrolled concurrent accesses can lead to inconsistencies
7. Security problems
◦ Hard to provide user access to some, but not all, data
Relational Database Management Systems
Following Developments → RDBMS
◦ Designed to take care of DBMS drawbacks /inefficiencies
◦ Data is stored in the form of tables
◦ Maintaining the relationships among tables
◦ Large data sizes, distribution, many users, multiple levels of data security supports
integrity constraints, etc.
Internet growth
- The example of Wikipedia
• Free content encyclopedia
• Among the popular websites
• Written/maintained: community of volunteer contributors
• Various actions given to average Web user!
• Create new articles • Extend existing articles
• Translate to other languages
• Events appear within minutes
Big Data and its challenges
• Reality: ever-increasing data, demanding users
Big Data
• Information assets that require new forms of processing to enable enhanced decision
making and insight discovery
• Definition of Big Data expressed through Vs
1. Volume:
◦ Amount of generated and stored data ◦ Wikipedia 6.5M Eng. articles, users,
other languages, etc.
2. Velocity:
◦ Rate/Speed at which the data is generated, received, collected, and
(perhaps) processed
◦ Wikipedia e.g., 6000 editors have more than 100 edits per month
over the English articles
3. Variety:
◦ Different types of data that are available
◦ RDBMS: structured and neatly fit
◦ Web systems, e.g., Wikipedia: unstructured and semistructured data types,
such as text, audio, and video
◦ Requires additional preprocessing to derive meaning and support metadata
4. Veracity:
◦ Quality of captured data
◦ Truthful of the data and how much we can rely on it
◦ Low veracity → high percentage of meaningless data (e.g., noise)
5. Value:
◦ Refers to the inherent wealth (i.e., economic and social) embedded in the
data
◦ Consider biggest tech. companies large part of their value comes from their
data, which they’re constantly analyzing to improve efficiency & develop new
products
Even more Big Data Characteristics
• Visualization:
◦ Display the data
◦ Technical issues due to limitations of in-memory technology, scalability,
response time, etc.
• Volatility:
◦ Everything changes … thus we always need to be if data is now irrelevant,
historic, or just not useful
• Vulnerability:
◦ New security concerns
Example
Data Integration
• Entities encode a large part of our knowledge
• Valuable asset for numerous current applications and (Web) systems
• Plethora of different objects have the same name
• Example: London
Entity Resolution
• Task that identifies and aggregates the different descriptions that refer to the same
real-world objects
• Primary usefulness:
◦ Improves data quality and integrity
◦ Fosters re-use of existing data sources
• Example application domains:
◦ Linked Data
◦ Building Knowledge Graphs
◦ Census data
◦ Price comparison portals
Data Management
• Challenges arise from the application settings • Examples: ◦ Data characteristics ◦ System
and resources ◦ Time restrictions • Evolving nature of the application settings implies a
constant modification of the challenges • Primary reason for the plethora of the Entity
Resolution methods
Challenges
Veracity
• Structured data with known semantics and quality
• Dealing with high levels of description noise
+ Volume
• Very large number of description
+ Variety
• Large volumes of semi-structured, unstructured or highly heterogeneous structured
data
+ Velocity
• Increasing volume of available data
Challenges in time
Big Data refers to the inherent wealth, economic and social, embedded in any data collection
- Data storage
- Finding the needle in the haystack
- Data processing
- Scalability
Architectural choices to consider
• Storage layer
• Programming model & execution engine
• Scheduling
• Optimizations
• Fault tolerance
• Load balancing
Scalability in data management (Chronological order)
Traditional databases
◦ Constrained functionality: SQL only
◦ Efficiency limited by server capacity
- Memory
- CPU (central processing unit)
- HDD (hard disk drive)
- Network
• Scaling can be done by
◦ Adding more hardware
◦ Creating better algorithms
- But we still can reach the limits
Distributed databases
• Innovation:
◦ Add more DBSMs & partition the data
• Constrained functionality:
◦ Answer SQL queries
• Efficiency limited by network, #servers
• API offers location transparency
◦ User/application always sees a single machine
◦ User/application not caring about data location
• Scaling: add more/better servers, faster network
Massively parallel processing platforms
• Innovation:
◦ Connect computers (nodes) over a LAN & make development, parallelization, and
robustness easy
• Functionality:
◦ Generic data-intensive computing
• Efficiency relies on network, #computers, and algorithms
• API offers location & parallelism transparency
◦ Developers don’t know where data is stored and how the code will be parallelized
• Scaling: ◦ Add more and/or better computers
Cloud
• Massively parallel processing platforms running over rented hardware
• Innovation: Elasticity, standardization
◦ Amazon requires huge computational capacity near holidays
◦ University requires very little resources during holidays
• Elasticity can be automatically adjusted
• API offers location and parallelism transparency
• Scaling: it’s magic!
Big Data models
Store, Manage, and Process of Big Data by harnessing large clusters of commodity nodes
• MapReduce family: simpler, more constrained
• 2nd generation: enables more complex processing and data, optimization opportunities
- Apache spark, Google Pregel, Microsoft Dryad
Big Data Analytics (according to IBM)\
• Driven by artificial intelligence, mobile devices, social media and the Internet of Things
(IoT)
• Data sources are becoming more complex than those for traditional data
◦ e.g., Web applications allow user generated data
- Deliver deeper insights
- Power innovative data applications
- Better and faster decision-making
- Predicting future outcomes
- Enhanced business intelligence
Analytics
• Traditional computation (e.g., SQL):
◦ Exact and all answers over the whole data collection
Interactive processing:
• Users give an opinion
• Thus:
- Users understand the problem
- Users influence decisions
• ER: system users are asked to help during the processing, i.e., their answers are
considered as part of the algorithm
Crowdsourcing processing:
• Difficult tasks or opinions in the processing are given to a group of people
• ER: humans are asked about the relation between profiles for a small compensation per
reply
Approximate processing:
• Use a representative sample instead of the entire input data collection
• Give approximate output and not exact answers
• Answers given with quarantines
• ER: profiles are the same with 95% certainty
Progressive processing:
• Efficiently process given limited time and/or computational resources that currently are
available
• ER: results are shown as soon as there are available
Incremental processing:
• Data updates is often high, which quickly makes the previous results obsolete
• Update existing processing information • Allow leveraging new evidence from updates to:
• Fix previous inconsistencies or
• Complete the information