ADVANCED DATABASES
Unit 2 ( Module 1 )
Introduction To Moder Databases
What is a Database?
A database is like a digital storage room where data is kept. Imagine a huge,
organized file cabinet. It stores all kinds of information like customer details, product
information, or transaction records.
Modern Databases:
Modern databases are more advanced and powerful than older ones. They are
designed to store, manage, and quickly find large amounts of data, even as that data
grows rapidly. They use advanced technologies to make sure the data is organized,
easy to access, and secure.
Modern databases are powerful tools that store, manage, and retrieve data
efficiently. They are built to handle lots of data and make it easy to access, secure,
and manage. They are essential for everything from small apps to massive
companies with millions of users.
Example of Where Modern Databases Are Used:
E-commerce websites use databases to store product information, customer
orders, and payments.
Social media platforms use databases to store user profiles, posts, and comments.
Banks use databases to track account information and transactions.
NoSQL, NewSQL
1. NoSQL:
NoSQL stands for "Not Only SQL". It’s a type of database that is designed for storing
and managing large amounts of data that may not fit well into traditional relational
databases.
They can handle huge amounts of data across many servers.
NoSQL can store data in different formats, like key-value pairs, documents, wide-
columns, or graphs
Types of NoSQL Databases:
o Document-Based: Stores data in documents (e.g., JSON or BSON format).
Example: MongoDB.
o Key-Value Stores: Stores data as key-value pairs. Example: Redis.
o Column-Based: Data is stored in columns rather than rows. Example:
Cassandra.
o Graph-Based: Designed for relationships between data (e.g., social networks).
Example: Neo4j.
When to Use: NoSQL is ideal for projects that need to handle:
o Large amounts of unstructured or semi-structured data.
o Quick scalability and flexibility.
o Real-time data, like social media or IoT (Internet of Things) data.
2. NewSQL:
NewSQL is a newer category of databases that aim to provide the advantages of SQL
(structured data and relational models) with the scalability and performance
features that NoSQL databases offer.
It is designed to scale horizontally, which means it can handle increased traffic and
large amounts of data more easily (just like NoSQL).
support transactional processing (like banking systems).
What it is: NewSQL is built to combine the best of both worlds: it supports traditional
SQL (structured queries, transactions) but can handle large-scale data and
distributed architectures like NoSQL.
Popular NewSQL Databases:
o Google Spanner: A distributed relational database that can scale horizontally
while maintaining consistency and strong consistency guarantees.
o CockroachDB: A distributed SQL database that is easy to scale while
maintaining SQL features.
o VoltDB: A high-performance NewSQL database designed for fast transactions.
When to Use: NewSQL is useful when you need:
o Relational data but also need to scale to handle high traffic.
o Strong consistency and ACID transactions at a large scale.
o High availability with minimal downtime.
RDBMS Databases
RDBMS (Relational Database Management System):
An RDBMS is a type of database that stores data in an organized way, using tables
that are related to each other. It's like a digital spreadsheet where the data is
structured into rows and columns.
Example:
StudentID First_Name Last_Name Age Major
Computer
1 John Doe 20
Science
2 Jane Smith 22 Mathematics
This is a simple example of an RDBMS table where:
The columns represent attributes (like name, age, major).
Each row represents a single student.
Examples: MySQL, PostgreSQL, Oracle, SQL Server.
NoSQL Vs RDBMS Databases
Feature NoSQL RDBMS (SQL)
Flexible (documents, key-value, graphs, Structured (tables with rows and
Data Model
etc.) columns)
No fixed schema (can change over
Schema Fixed schema (predefined structure)
time)
Vertical scaling (requires stronger
Scaling Horizontal scaling (across many servers)
hardware)
Not always ACID-compliant (eventual ACID-compliant (strong consistency
Transactions
consistency) and reliability)
High performance, especially for large Optimized for complex queries and
Performance
datasets transactions
Big data, real-time apps, flexible data Financial systems, CRMs, inventory
Use Cases
(social media, IoT) systems, reporting
MySQL, PostgreSQL, Oracle, SQL
Examples MongoDB, Cassandra, Redis, Neo4j
Server
Unit 2 ( Module 1 )
Tools
1. Database Management Systems (DBMS):
These are the core tools used to create, manage, and interact with databases. They allow
users to store, retrieve, and manipulate data.
Examples:
o MySQL: A popular open-source relational database system.
o PostgreSQL: Another open-source database system known for its advanced
features.
o MongoDB: A NoSQL database used for flexible data storage (documents, key-
value pairs, etc.).
2. ETL Tools (Extract, Transform, Load):
ETL tools are used to move and manipulate data from different sources and load it into a
data warehouse or database.
Extract: Getting data from various sources.
Transform: Cleaning or converting the data into a suitable format.
Load: Putting the data into the final destination (like a data warehouse).
Examples:
o Informatica: A powerful tool used for data integration.
o Talend: An open-source ETL tool that helps in connecting and transforming
data.
o Apache Nifi: A tool for automating the flow of data between systems.
3. Data Warehousing Tools:
These are used to store and manage large amounts of historical data that come from
various sources, making it easier for businesses to run reports and analyze trends.
Examples:
o Amazon Redshift: A cloud-based data warehouse that can handle large
datasets.
o Google BigQuery: A tool for running fast, SQL-like queries on massive
amounts of data in the cloud.
4. Database Performance Tuning Tools:
These tools help optimize and monitor how well a database is running. They make sure the
database is fast, efficient, and can handle a lot of queries.
Examples:
o Oracle Enterprise Manager: Helps monitor and manage Oracle databases.
o SQL Profiler (for SQL Server): Monitors and analyzes SQL queries to identify
slow parts of the database.
o pgAdmin: A tool for managing PostgreSQL databases and optimizing their
performance.
5. Backup and Recovery Tools:
These tools ensure that your data is safe and can be restored if something goes wrong, like a
system failure or human error.
Examples:
o Veeam: A backup and recovery tool for both databases and virtual
environments.
o RMAN (Recovery Manager): A tool for backing up and recovering Oracle
databases.
6. Data Migration Tools:
These tools help you move data from one system or format to another, such as moving data
between different databases or to the cloud.
Examples:
o AWS Database Migration Service: Helps you move databases to the cloud
with minimal downtime.
o Microsoft Data Migration Assistant: Used to migrate databases to SQL Server.
7. NoSQL Database Tools:
These tools help manage and interact with NoSQL databases that store data in ways other
than traditional tables (e.g., key-value pairs, documents, or graphs).
Examples:
o MongoDB Compass: A GUI tool for MongoDB that helps visualize and analyze
data.
o Cassandra Query Language (CQL): A tool used to interact with Apache
Cassandra (a NoSQL database).
8. Database Security Tools:
These tools ensure that the data is protected and only authorized users can access or modify
it.
Examples:
o IBM Guardium: Monitors and protects sensitive data in databases.
o Oracle Audit Vault: A tool for monitoring database security and compliance.
9. Data Visualization and Reporting Tools:
These tools help create reports and visualizations of the data stored in databases, making it
easier to analyze trends and make decisions.
Examples:
o Tableau: A popular tool for creating visualizations and dashboards from
database data.
o Power BI: A Microsoft tool that connects to various databases and creates
interactive reports and dashboards.
OLTP & OLAP
OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) are two types
of database systems used for different purposes
1. OLTP (Online Transaction Processing):
o It's designed for handling everyday transactions and operations.
o Example: When you make a purchase online, check your bank account
balance, or update your contact details, these are all OLTP activities.
o Focus: Speed, accuracy, and handling many small transactions at once (like
inserting, updating, or deleting records).
o Databases are usually highly normalized (organized to minimize redundancy).
Example: An e-commerce website where every time a customer buys something, the system
records the transaction, updates the inventory, and adjusts the customer's order history.
2. OLAP (Online Analytical Processing):
o It's designed for complex data analysis and reporting, often using historical
data.
o Example: Looking at business trends over the past year, running reports on
sales performance by region, or analyzing data for decision-making.
o Focus: Complex queries, aggregations, and summarizations of large datasets,
often for decision-making.
o Databases are usually denormalized (to make analysis faster by storing data in
a more readable format).
Example: A company’s manager might run an OLAP query to find out how sales have
changed over the last 5 years in different regions.
Key Differences:
OLTP is about fast and efficient handling of transactions, while OLAP is about
analyzing large amounts of data for patterns and trends.
OLTP databases have lots of small updates, inserts, and deletions, whereas OLAP
databases focus on large read-heavy operations, like summarizing and analyzing
data.
Data Preparation & Cleaning Techniques
In an advanced database context, data preparation and cleaning techniques are all
about making sure the data you work with is accurate, consistent, and usable for
analysis or further processing. Here are the most common techniques,
1. Handling Missing Data
Why?: Missing data can mess up your analysis, so it's important to deal with it.
How?:
o Remove Missing Data: Sometimes, if the missing data is small, you can simply
remove the rows or columns that have it.
o Fill with Defaults: You can replace missing values with common replacements
like the mean, median, or the most frequent value.
o Prediction: Use algorithms to predict what the missing values should be
based on other data.
2. Removing Duplicates
Why?: Duplicate data can distort your results, making them inaccurate.
How?: Find and remove rows that are exactly the same to ensure that each record is
unique.
3. Standardizing Data
Why?: Data may come from different sources with different formats (like dates in
various formats), which can cause confusion.
How?:
o Consistent Formats: Make sure everything is in the same format (e.g., dates
should all be in YYYY-MM-DD).
o Scaling: If you're working with numbers, sometimes you need to normalize or
standardize them (scaling to a specific range or making them comparable).
4. Handling Outliers
Why?: Outliers (data points far from the norm) can skew your analysis and make
results unreliable.
How?: Identify and either remove outliers or transform them to be in line with other
data, depending on their significance.
5. Dealing with Categorical Data
Why?: Many machine learning algorithms can't work with categories like "yes", "no",
"red", "blue" directly.
How?: Convert these categories into numbers or one-hot encode them (creating
separate columns for each category).
6. Text Data Cleaning
Why?: If you're working with text data (like customer reviews or tweets), it might
contain extra or irrelevant information.
How?:
o Remove unwanted characters (like punctuation or special symbols).
o Lowercase everything to make it uniform.
o Remove common words (like "the", "is", "and") that don’t add much
meaning.
7. Fixing Inconsistent Data
Why?: Sometimes data entries aren’t consistent (e.g., "USA" vs "U.S.A." or "NY" vs
"New York").
How?: Standardize the way things are written, making sure they all follow the same
naming rules.
8. Converting Data Types
Why?: Data may be stored incorrectly (e.g., numbers stored as text or dates stored as
plain text), making it hard to work with.
How?: Convert data into the right type (e.g., turning a string of numbers into actual
numeric values).
9. Data Transformation
Why?: Sometimes data needs to be changed to make it more useful for analysis.
How?:
o Log Transformation: For very large numbers, taking the logarithm can make
the data easier to analyze.
o Feature Engineering: Create new columns from existing data, like splitting a
"date" column into "day", "month", and "year".
10. Data Consistency Checks
Why?: You need to make sure your data is valid and follows the rules you expect
(e.g., no negative values for ages or prices).
How?: Verify that the data follows proper rules and fix any errors (like changing a
negative price to a valid value).
11. Data Aggregation
Why?: Sometimes, you need to combine data into a simpler form to make it more
useful for analysis.
How?: You might combine data from different rows or columns into a single
summary, like calculating the total sales from individual product sales.
By applying these techniques, you make sure that the data in your advanced
database is clean, consistent, and ready for more complex analysis, like generating
reports, building models, or making predictions.