Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views31 pages

Unit I

The document outlines the foundations of data systems, focusing on data processing, its stages, types, and methods, as well as the importance of data analytics. It details the data processing cycle, which includes steps such as collection, preparation, input, processing, output, and storage, and discusses various processing methods like batch, stream, and real-time processing. Additionally, it explains the role of data analytics in decision-making and operational efficiency, highlighting different types of analytics such as descriptive, diagnostic, predictive, and prescriptive.

Uploaded by

DIVYALAKSHMI K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views31 pages

Unit I

The document outlines the foundations of data systems, focusing on data processing, its stages, types, and methods, as well as the importance of data analytics. It details the data processing cycle, which includes steps such as collection, preparation, input, processing, output, and storage, and discusses various processing methods like batch, stream, and real-time processing. Additionally, it explains the role of data analytics in decision-making and operational efficiency, highlighting different types of analytics such as descriptive, diagnostic, predictive, and prescriptive.

Uploaded by

DIVYALAKSHMI K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Unit 1.

Foundations of data systems


1.1. Introduction to Data Processing
Data in its raw form is not useful to any organization. Data processing is the
method of collecting raw data and translating it into usable information. It
is usually performed in a step-by-step process by a team of data
scientists and data engineers in an organization. The raw data is collected,
filtered, sorted, processed, analyzed, stored, and then presented in a readable
format.
Data processing is essential for organizations to create better business strategies
and increase their competitive edge. By converting the data into readable
formats like graphs, charts, and documents, employees throughout the
organization can understand and use the data.

1.2. Stages of data processing


The data processing cycle consists of a series of steps where raw data (input) is
fed into a system to produce actionable insights (output). Each step is taken in a
specific order, but the entire process is repeated in a cyclic manner. The first
data processing cycle's output can be stored and fed as the input for the next
cycle, as the illustration below shows us.
Generally, there are six main steps in the data processing cycle:
Step 1: Collection
The collection of raw data is the first step of the data processing cycle. The type
of raw data collected has a huge impact on the output produced. Hence, raw
data should be gathered from defined and accurate sources so that the
subsequent findings are valid and usable. Raw data can include monetary
figures, website cookies, profit/loss statements of a company, user behavior, etc.
Step 2: Preparation
Data preparation or data cleaning is the process of sorting and filtering the raw
data to remove unnecessary and inaccurate data. Raw data is checked for errors,
duplication, miscalculations or missing data, and transformed into a suitable
form for further analysis and processing. This is done to ensure that only the
highest quality data is fed into the processing unit.
The purpose of this step to remove bad data (redundant, incomplete, or incorrect
data) so as to begin assembling high-quality information so that it can be used in
the best possible way for business intelligence.
Step 3: Input
In this step, the raw data is converted into machine readable form and fed into
the processing unit. This can be in the form of data entry through a keyboard,
scanner or any other input source.
Step 4: Data Processing
In this step, the raw data is subjected to various data processing methods using
machine learning and artificial intelligence algorithms to generate a desirable
output. This step may vary slightly from process to process depending on the
source of data being processed (data lakes, online databases, connected devices,
etc.) and the intended use of the output.
Step 5: Output
The data is finally transmitted and displayed to the user in a readable form like
graphs, tables, vector files, audio, video, documents, etc. This output can be
stored and further processed in the next data processing cycle.
Step 6: Storage
The last step of the data processing cycle is storage, where data and metadata
are stored for further use. This allows for quick access and retrieval of
information whenever needed, and also allows it to be used as input in the next
data processing cycle directly.
Now that we have learned what is data processing and its cycle, now we can
look at the types.
What is Data Processing: Types of Data Processing
There are different types of data processing based on the source of data and the
steps taken by the processing unit to generate an output. There is no one-size-
fits-all method that can be used for processing raw data.

Type Uses

Data is collected and


processed in batches. Used for
Batch Processing large amounts of data.
Eg: payroll system

Data is processed within


seconds when the input is
given. Used for small amounts
Real-time Processing of data.
Eg: withdrawing money from
ATM

Online Processing Data is automatically fed into


the CPU as soon as it becomes
available. Used for continuous
processing of data.
Eg: barcode scanning

Data is broken down into


frames and processed using
two or more CPUs within a
Multiprocessing single computer system. Also
known as parallel processing.
Eg: weather forecasting

Allocates computer resources


Time-sharing and data in time slots to
several users simultaneously.

What is Data Processing: Data Processing Methods


There are three main data processing methods - manual, mechanical and
electronic.
Manual Data Processing
This data processing method is handled manually. The entire process of data
collection, filtering, sorting, calculation, and other logical operations are all
done with human intervention and without the use of any other electronic device
or automation software. It is a low-cost method and requires little to no tools,
but produces high errors, high labor costs, and lots of time and tedium.
Mechanical Data Processing
Data is processed mechanically through the use of devices and machines. These
can include simple devices such as calculators, typewriters, printing press, etc.
Simple data processing operations can be achieved with this method. It has
much lesser errors than manual data processing, but the increase of data has
made this method more complex and difficult.
Electronic Data Processing
Data is processed with modern technologies using data processing software and
programs. A set of instructions is given to the software to process the data and
yield output. This method is the most expensive but provides the fastest
processing speeds with the highest reliability and accuracy of output.

1.3. Data Analytics


Data analytics is the collection, transformation, and organization of data in order
to draw conclusions, make predictions, and drive informed decision making.
Data analytics is often confused with data analysis. While these are related
terms, they aren’t exactly the same. In fact, data analysis is a subcategory of
data analytics that deals specifically with extracting meaning from data. Data
analytics, as a whole, includes processes beyond analysis, including data
science (using data to theorize and forecast) and data engineering (building data
systems).

Types of Data Analytics


Data analytics is a broad field. There are four primary types of data analytics:
descriptive, diagnostic, predictive and prescriptive analytics. Each type has a
different goal and place in the data analysis process. These are also the primary
data analytics applications in business.
 Descriptive analytics helps answer the question “What happened?” For a
business, this can be used to describe outcomes to stakeholders. By
developing key performance indicators (KPIs), descriptive analysis
strategies can help track successes or failures. For example, metrics such
as return on investment (ROI) are often used. Specialized metrics can also
be developed to track performance specific to an industry. This process
requires the collection of relevant data, data processing, analysis and
visualization. Together, this can provide essential insight into past
performance.
 Diagnostic analytics helps answer questions about why things happened.
These techniques supplement basic descriptive analytics. They take the
findings from descriptive analytics and dig deeper to find the causes
behind trends and outcomes. Key performance indicators are further
investigated to discover why they improved or worsened. This analysis
could look like the following steps:
o Identify anomalies in the data. These may be unexpected changes
in a metric or a particular market.
o Collect data related to these anomalies.
o Implement statistical techniques to find relationships and trends
that explain the anomalies.
 Predictive analytics helps answer questions about what will happen in the
future. These techniques use historical data to identify trends and
determine if they are likely to reoccur or change. Predictive analytical
tools and techniques include a variety of statistical and machine learning
techniques, such as neural networks, decision trees and regression.
 Prescriptive analytics helps answer questions about what should be done.
By using insights from predictive analytics, data-driven decisions can be
made, even in the face of uncertainty. Prescriptive analytics techniques
rely on machine learning strategies that can find patterns in large datasets.
These types of data analytics provide the insight that businesses need to make
effective and efficient decisions. Used in combination, they provide a well-
rounded understanding of a company’s needs and opportunities.
The Role of Data Analytics
Data analytics can enhance operations, efficiency, and performance in numerous
industries by shining a spotlight on patterns. Implementing these techniques can
give companies and businesses a competitive edge. Let's take a look at the
process of data analysis divided into four basic steps.
Gathering Data
As the name suggests, this step involves collecting or gathering data and
information from across a broad spectrum of sources. Various forms of
information are then recreated into the same format so they can eventually be
analyzed. The process can take a good bit of time, more than any other step.
Data Management
Data requires a database to contain, manage, and provide access to the
information that has been gathered. The next step in data analytics is therefore
the creation of such a database to manage the information.
While some people or organizations may store data in Microsoft Excel
spreadsheets, Excel is limited for this purpose and is more a tool for basic
analysis and calculations such as in finance. Relational databases are a much
better option than Excel for data storage. They allow for the storage of much
greater volumes of data, and allow for efficient access. The relational structure
allows for tables to easily be used together. Structured Query Language, known
by its initials SQL, is the computer language used to work on and query from
relational databases. Created in 1979, SQL allows for easy interaction with
relational databases enabling datasets to be queried, built, and analysized.3
Statistical Analysis
The third step is statistical analysis. It involves the interpretation of the gathered
and stored data into models that will hopefully reveal trends that can be used to
interpret future data. This is achieved through open-source programming
languages such as Python. More specific tools for data analytics, like R, can be
used for statistical analysis or graphical modeling.
Data Presentation
The results of the data analytics process are meant to be shared. The final step is
formatting the data so it’s accessible to and understandable by others,
particularly those individuals within a company who are responsible for growth,
analysis, efficiency, and operations. Having access can be beneficial to
shareholders as well.
Why Is Data Analytics Important?
Implementing data analytics into the business model means companies can help
reduce costs by identifying more efficient ways of doing business. A company
can also use data analytics to make better business decisions.

1.4. Batch processing


Batch processing is a way of processing high-volume data, where data is
collected, stored, and processed in batches, usually at regular intervals or on
demand. A “batch” is a group of events or data points collected within a given
time period typically in hours or days.
A few use cases of batch processing are as follows:
1. Payroll: Payments in payroll systems are disbursed to employees at the
end of the period (usually weekly or monthly). The computation of the
payments to be disbursed is performed periodically and the funds are
procured accordingly. Even the reimbursement claims are disbursed in
batches post auditing. Reporting and reconciliation of expenses are
performed at the end of financial cycle in batches.
2. Inventory Management: Several processes in inventory management of
e-Commerce/manufacturing depends on demand forecasting based on
past customers behavior, upcoming high velocity events like festival
seasons, day or month of the year, product review, geographic location
etc. to determine optimal levels of products to be maintained in
warehouses. This requires collecting structured and semi structured data
across different sources like product catalog, customer orders, product
reviews and perform analysis or run machine learning algorithms to
generate demand forecasts for the products. Due to the volume of data
and the nature of analysis such processing is generally done in batches.
Example Design of Batch Processing

Advantages of batch based processing are as follows:


1. Efficient use of compute: Because of big data, the compute is used
efficiently as the individual instances do not sit idle during processing.
Typically due to predictable workloads during batch processing most of
the capacity of the compute instances are fully utilized.
2. Simplicity in terms of infrastructure maintenance: Maintenance of the
resources and stack is simpler for batch processing. This is because of the
fact that data does not flow in real time and hence there is no need to re-
adjust the capacity on a frequent basis. As the processing is in batch,
fluctuation in data volume is averaged out and hence resource capacity
fluctuation is not frequent.
3. Ease and flexibility of generating analytical reports: Batch processing
makes it easy to generate reports at multiple grains (or dimensions). It
also improves the quality of output data due to availability of multiple
updates for the same event accumulated over a period of time. It also
enables easier generation of reports on historical data, which will
otherwise require backfilling and/or re-processing events one by one. For
e.g., If someone works in transportation finance and needs to create a
financial report at carrier and country level over a month of data then it
can be easily implemented using batch processing. Furthermore, if the
requirements change and another report (including historical events) is
required at the grain of carrier, country and business type then it does not
require any backfill if the additional attributes (here business type) are
available in the historical dataset.
Some limitations of batch based processing are as follows:
1. High processing time and lack of responsiveness: As batch processing
involves processing of data in bulk, processing time is always high. We
cannot expect the processing to finish in few milliseconds or seconds.
One can also not expect the response from a batch process in real time.
The processes execute on a group of data points leading to generation of
output over the accumulation period for all the events together.
2. Risk of zero output for partial failure: In batch processing if an
unhandled error appears even for 1 event then it leads to failure of
complete process and blast radius of such failures will impact all the
events in the batch. This further results in re-execution of the complete
job to generate the output. This is a problem if the availability of data is
critical within a defined SLA.
3. Larger compute requirement: Because of bulk nature of data, batch
based processing often requires larger compute instances and cluster.
Unlike stream based processes, it cannot execute with a small cluster
running all the time resulting in the procurement of high-end server
instances. However, it does make optimal use of the overall computation
power and typically due to reduction in wastage compute cost is less in
the long term.

1.5. Stream processing


Stream processing is the act of processing an event or a data point as it is being
generated. The processing is done either in real time or in near real time. Stream
processing is either at individual data point level or over a small set of data
points level collected over a few seconds or minutes.
A few use cases of stream processing are as follows:
1. Delivery and its Tracking: Delivery and its tracking in transportation
companies often deals in stream where each customer has the ability to
place request to deliver packages individually. This is applicable in terms
of package delivery and tracking among other use-cases. Even customer
support for issues in delivery delays, lost packages etc. deals at per
request level and not over a bulk of packages at once.
2. Social Media: Operations like post view counts, like counts, subscription
counts in social media platforms work at near real time. Hence they are
stream based processes.
Example Design of Stream Processing

Advantages of stream based processing are as follows:


1. Low processing time: As the nature of stream based processing is to
process either individual event or a small set of events, processing time is
always low. We do not expect processing time of hours or days for such
use cases.
2. Availability of latest information: One can expect the output from stream
based processing in real time or near real time. Also as events are
processed in near real time, any issue with the processing of event can be
identified earlier, as soon as the event is generated. This can help in early
fixing of the issues as well.
3. Smaller compute requirement: A stream based processing can be be kick
started with smaller compute instances and cluster sizes with long
running processes handling events in real time. Consequently it doesn’t
require high-end server instances.
Some limitations of stream based processing are as follows:
1. Handling out of order or missing events: Since each event is processed
individually, it needs to handle the dependencies and out-of-order
processing of events which makes the system relatively more complex.
This is because all the dependent events or earlier versions of the events
may not be available at the time of processing.
2. Maintenance of queryable state: Handling of out of order and missing
events typically will require maintaining queryable states of the events in
indexed storage resulting in higher cost. For e.g. if an event for the
payment of order is ingested before the actual order event then we may
need to store the payment event in a separate queryable store to match it
against the future order event.
3. Complexity in concurrency and error handling: It is difficult in stream
based process to ensure fault tolerance and consistency in case of
concurrent event from multiple sources. Because of processing at
transaction level, it can cause intermittent failures for different sources
due to incorrect ordering of events or unavailability of events. For e.g., If
a synchronous system X receives one create request of a transaction A
from service Y and an update request over A from service Z then it will
result in intermittent failure for service Z for updates if Y failed to create
transaction A before update request is made by Y. Solving such scenarios
may require defining ordering of events, optimistic locking, retry
strategies considering worst possible cases etc.
3. Scenarios for Batch and stream Processing
Use batch based approach if,
1. Data volume is high: Batch based processing should only be used when
data is in bulk i.e. somewhere in the range of 10s or 100s of GBs at least.
If the data volume is few GBs, one should rethink before going with
batch processing.
2. Loading full dataset is justified: One should only choose batch based
processing if there is a genuine need to process/load bulk data of
hundreds of GBs with high compute cost. If the processing can be
completed with a few GBs of data, the data range should not be increased
for the sake of choosing batch based approach.
3. There is no strict SLA and high processing time is acceptable: One should
only choose batch based processing if there is no strict SLA of data
processing and timely availability of data is not critical. If the output of
processing has to be made available to the customers by period X and
batch processing has a risk of missing availability of significant chunk of
the output then one should evaluate choosing stream based processing
instead.
Use stream based approach if,
1. Request and response expectation is at event level: Stream based
processing is recommended if the incoming event is readily available
individually and the response is expected to be provided for individual
events or a set of events aggregated over a small period of time in a
micro-batch (say 1 mins or 5 mins). It is possible to accumulate and
segregate data at request and response layers and choose batch based
processing, but other factors will play a part to trade-off against added
complexity of accumulation/segregation.
2. There is strict SLA and low processing time is expected: Stream based
processing of data should be chosen if availability of data within a certain
period is critical. e.g., if it is critical to provide data to stakeholders within
1 hour then one should process each transaction individually rather than
having a batch based process at the end of the hour. Batch process is
problematic here because if there is any failure while processing the data
by the end of the hour, it will result in not providing any output to
stakeholders and the blast radius will be high, whereas stream processing
will be having more chances of success for individual events.
3. Lookup over indexed data is enough: One should go with stream based
processing if there is no need to lookup over a huge set of unindexed
data, otherwise additional indexes has to be created which will result in
additional cost. Here unindexed data refer to data which is not stored
against a particular key or set of keys and is not queryable without
iterating the entire dataset.

Difference between Batch Processing and Stream processing :


S.No. BATCH PROCESSING STREAM PROCESSING

Batch processing refers to


processing of high volume of Stream processing refers to
data in batch within a specific processing of continuous stream of
01. time span. data immediately as it is produced.

Batch processing processes Stream processing analyzes


02. large volume of data all at once. streaming data in real time.

In Batch processing data size is In Stream processing data size is


04. known and finite. unknown and infinite in advance.

In Batch processing the data is In stream processing generally


05. processes in multiple passes. data is processed in few passes.

Stream processor takes few


Batch processor takes longer seconds or milliseconds to process
06. time to processes data. data.

In batch processing the input In stream processing the input


07. graph is static. graph is dynamic.

In this processing the data is In this processing the data is


08. analyzed on a snapshot. analyzed on continuous.

In batch processing the response


is provided after job In stream processing the response
09. completion. is provided immediately.

Examples are distributed Examples are programming


programming platforms like platforms like spark streaming
MapReduce, Spark, GraphX and S4 (Simple Scalable
10. etc. Streaming System) etc.

Batch processing is used in Stream processing is used in stock


payroll and billing system, food market, e-commerce transactions,
11. processing system etc. social media etc.
Processes data in batches or Processes data in real-time, as it is
sets, typically stored in a generated or received from a
12 database or file system. source.

Processes data in discrete, finite Processes data continuously and


13 batches or jobs. incrementally.

1.6. What is data migration?


Data migration is the process of selecting, preparing, and moving existing
data from one computing environment to another. Data may be migrated
between applications, storage systems, databases, data centers, and business
processes.

Each organization’s data migration goals and processes are unique. They must
consider many factors such as costs, timing, technical requirements, impact to
business operations, the potential for data loss, compliance requirements, and
more.

1.6.1. Why do businesses migrate data?

Businesses may undertake data migration projects for a number of reasons, such
as to:

 Reduce media, storage, or other IT equipment costs

 Expand and scale storage capacity

 Improve customers’ website or digital experience

 Centralize and simplify data management

 Accelerate application performance

 Merge data from a company acquisition

 Meet new compliance or security requirements

 Enhance data analytics and reporting capabilities


Consider the following example: when someone buys a new computer, they
usually prefer to install the newest versions of software and just copy over the
most important files from their old one. Adding a bloat of obsolete software and
files would unnecessarily take up storage space and slow down their new
device. Likewise, efficient data migration ensures that the new system is
utilizing correctly cleansed, extracted, and transformed data.

Data migration can be a key enabler of digital transformation — the use of


digital technology to modernize business workloads and processes. It often goes
hand-in-hand with cloud migration — specifically to ensure that no outdated or
corrupt data is migrated to an organization’s new cloud infrastructure.

1.6.2. What are the main types of data migration?

Data centers store files or databases used by software applications, which drive
business processes and workflows. As such, data migration is commonly
categorized into five types:

1. Storage migration transfers data from one storage medium to


another. Organizations may change physical media formats (such as
from paper to digital files or hard disk drives), or from on-premise
storage to cloud storage. Data can also be migrated between one or
more cloud storage systems. After a storage migration, how the data
is accessed changes, although the data itself does not.

2. Application migration shifts software applications from one


computing environment to another. This may include migrating
application programs from an on-premise server to
a cloud environment, between clouds (e.g., from AWS to Microsoft
Azure), or upgrading an application and retiring the old one. Because
every application has a unique data model, the format of the data (as
well as how it is viewed by end users) may change during an
application migration.

3. Business process migration transfers applications or databases (such


as a CRM or ERP platform) that are operated by humans to produce a
service for customers. A business process migration is usually
prompted by a company merger, acquisition, or reorganization.
4. Database migration, also sometimes called schema migration,
moves data between two or more databases. Databases are managed
with database management systems (DBMS) such as Oracle,
MySQL, PostgreSQL, and others, so database migration can mean
moving from one DBMS to another, or upgrading to a newer DBMS
version.

5. Data center migration refers to transferring assets from one data


center to another location or operating environment. A data center
migration is particularly complex because data centers include IT
assets that store, retrieve, distribute, or archive data and applications.
Depending on the organization’s objectives, data center migration
can involve completely changing physical hardware, virtual
machines, or cloud solutions.

1.6.3. What does the data migration process involve?

There is no “one size fits all” process for every type of data migration.
However, a complete data migration plan contains three phases, which then
comprise a number of other components and stages.

1. Pre-migration

2. Migration (“go-live”)

3. Post-migration (test/audit)

Pre-migration (planning/discovery)

Pre-migration is the initial planning phase, which ensures that the migration will
go smoothly and aims to minimize risks. During this phase, the data migration
teams establish project objectives, scope, staffing/resources required, and
critical requirements.

Pre-migration tasks can include (but are not limited to):

 Assessing (profiling) data sources, destinations and formats


 Inspecting data quality, anomalies, or duplications

 Identifying impacted users and potential disruption

 Defining hardware, software, and security requirements

 Determining costs, staff, and data migration tools required

 Setting a migration completion timeline

 Cleansing or reformatting data

 Backing up data and determining what to do with obsolete data

 Deciding on specific approach (described in the next section)

 Creating risk mitigation and stakeholder communication plans

Migration (“go-live”)

Once the plan has been created, the right permissions are secured, and all the
data is ready for migrating to the target system, the actual data migration begins.
The “go live” execution can include:

 Loading the necessary permissions and settings

 Testing the migration with a mirror of the live environment

 Implementing the data migration policies and security rules

 Testing data in the new system to ensure it is accurate

 Fixing problems from migration

There are a few specific strategies for application migrations to the cloud, such
as re-hosting (also called “lift and shift”), re-architecting, re-platforming, and
others.

Post-migration (validation)
Data migration is not complete after “flipping the switch.” The results of the
migration must be audited and validated to make sure everything has been
correctly transferred and logged.

Once the post-migration audit is deemed successful, the old system can be
decommissioned.

1.7. Transactional data processing


Transactional data processing (TDP) is a type of data processing specifically
focused on managing and processing transactions within information systems.
Transactions are defined as a series of operations performed as a single unit of
work, typically involving the addition, modification, or deletion of data in a
database. TDP ensures that these operations are completed accurately and
consistently, even in the presence of system failures or concurrent transactions.

### **Key Concepts of Transactional Data Processing**

1. **Transaction**

- **Definition**: A sequence of operations performed as a single logical unit


of work. Transactions typically involve reading and writing data to a database.

- **Properties**: A transaction must adhere to the ACID properties to ensure


reliability and correctness.

2. **ACID Properties**

- **Atomicity**: Ensures that all operations in a transaction are completed


successfully or none are applied. If one part of the transaction fails, the entire
transaction is rolled back.

- **Consistency**: Ensures that a transaction transforms the database from


one consistent state to another. All constraints, rules, and data integrity are
maintained before and after the transaction.

- **Isolation**: Ensures that the operations of one transaction are isolated


from those of other concurrent transactions. Each transaction should operate as
if it is the only transaction running.
- **Durability**: Ensures that once a transaction is committed, its changes are
permanently saved to the database, even in the event of a system crash.

3. **Transaction Management**

- **Commit**: Finalizes a transaction, making all changes permanent and


visible to other transactions.

- **Rollback**: Reverts all changes made during a transaction if an error


occurs, restoring the database to its previous state.

- **Concurrency Control**: Manages the simultaneous execution of


transactions to ensure consistency and avoid conflicts. Techniques include
locking mechanisms and optimistic concurrency control.

4. **Isolation Levels**

- **Read Uncommitted**: Allows transactions to read uncommitted changes


made by other transactions. This can lead to dirty reads.

- **Read Committed**: Ensures that transactions only read committed


changes. This prevents dirty reads but allows non-repeatable reads.

- **Repeatable Read**: Ensures that if a transaction reads a data item, it will


see the same value if it reads it again during the transaction. This prevents non-
repeatable reads but may allow phantom reads.

- **Serializable**: Provides the highest level of isolation by ensuring that


transactions are executed in a serial order, as if they were executed one after
another. This prevents dirty reads, non-repeatable reads, and phantom reads.

### **Transactional Data Processing Workflow**

1. **Transaction Initiation**

- **Activity**: A transaction begins when a request is made to perform a


series of operations (e.g., placing an order).

- **Outcome**: A new transaction is started and assigned a unique


transaction ID.
2. **Operation Execution**

- **Activity**: Operations such as read, write, update, or delete are


performed on the database.

- **Outcome**: Intermediate changes are made to the database, subject to


ACID properties.

3. **Transaction Commit/Rollback**

- **Commit**: If all operations are successful, the transaction is committed,


and changes are made permanent.

- **Rollback**: If any operation fails, the transaction is rolled back, and all
changes are undone.

4. **Concurrency Management**

- **Activity**: The system manages concurrent transactions to ensure data


integrity and consistency.

- **Outcome**: Techniques like locking or timestamp ordering are used to


handle simultaneous transactions.

5. **Recovery**

- **Activity**: In case of system failures or crashes, recovery mechanisms


ensure that committed transactions are preserved and rolled back transactions
are undone.

- **Outcome**: The database is restored to a consistent state after recovery.


### **Applications of Transactional Data Processing**

1. **Banking Systems**

- **Description**: Handling financial transactions such as deposits,


withdrawals, and transfers, ensuring consistency and accuracy.

2. **E-Commerce**

- **Description**: Managing online transactions such as order processing,


inventory updates, and payment processing.

3. **Reservation Systems**
- **Description**: Managing bookings and reservations for airlines, hotels,
and other services, ensuring that reservations are accurately recorded and
processed.

4. **Point-of-Sale Systems**

- **Description**: Handling transactions at retail stores, including sales,


returns, and inventory updates.

5. **Healthcare Systems**

- **Description**: Managing patient records, appointment scheduling, and


billing, ensuring data accuracy and privacy.

### **Transactional Data Processing Tools and Technologies**

1. **Database Management Systems (DBMS)**

- **Examples**: Oracle, Microsoft SQL Server, MySQL, PostgreSQL

- **Features**: Provide built-in support for transaction management, ACID


properties, and concurrency control.

2. **Transaction Processing Monitors (TPMs)**

- **Examples**: IBM CICS, BEA Tuxedo

- **Features**: Manage and coordinate transactions across distributed


systems, providing high availability and scalability.
3. **Distributed Transaction Systems**

- **Examples**: Two-Phase Commit Protocol (2PC), XA Transactions

- **Features**: Handle transactions that span multiple databases or systems,


ensuring atomicity and consistency across distributed environments.

1.8. What Is Data Mining?


Data mining is the process of searching and analyzing a large batch of raw data in
order to identify patterns and extract useful information.

Companies use data mining software to learn more about their customers. It can
help them to develop more effective marketing strategies, increase sales, and
decrease costs. Data mining relies on effective data collection, warehousing, and
computer processing.

How Data Mining Works


Data mining involves exploring and analyzing large blocks of information to
glean meaningful patterns and trends. It is used in credit risk management, fraud
detection, and spam filtering. It also is a market research tool that helps reveal the
sentiment or opinions of a given group of people. The data mining process breaks
down into four steps:

1. Data is collected and loaded into data warehouses on site or on a cloud


service.
2. Business analysts, management teams, and information technology
professionals access the data and determine how they want to organize it.
3. Custom application software sorts and organizes the data.
4. The end user presents the data in an easy-to-share format, such as a graph
or table.

Data mining techniques


Here are some of the most popular types of data mining:

Association rules: An association rule is an if/then, rule-based method for


finding relationships between variables in a data set. The strengths of
relationships are measured by support and confidence. The confidence level is
based on how often the if or then statements are true. The support measure is
how often the related elements are shown in the data.

These methods are frequently used for market basket analysis, enabling
companies to better understand the relationships between different products,
such as those that are frequently purchased together. Understanding customer
habits enables businesses to develop better cross-selling strategies and
recommendation engines.

Classification: Classes of objects are predefined, as needed by the


organization, with definitions of the characteristics that the objects have in
common. This enables the underlying data to be grouped for easier analysis.

For example, a consumer product company might examine its couponing


strategy by reviewing past coupon redemptions together with sales data,
inventory stats and any consumer data on hand to find the best future campaign
strategy.

Clustering: Closely related to classification, clustering reports similarities, but


then also provides more groupings based on differences. Preset classifications
for a soap manufacturer might include detergent, bleach, laundry softener, floor
cleaner and floor wax; while clustering might create groups including laundry
products and floor care.

Decision tree: This data mining technique uses classification or regression


analytics to classify or predict potential outcomes based on a set of decisions.
As the decision tree name suggests, it uses a tree-like visualization to represent
the potential outcomes of these decisions.

K-nearest neighbor (KNN): Also known as the KNN algorithm, K-nearest


neighbor is a nonparametric algorithm that classifies data points based on their
proximity and association to other available data. This algorithm assumes that
similar data points are found near each other. As a result, it seeks to calculate
the distance between data points, usually through Euclidean distance, and then it
assigns a category based on the most frequent category or average.

Neural networks: Primarily used for deep learning algorithms, neural


networks process training data by mimicking the interconnectivity of the human
brain through layers of nodes. Each node is made up of inputs, weights, a bias
(or threshold) and an output.

If that output value exceeds the set threshold, it “fires” or activates the node,
passing data to the next layer in the network. Neural networks learn this
mapping function through supervised learning, making adjustments based on
the loss function through the process of gradient descent. When the cost
function is at or near zero, an organization can be confident in the model’s
accuracy to yield the correct answer.

Predictive analytics: By combining data mining with statistical modeling


techniques and machine learning, historical data can be analyzed by using
predictive analytics to create graphical or mathematical models intended to
identify patterns, forecast future events and outcomes, and identify risks and
opportunities.

Regression analysis: This technique discovers relationships in data by


predicting outcomes based on predetermined variables. This can
include decision trees and multivariate and linear regression. Results can be
prioritized by the closeness of the relationship to help determine what data is
most or least significant. An example would be for a soft drink manufacturer to
estimate the needed inventory of drinks before the arrival of predicted hot
summer weather.

1.9. Data management strategy


A data management strategy is like a game plan for handling all the information a
company deals with every day.

What is a Data Management Strategy?

A data management strategy is key to how organizations approach their data


assets. It’s more than a set of guidelines; it’s a comprehensive approach
encompassing data collection, storage, and processing, ensuring data quality
and security throughout the data lifecycle.

This strategy involves the thoughtful integration of data management tools, from
data warehousing to advanced data analytics, to transform raw data into actionable
insights.

Effective data management strategy is not just about safeguarding data—it’s about
unlocking the full potential of your data assets. It encompasses everything from
master data management, ensuring uniformity and accuracy, to predictive
analytics, helping forecast trends and behaviors.

With a keen focus on data governance and privacy, it guides how organizations
manage, analyze, and use their data, aligning it with business objectives and
ensuring operational efficiency.
This strategy is a blueprint for managing data in a way that maximizes its business
value. It involves ongoing data management processes and practices, engaging data
management professionals to oversee data integration, data modeling, and data
analysis.

It’s a dynamic and evolving plan, reflecting the latest in data management practices
and technologies, tailored to meet the specific needs of business users and the
company’s data strategy.

An effective data management strategy is the foundation for a robust data


management program. It’s essential in realizing a comprehensive enterprise data
management strategy, facilitating everything from data migration to utilizing data
management platforms.

It ensures that data remains secure, high-quality, and readily available,


transforming data management from a mere operational task to a strategic asset
driving business growth.

Data Management Strategy Benefits

-Enhanced Data Quality and Reliability


-Robust Data Security and Governance
-Streamlined Decision-Making with Data Analytics
-Efficient Data Integration and Utilization

Types of Data Management Systems


There’s a variety of data management systems, each serving different needs. Data
warehouses store processed data, while data lakes keep raw data in its native
format.

Master data management systems ensure consistency across all data sources. And
let’s not forget data management software that aids in integrating, processing, and
analyzing data efficiently.

7 Steps to an Effective Data Management Strategy

1. Identify Business Objectives


Start by setting clear, specific objectives for what your data should achieve. These
goals range from improving operational efficiency to enhancing customer
experiences or driving product innovation.
Consider how data analytics can uncover insights to drive your business forward,
influence your market positioning, or create competitive advantages. Establish
short-term and long-term goals, ensuring they are measurable and aligned with the
business strategy.

2. Assess Data Assets


Evaluate the current state of your data assets. This assessment involves analyzing
the type, quality, relevance, and sources of the data you possess. Understand the
limitations and potentials of your structured and unstructured data.

Identify areas where data quality can be improved, such as accuracy, completeness,
and timeliness. This step is also about recognizing gaps in your data collection and
understanding how these gaps impact your ability to meet your business objectives.

3. Implement Data Governance


Develop a comprehensive data governance framework that establishes control
mechanisms to manage your data effectively. Define clear roles and
responsibilities for data management within your organization, including a
governance team or data stewards.

Set up policies and procedures for data usage, privacy, and security, ensuring they
comply with relevant laws and regulations. Implement standards and metrics to
maintain data quality and consistency across different departments.

4. Invest in Data Management Tools


Select and invest in data management tools and technologies that best fit your
strategic needs. This may include data integration platforms to connect disparate
systems, data analytics software for gaining insights, or CRM systems for
managing customer data.

Prioritize tools with scalability, user-friendliness, and strong integration


capabilities with existing systems. Consider future-proofing your investment by
choosing solutions that adapt to evolving data needs and technologies.

5. Continuously Monitor and Adapt


Adopt a proactive approach to managing your data strategy. Regularly review and
refine your strategy to align with changing business objectives, emerging market
trends, and new technological advancements.

Stay vigilant for changes in data regulations, advancements in data processing


technologies, and shifts in data sources and types. Establish a routine for updating
policies, processes, and tools to keep your data management strategy current and
effective.
6. Develop a Data-Driven Culture
Cultivate a data-driven culture within your organization to ensure the successful
implementation of your data strategy.

Encourage all levels of the organization to understand the value of data and how to
use it in their roles. Offer training and resources to improve data literacy among
employees.

Promote open communication about data findings and encourage a culture where
data-driven insights are valued and used for decision-making. In light of this effort,
it’s notable that recent studies indicate only 23% of executives feel their
organizations have reached a data-driven status, marking a decline from 31% just
four years ago.

This statistic highlights a critical challenge in evolving organizational cultures to


fully embrace and leverage data for strategic decision-making, underscoring the
urgency for businesses to intensify their efforts in fostering a genuinely data-driven
environment.

7. Leverage Data Analytics for Insights


Effectively use data analytics to turn raw data into actionable insights. Analyze
data to identify patterns, trends, and correlations that can inform strategic
decisions. Use predictive analytics to anticipate market changes, customer
behaviors, and potential risks.

Advanced analytics can help uncover hidden opportunities, optimize operations,


and enhance customer experiences. Ensure that the insights derived are effectively
communicated and accessible to decision-makers.

Highlighting the power of timely data utilization, research has shown that the most
successful marketers, those delivering exceptional results, are 50% more likely to
apply analytics data to their campaigns almost in real time.

This approach highlights the importance of data analytics in driving marketing


success and the competitive advantage gained through the immediate application
of insights.

1.10. What Is Data as a Service?


Data as a Service (DaaS) refers to the concept of providing data storage, processing,
analytics, and other data capabilities as an on-demand service.
With data as a service, organizations can access and analyze data on an as-needed basis,
without having to invest in and manage large data infrastructure and platforms themselves.

At its core, DaaS delivers data management and usage as a cloud-based service. Companies
can leverage DaaS solutions to store structured and unstructured data across public, private,
or hybrid cloud environments.

1.10.1. Benefits of Data as a Service

Data as a Service offers several key benefits for organizations looking to leverage data
analytics and insights without the burden of building and maintaining complex data
infrastructure.

Cost Savings

One of the biggest advantages of data as a service is significant cost savings compared to
owning and managing data infrastructure. With DaaS, organizations pay only for the data
they need, when they need it.

They avoid large capital expenditures (CapEx) on hardware, software, maintenance, and IT
staffing required for on-premises data analytics platforms. DaaS allows scaling data usage up
and down on demand.

Scalability and Flexibility

DaaS provides easy scalability to support both short-term projects and fluctuating workloads,
as well as long-term growth. Organizations can start small with a proof of concept, then
quickly scale up as needs expand.

The cloud-based nature of DaaS means capacity can be added or reduced almost instantly.
This scalability and flexibility is difficult to achieve with on-premises systems.

Faster Time to Insights

With data as a service, organizations can gain insights from their data much faster compared
to traditional analytics tools. The time required to set up infrastructure is eliminated.
Subject matter experts can quickly access ready-to-analyze data through intuitive interfaces
rather than waiting on IT and data engineers. This enables faster iterations and reduces time
to uncover impactful insights.

Focus on Core Business

Data as a service alleviates the burden of building and maintaining data infrastructure, freeing
up resources to focus on core business goals. Technical staff are no longer bogged down
supporting complex data pipelines.

Instead, they can dedicate time to high-value activities that drive business outcomes. The
organization avoids the overhead of managing infrastructure and can redirect those efforts
toward differentiating capabilities.

1.10.2. Data as a Service Challenges

Data as a Service provides organizations with many benefits, but it also comes with some
challenges that need to be addressed. Some of the key challenges with DaaS include:

Data Security and Privacy Concerns

One major concern with data as a service is data security and privacy, especially when
sensitive data is involved. Organizations need to carefully evaluate security measures and
privacy policies of any potential DaaS provider.

With data residing outside the organization's own infrastructure, there is always a risk of
unauthorized access or data breaches.

DaaS providers must have robust security in place, including encryption, access controls, and
compliance with regulations. Organizations should conduct risk assessments and due
diligence before adopting DaaS.

Vendor Dependence and Lock-In

Adopting data as a service often means becoming dependent on an external vendor. This can
limit flexibility and control compared to managing data internally. There is also the risk
of vendor lock-in if data and integrations are tied to proprietary platforms.
Migrating large datasets to a new provider is challenging. Organizations should negotiate
appropriate exit strategies with vendors and use open APIs and standard integrations where
possible.

Integration with Legacy Systems

Integrating data as a service with existing on-premises infrastructure and legacy systems can
be difficult. DaaS may use data structures, APIs, and platforms different from current
systems.

IT teams need sufficient resources to build and maintain integrations between DaaS and
internal systems. Adequate bandwidth must be provisioned as large volumes of data may
need to be moved back and forth.

With careful planning and vendor evaluation, organizations can overcome these challenges
and successfully leverage DaaS for analytics, storage, and other use cases. However, the risks
must be addressed upfront through security controls, service agreements, and integration
strategies.

You might also like