lOMoARcPSD|46570384
Final Report - related to google android
Data Structure and Algorithms (Indian Institute of Technology (Indian School of Mines),
Dhanbad)
Scan to open on Studocu
Studocu is not sponsored or endorsed by any college or university
Downloaded by 22501A4437 LANKIREDDY M S K REDDY (
[email protected])
lOMoARcPSD|46570384
INTERNSHIP
REPORT
GOOGLE ANDROID DEVELOPER VIRTUAL INTERNSHIP
PRESENTED BY :
K.Shithikanteshwar
K.Shithikam
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
SUMMER INTERNSHIP
GOOGLE ANDROID DEVELOPER
Under the Guidance of
Mr. Jinka Sreedhar
Submitted
By
K.Shithikanteshwar
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
ABSTRACT
This extensive program for Android development gives students the tools
they need to design cutting-edge, useful Kotlin applications. Beginning with
the fundamentals of Kotlin and Android Studio, it walks students through
creating and launching their first app. Interactive components, unit testing,
UI state management, and sophisticated features like data classes,
collections, scrollable lists, and Material Design integration are all covered in
the program.
The concepts of app architecture—activities, lifecycles, View Models, and
State Flow—are examined, and Jetpack Compose is introduced for screen
adaptation and navigation. The curriculum covers HTTP, REST using Retrofit,
image handling with Coil, and web data integration with Kotlin coroutines.
Learners get proficient in Room, Preference Data Store, Work Manager for
background chores, SQL procedures, and local storage solutions. With the
help of this well-organized curriculum, students can create reliable, effective,
and eye-catching Android apps.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
Table of Contents
Chapter 1: Your First Android App
1.1. Introduction 1
1.2. Write Kotlin Programs to Display Text and Images 1
1.3. Download and Install Android Studio 2
1.4. Build a Basic App with a Simple User Interface 2
1.5. Run the App on Physical Devices and Emulators 2
Chapter 2: Building App UI
2.1. Introduction 3
2.2. Expand on Kotlin Fundamentals to Build Interactive Apps 3
2.3. Use Conditionals, Function Types, Classes, and Lambda Expressions 4
2.4. Learn About UI Composition and State Management 4
2.5. Add Interactive Elements Like Buttons 5
2.6. Create a Tip Calculator App 5
2.7. Write Unit Tests for Isolated Functions 6
Chapter 3: Display Lists and Use Material Design
3.1. Introduction - 7
3.2. Build Apps Displaying Lists of Data - 7
3.3. Enhance Apps with Material Design Principles - 8
3.4. Work with Data Classes, Functions, and Collections - 8
3.5. Create Scrollable Lists with Interactive Elements - 9
3.6. Learn Material Design Principles for Modern UIs - 9
Chapter 4: Navigation and App Architecture
4.1. Introduction - 10
4.2. Learn Best Practices for App Architecture - 10
4.3. Understand Activities, Lifecycles, StateFlow, and ViewModels - 11
4.4. Focus on Responsive UI Design - 11
4.5. Set Up Navigation Using Jetpack Compose - 12
4.6. Adapt Apps for Various Screen Sizes - 12
Chapter 5: Connect to the Internet
5.1. Introduction - 13
5.2. Use Kotlin Coroutines for Concurrency - 13
5.3. Learn HTTP and REST with Retrofit - 14
5.4. Implement a Repository Pattern for Centralized Data Access - 14
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
5.5. Use Coil to Load and Display Images - 15
5.6. Apply Dependency Injection for Scalable Code - 15
Chapter 6: Data Persistence
6.1. Introduction - 16
6.2. Learn Local Data Storage and Persistence - 16
6.3. Understand SQL Basics - 17
6.4. Use Room Library for Database Management - 17
6.5. Debug with Database Inspector - 18
6.6. Store User Preferences with Preference DataStore – 18
Chapter 7: WorkManager
7.1. Introduction - 19
7.2. Explore WorkManager API for Background Tasks - 19
7.3. Create a Worker Object and Enqueue Work - 20
7.4. Create Constraints on WorkRequests - 20
7.5. Use the Background Task Inspector to Inspect and Debug WorkManager - 21
Chapter 8: Views and Compose
8.1. Introduction - 22
8.2. Learn to Use Compose and the Older UI Toolkit Based on Views Side-by-Side - 22
8.3. Understand the View-Based UI Toolkit and Build App UI Using XML - 23
8.4. Add a Composable in an App Built with Views - 23
8.5. Add Navigation Component to the App and Navigate Between Fragments - 24
8.6. Use AndroidView to Display Views - 24
8.7. Add Existing View-Based UI Components in a Compose App - 25
References
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
Unit 1: My First Android App
Overview
In this unit, I began my journey into the world of Android app development. The unit was divided
into three pathways, each focusing on different aspects of creating an Android app. By the end of
this unit, I had built a simple Android application that displays text and images and could run it
on an Android device or emulator.
Pathways
This unit was structured into three key pathways:
1. Programming Basics with Kotlin
2. Setting Up Android Studio
3. Building and Running My First App
Learning Objectives
The main goals I aimed to achieve in this unit were:
1. Learn Programming Basics
I understood the fundamental principles of programming.
I wrote simple programs in Kotlin that output text, providing a foundation for more
complex app development.
2. Download and Install Android Studio
I set up Android Studio, the official Integrated Development Environment (IDE) for
Android development.
I familiarized myself with the interface and essential tools available in Android
Studio.
3. Build an Android App with a Simple User Interface
I created a new Android project in Android Studio.
I designed a user interface that included text and images.
I used XML to define UI components and Kotlin to handle their behavior.
4. Run the App on a Device or Emulator
I tested my application on an Android device or emulator.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
I debugged and refined my app to ensure it functioned correctly and met user
expectations.
Pathway Details
Pathway 1: Programming Basics with Kotlin
Introduction to Kotlin:
I learned the basics of Kotlin, the preferred programming language for Android
development.
I understood the syntax and structure of Kotlin programs.
Writing and Executing Simple Kotlin Programs:
I practiced writing simple programs that displayed text output.
I became familiar with the Kotlin development environment.
Understanding Variables, Data Types, and Control Structures:
I learned how to declare and use variables.
I explored different data types in Kotlin.
I understood and implemented control structures such as loops and conditionals.
Pathway 2: Setting Up Android Studio
Downloading and Installing Android Studio:
I followed step-by-step instructions for downloading Android Studio from the official
website.
I installed Android Studio on my operating system (Windows, macOS, or Linux).
Overview of Android Studio Interface:
I learned about the key components of the Android Studio interface.
I understood the important features and tools within the IDE.
Creating and Configuring My First Android Project:
I created a new Android project.
I configured project settings and understood the project structure.
Pathway 3: Building and Running My First App
Designing the User Interface Using XML:
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
I learned about XML and its role in Android UI design.
I created a simple user interface that included TextView and ImageView
components.
Adding and Configuring UI Components:
I added TextView and ImageView components to my layout.
I configured the properties of these components to achieve the desired design.
Writing Kotlin Code to Interact with UI Components:
I wrote Kotlin code to handle user interactions and update the UI.
I implemented simple event handlers and data binding techniques.
Running the App on an Android Device or Emulator:
I set up and used an Android emulator.
I connected a physical Android device for app testing.
I debugged and troubleshooted common issues.
Practical Application
Throughout this unit, I engaged in hands-on exercises and projects that reinforced the concepts
and skills I learned. By the end of the unit, I had:
Developed a solid understanding of basic programming principles.
Gained practical experience with the Kotlin programming language.
Successfully set up and navigated the Android Studio development environment.
Built a functional Android app with a simple user interface.
Tested and debugged the app on an Android device or emulator.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
CHAPTER 2 - ANOMALY DETECTION IN UNSUPERVISED LEARNING
As we know, unsupervised learning is used in many ways and is crucial in real life, such as in
anomaly detection. Anomaly detection involves identifying rare items, events, or
observations that deviate significantly from the majority of the data. This is particularly
important in various domains, including fraud detection, network security, fault detection in
manufacturing, and health monitoring. Unlike supervised learning, where the model is
trained on labeled data, unsupervised learning does not rely on predefined labels and
instead focuses on uncovering the underlying structure of the data.
2.1 Introduction -
Anomaly detection, a critical aspect of data analysis, refers to the identification of rare
items, events, or observations that deviate significantly from the majority of the data. This
process is essential in various domains, such as fraud detection, network security, fault
detection, and medical diagnosis, where identifying outliers can indicate critical issues or
rare events.
Unsupervised learning is particularly suited for anomaly detection due to its ability to
identify patterns in data without requiring labeled examples. Unlike supervised learning,
where the algorithm is trained on a dataset with known outcomes, unsupervised learning
deals with unlabeled data, making it ideal for scenarios where anomalies are rare and labels
are scarce or non-existent.
Autoencoders, a type of neural network, consist of an encoder and a decoder. The encoder
compresses the input data into a lower-dimensional latent space, while the decoder
reconstructs the original data from this compressed representation. By training on normal
data, autoencoders learn to accurately reconstruct it. Anomalies, which differ from the
normal patterns, are poorly reconstructed, resulting in higher reconstruction errors. This
characteristic makes autoencoders a powerful tool for anomaly detection in various
applications, including image processing, network traffic analysis, and industrial equipment
monitoring.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
2.2 Anomaly Detection Use Cases
1. Fraud Detection
• Financial Transactions: Identify suspicious transactions in credit card usage, wire
transfers, and banking activities.
• Insurance Claims: Detect fraudulent insurance claims by identifying patterns that
deviate from normal claim behavior.
2. Network Security
• Intrusion Detection Systems (IDS): Monitor network traffic for unusual patterns that
may indicate security breaches or cyber-attacks.
• Endpoint Security: Identify anomalous behavior on devices that could signify malware
or unauthorized access.
3. Manufacturing
• Predictive Maintenance: Detect early signs of equipment failure by monitoring sensor
data for deviations from normal operating conditions.
• Quality Control: Identify defects in production processes by spotting anomalies in
production line data.
4. Healthcare
• Patient Monitoring: Detect irregularities in patient vitals or lab results that could
indicate medical conditions or emergencies.
• Medical Imaging: Identify abnormal patterns in radiology images, such as tumors or
lesions.
5. Finance
• Market Analysis: Spot unusual trading patterns or stock price movements that could
indicate insider trading or market manipulation.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
• Risk Management: Identify anomalous trends in market data that may affect
investment portfolios.
6. Retail
• Customer Behavior Analysis: Detect unusual purchasing patterns that might indicate
fraud or shifts in consumer trends.
• Inventory Management: Identify discrepancies in inventory levels that could signify
theft or stock misplacement.
7. Energy and Utilities
• Smart Grid Monitoring: Detect anomalies in power usage data that might indicate
issues like power theft or faults in the distribution network.
• Oil and Gas: Identify irregularities in drilling data that could point to potential
equipment failures or operational inefficiencies.
8. Transportation
• Fleet Management: Monitor vehicle performance and driver behavior to detect
anomalies that could suggest maintenance needs or unsafe driving practices.
• Public Transit: Identify unusual patterns in passenger data to optimize routes and
schedules.
9. Telecommunications
• Call Detail Records (CDR) Analysis: Detect unusual calling patterns that might indicate
fraud or misuse of services.
• Network Performance Monitoring: Identify network performance issues by spotting
deviations from expected traffic patterns.
10. Environmental Monitoring
• Climate Data Analysis: Detect unusual patterns in weather data that could indicate
climate anomalies or environmental changes.
• Pollution Monitoring: Identify abnormal levels of pollutants that might indicate
environmental hazards or violations of regulations.
Anomaly detection plays a crucial role across various industries by identifying irregular
patterns that could signify important, often critical, events. By leveraging advanced analytics
and machine learning, organizations can proactively address potential issues, enhance
security, and improve operational efficiency.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
Finding anomaly using data set -
The dataset contains information about various countries, including socio-economic and
health indicators such as GDP per capita, life expectancy, child mortality rates, access to
clean water, access to sanitation, and literacy rates. Finding Anomalies in it
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
2.3 Anomaly Detection Approaches
2.3.1. Statistical Methods Statistical methods rely on the assumption that data follows a
known distribution. Anomalies are detected by identifying data points that significantly
deviate from this distribution. Common statistical techniques include Z-score, which
measures how many standard deviations a data point is from the mean, and Grubbs' test,
which detects outliers in a univariate dataset. These methods are relatively simple to
implement and interpret, making them suitable for datasets with a clear and stable
distribution pattern.
2.3.2. Machine Learning-Based Methods Machine learning approaches for anomaly
detection can be broadly categorized into supervised, unsupervised, and semi-supervised
methods. Supervised methods require labeled training data with known anomalies, which
can be used to train models like decision trees, support vector machines, or neural networks
to recognize similar patterns in new data. Unsupervised methods, such as clustering (e.g., K-
means) and dimensionality reduction (e.g., PCA), do not require labeled data and are useful
when anomalies are not well-defined. Semi-supervised methods combine both approaches,
using a small amount of labeled data to improve the performance of unsupervised
techniques. Machine learning methods are powerful and flexible, capable of handling
complex and high-dimensional data.
2.3.3. Proximity-Based Methods Proximity-based methods detect anomalies by measuring
the distance between data points. Techniques like K-nearest neighbors (KNN) identify
anomalies as points that are far from their nearest neighbors. Density-based methods, such
as DBSCAN, identify clusters of varying density and consider points in low-density regions as
anomalies. These methods are effective for datasets where normal points form dense
clusters, and anomalies are isolated.
2.3.4. Time Series Analysis Time series analysis is particularly valuable for anomaly
detection in data that is indexed over time. This approach focuses on identifying deviations
from expected temporal patterns. Techniques like Autoregressive Integrated Moving
Average (ARIMA), Seasonal Decomposition of Time Series (STL), and Exponential Smoothing
State Space Model (ETS) are commonly used to model time series data. Time series analysis
can capture trends, seasonality, and cyclic behavior, making it ideal for applications such as
financial market analysis, network monitoring, and predictive maintenance.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
Choosing time series analysis for anomaly detection offers several advantages. Firstly, it
accounts for temporal dependencies and trends, which are often crucial for accurately
identifying anomalies in sequential data. For instance, a sudden spike in network traffic
might be normal during business hours but unusual at midnight. Time series models can
differentiate between such contextually normal variations and genuine anomalies. Secondly,
time series analysis can handle seasonality effectively, identifying patterns that recur at
regular intervals, such as daily, weekly, or yearly cycles. This is essential for distinguishing
between regular fluctuations and true anomalies. Lastly, time series models can provide
predictive insights, enabling proactive measures before anomalies lead to significant issues.
CHAPTER 3 - ANOMALY DETECTION IN TIME SERIES DATA
3.1 INTRODUCTION – Anomaly detection in time series data involves identifying unusual
patterns, outliers, or deviations from normal behavior within a sequence of data points
indexed by time. Time series data is prevalent in various domains, including finance,
healthcare, manufacturing, telecommunications, and environmental monitoring. Detecting
anomalies in such data is crucial for early identification of irregularities, which could signify
critical events or issues requiring attention. Here's a detailed explanation of anomaly
detection in time series:
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
1. Understanding Time Series Data:
• Time series data consists of observations recorded over time, typically at regular
intervals. Examples include stock prices, sensor readings, weather measurements,
and patient vitals.
• Time series data often exhibits certain characteristics such as trends (long-term
changes), seasonality (regular patterns that repeat over fixed intervals), and noise
(random fluctuations).
• Anomalies in time series data can manifest as sudden spikes, drops, shifts, or irregular
patterns that deviate significantly from the expected behavior.
2. Approaches to Anomaly Detection:
• Threshold-based Methods: Define thresholds based on historical data or statistical
properties (e.g., mean, standard deviation) and flag data points that fall outside these
thresholds as anomalies.
• Statistical Models: Use statistical techniques such as time series decomposition,
autoregressive models, or moving averages to model the underlying patterns in the
data and identify deviations from the expected behavior.
• Machine Learning Models: Employ supervised, unsupervised, or semi-supervised
machine learning algorithms to learn patterns from historical data and detect
anomalies. Common techniques include support vector machines, isolation forests,
and recurrent neural networks.
3. Preprocessing:
• Preprocessing steps may include cleaning the data to handle missing values,
smoothing to remove noise, and normalization to scale the data appropriately.
• Time series decomposition techniques such as seasonal decomposition of time series
(STL) or moving averages can help separate the data into trend, seasonal, and residual
components, facilitating anomaly detection.
4. Feature Engineering:
• Extract relevant features from the time series data that capture important aspects of
the underlying patterns, such as trend, seasonality, and cyclic behavior.
• Additional features such as lagged values, moving averages, or difference
transformations can provide valuable information for anomaly detection models.
5. Model Training and Evaluation:
• Train the anomaly detection model using historical time series data, ensuring that the
model captures both normal and anomalous patterns.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
• Evaluate the model's performance using appropriate metrics such as precision, recall,
F1-score, or area under the receiver operating characteristic curve (AUC-ROC).
• Consider cross-validation techniques to assess the model's generalization ability and
robustness to unseen data.
6. Interpretation and Post-processing:
• Interpret the results of the anomaly detection model, distinguishing between true
anomalies and false positives.
• Post-processing techniques such as filtering, aggregation, or contextual analysis can
help refine the detected anomalies and provide actionable insights to stakeholders.
• Incorporate human expertise and domain knowledge to validate and interpret the
detected anomalies, ensuring that appropriate actions are taken in response to
identified irregularities.
7. Continuous Monitoring and Adaptation:
• Deploy the anomaly detection model in a real-time or near-real-time environment for
continuous monitoring of incoming time series data.
• Periodically retrain the model using updated data to adapt to changes in the
underlying patterns and maintain its effectiveness over time.
Anomaly detection in time series data is a multifaceted task that requires a combination of
domain knowledge, statistical techniques, and machine learning algorithms. By effectively
leveraging these approaches, organizations can enhance their ability to detect and respond
to anomalous events, thereby improving operational efficiency, minimizing risks, and
ensuring the reliability of their systems and processes.
Finding anomaly using time series dataset –
The "daily-minimum-temperatures" dataset comprises daily minimum temperature records
for a specific location. Each row corresponds to a single day and includes the following
columns: Date (YYYY-MM-DD) and Min Temp (°C).
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
3.2 Types of anomalies in time-series data
As the figure above shows, outliers in time series can have two different meanings. The
semantic distinction between them is mainly based on your interest as the analyst, or the
particular scenario.
These observations have been related to noise, erroneous or unwanted data, which by itself
isn’t interesting to the analyst. In these cases, outliers should be deleted or corrected to
improve data quality, and generate a cleaner dataset that can be used by other data mining
algorithms. For example, sensor transmission errors are eliminated to obtain more accurate
predictions, because the main goal is to make predictions.
Nevertheless, in recent years – especially in the area of time series data – many researchers
have aimed to detect and analyze unusual, but interesting phenomena. Fraud detection is a
good example – the main objective is to detect and analyze the outlier itself. These
observations are often referred to as anomalies.
The anomaly detection problem for time series is usually formulated as identifying outlier
data points relative to some norm or usual signal. Take a look at some outlier types:
Let’s break this down one by one:
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
3.2.1 Point outlier
A point outlier is a datum that behaves unusually in a specific time instance when compared
either to the other values in the time series (global outlier), or to its neighboring points
(local outlier).
Example: are you aware of the Gamestop frenzy? A slew of young retail investors bought
GME stock to get back at big hedge funds, driving the stock price way up. That sudden,
short-lived spike that occurred due to an unlikely event is an additive (point) outlier. The
unexpected growth of a time-based value in a short period (looks like a sudden spike) comes
under additive outliers.
Source:
Google
Point outliers can be univariate or multivariate, depending on whether they affect one or
more time-dependent variables, respectively.
Fig. 1a contains two univariate point outliers, O1 and O2, whereas the multivariate time
series is composed of three variables in Fig. 3b, and has both univariate (O3) and
multivariate (O1 and O2) point outliers.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
Fig: 1 — Point outliers in time series data. | Source
We will take a deeper look at Univariate Point Outliers in the Anomaly Detection section.
3.2.2 Subsequence outlier
This means consecutive points in time whose joint behavior is unusual, although each
observation individually is not necessarily a point outlier. Subsequence outliers can also be
global or local, and can affect one (univariate subsequence outlier) or more (multivariate
subsequence outlier) time-dependent variables.
Fig. 2 provides an example of univariate (O1 and O2 in Fig. 2a, and O3 in Fig. 2b) and
multivariate (O1 and O2 in Fig. 2b) subsequence outliers. Note that the latter does not
necessarily affect all the variables (e.g., O2 in Fig. 2b).
Fig: 2 — Subsequence outliers in time series data. | Source
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
3.3 Anomaly detection techniques in time series data
3.3.1 Anomaly Detection with Autoencoders
Introduction: Autoencoders are unsupervised learning models that aim to reconstruct input
data while learning a compact representation (latent space) of the data's features. This
makes them suitable for anomaly detection, as anomalies often deviate significantly from
the normal patterns present in the data.
Dimensionality Reduction and Outlier Detection: In anomaly detection, dimensionality
reduction serves as a means to reveal outliers. Despite potentially losing some information
during dimensionality reduction, the main patterns in the data are retained. Outliers, being
extreme deviations from these patterns, become more apparent once the data is projected
into a lower-dimensional space.
Why Autoencoders? Autoencoders offer advantages over traditional techniques like
Principal Component Analysis (PCA) due to their ability to capture nonlinear relationships
within the data. While PCA relies on linear transformations, autoencoders leverage
nonlinear activation functions and multiple layers to perform more complex
transformations. This makes autoencoders more adept at handling complex and non-linear
data problems.
Implementation with PyOD: PyOD is a Python module that simplifies the implementation of
anomaly detection techniques, including autoencoders. Below is a step-by-step guide to
implementing anomaly detection using autoencoders with PyOD.
1. Data Generation: Generate synthetic data using the generate_data() function from
PyOD. This function creates a dataset with a specified number of features, observations,
and a given percentage of outliers.
2. Visualization with PCA: Use Principal Component Analysis (PCA) to reduce the
dimensionality of the data to two dimensions for visualization purposes.
3. Model Building: Construct an autoencoder model specifying the architecture of the
network, including the number of layers and neurons per layer.
4. Anomaly Detection: Predict anomaly scores for the test data using the trained
autoencoder model.
5. Summary Statistics: Assign outliers based on a chosen threshold and compute
summary statistics for each cluster.
Autoencoders offer a powerful approach to anomaly detection, especially in scenarios
involving complex and non-linear data. By leveraging their ability to learn latent
representations, autoencoders can effectively identify anomalies in high-dimensional
datasets, providing valuable insights for various applications.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
3.3.2 Anomaly Detection with STL decomposition
STL stands for seasonal-trend decomposition procedure based on LOESS. This technique
gives you the ability to split your time series signal into three parts: seasonal, trend, and
residue.
It works for seasonal time-series, which is also the most popular type of time series data. To
generate an STL-decomposition plot, we just use the ever-amazing statsmodels to do the
heavy lifting for us. plt.rc('figure',figsize=(12,8)) plt.rc('font',size=15)
result = seasonal_decompose(lim_catfish_sales,model='additive')
fig = result.plot()
This is Catfish sales data from 1996–2000 with an anomaly introduced in Dec-1998
If we analyze the deviation of residue and introduce some threshold for it, we’ll get an
anomaly detection algorithm. To implement this, we only need the residue data from the
decomposition.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
plt.rc('figure',figsize=(12,6)) plt.rc('font',size=15) fig, ax = plt.subplots() x
= result.resid.index y = result.resid.values ax.plot_date(x, y,
color='black',linestyle='--') ax.annotate('Anomaly',
(mdates.date2num(x[35]), y[35]), xytext=(30, 20),
textcoords='offset points',
color='red',arrowprops=dict(facecolor='red',arrowstyle='fancy'))
fig.autofmt_xdate() plt.show()
Residue from the above STL decomposition
Pros
It’s simple, robust, it can handle a lot of different situations, and all anomalies can still be
intuitively interpreted.
Cons
The biggest downside of this technique is rigid tweaking options. Apart from the threshold
and maybe the confidence interval, there isn’t much you can do about it. For example,
you’re tracking users on your website that was closed to the public and then was suddenly
opened. In this case, you should track anomalies that occur before and after launch periods
separately.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
3.3.3 DBSCAN :
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clever way to find
unusual or outlier data points in a group of data. Imagine you have a bunch of points on a
map, and you want to find the weird ones that don’t really fit into any group.
Here’s how DBSCAN works:
Step 1: Select a starting point
• Begin by randomly selecting a data point from your dataset.
Step 2: Define a radius (Epsilon) and minimum number of oints (Min_Samples)
Specify two important values:
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
• Epsilon (a radius around the selected point).
• Min_Samples (the minimum number of data points that should be within this radius
to form a cluster)
Step 3: Check neighboring points
• Examine all data points within the defined radius (Epsilon) around the selected point.
Step 4: Form a cluster
• If there are at least as many data points within the Epsilon radius as specified by
Min_Samples, consider the selected point and these nearby points as a cluster.
Step 5: Expand the cluster
• Now, for each point within this newly formed cluster, repeat the process. Check for
nearby points within the Epsilon radius.
• If additional points are found, add them to the cluster. This process continues
iteratively, expanding the cluster until no more points can be added.
Step 6: Identify outliers (noise)
• Any data points that are not included in any cluster after the expansion process are
labeled as outliers or noise. These points do not belong to any cluster.
Imagine you have a field with a bunch of people scattered around, and you want to organize
a game of tag. Some people are standing close together, and others are standing alone.
DBSCAN helps you identify two things:
1. Groups of Players: It starts by picking a person, any person, and puts an imaginary
hula hoop around them (this is like setting a maximum distance). Now, it checks how
many other people are inside that hula hoop. If there are enough (more than a
certain number you decide in advance), it forms a group. This group is like a team of
players playing tag.
2. Lonely Players: After forming that group, it picks a person within that group, puts a
hula hoop around them, and checks if there are more people inside. If yes, it adds
them to the group. This process continues until there are no more people to add to
that group.
Now, here’s the cool part: Anyone who doesn’t end up in a group is the outlier or the
“lonely player.” These are the people who don’t belong to any team, or in data terms, they
are the outliers.
To apply DBSCAN for outlier detection in Python using Scikit-Learn, we begin by importing the
necessary libraries and modules, as follows:
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
Step 1: Import necessary libraries
• The code starts by importing the required Python libraries, including NumPy for
numerical operations, Matplotlib for data visualization, and the DBSCAN class from
scikit-learn for implementing the DBSCAN algorithm.
import numpy as np import
matplotlib.pyplot as plt from
sklearn.cluster import DBSCAN from
sklearn.datasets import make_blobs
Step 2: Create a synthetic dataset
# Create a synthetic dataset with normal and anomalous data points n_samples
= 300
X, y = make_blobs(n_samples=n_samples, centers=2, random_state=42, cluster_std=1.0)
anomalies = np.array([[5, 5], [6, 6], [7, 7]])
• In this step, a synthetic dataset is generated to illustrate the concept. The dataset is
created using the make_blobs function, producing two clusters of data points with
some isolated anomalies.
• n_samples determines the total number of data points, and the centers parameter
specifies the number of clusters (2, in this case).
• The anomalies variable is an array of manually created anomalous data points.
Step 3: Combine normal and anomalous data
# Combine the normal data and anomalies
X = np.vstack([X, anomalies])
• The normal data and anomalies are combined into a single dataset represented by
the X array using np.vstack.
Step 4: Visualize the dataset
# Visualize the dataset plt.scatter(X[:, 0], X[:, 1],
c='b', marker='o', s=25) plt.title("Synthetic
Dataset") plt.show()
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
• The code plots the dataset to provide a visual representation. It uses Matplotlib to
create a scatter plot, where normal data points are marked in blue circles.
• The resulting plot visually shows two clusters and some isolated red crosses
representing the anomalies.
Step 5: Apply DBSCAN for anomaly detection
# Apply DBSCAN for anomaly detection with increased epsilon
dbscan = DBSCAN(eps=1, min_samples=41) # Increase eps labels
= dbscan.fit_predict(X)
# Anomalies are considered as points with label -1 anomalies
= X[labels == -1]
• DBSCAN is applied for anomaly detection using the DBSCAN class from scikit-learn.
The parameters eps (epsilon) and min_samples control the algorithm's behavior.
• The eps parameter sets the radius within which points are considered neighbors.
• The min_samples parameter defines the minimum number of points required to form
a cluster.
• The code then fits the DBSCAN model to the dataset using fit_predict to obtain
cluster labels for each data point.
Step 6: Identify anomalies
# Anomalies are considered as points with label -1 anomalies
= X[labels == -1]
• Anomalies are identified by finding data points labeled as -1. These points do not
belong to any cluster and are considered outliers or anomalies.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
Step 7: Visualize the anomalies
# Visualize the anomalies plt.scatter(X[:, 0], X[:, 1], c='b', marker='o', s=25)
plt.scatter(anomalies[:, 0], anomalies[:, 1], c='r', marker='x', s=50, label='Anomalies')
plt.title("Anomaly Detection with DBSCAN (Anomalies Outside Clusters)")
plt.legend() plt.show()
• The code plots the anomalies found by DBSCAN in red crosses on top of the original
data points.
• This visualization helps to highlight the anomalies detected by the algorithm.
Step 8: Print the identified anomalies
# Print the identified anomalies
print("Identified Anomalies:") print(anomalies)
• The code concludes by printing the coordinates of the identified anomalies, allowing
you to see the specific data points classified as anomalies by the DBSCAN algorithm.
By following these steps, you can effectively identify an anomaly with DBSCAN and visualize
its results.
Conclusion
DBSCAN is a valuable tool for anomaly detection, offering a data-driven approach to
uncovering outliers in complex datasets. By following the step-by-step guide and code
provided in this blog post, you can integrate DBSCAN into your own data analysis projects,
enhance your anomaly detection capabilities, and make more informed decisions based on
the unique insights that outliers can provide.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
3.4 Support vector machine in Machine Learning
Support Vector Machines : Support vector machine is a supervised learning system and is
used for classification and regression problems. Support vector machine is extremely
favored by many as it produces notable correctness with less computation power. It is
mostly used in classification problems. We have three types of learning: supervised,
unsupervised, and reinforcement learning. A support vector machine is a selective classifier
formally defined by dividing the hyperplane. Given labeled training data the algorithm
outputs best hyperplane which classified new examples. In two-dimensional space, this
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
hyperplane is a line splitting a plane into two parts where each class lies on either side. The
intention of the support vector machine algorithm is to find a hyperplane in an
Ndimensional space that separately classifies the data points.
Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used
for classification and regression tasks. The main idea behind SVM is to find the best
boundary (or hyperplane) that separates the data into different classes.
In the case of classification, an SVM algorithm finds the best boundary that separates the
data into different classes. The boundary is chosen in such a way that it maximizes the
margin, which is the distance between the boundary and the closest data points from each
class. These closest data points are called support vectors.
SVMs can also be used for non-linear classification by using a technique called the kernel
trick. The kernel trick maps the input data into a higher-dimensional space where the data
becomes linearly separable. Common kernels include the radial basis function (RBF) and the
polynomial kernel.
SVMs can also be used for regression tasks by allowing for some of the data points to be
within the margin, rather than on the boundary. This allows for a more flexible boundary
and can lead to better predictions.
SVMs have several advantages, such as the ability to handle high-dimensional data and the
ability to perform well with small datasets. They also have the ability to model non-linear
decision boundaries, which can be very useful in many applications. However, SVMs can be
sensitive to the choice of kernel, and they can be computationally expensive when the
dataset is large.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
CHAPTER 4 - MULTIVARIATE TIMESERIES ANALYSIS
4.1 INTRODUCTION - A novel deep neural network model is proposed for multivariate time
series anomaly detection, integrating convolutional and LSTM networks. The method,
comprising feature fusion and timing prediction steps, addresses challenges posed by noise,
dimensionality, and hidden features in abnormal data. The convolutional network captures
feature correlations and abstracts multivariate time series features, while LSTM effectively
identifies anomalies and predicts time series offsets. Applied to server fault diagnosis and
network traffic anomaly detection, the approach demonstrates superior performance over
traditional methods, offering significant application benefits in complex scenarios.
Experimental evaluations confirm its effectiveness in large multivariate time series datasets.
4.2 Example – The dataset contains an example of weather conditions data set where we
will have attributes/ columns/ like pollution, dew, temp, wind speed, snow, rain. Now we
will use the Multivariate LSTM time series forecasting technique to predict the pollution for
the next hours based on pollution, dew, temp, wind speed, snow, rain conditions.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
CHAPTER 5 - MULTIVARIATE FORECASTING
5.1 INTRODUCTION - Multivariate forecasting in multivariate time series data involves
predicting future values based on multiple interdependent time series. These methods
account for the correlations and interactions between the different time series, often
resulting in more accurate and robust forecasts. Here are several methods and techniques
used for multivariate forecasting:
5.2 TYPES OF FORECASTING METHODS
5.2.1 Auto-Regressive Integrated Moving Average (ARIMA) –
ARIMA (Auto-Regressive Integrated Moving Average) for multivariate time series data
extends the traditional ARIMA model to handle multiple related variables simultaneously.
1. Auto-Regressive (AR) Component: Models the relationship between each variable
and its own lagged values. In multivariate ARIMA, each variable is regressed against its own
past values and possibly past values of other variables.
2. Integrated (I) Component: Deals with non-stationarity by differencing the time series
data. This component can be applied individually to each variable or collectively to a set of
variables.
3. Moving Average (MA) Component: Models the relationship between each variable
and the errors of its lagged values. It captures short-term irregularities in each variable's
behaviour.
4. Multivariate Extension: Combines these components to account for dependencies
between variables. This includes:
- Vector Auto-Regressive (VAR): A multivariate extension of AR that models each
variable as a linear function of lagged values of all variables.
- Vector Moving Average (VMA): Captures dependencies on past error terms from all
variables.
- Vector Auto-Regressive Moving Average (VARMA): Combines VAR and VMA
components to model interdependencies and short-term dynamics.
5. Forecasting: Once the ARIMA model is fitted to multivariate data, it can forecast future
values for each variable based on historical observations and estimated parameters.
In essence, ARIMA for multivariate time series data extends the classic ARIMA model to
handle multiple variables, capturing both their individual dynamics and their
interdependencies over time.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
5.2.2 Vector Autoregression (VAR) Model:
The VAR model is designed to capture the linear interdependencies among multiple time
series by considering the past values of all series in the system. This approach assumes that
the data is stationary, meaning the statistical properties like mean and variance remain
constant over time. By including lagged values of all variables, VAR models can effectively
capture the dynamic relationships within the multivariate time series, making it a robust tool
for analyzing the temporal interactions between different time series.
Example : Taking a dataset containing the Measurements of electric power consumption in
one household with a one-minute sampling rate over a period of almost 4 years. Different
electrical quantities and some sub-metering values are available.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
5.2.2.1 The Augmented Dickey-Fuller (ADF) test is a statistical test used to assess whether a
time series is stationary or non-stationary. Stationarity is a fundamental assumption in time
series analysis, as many modeling techniques and forecasting methods rely on the premise
that the statistical properties of the series do not change over time.
The ADF test evaluates a null hypothesis that a unit root is present in the time series,
indicating non-stationarity. The alternative hypothesis suggests that the series is stationary,
meaning it exhibits a constant mean, variance, and autocovariance over time. The test
calculates a test statistic and compares it with critical values to determine the significance of
the result.
Key components of the ADF test include:
- ADF Statistic: This is the test statistic computed from the data. More negative values
indicate stronger evidence against the null hypothesis of non-stationarity.
- p-value: The probability value associated with the test statistic. A low p-value (typically
less than 0.05) suggests rejecting the null hypothesis in favor of stationarity.
- Critical Values: These are pre-determined thresholds that the ADF statistic must
surpass for the result to be considered significant at a certain confidence level.
Interpreting the ADF test results involves comparing the ADF statistic with critical values and
examining the p-value. If the ADF statistic is lower than the critical values and the p-value is
sufficiently small, it indicates strong evidence that the series is stationary. Conversely, a
higher ADF statistic or a larger p-value suggests the series may be non-stationary.
In practice, before applying forecasting models like ARIMA or VAR, analysts often perform
the ADF test to ensure the time series data meets the stationarity assumption. If the series is
found to be non-stationary, techniques such as differencing may be employed to transform
the data into a stationary form. Overall, the ADF test is a crucial tool for ensuring the
reliability and accuracy of time series analysis and forecasting.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
5.2.2.2 Granger causality test - The Granger causality test is a statistical hypothesis test used
to determine whether one time series can predict another. Named after Clive Granger, who
introduced it in econometrics, the test assesses whether the past values of one variable (X)
help in forecasting another variable (Y) beyond the information contained in Y's past values
alone.
To perform the Granger causality test:
1. Null Hypothesis (H0): The null hypothesis states that past values of X do not cause Y. In
other words, X does not Granger-cause Y.
2. Alternative Hypothesis (H1): The alternative hypothesis suggests that X Granger-causes Y,
meaning past values of X provide significant predictive information about Y beyond what
is explained by Y's own past values.
3. Procedure:
- The test involves regressing the dependent variable Y on lagged values of both Y and X
(and potentially other variables) up to a specified number of lags.
- It compares the fit of this regression model against a reduced model that excludes X's
lagged values using statistical measures such as F-statistic, p-values, or information criteria
like AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion).
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
4. Interpreting Results:
- If the p-value associated with the F-statistic is less than a chosen significance level
(commonly 0.05), the null hypothesis is rejected. This indicates that X Granger-causes Y,
implying a potential causal relationship in a predictive sense.
- Conversely, a non-significant p-value suggests that including X's lagged values does
not significantly improve the predictive power of the model for Y.
5. Applications:
- In econometrics, the Granger causality test is widely used to analyze causal
relationships between economic variables, such as GDP and unemployment rates.
- It is also applied in various fields including finance, neuroscience, and
environmental sciences to explore predictive relationships among variables over time.
6. Considerations:
- The Granger causality test does not establish causality in a deterministic sense but
rather infers predictive causality based on statistical evidence.
- Proper interpretation requires careful consideration of model assumptions,
potential confounding factors, and the appropriateness of the lag length chosen for the
test.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
VAR model for a dataset containing Measurements of electric power consumption in one
household with a one-minute sampling rate over a period of almost 4 years. Different
electrical quantities and some sub-metering values are available.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
5.2.3 Vector Autoregressive Moving Average (VARMA) Model:
Building on the VAR model, the VARMA model includes moving average components, which
account for past forecast errors in addition to the past values of the series. This addition
allows the model to capture not only the direct influences of previous observations but also
the residual effects of past errors, providing a more comprehensive understanding of the
underlying data. VARMA models are suitable for stationary time series and offer enhanced
modeling capabilities by combining autoregressive and moving average processes.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
5.2.4 Vector Autoregressive Integrated Moving Average (VARIMA) Model:
The VARIMA model extends the VARMA approach to handle non-stationary time series data.
It does this by applying differencing to the data, which transforms it into a stationary series
before applying the VARMA framework. Differencing involves subtracting the previous
observation from the current observation to remove trends and stabilize the mean of the
time series. By integrating differencing with autoregressive and moving average
components, VARIMA models effectively manage complex time series dynamics, making
them suitable for analyzing and forecasting non-stationary multivariate time series.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
CHAPTER 6 - DEEP LEARNING
6.1 INTRODUCTION - Deep learning is a subset of machine learning that uses neural
networks with many layers (hence "deep") to model complex patterns in data. It excels in
tasks like image recognition, natural language processing, and speech recognition. Key
features include:
Neural Networks : Composed of interconnected nodes (neurons) that simulate the human
brain.
Layers : Multiple layers process data, with each layer extracting higher-level features.
Training : Involves large datasets and high computational power, often using GPUs.
Backpropagation : An optimization technique to adjust weights by minimizing error.
Deep learning models automatically learn features from raw data, reducing the need for
manual feature extraction.
6.2 TYPES OF DEEP LEARNING METHODS
6.2.1 LSTM MODEL –
An LSTM (Long Short-Term Memory) model is a type of recurrent neural network (RNN)
designed to handle and analyze sequential data. Unlike traditional RNNs, LSTMs can learn
and remember over long sequences, making them particularly effective for tasks where
context and time play critical roles.
Key Features of LSTM:
Memory Cells: LSTMs have special units called memory cells that maintain information over
long periods. These cells are the core of the LSTM's ability to remember dependencies
across time steps.
Gates: LSTMs use three types of gates to control the flow of information:
- Forget Gate: Determines which information from the previous cell state should be
discarded.
- Input Gate: Decides which new information should be stored in the cell state.
- Output Gate: Determines the output based on the cell state.
Long-Term Dependencies: The architecture of LSTMs allows them to capture and utilize
long-term dependencies in data. This is crucial for tasks where earlier data points in a
sequence significantly influence later ones.
Applications of LSTM:
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
1. Time Series Forecasting: Predicting future values based on historical data, such as
stock prices, weather patterns, and sales forecasting.
2. Natural Language Processing (NLP): Tasks like language modelling, machine
translation, text generation, and sentiment analysis.
3. Speech Recognition: Converting spoken language into text by understanding and
predicting sequences of phonemes or words.
4. Anomaly Detection: Identifying unusual patterns or outliers in sequential data, such
as in network security or fault detection in industrial systems.
The code to implement the LSTM model for a multivariate time series data containing the
measurements of electric power consumption in one household with a one-minute sampling
rate over a period of almost 4 years. Different electrical quantities and some submetering
values are available.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
6.2.2 Convolutional Neural Network (CNN):
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
A Convolutional Neural Network (CNN) is a type of deep learning model primarily used for
analysing visual imagery. It is particularly powerful for tasks such as image classification,
object detection, and image segmentation. However, CNNs can also be adapted for other
types of data that exhibit spatial or sequential patterns, such as time series data or natural
language processing tasks.
Key Components:
1. Convolutional Layers: These are the core building blocks of a CNN. Convolutional
layers apply a set of learnable filters (kernels) to small regions of the input data. This process
extracts features such as edges, textures, and patterns from the input. Each filter learns to
detect different features, and multiple filters are applied in parallel to generate a feature
map.
2. Pooling Layers: After each convolutional layer, pooling layers are often used to down
sample the feature maps generated by the convolution. Common pooling operations include
max pooling and average pooling, which reduce the spatial dimensions of the feature maps
while retaining important information.
3. Activation Function: Typically, rectified linear unit (ReLU) activation functions are used
after convolutional and fully connected layers to introduce non-linearity into the network,
enabling it to learn complex patterns and relationships in the data.
4. Fully Connected Layers: In the final stages of a CNN, fully connected layers (dense
layers) are used to combine the features extracted by the convolutional and pooling layers.
These layers integrate the high-level features and make predictions based on them, such as
class probabilities in classification tasks.
5. Output Layer: The output layer of a CNN depends on the specific task. For example, in
image classification, it might have neurons corresponding to each class label, with softmax
activation to output class probabilities. In regression tasks, it might output a single value or
multiple values directly.
Applications:
- Computer Vision: Image classification, object detection, image segmentation.
- Natural Language Processing: Text classification, sentiment analysis, language
translation (when adapted).
- Signal Processing: Speech recognition, time series analysis (when adapted).
The code to implement the CNN model for a multivariate time series data containing the
measurements of electric power consumption in one household with a one-minute sampling
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
rate over a period of almost 4 years. Different electrical quantities and some submetering
values are available.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
Anomaly detection in multivariate time series data like electric power consumption involves
identifying unusual patterns that do not conform to expected behaviour. Here’s a brief
overview of how you can approach this:
1. Data Understanding and Preprocessing:
- Data Exploration: Begin by visualizing and exploring the data to understand its
patterns and distributions.
- Feature Engineering: Extract relevant features such as active power, reactive power,
voltage, and sub-metering values if they are available.
2. Normalization:
- Scale the data appropriately, especially since different electrical quantities may have
different scales. Techniques like `MinMaxScaler` can be used for this purpose.
3. Model Selection:
- Choose a suitable anomaly detection model. Common approaches include:
- Statistical Methods: Such as Z-score or modified Z-score.
- Machine Learning Models: For example, Isolation Forest, One-Class SVM, or Autoencoders.
- Time Series Specific Methods: Like Seasonal Hybrid Extreme Studentized Deviate (S-HESD)
test.
4. Training and Detection:
- Train the chosen model on the training data, which should ideally be anomaly-free or
contain labeled anomalies.
- Use the trained model to detect anomalies in the test data or real-time streaming data.
5. Anomaly Insertion (if necessary):
- If your dataset lacks anomalies for training purposes, consider synthetic data generation
techniques to simulate anomalies. This involves injecting anomalous patterns into the
dataset based on domain knowledge or statistical analysis.
6. Evaluation:
- Evaluate the performance of your anomaly detection model using appropriate metrics
such as precision, recall, F1-score, or area under the Receiver Operating Characteristic (ROC)
curve.
7. Iterate and Refine:
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
- Refine your approach based on the results obtained and iterate if necessary, adjusting
parameters or trying different models until satisfactory performance is achieved.
Given your specific dataset on electric power consumption, ensure to handle seasonality,
trends, and potentially missing values appropriately during preprocessing and modeling
phases.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])
lOMoARcPSD|46570384
REFERENCES –
James Stradling's Medium article on unsupervised machine learning with one-class support
vector machines (https://medium.com/@jamesstradling/unsupervised-machine-
learningwith-one-class-support-vector-machines-129579a49d1d)) provided valuable insights
into anomaly detection using SVMs.
Aimonks' Medium article on multivariate time series analysis with TensorFlow
(https://medium.com/aimonks/multivariate-timeseries-analysis-using-
tensorflow9554e607077a)) was instrumental in navigating the complexities of
TensorFlow for analyzing multivariate time series data.
Builtin's guide on the elbow method for determining optimal cluster numbers
(https://builtin.com/data-science/elbow-method)) enriched my understanding of clustering
algorithms and many more
Chatgpt and many more websites for implementation of code.
Downloaded by 22501A4437 LANKIREDDY M S K REDDY ([email protected])