Clustering of Time-Series Data
Clustering of Time-Series Data
4,600
Open access books available
119,000
International authors and editors
135M
Downloads
154
Countries delivered to
TOP 1%
most cited scientists
12.2%
Contributors from top 500 universities
Abstract
1. Introduction
The rapid development of technology has led to the registration of many pro-
cesses in an electronic environment, the storage of these records, and the accessi-
bility of these records when requested. With the evolving technology such as cloud
computing, big data, the accumulation of a large amount of data stored in data-
bases, and the process of parsing and screening useful information made data
mining necessary.
It is possible to examine the data which are kept in databases and reach to huge
amounts of size every second, in two parts according to their changes in time: static
and temporal. Data is called the static data when its feature values do not change
with time, if the feature comprise values change with time then it is called the
temporal or time-series data.
Today, with the increase in processor speed and the development of storage
technologies, real-world applications can easily record changing data over time.
Time-series analysis is a trend study subject because of its prevalence in various
fields ranging from science, engineering, bioinformatics, finance, and government
to health-care applications [1–3]. Data analysts are looking for the answers of such
questions: Why does the data change this way? Are there any patterns? Which series
show similar patterns? etc. Subsequence matching, indexing, anomaly detection,
motif discovery, and clustering of the data are the answers of some questions [4].
Clustering, which is one of the most important concepts of data mining, defines its
structure by separating unlabeled data sets into homogeneous groups. Many
general-purpose clustering algorithms are used for the clustering of time-series
1
Statistical Machine Learning
Data representation is one of the main challenging issues for time-series cluster-
ing. Because, time-series data are much larger than memory size [7, 8] that
increases the need for high processor power and time for the clustering process
increases exponentially. In addition, the time-series data are multidimensional,
which is a difficulty for many clustering algorithms to handle, and it slows down the
calculation of the similarity measurement. Consequently, it is very important for
time-series data to represent the data without slowing down the algorithm execu-
tion time and without a significant data loss. Therefore, some requirements can be
listed for any data representation methods [9]:
ii. Maintain the local and global shape characteristics of the time series,
2
Clustering of Time-Series Data
DOI: http://dx.doi.org/10.5772/intechopen.84490
Dimension reduction is one of the most frequently used methods in the litera-
ture [7, 10–12] for the data representation.
Definition:
Figure 1.
Time-series clustering.
3
Statistical Machine Learning
• Non-data adaptive methods are use fix-size parameters for the representing
time-series data. Following methods are shown among non-data adaptive
representation methods: Discrete Fourier Transform (DFT) [18], Discrete
Wavelet Transform (DWT) [20–22], Discrete Cosine Transformation (DCT)
[17], Perceptually Important Point (PIP) [23], Piecewise Aggregate
Approximation (PAA) [24], Chebyshev Polynomials (CHEB) [25],
Random Mapping [26], and Indexable Piecewise Linear Approximation
(IPLA) [27].
Many representation methods for time-series data are proposed and each of
them offering different trade-offs between the aforementioned requirements. The
correct selection of the representation method plays a major role in the effectiveness
and usability of the application to be performed.
4
Clustering of Time-Series Data
DOI: http://dx.doi.org/10.5772/intechopen.84490
Definition:
Similarity between two “n” sized time series T = {t1,t2,….tn} and U = {u1,u2,….un}
is the length of the path connecting pair of points [11]. This distance is the measure
of similarity. D (T, U) is a function that takes two times series (T, U) as input and
calculates their distance “d”.
Metrics to be used in clustering must cope with the problems caused by common
features of time-series data such as noise, temporal drift, longitudinal scaling, offset
translation, linear drift, discontinuities, and amplitude scaling. Various methods
have been developed for similarity measure, and the method to choose is problem
specific. These methods can be grouped under three main headings: similarity in
time, similarity in shape, and similarity in change.
The similarity between the series is that they are highly time dependent. Such a
measure is costly for the raw time series, so a preprocessing or transformation is
required beforehand [34, 36].
Clustering algorithms that use similarity in shape measure, assigns time series
containing similar patterns to the same cluster. Independently of the time, it does
not care how many times the pattern exists [37, 38].
The result of using this metric is time-series clusters that have the similar
autocorrelation structure. Besides, it is not a suitable metric for short time series
[39, 40, 29].
5
Statistical Machine Learning
Definition:
Given a dataset on n time series T = {t1, t2,…., tn}, time-series clustering is the
process of partitioning of T into C = {C1,C2,….,Ck} according to certain similarity
criterion. Ci is called “cluster” where,
Contrary to the partitioning approach, which aims segmenting data that do not
intersect, the hierarchical approach produces a hierarchical series of nested clusters
that can be represented graphically (dendrogram, tree-like diagram). The branches
of the dendrogram show the similarity between the clusters as well as the knowl-
edge of the shaping of the clusters. Determined number of clusters can be obtained
by cutting the dendrogram at a certain level.
Hierarchical clustering methods [47–49] are based on the separating clusters
into subgroups that are processed step by step as a whole, or the stepwise integra-
tion of individual clusters into a cluster [50]. Hierarchical clustering methods are
divided into two methods: agglomerative clustering methods and divisive hierar-
chical clustering methods according to the creation of the dendrogram.
In agglomerative hierarchical clustering methods, each observation is initially
treated as an independent cluster, and then repeatedly, until each individual obser-
vation obtains a single set of all observations, thereby forming a cluster with the
closest observation.
In the divisive hierarchical clustering methods, initially all observations are
evaluated as a single cluster and then repeatedly separated in such a way that each
observation is separated from the farthest observation to form a new cluster. This
process continues until all the observations create a single cluster.
6
Clustering of Time-Series Data
DOI: http://dx.doi.org/10.5772/intechopen.84490
Hierarchical clustering not only forms a group of similar series but also provides
a graphical representation of the data. Graphical presentation allows the user to
have an overall view of the data and an idea of data distribution. However, a small
change in the data set leads to large changes in the hierarchical dendrogram.
Another drawback is high computational complexity.
In this approach, grids made up of square cells are used to examine the data
space. It is independent of the number of objects in the database due to the used
grid structure. The most typical example is STING [56], which uses various levels of
quadrilateral cells at different levels of resolution. It precalculates and records
statistical information about the properties of each cell. The query process usually
begins with a high-level hierarchical structure. For each cell at the current level, the
confidence interval, which reflects the cell’s query relation, is computed. Unrelated
cells are exempt from the next steps. The query process continues for the
corresponding cells in the lower level until reaching the lowest layer.
7
Statistical Machine Learning
After analyzing the data set and obtaining the clustering solution, there is no
guarantee of the significance and reliability of the results. The data will be clustered
even if there is no natural grouping. Therefore, whether the clustering solution
obtained is different from the random solution should be determined by applying
some tests. Some methods developed to test the quality of clustering solutions are
classified into two types: external index and internal index.
• The external index is the most commonly used clustering evaluation method
also known as external validation, external criterion. The ground truth is the
goal clusters, usually created by experts. This index measures how well the
target clusters and the resulting clusters overlap. Entropy, Adjusted Rand
Index (ARI), F-measure, Jaccard Score, Fowlkes and Mallows Index (FM), and
Cluster Similarity Measure (CSM) are the most known external indexes.
• The internal indexes evaluate clustering results using the features of data sets
and meta-data without any external information. These are often used in cases
where the correct solutions are not known. Sum of squared error is one of the
most used internal methods which the distance to the nearest cluster
determines the error. So clusters with similar time series are expected to give
lower error values. Distance between two clusters (CD) index, root-mean-
square standard deviation (RMSSTD), Silhouette index, R-squared index,
Hubert-Levin index, semi-partial R-squared (SPR) index, weighted inter-intra
index, homogeneity index, and separation index are the common internal
indexes.
The funFEM algorithm [55, 57] allows to cluster time series or, more generally,
functional data. FunFem is based on a discriminative functional mixture model
(DFM) which allows the clustering of the curves (data) in a functional subspace. If
the observed curves are {x1 , x2 …xn }, FunFem aims cluster into K homogenous
groups. It assumes that there exists an unobserved random variable Z = {z1 , z2 …zn }
∈f0; 1gk , if x belongs to group k, Zk is defined as 1 otherwise 0. The clustering
task goal is to predict the value zi = (zi1 ,… zik ) of Z for each observed curve xi , for
i = 1…n. The FunFem algorithm alternates, over the three steps of Fisher EM
algorithm [57] (“F-step,” “E-Step” and “M-step”) to decide group memberships of
Z = {z1 , z2 …zn }. In other words, from 12 defined discriminative functional mixture
(DFM) models, Fisher-EM decides which data fit the best. The Fisher-EM algorithm
alternates between three steps:
• an M step in which parameters of the mixture model are estimated in the latent
subspace by maximizing the conditional expectation of the complete
likelihood.
8
Clustering of Time-Series Data
DOI: http://dx.doi.org/10.5772/intechopen.84490
. . . . . . .
. . . . . . .
Table 1.
Input data of the FunFEM algorithm.
9
Statistical Machine Learning
AADAC 2
AAK1 3
AAMP 3
AANAT 1
AARS 4
AASDHPPT 3
AASS 1
AATF 3
AATK 2
. .
. .
ZP2 1
ZPBP 1
ZW10 3
ZWINT 4
ZYX 4
ZZEF1 2
ZZZ3 3
Table 2.
Output data of the FunFEM algorithm.
The approach to be taken depends on the application area and the characteristics
of the data. For this reason, as a case study, the clustering of gene expression data,
which is a special area of clustering of time-series data, will be examined in this
section. Microarray is the technology which measures the expression levels of large
numbers of genes simultaneously. DNA microarray technology overcomes tradi-
tional approaches in the identification of gene copies in a genome, in the identifica-
tion of nucleotide polymorphisms and mutations, and in the discovery and
development of new drugs. It is used as a diagnostic tool for diseases. DNA
microarrays are widely used to classify gene expression changes in cancer cells.
The gene expression time series (gene profile) is a set of data generated by
measuring expression levels at different cases/times in a single sample. Gene
expression time series have two main characteristics, short and unevenly sampled.
In The Stanford Microarray database, more than 80% of the time-series experi-
ments contains less than 9 time points [63]. Observations below 50 are considered
to be quite short for statistical analysis. Gene expression time-series data are sepa-
rated from other time-series data by this characteristics (business, finance, etc.). In
addition to these characteristics, three basic similarity requirements can be identi-
fied for the gene expression time series: scaling and shifting, unevenly distributed
sampling points, and shape (internal structure) [64]. Scaling and shifting problems
arise due to two reasons: (i) the expression of genes with a common sequence is
similar, but in this case, the genes need not have the same level of expression at the
same time. (ii) Microarray technology, which is often corrected by normalization.
10
Clustering of Time-Series Data
DOI: http://dx.doi.org/10.5772/intechopen.84490
The scaling and shifting factor in the expression level may hide similar expressions
and should not be taken into account when measuring the similarity between the
two expression profiles. Sampling interval length is informative and cannot be
ignored in similarity comparisons. In microarray experiments, the density change
characterizes the shape of the expression profile rather than the density of the gene
expression. The internal structure can be represented by deterministic function,
symbols describing the series, or statistical models.
There are many popular clustering techniques for gene expression data. The
common goal of all is to explain the different functional roles of the genes that play
a key biological process. Genes expressed in a similar way may have a similar
functional role in the process [65].
In addition to all these approaches, it is possible to examine the cluster of gene
expression data in three different classes as gene-based clustering, sample-based
clustering, and subspace clustering (Figure 2) [66]. In gene-based clustering, genes
are treated as objects, instances (time-point/patient-intact) as features. Sample-
based clustering is exactly the opposite: samples are treated as objects, genes as
features. The distinction between these two clustering approaches is based on the
basic characterization of the clustering process used for gene expression data. Some
clustering algorithms, such as K-means and hierarchical approach, can be used to
cluster both genes and fragments of samples. In the molecular biology, “any func-
tion in the cell is carried out with the participation of a small subset of genes, and
the cellular function only occurs on a small sample subset.” With this idea, genes
and samples are handled symmetrically in subspace clustering; gene or sample,
object or features.
In gene-based clustering, the aim is to group the co-expressed genes together.
However, due to the complex nature of microarray experiments, gene expression
data often contain high amounts of noise, characterizing features such as gene
expression data often linked to each other (clusters often have a high intersection
ratio), and some problems arising from constraints from the biological domain.
Figure 2.
Gene expression data clustering approaches.
11
Statistical Machine Learning
Also, among biologists who will use microarray data, the relationship between
genes or clusters that are usually related to each other within the cluster, rather than
the clusters of genes, is a more favorite subject. That is, it is also important for the
algorithm to make graphical presentations not just clusters. K-means, self-
organizing maps (SOM), hierarchical clustering, graph-theoretic approach, model-
based clustering, and density-based approach (DHC) are the examples of gene-
based clustering algorithms.
The goal of the sample-based approach is to find the phenotype structure or the
sub-structure of the sample. The phenotypes of the samples studied [67] can only be
distinguished by small gene subsets whose expression levels are highly correlated
with cluster discrimination. These genes are called informative genes. Other genes
in the expression matrix have no role in the decomposition of the samples and are
considered noise in the database. Traditional clustering algorithms, such as K-
means, SOM, and hierarchical clustering, can be applied directly to clustering sam-
ples taking all genes as features. The ratio of the promoter genes to the nonrelated
genes (noise ratio) is usually 1:10. This also hinders the reliability of the clustering
algorithm. These methods are used to identify the informative genes. Selection of
the informative genes is examined in two different categories as supervised and
unsupervised. The supervised approach is used in cases where phenotype informa-
tion such as “patient” and “healthy” is added. In this example, the classifier
containing only the informative genes is constructed using this information. The
supervised approach is often used by biologists to identify informative genes. In the
unsupervised approach, no label specifying the phenotype of the samples is placed.
The lack of labeling and therefore the fact that the informative genes do not guide
clustering makes the unsupervised approach more complicated. There are two
problems that need to be addressed in the unsupervised approach: (i) the high
number of genes versus the limited number of samples and (ii) the vast majority of
collected genes are irrelevant. Two strategies can be mentioned for these problems
in the unsupervised approach: unsupervised gene selection and clustering. In
unsupervised gene selection, gene selection and sample clustering are treated as two
separate processes. First, the gene size is reduced, and then classical clustering
algorithms are applied. Since there is no training set, the choice of gene is based
solely on statistical models that analyze the variance of gene expression data. Asso-
ciated clustering dynamically supports the combination of repetitive clustering and
gene selection processes by the use of the relationship between genes and samples.
After many repetitions, the sample fragments converge to the real sample structure
and the selected genes are likely candidates for the informative gene cluster.
When subspace clustering is applied to gene expression vectors, it is treated as
a “block” consisting of clusters of genes and subclasses of experimental conditions.
The expression pattern of the genes in the same block is consistent under the
condition in that block. Different greedy heuristic approaches have been adapted to
approximate optimal solution.
Subspace clustering was first described by Agrawal et al. in 1998 on general data
mining [68]. In subspace clustering, two subspace sets may share the same objects
and properties, while some objects may not belong to any subspace set. Subspace
clustering methods usually define a model to determine the target block and then
search in the gen-sample space. Some examples of subspatial cluster methods pro-
posed for gene expression are biclustering [69], coupled two way clustering
(CTWC) [70], and plaid model [71].
According to different clustering criteria, data can be clustered such as the co-
expressing gene groups, the samples belonging to the same phenotype or genes
from the same biological process. However, even if the same criteria are used in
12
Clustering of Time-Series Data
DOI: http://dx.doi.org/10.5772/intechopen.84490
different clustering algorithms, the data can be clustered in different forms. For this
reason, it is necessary to select more suitable algorithm for data distribution.
4. Conclusions
Clustering for time-series data is used as an effective method for data analysis of
many areas from social media usage and financial data to bioinformatics. There are
various methods introduced for time-series data. Which approach is chosen is
specific to the application. The application is determined by the needs such as time,
speed, reliability, storage, and so on. When determining the approach to clustering,
three basic issues need to be decided: data representation, similarity measure, and
clustering algorithm.
The data representation involves transforming the multi-dimensional and noisy
structure of the time-series data into a less dimensional that best expresses the
whole data. The most commonly used method for this purpose is dimension reduc-
tion or feature extraction.
It is challenging to measure the similarity of two time series. The chapter has
been examined similarity measures in three sections as similarity in shape, similar-
ity in time, and similarity in change.
For the time-series clustering algorithms, it is not wrong to say that the evolu-
tion of conventional clustering algorithms. Therefore, the classification of tradi-
tional clustering algorithms (developed for static data) has been included. It is
classified as partitioning, hierarchical, model-based, grid-based, and density-based.
Partition algorithms initially require prototypes. The accuracy of the algorithm
depends on the defined prototype and updated method. However, they are suc-
cessful in finding similar series and clustering time series with equal length. The fact
that the number of clusters is not given as the initial parameter is a prominent and
well-known feature of hierarchical algorithms. At the same time, works on time
series that are not of equal length causes it to be one step ahead of other algorithms.
However, hierarchical algorithms are not suitable for large data sets due to the
complexity of the calculation and the scalability problem. Model-based algorithms
suffer from problems such as initialization of parameters based on user predictions
and slow processing time for large databases. Density-based algorithms are not
generally preferred over time-series data due to their high working complexity.
Each approach has pros and cons compared to each other, and the choice of algo-
rithm for time-series clustering varies completely according to the characteristics of
the data and the needs of the application. Therefore, in the last chapter, a study on
the clustering of gene expression data, which is a specific field of application, has
been mentioned.
In time-series data clustering, there is a need for algorithms that execute fast,
accurate, and with less memory on large data sets that can meet today’s needs.
13
Statistical Machine Learning
Author details
© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms
of the Creative Commons Attribution License (http://creativecommons.org/licenses/
by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly cited.
14
Clustering of Time-Series Data
DOI: http://dx.doi.org/10.5772/intechopen.84490
References
[2] Özkoç EE, Oğul H. Content-based [10] Keogh E, Lin J, Fu A. Hot sax:
search on time-series microarray Efficiently finding the most unusual
databases using clustering-based time series subsequence. In: Fifth IEEE
fingerprints. Current Bioinformatics. International Conference on Data
2017;12(5):398-405. ISSN: 2212-392X Mining (ICDM’05); 27 November 2005;
IEEE. pp. 226-233
[3] Lin J, Keogh E, Lonardi S, Lankford J,
Nystrom D. Visually mining and [11] Ghysels E, Santa-Clara P, Valkanov
monitoring massive time series. In: R. Predicting volatility: Getting the most
Proceedings of 2004 ACM SIGKDD out of return data sampled at different
International Conference on Knowledge frequencies. Journal of Econometrics.
Discovery and data Mining–KDD ’04; 2006;131(1-2):59-95
2004. p. 460
[12] Kawagoe GD. Grid Representation
[4] Bornemann L, Bleifuß T, of Time Series Data for Similarity
Kalashnikov D, Naumann F, Srivastava Search. In: Data Engineering Workshop;
D. Data change exploration using time 2006
series clustering. Datenbank-Spektrum.
2018;18(2):79-87 [13] Agronomischer Zeitreihen CA. Time
Series Clustering in the Field of
[5] Rani S, Sikka G. Recent techniques of Agronomy. Technische Universitat
clustering of time series data: A survey. Darmstadt (Master-Thesis); 2013
International Journal of Computers and
Applications. 2012;52(15):1 [14] Keogh E, Lonardi S,
Ratanamahatana C. Towards parameter-
[6] Aghabozorgi S, Shirkhorshidi AS, free data mining. In: Proceedings of
Wah TY. Time-series clustering–A Tenth ACM SIGKDD International
decade review. Information Systems. Conference on Knowledge Discovery
2015;53:16-38 Data Mining; 2004, Vol. 22, No. 25.
pp. 206-215
[7] Lin J, Keogh E, Lonardi S, Chiu B. A
symbolic representation of time series, [15] Keogh E, Chakrabarti K, Pazzani M,
with implications for streaming Mehrotra S. Locally adaptive
algorithms. In: Proceedings of the 8th dimensionality reduction for indexing
ACM SIGMOD Workshop on Research large time series databases. ACM
Issues in Data Mining and Knowledge SIGMOD Record. 2001;27(2):151-162
Discovery; 13 June 2003; ACM; pp. 2-11
[16] Keogh E, Pazzani M. An enhanced
[8] Keogh EJ, Pazzani MJ. A simple representation of time series which
dimensionality reduction technique for allows fast and accurate classification,
fast similarity search in large time series clustering and relevance feedback. In:
databases. In: Pacific-Asia Conference Proceedings of the 4th International
15
Statistical Machine Learning
16
Clustering of Time-Series Data
DOI: http://dx.doi.org/10.5772/intechopen.84490
17
Statistical Machine Learning
18
Clustering of Time-Series Data
DOI: http://dx.doi.org/10.5772/intechopen.84490
19