Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
73 views20 pages

Clustering of Time-Series Data

Uploaded by

Ayu Sofia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views20 pages

Clustering of Time-Series Data

Uploaded by

Ayu Sofia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

We are IntechOpen,

the world’s leading publisher of


Open Access books
Built by scientists, for scientists

4,600
Open access books available
119,000
International authors and editors
135M
Downloads

Our authors are among the

154
Countries delivered to
TOP 1%
most cited scientists
12.2%
Contributors from top 500 universities

Selection of our books indexed in the Book Citation Index


in Web of Science™ Core Collection (BKCI)

Interested in publishing with us?


Contact [email protected]
Numbers displayed above are based on latest data collected.
For more information visit www.intechopen.com
Chapter

Clustering of Time-Series Data


Esma Ergüner Özkoç

Abstract

The process of separating groups according to similarities of data is called “clus-


tering.” There are two basic principles: (i) the similarity is the highest within a
cluster and (ii) similarity between the clusters is the least. Time-series data are
unlabeled data obtained from different periods of a process or from more than one
process. These data can be gathered from many different areas that include engi-
neering, science, business, finance, health care, government, and so on. Given the
unlabeled time-series data, it usually results in the grouping of the series with
similar characteristics. Time-series clustering methods are examined in three main
sections: data representation, similarity measure, and clustering algorithm. The
scope of this chapter includes the taxonomy of time-series data clustering and the
clustering of gene expression data as a case study.

Keywords: time-series data, data mining, data representation, similarity measure,


clustering algorithms, gene expression data clustering

1. Introduction

The rapid development of technology has led to the registration of many pro-
cesses in an electronic environment, the storage of these records, and the accessi-
bility of these records when requested. With the evolving technology such as cloud
computing, big data, the accumulation of a large amount of data stored in data-
bases, and the process of parsing and screening useful information made data
mining necessary.
It is possible to examine the data which are kept in databases and reach to huge
amounts of size every second, in two parts according to their changes in time: static
and temporal. Data is called the static data when its feature values do not change
with time, if the feature comprise values change with time then it is called the
temporal or time-series data.
Today, with the increase in processor speed and the development of storage
technologies, real-world applications can easily record changing data over time.
Time-series analysis is a trend study subject because of its prevalence in various
fields ranging from science, engineering, bioinformatics, finance, and government
to health-care applications [1–3]. Data analysts are looking for the answers of such
questions: Why does the data change this way? Are there any patterns? Which series
show similar patterns? etc. Subsequence matching, indexing, anomaly detection,
motif discovery, and clustering of the data are the answers of some questions [4].
Clustering, which is one of the most important concepts of data mining, defines its
structure by separating unlabeled data sets into homogeneous groups. Many
general-purpose clustering algorithms are used for the clustering of time-series

1
Statistical Machine Learning

data, either by directly or by evolving. Algorithm selection depends entirely on the


purpose of the application and on the properties of the data such as sales data,
exchange rates in finance, gene expression data, image data for face recognition, etc.
In the age of informatics, the analysis of multidimensional data that has emerged
as part of the digital transformation in every field has gained considerable impor-
tance. These data can be from data received at different times from one or more
sensors, stock data, or call records to a call center. This type of data, that is,
observing the movement of a variable over time, where the results of the observa-
tion are distributed according to time, is called time-series data. Time-series analy-
sis is used for many purposes such as future forecasts, anomaly detection,
subsequence matching, clustering, motif discovery, indexing, etc. Within the scope
of this study, the methods developed for the time-series data clustering which are
important for every field of digital life in three main sections. In the first section,
the proposed methods for the preparation of multidimensional data for clustering
(dimension reduction) in the literature are categorized. In the second section, the
similarity criteria to be used when deciding on the objects to be assigned to the
related cluster are classified. In the third section, clustering algorithms of time-
series data are examined under five main headings according to the method used. In
the last part of the study, the use of time-series clustering in bioinformatics which is
one of the favorite areas is included.

2. Time-series clustering approaches

There are many different categorizations of time-series clustering approaches.


Such as, time-series clustering approaches can be examined in three main sections
according to the characteristics of the data used whether they process directly on
raw data, indirectly with features extracted from the raw data, or indirectly with
models built from the raw data [5]. Another category is according to the clustering
method: shape-based, feature-based, and model-based [6]. But whatever the cate-
gorization is, for any time-series clustering approach, the main points to be consid-
ered are: how to measure the similarity between time series; how to compress the
series or reduce dimension and what algorithm to use for cluster. Therefore, this
chapter examines time-series clustering approaches according to three main build-
ing blocks: data representation methods, distance measurements, and clustering
algorithms (Figure 1).

2.1 Data representation

Data representation is one of the main challenging issues for time-series cluster-
ing. Because, time-series data are much larger than memory size [7, 8] that
increases the need for high processor power and time for the clustering process
increases exponentially. In addition, the time-series data are multidimensional,
which is a difficulty for many clustering algorithms to handle, and it slows down the
calculation of the similarity measurement. Consequently, it is very important for
time-series data to represent the data without slowing down the algorithm execu-
tion time and without a significant data loss. Therefore, some requirements can be
listed for any data representation methods [9]:

i. Significantly reduce the data size/dimensionality,

ii. Maintain the local and global shape characteristics of the time series,

2
Clustering of Time-Series Data
DOI: http://dx.doi.org/10.5772/intechopen.84490

iii. Acceptable computational cost,

iv. Reasonable level of reconstruction from the reduced representation,

v. Insensitivity to noise or implicit noise handling.

Dimension reduction is one of the most frequently used methods in the litera-
ture [7, 10–12] for the data representation.

Definition:

The representation of a time series T with length n is a model T̅ with reduced


dimensions, so that T approximates T [13]. Dimension reduction or feature extrac-
tion is a very useful method for reducing the number of variables/attributes or units

Figure 1.
Time-series clustering.

3
Statistical Machine Learning

in multivariate statistical analyzes so that the number of attributes can be reduced


to a number that “can handle.”
Due to the noisy and high-dimensional features of many time-series data, data
representations have been studied and generally examined in four main sections:
data adaptive, nondata adaptive, model-based, and data dictated [6].

• Data adaptive methods that have changing parameters according to


processing time-series data. Methods in this category try to minimize global
reconstruction error by using unequal length segments. Although it is difficult
to compare several time series, this method approximates each series better.
Some of the popular data adaptive representation methods are: Symbolic
Aggregate Approximation (SAX) [14], Adaptive Piecewise Constant
Approximation (APCA) [15], Piecewise Linear Approximation (PLA) [16],
Singular Value Decomposition (SVD) [17–18], and Symbolic Natural Language
(NLG) [19].

• Non-data adaptive methods are use fix-size parameters for the representing
time-series data. Following methods are shown among non-data adaptive
representation methods: Discrete Fourier Transform (DFT) [18], Discrete
Wavelet Transform (DWT) [20–22], Discrete Cosine Transformation (DCT)
[17], Perceptually Important Point (PIP) [23], Piecewise Aggregate
Approximation (PAA) [24], Chebyshev Polynomials (CHEB) [25],
Random Mapping [26], and Indexable Piecewise Linear Approximation
(IPLA) [27].

• Model-based methods assume that observed time series was produced by an


underlying model. The real issue here is to find the parameters that produce
this model. Two time series produced by the same set of parameters using the
underlying model are considered similar. Some of the model-based methods
can be listed as: Auto-regressive Moving Average (ARMA) [28, 29], Time-
Series Bitmaps [30], and Hidden Markov Model (HMM) [31–33].

• Data dictated methods automatically determine the dimension reduction rate


but in the three methods mentioned above, the dimension reduction rates are
automatically determined by the user. The most common example of data
dictated method is clipped data [34–36].

Many representation methods for time-series data are proposed and each of
them offering different trade-offs between the aforementioned requirements. The
correct selection of the representation method plays a major role in the effectiveness
and usability of the application to be performed.

2.2 Similarity/distance measure

In particular, the similarity measure is the most essential ingredient of time-


series clustering.
The similarity or distance for the time-series clustering is approximately calcu-
lated, not based on the exact match as in traditional clustering methods. It requires
to use distance function to compare two time series. In other words, the similarity
of the time series is not calculated, it is estimated. If the estimated distance is large,
the similarity between the time series is less and vice versa.

4
Clustering of Time-Series Data
DOI: http://dx.doi.org/10.5772/intechopen.84490

Definition:

Similarity between two “n” sized time series T = {t1,t2,….tn} and U = {u1,u2,….un}
is the length of the path connecting pair of points [11]. This distance is the measure
of similarity. D (T, U) is a function that takes two times series (T, U) as input and
calculates their distance “d”.
Metrics to be used in clustering must cope with the problems caused by common
features of time-series data such as noise, temporal drift, longitudinal scaling, offset
translation, linear drift, discontinuities, and amplitude scaling. Various methods
have been developed for similarity measure, and the method to choose is problem
specific. These methods can be grouped under three main headings: similarity in
time, similarity in shape, and similarity in change.

2.2.1 Similarity in time

The similarity between the series is that they are highly time dependent. Such a
measure is costly for the raw time series, so a preprocessing or transformation is
required beforehand [34, 36].

2.2.2 Similarity in shape

Clustering algorithms that use similarity in shape measure, assigns time series
containing similar patterns to the same cluster. Independently of the time, it does
not care how many times the pattern exists [37, 38].

2.2.3 Similarity in change

The result of using this metric is time-series clusters that have the similar
autocorrelation structure. Besides, it is not a suitable metric for short time series
[39, 40, 29].

2.3 Clustering algorithms

The process of separating groups according to similarities of data is called “clus-


tering.” There are two basic principles: the similarity within the cluster is the
highest and the similarity between the clusters is the least. Clustering is done on the
basis of the characteristics of the data and using multivariate statistical methods.
When dividing data into clusters, the similarities/distances of the data to each other
are measured according to the specification of the data (discrete, continuous, nom-
inal, ordinal, etc.)
Han and Kamber [41] classify the general-purpose clustering algorithms which
are actually designed for static data in five main sections: partition-based,
hierarchical-based, density-based, grid-based, and model-based. Besides these, a
wide variety of algorithms has been developed for time-series data. However, some
of these algorithms (ignore minor differences) intend to directly use the methods
developed for static data without changing the algorithm by transforming it into a
static data form from temporal data. Some approaches apply a preprocessing step
on the data to be clustered before using the clustering algorithm. This preprocessing
step converts the raw-time-series data into feature vectors using dimension reduc-
tion techniques, or converts them into parameters of a specified model [42].

5
Statistical Machine Learning

Definition:

Given a dataset on n time series T = {t1, t2,…., tn}, time-series clustering is the
process of partitioning of T into C = {C1,C2,….,Ck} according to certain similarity
criterion. Ci is called “cluster” where,

T ¼ ⋃ki¼1 Ci and Ci ⋂Cj ¼ ∅ for i 6¼ j (1)

In this section, previously developed clustering algorithms will be categorized.


Some of these algorithms work directly with raw time-series data, while others use
the data presentation techniques that are previously mentioned.
Clustering algorithms are generally classified as: partitioning, hierarchical,
graph-based, model-based, and density-based clustering.

2.3.1 Partitioning clustering

The K-means [43] algorithm is a typical partition-based clustering algorithm


such that the data are divided into a number of predefined sets by optimizing the
predefined criteria. The most important advantage is its simplicity and speed. So it
can be applied to large data sets. However, the algorithm may not produce the same
result in each run and cannot handle the outlier. Self-organizing map [44] is stron-
ger than the noisy data clustering from K-means. The user is prompted to enter the
cluster number and grid sets. It is difficult to determine the number of clusters for
time-series data. Other examples of partition-based clustering are CLARANS [45]
and K-medoids [46]. In addition, the partitioning approach is suitable for low-
dimensional, well-separated data. However, time-series data are multidimensional
and often contain intersections, embedded clusters.
In essence, these algorithms act as n-dimensional vectors to time-series data and
applies distance or correlation functions to determine the amount of similarity
between two series. Euclidean distance, Manhattan distance, and Pearson
correlation coefficient are the most commonly used functions.

2.3.2 Hierarchical clustering

Contrary to the partitioning approach, which aims segmenting data that do not
intersect, the hierarchical approach produces a hierarchical series of nested clusters
that can be represented graphically (dendrogram, tree-like diagram). The branches
of the dendrogram show the similarity between the clusters as well as the knowl-
edge of the shaping of the clusters. Determined number of clusters can be obtained
by cutting the dendrogram at a certain level.
Hierarchical clustering methods [47–49] are based on the separating clusters
into subgroups that are processed step by step as a whole, or the stepwise integra-
tion of individual clusters into a cluster [50]. Hierarchical clustering methods are
divided into two methods: agglomerative clustering methods and divisive hierar-
chical clustering methods according to the creation of the dendrogram.
In agglomerative hierarchical clustering methods, each observation is initially
treated as an independent cluster, and then repeatedly, until each individual obser-
vation obtains a single set of all observations, thereby forming a cluster with the
closest observation.
In the divisive hierarchical clustering methods, initially all observations are
evaluated as a single cluster and then repeatedly separated in such a way that each
observation is separated from the farthest observation to form a new cluster. This
process continues until all the observations create a single cluster.

6
Clustering of Time-Series Data
DOI: http://dx.doi.org/10.5772/intechopen.84490

Hierarchical clustering not only forms a group of similar series but also provides
a graphical representation of the data. Graphical presentation allows the user to
have an overall view of the data and an idea of data distribution. However, a small
change in the data set leads to large changes in the hierarchical dendrogram.
Another drawback is high computational complexity.

2.3.3 Density-based clustering

The density-based clustering approach is based on the concepts of density and


attraction of objects. The idea is to create clusters of dense multi-dimensional areas
where objects attract each other. In the core of dense areas, objects are very close
together and crowded. The objects in the walls of the clusters were scattered less
frequently than the core. In other words, density-based clustering determines dense
areas of object space. The clusters are dense areas which are separated by rare dense
areas. DBSCAN [51] and OPTICS [52] algorithms are the most known of density-
based clustering examples.
The density-based approach is robust for noisy environments. The method also
deals with outliers when defining embedded clusters. However, density-based clus-
tering techniques cause difficulties due to high computational complexity and input
parameter dependency when the dimensional index structure is not used.

2.3.4 Model-based clustering

The model-based approach [53–55] uses a statistical infrastructure to model the


cluster structure of the time-series data. It is assumed that the underlying probabil-
ity distributions of the data come from the final mixture. Model-based algorithms
usually try to estimate the likelihood of the model parameters by applying some
statistical techniques such as Expectation Maximization (EM). The EM algorithm
iterates between an “E-step,” which computes a matrix z such that zik is an estimate
of the conditional probability that observation i belongs to group k given the current
parameter estimates, and an “M-step,” which computes maximum likelihood
parameter estimates given z. Each data object is assigned to a cluster with the
highest probability until the EM algorithm converges, so as to maximize likelihood
for the entirety of the grant.
The most important advantage of the model-based approach is to estimate the
probability that i. observation belongs to k. cluster. In some cases, the time series is
likely to belong to more than one cluster. For such time-series data, the probability-
giving function of the approach is the reason for preference. In this approach, it is
assumed that the data set has a certain distribution but this assumption is not always
correct.

2.3.5 Grid-based clustering

In this approach, grids made up of square cells are used to examine the data
space. It is independent of the number of objects in the database due to the used
grid structure. The most typical example is STING [56], which uses various levels of
quadrilateral cells at different levels of resolution. It precalculates and records
statistical information about the properties of each cell. The query process usually
begins with a high-level hierarchical structure. For each cell at the current level, the
confidence interval, which reflects the cell’s query relation, is computed. Unrelated
cells are exempt from the next steps. The query process continues for the
corresponding cells in the lower level until reaching the lowest layer.

7
Statistical Machine Learning

After analyzing the data set and obtaining the clustering solution, there is no
guarantee of the significance and reliability of the results. The data will be clustered
even if there is no natural grouping. Therefore, whether the clustering solution
obtained is different from the random solution should be determined by applying
some tests. Some methods developed to test the quality of clustering solutions are
classified into two types: external index and internal index.

• The external index is the most commonly used clustering evaluation method
also known as external validation, external criterion. The ground truth is the
goal clusters, usually created by experts. This index measures how well the
target clusters and the resulting clusters overlap. Entropy, Adjusted Rand
Index (ARI), F-measure, Jaccard Score, Fowlkes and Mallows Index (FM), and
Cluster Similarity Measure (CSM) are the most known external indexes.

• The internal indexes evaluate clustering results using the features of data sets
and meta-data without any external information. These are often used in cases
where the correct solutions are not known. Sum of squared error is one of the
most used internal methods which the distance to the nearest cluster
determines the error. So clusters with similar time series are expected to give
lower error values. Distance between two clusters (CD) index, root-mean-
square standard deviation (RMSSTD), Silhouette index, R-squared index,
Hubert-Levin index, semi-partial R-squared (SPR) index, weighted inter-intra
index, homogeneity index, and separation index are the common internal
indexes.

2.3.6 Clustering algorithm example: FunFEM

The funFEM algorithm [55, 57] allows to cluster time series or, more generally,
functional data. FunFem is based on a discriminative functional mixture model
(DFM) which allows the clustering of the curves (data) in a functional subspace. If
the observed curves are {x1 , x2 …xn }, FunFem aims cluster into K homogenous
groups. It assumes that there exists an unobserved random variable Z = {z1 , z2 …zn }
∈f0; 1gk , if x belongs to group k, Zk is defined as 1 otherwise 0. The clustering
task goal is to predict the value zi = (zi1 ,… zik ) of Z for each observed curve xi , for
i = 1…n. The FunFem algorithm alternates, over the three steps of Fisher EM
algorithm [57] (“F-step,” “E-Step” and “M-step”) to decide group memberships of
Z = {z1 , z2 …zn }. In other words, from 12 defined discriminative functional mixture
(DFM) models, Fisher-EM decides which data fit the best. The Fisher-EM algorithm
alternates between three steps:

• an E step in which posterior probabilities that observations belong to the K


groups are computed,

• an F step that estimates the orientation matrix U of the discriminative latent


space conditionally to the posterior probabilities,

• an M step in which parameters of the mixture model are estimated in the latent
subspace by maximizing the conditional expectation of the complete
likelihood.

Fisher-EM algorithm updates the parameters repeatedly until the Aitken


criterion is provided. Aitken criterion estimates the asymptotic maximum of the

8
Clustering of Time-Series Data
DOI: http://dx.doi.org/10.5772/intechopen.84490

log-likelihood in order to detect in advance the algorithm converge [57]. In


model-based clustering, a model is defined by its number of component/cluster K
and its parameterization. In model selection task, several models are reviewed
while selecting the most appropriate model for the considered data.
FunFEM allows to choose between AIC (Akaike Information Criterion) [58],
BIC (Bayesian information criteria) [59], and ICL (Integrated Completed Likeli-
hood) [60] when deciding the number of clusters. The penalty terms are:
γ ðMÞ
2 log ðnÞ in the BIC criterion, γ ðMÞ in the AIC criterion, and ∑ni¼1 ∑Kk¼1 tik log ðtik Þ
in the ICL criterion. Here, M indicates the number of parameters in the model, n is
the number of observations, K is the number of clusters, and tik is the probability of
ith observation belonging to kth cluster.
FunFem is implemented in R programming languages and serves as a
function. [61] The algorithm is applied on a time series gene expression data in the
following section. Input of the algorithm is gene expression data which is given in
Table 1. The table shows the gene expression values measured as a result of the
microarray experiment. The measurement was performed at six different times
for each gene. The data were taken from the GEO database (GSE2241) [62].
FunFEM method is decided, and the best model is DkBk with K = 4
(bic = 152654.5) for input data. As a result, method assigned each gene to the
appropriate cluster which is determined by the algorithm. Table 2 demonstrates the
gene symbol and cluster number. As a result, method assigned each gene to the
appropriate cluster which is determined by the algorithm (Table 2).

Gene Symbol TP1 TP2 TP3 TP4 TP5 TP6

AADAC 18.4 29.7 30 79.7 86.7 163.2

AAK1 253.2 141.8 49.2 118.7 145.2 126.7

AAMP 490 340.9 109.1 198.4 210.5 212

AANAT 5.6 1.4 3.7 3.1 1.6 4.9

AARS 1770 793.6 226.5 1008.9 713.3 1253.7

AASDHPPT 940.1 570.5 167.2 268.6 683 263.5

AASS 10.9 1.9 1.5 4.1 19.7 25.5

AATF 543.4 520.1 114.5 305.7 354.2 384.9

AATK 124.5 74.5 17 25.6 64.6 13.6

. . . . . . .

. . . . . . .

ZP2 4.1 1.4 0.8 1.4 1.4 3

ZPBP 23.4 13.7 7 7.8 22.3 26.9

ZW10 517.1 374.5 72.6 240.8 345.7 333.1

ZWINT 1245.4 983.4 495.3 597.4 1074.3 620.7

ZYX 721.6 554.9 135.5 631.5 330.9 706.8

ZZEF1 90.5 49.3 18.6 66.7 10.4 52.2

ZZZ3 457.3 317.1 93 243.2 657.5 443

Table 1.
Input data of the FunFEM algorithm.

9
Statistical Machine Learning

Gene symbol Cluster number

AADAC 2

AAK1 3

AAMP 3

AANAT 1

AARS 4

AASDHPPT 3

AASS 1

AATF 3

AATK 2

. .

. .

ZP2 1

ZPBP 1

ZW10 3

ZWINT 4

ZYX 4

ZZEF1 2

ZZZ3 3

Table 2.
Output data of the FunFEM algorithm.

3. Clustering approaches for gene expression data clustering

The approach to be taken depends on the application area and the characteristics
of the data. For this reason, as a case study, the clustering of gene expression data,
which is a special area of clustering of time-series data, will be examined in this
section. Microarray is the technology which measures the expression levels of large
numbers of genes simultaneously. DNA microarray technology overcomes tradi-
tional approaches in the identification of gene copies in a genome, in the identifica-
tion of nucleotide polymorphisms and mutations, and in the discovery and
development of new drugs. It is used as a diagnostic tool for diseases. DNA
microarrays are widely used to classify gene expression changes in cancer cells.
The gene expression time series (gene profile) is a set of data generated by
measuring expression levels at different cases/times in a single sample. Gene
expression time series have two main characteristics, short and unevenly sampled.
In The Stanford Microarray database, more than 80% of the time-series experi-
ments contains less than 9 time points [63]. Observations below 50 are considered
to be quite short for statistical analysis. Gene expression time-series data are sepa-
rated from other time-series data by this characteristics (business, finance, etc.). In
addition to these characteristics, three basic similarity requirements can be identi-
fied for the gene expression time series: scaling and shifting, unevenly distributed
sampling points, and shape (internal structure) [64]. Scaling and shifting problems
arise due to two reasons: (i) the expression of genes with a common sequence is
similar, but in this case, the genes need not have the same level of expression at the
same time. (ii) Microarray technology, which is often corrected by normalization.

10
Clustering of Time-Series Data
DOI: http://dx.doi.org/10.5772/intechopen.84490

The scaling and shifting factor in the expression level may hide similar expressions
and should not be taken into account when measuring the similarity between the
two expression profiles. Sampling interval length is informative and cannot be
ignored in similarity comparisons. In microarray experiments, the density change
characterizes the shape of the expression profile rather than the density of the gene
expression. The internal structure can be represented by deterministic function,
symbols describing the series, or statistical models.
There are many popular clustering techniques for gene expression data. The
common goal of all is to explain the different functional roles of the genes that play
a key biological process. Genes expressed in a similar way may have a similar
functional role in the process [65].
In addition to all these approaches, it is possible to examine the cluster of gene
expression data in three different classes as gene-based clustering, sample-based
clustering, and subspace clustering (Figure 2) [66]. In gene-based clustering, genes
are treated as objects, instances (time-point/patient-intact) as features. Sample-
based clustering is exactly the opposite: samples are treated as objects, genes as
features. The distinction between these two clustering approaches is based on the
basic characterization of the clustering process used for gene expression data. Some
clustering algorithms, such as K-means and hierarchical approach, can be used to
cluster both genes and fragments of samples. In the molecular biology, “any func-
tion in the cell is carried out with the participation of a small subset of genes, and
the cellular function only occurs on a small sample subset.” With this idea, genes
and samples are handled symmetrically in subspace clustering; gene or sample,
object or features.
In gene-based clustering, the aim is to group the co-expressed genes together.
However, due to the complex nature of microarray experiments, gene expression
data often contain high amounts of noise, characterizing features such as gene
expression data often linked to each other (clusters often have a high intersection
ratio), and some problems arising from constraints from the biological domain.

Figure 2.
Gene expression data clustering approaches.

11
Statistical Machine Learning

Also, among biologists who will use microarray data, the relationship between
genes or clusters that are usually related to each other within the cluster, rather than
the clusters of genes, is a more favorite subject. That is, it is also important for the
algorithm to make graphical presentations not just clusters. K-means, self-
organizing maps (SOM), hierarchical clustering, graph-theoretic approach, model-
based clustering, and density-based approach (DHC) are the examples of gene-
based clustering algorithms.
The goal of the sample-based approach is to find the phenotype structure or the
sub-structure of the sample. The phenotypes of the samples studied [67] can only be
distinguished by small gene subsets whose expression levels are highly correlated
with cluster discrimination. These genes are called informative genes. Other genes
in the expression matrix have no role in the decomposition of the samples and are
considered noise in the database. Traditional clustering algorithms, such as K-
means, SOM, and hierarchical clustering, can be applied directly to clustering sam-
ples taking all genes as features. The ratio of the promoter genes to the nonrelated
genes (noise ratio) is usually 1:10. This also hinders the reliability of the clustering
algorithm. These methods are used to identify the informative genes. Selection of
the informative genes is examined in two different categories as supervised and
unsupervised. The supervised approach is used in cases where phenotype informa-
tion such as “patient” and “healthy” is added. In this example, the classifier
containing only the informative genes is constructed using this information. The
supervised approach is often used by biologists to identify informative genes. In the
unsupervised approach, no label specifying the phenotype of the samples is placed.
The lack of labeling and therefore the fact that the informative genes do not guide
clustering makes the unsupervised approach more complicated. There are two
problems that need to be addressed in the unsupervised approach: (i) the high
number of genes versus the limited number of samples and (ii) the vast majority of
collected genes are irrelevant. Two strategies can be mentioned for these problems
in the unsupervised approach: unsupervised gene selection and clustering. In
unsupervised gene selection, gene selection and sample clustering are treated as two
separate processes. First, the gene size is reduced, and then classical clustering
algorithms are applied. Since there is no training set, the choice of gene is based
solely on statistical models that analyze the variance of gene expression data. Asso-
ciated clustering dynamically supports the combination of repetitive clustering and
gene selection processes by the use of the relationship between genes and samples.
After many repetitions, the sample fragments converge to the real sample structure
and the selected genes are likely candidates for the informative gene cluster.
When subspace clustering is applied to gene expression vectors, it is treated as
a “block” consisting of clusters of genes and subclasses of experimental conditions.
The expression pattern of the genes in the same block is consistent under the
condition in that block. Different greedy heuristic approaches have been adapted to
approximate optimal solution.
Subspace clustering was first described by Agrawal et al. in 1998 on general data
mining [68]. In subspace clustering, two subspace sets may share the same objects
and properties, while some objects may not belong to any subspace set. Subspace
clustering methods usually define a model to determine the target block and then
search in the gen-sample space. Some examples of subspatial cluster methods pro-
posed for gene expression are biclustering [69], coupled two way clustering
(CTWC) [70], and plaid model [71].
According to different clustering criteria, data can be clustered such as the co-
expressing gene groups, the samples belonging to the same phenotype or genes
from the same biological process. However, even if the same criteria are used in

12
Clustering of Time-Series Data
DOI: http://dx.doi.org/10.5772/intechopen.84490

different clustering algorithms, the data can be clustered in different forms. For this
reason, it is necessary to select more suitable algorithm for data distribution.

4. Conclusions

Clustering for time-series data is used as an effective method for data analysis of
many areas from social media usage and financial data to bioinformatics. There are
various methods introduced for time-series data. Which approach is chosen is
specific to the application. The application is determined by the needs such as time,
speed, reliability, storage, and so on. When determining the approach to clustering,
three basic issues need to be decided: data representation, similarity measure, and
clustering algorithm.
The data representation involves transforming the multi-dimensional and noisy
structure of the time-series data into a less dimensional that best expresses the
whole data. The most commonly used method for this purpose is dimension reduc-
tion or feature extraction.
It is challenging to measure the similarity of two time series. The chapter has
been examined similarity measures in three sections as similarity in shape, similar-
ity in time, and similarity in change.
For the time-series clustering algorithms, it is not wrong to say that the evolu-
tion of conventional clustering algorithms. Therefore, the classification of tradi-
tional clustering algorithms (developed for static data) has been included. It is
classified as partitioning, hierarchical, model-based, grid-based, and density-based.
Partition algorithms initially require prototypes. The accuracy of the algorithm
depends on the defined prototype and updated method. However, they are suc-
cessful in finding similar series and clustering time series with equal length. The fact
that the number of clusters is not given as the initial parameter is a prominent and
well-known feature of hierarchical algorithms. At the same time, works on time
series that are not of equal length causes it to be one step ahead of other algorithms.
However, hierarchical algorithms are not suitable for large data sets due to the
complexity of the calculation and the scalability problem. Model-based algorithms
suffer from problems such as initialization of parameters based on user predictions
and slow processing time for large databases. Density-based algorithms are not
generally preferred over time-series data due to their high working complexity.
Each approach has pros and cons compared to each other, and the choice of algo-
rithm for time-series clustering varies completely according to the characteristics of
the data and the needs of the application. Therefore, in the last chapter, a study on
the clustering of gene expression data, which is a specific field of application, has
been mentioned.
In time-series data clustering, there is a need for algorithms that execute fast,
accurate, and with less memory on large data sets that can meet today’s needs.

13
Statistical Machine Learning

Author details

Esma Ergüner Özkoç


Başkent University, Ankara, Turkey

*Address all correspondence to: [email protected]

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms
of the Creative Commons Attribution License (http://creativecommons.org/licenses/
by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly cited.

14
Clustering of Time-Series Data
DOI: http://dx.doi.org/10.5772/intechopen.84490

References

[1] Ratanamahatana C. Multimedia on Knowledge Discovery and Data


retrieval using time series Mining; 18 April 2000; Springer, Berlin,
representation and relevance feedback. Heidelberg. pp. 122-133
In: Proceedings of 8th International
Conference on Asian Digital [9] Esling P, Agon C. Time-series data
Libraries (ICADL2005); 2005. mining. ACM Computing Surveys
pp. 400-405 (CSUR). 2012;45(1):12

[2] Özkoç EE, Oğul H. Content-based [10] Keogh E, Lin J, Fu A. Hot sax:
search on time-series microarray Efficiently finding the most unusual
databases using clustering-based time series subsequence. In: Fifth IEEE
fingerprints. Current Bioinformatics. International Conference on Data
2017;12(5):398-405. ISSN: 2212-392X Mining (ICDM’05); 27 November 2005;
IEEE. pp. 226-233
[3] Lin J, Keogh E, Lonardi S, Lankford J,
Nystrom D. Visually mining and [11] Ghysels E, Santa-Clara P, Valkanov
monitoring massive time series. In: R. Predicting volatility: Getting the most
Proceedings of 2004 ACM SIGKDD out of return data sampled at different
International Conference on Knowledge frequencies. Journal of Econometrics.
Discovery and data Mining–KDD ’04; 2006;131(1-2):59-95
2004. p. 460
[12] Kawagoe GD. Grid Representation
[4] Bornemann L, Bleifuß T, of Time Series Data for Similarity
Kalashnikov D, Naumann F, Srivastava Search. In: Data Engineering Workshop;
D. Data change exploration using time 2006
series clustering. Datenbank-Spektrum.
2018;18(2):79-87 [13] Agronomischer Zeitreihen CA. Time
Series Clustering in the Field of
[5] Rani S, Sikka G. Recent techniques of Agronomy. Technische Universitat
clustering of time series data: A survey. Darmstadt (Master-Thesis); 2013
International Journal of Computers and
Applications. 2012;52(15):1 [14] Keogh E, Lonardi S,
Ratanamahatana C. Towards parameter-
[6] Aghabozorgi S, Shirkhorshidi AS, free data mining. In: Proceedings of
Wah TY. Time-series clustering–A Tenth ACM SIGKDD International
decade review. Information Systems. Conference on Knowledge Discovery
2015;53:16-38 Data Mining; 2004, Vol. 22, No. 25.
pp. 206-215
[7] Lin J, Keogh E, Lonardi S, Chiu B. A
symbolic representation of time series, [15] Keogh E, Chakrabarti K, Pazzani M,
with implications for streaming Mehrotra S. Locally adaptive
algorithms. In: Proceedings of the 8th dimensionality reduction for indexing
ACM SIGMOD Workshop on Research large time series databases. ACM
Issues in Data Mining and Knowledge SIGMOD Record. 2001;27(2):151-162
Discovery; 13 June 2003; ACM; pp. 2-11
[16] Keogh E, Pazzani M. An enhanced
[8] Keogh EJ, Pazzani MJ. A simple representation of time series which
dimensionality reduction technique for allows fast and accurate classification,
fast similarity search in large time series clustering and relevance feedback. In:
databases. In: Pacific-Asia Conference Proceedings of the 4th International

15
Statistical Machine Learning

Conference of Knowledge Knowledge and Information Systems.


Discovery and Data Mining; 1998. 2000;1805(1):122-133
pp. 239-241
[25] Caiand Y, Ng R. Indexing spatio-
[17] Korn F, Jagadish HV, Faloutsos C. temporal trajectories with Chebyshev
Efficientlysupportingadhoc queries in polynomials. In: Procedings of 2004
large datasets of time sequences. ACM SIGMOD International; 2004.
ACM SIGMOD Record. 1997;26: p. 599
289-300
[26] Bingham E. Random projection in
[18] Faloutsos C, Ranganathan M, dimensionality reduction: Applications
Manolopoulos Y. Fasts ubsequence to image and text data. In: Proceedings
matching in time-series databases. of the Seventh ACM SIGKDD
ACM SIGMOD Record. 1994;23(2): International Conference on Knowledge
419-429 Discovery and Data Mining; 2001.
pp. 245-250
[19] Portet F, Reiter E, Gatt A, Hunter J,
[27] Chen Q , Chen L, Lian X, Liu Y.
Sripada S, Freer Y, et al. Automatic
Indexable PLA for efficient similarity
generation of textual summaries from
search. In: Proceedings of the 33rd
neonatal intensive care data. Artificial
International Conference on Very large
Intelligence. 2009;173(7):789-816
Data Bases; 2007. pp. 435-446
[20] Chan K, Fu AW. Efficient time
[28] Corduas M, Piccolo D. Timeseries
series matching by wavelets. In: clustering and classification by the
Proceedings of 1999 15th International autoregressive metric. Computational
Conference on Data Engineering; 1999, Statistics & Data Analysis. 2008;52(4):
Vol. 15, no. 3. pp. 126-133 1860-1872

[21] Agrawal R, Faloutsos C, Swami A. [29] Kalpakis K, Gada D, Puttagunta V.


Efficient similarity search in sequence Distance measures for effective
databases. Foundations of Data clustering of ARIMA time-series. In:
Organization and Algorithms. 1993;46: Proceedings 2001 IEEE International
69-84 Conference on Data Mining; 2001.
pp. 273-280
[22] Kawagoe K, Ueda T. A similarity
search method of time series data with [30] Kumar N, Lolla N, Keogh E, Lonardi
combination of Fourier and wavelet S. Time-series bitmaps: A practical
transforms. In: Proceedings Ninth visualization tool for working with large
International Symposium on Temporal time series databases. In: Proceedings of
Representation and Reasoning; 2002. the 2005 SIAM International
pp. 86-92 Conference on Data Mining; 2005.
pp. 531-535
[23] Chung FL, Fu TC, Luk R. Flexible
time series pattern matching based on [31] Minnen D, Starner T, Essa M, Isbell
perceptually important points. In: Jt. C. Discovering characteristic actions
Conference on Artificial Intelligence from on body sensor data. In:
Workshop. 2001. pp. 1-7 Proceedings of 10th IEEE International
Symposium on Wearable Computers;
[24] Keogh E, Pazzani M, Chakrabarti K, 2006. pp. 11-18
Mehrotra S. A simple dimensionality
reduction technique for fast similarity [32] Minnen D, Isbell CL, Essa I, Starner
search in large time series databases. T. Discovering multivariate motifs using

16
Clustering of Time-Series Data
DOI: http://dx.doi.org/10.5772/intechopen.84490

subsequence density estimation and [40] Xiong Y, Yeung DY. Mixtures of


greedy mixture learning. In: ARMA models for model-based time
Proceedings of the National Conference series clustering. In: Data Mining, 2002.
on Artificial Intelligence; 2007, Vol. 22, ICDM 2003; 2002. pp. 717-720
No. 1. p. 615
[41] Han J, Kamber M. Data Mining:
[33] Panuccio A, Bicego M, Murino V. Concepts and Techniques. San
A hidden Markov model-based Francisco: Morgan Kaufmann; 2001.
approach to sequential data clustering. pp. 346-389
In: Joint IAPR International Workshops
on Statistical Techniques in Pattern [42] Liao TW. Clustering of time series
Recognition (SPR) and Structural and data—a survey. Pattern Recognition.
Syntactic Pattern Recognition (SSPR). 2005;38(11):1857-1874
Berlin, Heidelberg: Springer; 2002, pp.
734-743 [43] MacQueen J. Some methods for
classification and analysis of
[34] Bagnall AAJ, “Ann” multivariate observations. In:
Ratanamahatana C, Keogh E, Lonardi S, Proceedings of the Fifth Berkeley
Janacek G. A bit level representation for Symposium on Mathematical Statistics
time series data mining with shape and Probability; 21 June 1967, Vol. 1,
based similarity. Data Mining No. 14. pp. 281-297
and Knowledge Discovery. 2006;13(1):
11-40 [44] Tamayo P, Slonim D, Mesirov J,
Zhu Q , Kitareewan S, Dmitrovsky E,
[35] Ratanamahatana C, Keogh E, et al. Interpreting patterns of gene
Bagnall AJ, Lonardi S. A novel bit level expression with self-organizing maps:
time series representation with Methods and application to
implications for similarity search and hematopoietic differentiation.
clustering. In: Proceedings of 9th Proceedings of the National Academy of
Pacific-Asian International Conference Sciences. 1999;96(6):2907-2912
on Knowledge Discovery and Data
Mining (PAKDD’05); 2005. pp. 771-777 [45] Ng RT, Han J. Efficient and
effective clustering methods for spatial
[36] Bagnall AJ, Janacek G. Clustering data mining. In: Proceedings of the
time series with clipped data. Machine International Conference on Very Large
Learning. 2005;58(2):151-178 Data Bases; 1994. pp. 144-144

[37] Sakoe H, Chiba S. A dynamic [46] Kaufman L, Rousseeuw PJ,


programming approach to continuous Corporation E. Finding Groups in Data:
speech recognition. In: Proceedings of An Introduction to Cluster Analysis,
the Seventh International Congress on Vol. 39. Hoboken, NewJersey: Wiley
Acousticsvol; 1971, Vol. 3. pp. 65-69 Online Library; 1990

[38] Sakoe H, Chiba S. Dynamic [47] Guha S, Rastogi R, Shim K. CURE:


programming algorithm optimization An efficient clustering algorithm for
for spoken word recognition. IEEE large databases. ACM SIGMOD Record.
Transactions on Acoustics, Speech, and 1998;27(2):73-84
Signal Processing. 1978;26(1):43-49
[48] Zhang T, Ramakrishnan R, Livny
[39] Smyth P. Clustering sequences with M. BIRCH: An efficient data clustering
hidden Markov models. Advances in method for very large databases.
Neural Information Processing Systems. ACM SIGMOD Record. 1996;25(2):
1997;9:648-654 103-114

17
Statistical Machine Learning

[49] Karypis G, Han EH, Kumar V. Transactions on Automatic Control.


Chameleon: Hierarchical clustering 1974;19:716-723
using dynamic modeling. Computer.
1999;32(8):68-75 [59] Kass RE, Raftery AE. Bayes factors.
Journal of the American Statistical
[50] Beal M, Krishnamurthy P. Gene Association. 1995;90(430):773-795
expression time course clustering with
countably infinite hidden Markov [60] Biernacki C, Celeux G, Govaert G.
models. arXiv preprint arXiv:1206.6824; Assessing a mixture model for clustering
2012 with the integrated completed
likelihood. IEEE Transactions on Pattern
[51] Ester M, Kriegel HP, Sander J, Xu X. Analysis and Machine Intelligence.
A density-based algorithm for 2000;22:719-725
discovering clusters in large spatial data
bases with noise. In: Knowledge
[61] Bouveyron C. funFEM: Clustering in
Discovery and Data Mining. Vol. 96, No.
the Discriminative Functional Subspace.
34; August 1996. pp. 226-231
R package version. 2015;1
[52] Ankerst M, Breunig M, Kriegel H.
[62] Barrett T, Troup DB, Wilhite SE,
OPTICS: Ordering points to identify the
clustering structure. ACM SIGMOD Ledoux P, Rudnev D, Evangelista C,
Record. 1999;28(2):40-60 et al. NCBI GEO: Archive for high-
throughput functional genomic data.
[53] Fisher DH. Knowledge acquisition
Nucleic Acids Research. 2009;37
via incremental conceptual clustering. (Database):D885-D890
Machine Learning. 1987;2(2):139-172
[63] Kuenzel L. Gene clustering methods
[54] Carpenter GA, Grossberg S. A for time series microarray data.
massively parallel architecture for a self- Biochemistry. 2010;218
organizing neural pattern recognition
machine. Computer Vision Graphics [64] Moller-Levet CS, Cho KH, Yin H,
Image Process. 1987;37(1):54-115 Wolkenhauer O. Clustering of gene
expression time-series data. Technical
[55] Bouveyron C, Côme E, Jacques J. report. Department of Computer
The discriminative functional mixture Science, University of Rostock,
model for the analysis of bike sharing Germany; 2003
systems. The Annals of Applied
Statistics. 2015;9(4):1726-1760 [65] Beal M, Krishneamurthy P. Gene
expression time course clustering with
[56] Wang W, Yang J, Muntz R. STING: countably infinite hidden Markov
A statistical information grid approach models. arXiv preprint arXiv:1206.6824;
to spatial data mining. In: Proceedings 2012
of the International Conference on Very
Large Data Bases; 1997. pp. 186-195 [66] Jiang D, Tang C, Zhang A. Cluster
analysis for gene expression data: A
[57] Bouveyron C, Brunet C. survey. IEEE Transactions on
Simultaneous model-based clustering Knowledge and Data Engineering. 2004;
and visualization in the fisher 16(11):1370-1386
discriminative subspace. Statistics and
Computing. 2012;22:301-324 [67] Golub TR, Slonim DK, Tamayo P,
Huard C, Gaasenbeek M, Mesirov JP,
[58] Akaike H. A new look at the et al. Molecular classification of cancer:
statistical model identification. IEEE Class discovery and class prediction by

18
Clustering of Time-Series Data
DOI: http://dx.doi.org/10.5772/intechopen.84490

gene expression monitoring. Science.


1999;286(5439):531-537

[68] Agrawal R, Gehrke J, Gunopulos D,


Raghavan P. Automatic Subspace
Clustering of High Dimensional Data for
Data Mining Applications. ACM; 1998;
27(2):94-105

[69] Cheng Y, Church GM. Biclustering


of expression data. In: ISMB; 2000, Vol.
8, No. 2000. pp. 93-103

[70] Getz G, Levine E, Domany E.


Coupled two-way clustering analysis of
gene microarray data. Proceedings of
the National Academy of Sciences.
2000;97(22):12079-12084

[71] Lazzeroni L, Owen A. Plaid models


for gene expression data. Statistica
Sinica. 2002;1:61-86

19

You might also like