Graphical Models for Data Mining
NLP-AI Seminar
Title Page
Contents
JJ
II
Page 1 of 39
Go Back
Full Screen
Close
Quit
Manoj Kumar Chinnakotla
Outline of the Talk
Graphical Models - Overview
Title Page
Contents
Motivation
Bayesian Networks
JJ
II
Markov Random Fields
Inferencing and Learning
Page 2 of 39
Go Back
Full Screen
Close
Expressive Power
Example Applications
Gene Expression Analysis
Web Page Classification
Quit
Summary
Graphical Models - An Introduction
Title Page
Graph G =< V, E > representing a family of
probability distributions
Contents
JJ
II
Page 3 of 39
Go Back
Full Screen
Close
Quit
Nodes V - Random Variables
Edges E - Indicate Stochastic Dependence
G encodes Conditional Independence assertions in
domain
Mainly two kinds of Models
Directed (a.k.a Bayesian Networks)
Undirected (a.k.a Markov Random Fields (MRFs))
Graphical Models (Contd. . . )
Cloudy
Title Page
Sprinkler
Contents
JJ
II
Rain
Wet Grass
Page 4 of 39
Go Back
Direction of edges based on causal knowledge
Full Screen
A B : Not sure of causality
Close
Quit
A B : A causes B
Mixed versions also possible - Chain Graphs
a
Figure adapted from [RN95]
Why Graphical Models?
Title Page
Contents
JJ
II
Page 5 of 39
Go Back
Full Screen
Close
Quit
Framework for modeling and effeciently reasoning
about multiple correlated random variables
Provides insights into the assumptions of existing
models
Allows qualitative specification of independence assumptions
Why Graphical Models?
Recent Trends in Data Mining
Traditional learning algorithms assume
Title Page
Contents
JJ
II
Page 6 of 39
Go Back
Full Screen
Close
Quit
Data available in record format
Instances are i.i.d samples
Recent domains like Web, Biology, Marketing have
more richly structured data
Examples : DNA Sequences, Social Networks, Hyperlink structure of Web, Phylogeny Trees
Relational Data Mining - Data spread across multiple tables
Relational Structure helps significantly in enhancing
accuracy [CDI98, LG03]
Graphical Models offer a natural formalism to
model such data
Directed Models : Bayesian Networks
Bayes Net - DAG encoding the con-
ditional independence assumptions
among the variables
Title Page
Cycles not allowed - Edges usually
have causal interpretations
Contents
JJ
II
Specifies a compact representation
of joint distribution over the variables given by
Page 7 of 39
Go Back
P (X1 , . . . , Xn ) =
n
Y
Pi (Xi | P a(Xi ))
i=1
Full Screen
Close
Quit
Figure
from [RN95]
adapted
where P a(Xi ) = Parents of Node Xi
in the network
Pi Conditional Probability Distribution (CPD) of Xi
Undirected Graphical Models
Markov Random Fields
Have been well studied and applied
in Vision
Title Page
No underlying causal structure
Contents
Joint distribution can be factorized
JJ
II
into
X1
P (X1 , . . . , Xn ) =
X3
X2
1 Y
c (Xc )
Z
cC
Page 8 of 39
where C - Set of cliques in graph
Go Back
X4
Full Screen
Close
X5
c - Potential function (a positive
function) on the clique Xc
Z - Partition Function given by
Quit
Z=
XY
~
x cC
c (Xc )
Expressive Power
Directed vs Undirected Models
Dependencies which can be modeled - Not exactly
similar
Title Page
Example :
Contents
JJ
II
Page 9 of 39
Go Back
Full Screen
Close
Quit
Decomposable Models - Class of dependencies
which both can model
a
Figure adapted from [JP98]
What Class of Distributions Can be Modeled?
Title Page
Contents
JJ
II
Page 10 of 39
Go Back
Full Screen
Close
Quit
Inference
Title Page
Given a subset of variables XK ,
compute distribution of P (XU |XK )
~ = {XU } {XK }
where X
Contents
Marginals - involve summation over
JJ
II
Page 11 of 39
Go Back
Full Screen
Close
Quit
exponential terms
Complexity handled by exploiting
the graphical structure
Algorithms : Exact and Approxi-
mate
Some Examples : Variable Elimina-
tion, Sum-Product Algorithm, Sampling Algorithm
Learning
Title Page
Contents
JJ
II
Estimating graphical structure G and parameters
from data
Standard ML estimates used when variables in the
model are fully Observable
Page 12 of 39
Go Back
MRFs use Iterative Algorithms for parameter estimation
Full Screen
Close
Quit
Structure Learning relatively hard
Title Page
Contents
JJ
II
Page 13 of 39
Go Back
Full Screen
Close
Quit
Applications
Bio-informatics
Gene Expression Analysis
Title Page
Contents
JJ
II
Page 14 of 39
Gene Expression Analysis - Introduction
Standard Techniques - Clustering and Bayesian Networks
Probabilistic Relational Models (PRMs)
Go Back
Full Screen
Close
Quit
Integrating Additional Information into PRM
Learning PRMs from Data
DNA - The Blueprint of Life!
Title Page
DNA - Deoxyribo Nucleic Acid
Contents
JJ
II
Page 15 of 39
Go Back
Full Screen
Close
Quit
Double Helix Structure
Each Strand - Sequence of Nucleotides {Adenine
(A),Guanine (G),Cytosine (C), Thymine (T)}
Complementary Strands - A G, C T
Gene - Portions of DNA that code for Proteins or
large biomolecules
The Central Dogma - Transcription and
Translation
Title Page
Contents
JJ
II
Page 16 of 39
Go Back
Full Screen
Close
Quit
Figure Source : www.swbic.org/education/comp-bio/images/
Gene Expression
Each cell has same copy of DNA still different cells
Title Page
Contents
JJ
II
Page 17 of 39
Go Back
Full Screen
Close
Quit
synthesize different Proteins!
Example : Cells making the proteins needed for muscles,
eye lens etc.
Gene said to be expressed if it produces its corresponding protein
Genes expressed vary - Based on time, location, environmental and biological conditions
Expression regulated by a complex collection of
proteins
DNA Micro-array Technology
Micro-array or Gene chips used for experiments
Title Page
Contents
JJ
II
Page 18 of 39
Allows measurement of expression levels of tens of
thousands of genes simultaneously
Many experiments measure expression of same set
of genes under various environmental/biological
conditions
Example : Cell is heated up, cooled down, drug added
Go Back
Full Screen
Close
Quit
Expression Level
Estimated based on amount of mRNA for that gene currently present in that cell
Ratio of expression level under experiment condition to expression under normal condition taken instead
Gene Expression Data
Title Page
Contents
JJ
II
Page 19 of 39
Enormous amount of expression data for various
species publicly available
Go Back
Full Screen
Some Examples
EBI
Micro-array
data
repository
(http://www.ebi.ac.uk/arrayexpress/)
Stanford Micro-array Database (http://genomewww5.stanford.edu/) etc.
Close
Quit
Figure Source : [?]
The Problem - Drowning in Data!
Where is Information?
Title Page
Contents
Enormous amount of data
EBI data repository has grown 100-fold just in a year!
JJ
II
Difficult for humans to comprehend, detect patterns
Biological experiments - Costly and Time consum-
Page 20 of 39
Go Back
Full Screen
Close
Quit
ing
Machine Learning/Data Mining techniques to the
rescue
Allow learning of models which provide useful insight into
the biological processes
Reduce the number of biological experiments needed
Gene Expression Analysis - Approaches
Aim
Title Page
Contents
JJ
II
Page 21 of 39
Go Back
To identify co-regulated genes
To gain biological insight into gene regulatory
mechanisms
Approaches
Clustering
Bayesian Networks
Probabilistic Relational Models (PRMs)
Full Screen
Close
Quit
Focus of the Presentation
Probabilistic Models for Gene Expression using
PRMs
Clustering
Title Page
Contents
JJ
II
Two-Side Clustering
Genes and Experiments partitioned into clusters G1 , . . . , Gk
and E1 , . . . , El simultaneously
Summarizes data into groups of k l
Assumption - Expression governed by a distribution specific
to each combination of Gene/Experiment clusters
Page 22 of 39
Go Back
Full Screen
Close
Quit
Clustering Techniques - Problems
Similarity based on all the measurements. What if similarity
exists only over a subset of measurements?
Difficult to integrate additional information - Gene Annotation, Cell-Type/Strain used, Gene Promoters
Bayesian Networks
Bayes Net - DAG encoding the con-
ditional independence assumptions
among the variables
Title Page
Contents
JJ
II
Page 23 of 39
Go Back
Specifies a compact representation
of joint distribution over the variables given by
P (X1 , . . . , Xn ) =
n
Y
P (Xi | P a(Xi ))
i=1
where P a(Xi ) = Parents of Node Xi
in the network
Full Screen
Provides insight into the influence
Close
Quit
patterns across variables
Friedman et al have applied it to
learn gene regulatory mechanisms
Bayesian Networks (Contd. . . )
Modeling Relational Data
Relational Data - Data spread across multiple tables
Title Page
Contents
JJ
II
Page 24 of 39
Go Back
Full Screen
Close
Quit
Provides valuable additional information for learning models
Example : DNA Sequence Information, Gene Annotations
Bayes Nets not suitable for modeling
Bayes Net Learning Algorithms - Attribute Based
Assume all the data to be present in a single table
Make sample independence assumption
Solution : Why not flatten the data?
Will make the samples dependent
Cant be used to reach conclusions based on relational dependencies
Probabilistic Relational Models (PRMs)
Title Page
Learns a probabilistic model over a relational
schema involving multiple entities
Contents
JJ
II
Page 25 of 39
Go Back
Full Screen
Close
Quit
Entities in the current problem Gene, Array and
Expression
Each entity X can have attributes of the form
X.B - Simple Attribute
X.R.C - Attribute of another relation where R is a Reference
Slot
Reference Slots - Similar to foreign keys in the
database world
PRMs (Contd. . . )
Title Page
Contents
JJ
II
Page 26 of 39
Go Back
Full Screen
Close
Quit
Attributes of objects - Random Variables
Given the above, a PRM is defined by
A class-level dependency structure S
The parameter set S for the resultant Conditional Probability Distribution (CPD)
The PRM is only a class-level template - Gets
instantiated for each object
A Sample PRM
Title Page
Contents
JJ
II
Page 27 of 39
Go Back
Full Screen
Close
Quit
a
a
Figure Source : [FGKP99]
PRM for Gene Expression
Title Page
Gene
Contents
JJ
II
Array
GCluster
Phase
AAM
ACluster
Page 28 of 39
Go Back
Level
Expression
Full Screen
Close
Quit
Figure Source : [STG+01]
Inferencing in PRMs
A Relational Skeleton is an instantiation of this
Title Page
Contents
JJ
II
Page 29 of 39
Go Back
Full Screen
Close
Quit
schema
For Example : 1000 gene objects, 100 array objects
and 100,000 objects expression objects
Relational skeleton completely specifies the values for the reference slots
Objective
Given , with observed evidence regarding some
variables, update the probabilistic distribution over
the rest of the variables
Title Page
Contents
JJ
II
Page 30 of 39
Go Back
Full Screen
Close
Quit
Inferencing in PRMs (Contd. . . )
Given a relational skeleton , a PRM induces a
Bayesian Network over all the random variables
Parents and CPDs of Bayes Net - Obtained from
class-level PRM
Bayesian Network Inferencing Algorithms are then
used for inference in the resultant network
Integrating Additional Sources of Data
DNA Sequence Information
Transcription Factors (TFs) - Proteins that bind
Title Page
Contents
to specific DNA sequence in the promoter region
known as binding sites
JJ
II
TFs encourage or repress the start of transcription
Why is sequence information important?
Page 31 of 39
Go Back
Full Screen
Close
Quit
Help in identifying TF binding sites
Two genes with similar expression profiles mostly likely to be controlled by same TFs
New features added
Base pairs of Promoter Sequence
Regulates variable g.R(t) for each TF t
PRM with Promoter Sequence Information
Gene
Title Page
S1
S2
S3
Array
Contents
g.R(t1)
JJ
II
g.R(t2)
Phase
ACluster
Page 32 of 39
Go Back
Level
Full Screen
Expression
Close
a
Quit
Figure Source : [SBS+02]
Learning the Models
CPD Parameter Estimation
Title Page
Expression.Level modeled using a Gaussian
Contents
CPD divides the expression values into k l groups
Parameter set constitutes the mean and variance of each
group
JJ
II
Page 33 of 39
CPD Structure Learning
Scoring Function - measure of goodness of a structure rel-
Go Back
Full Screen
Close
Quit
ative to data
Search Algorithm - finding the structure with highest score
Bayesian Score as scoring function- Posterior of structure
given data P (S | D)
Greedy local structure search used for search algorithm
PRMs for Gene Expression : Conclusion
Title Page
Contents
JJ
II
Page 34 of 39
Go Back
Full Screen
Close
Quit
Templates for directed graphical models over relational data
PRMs can be applied to relational data spread across
multiple tables
Capable of learning unified models integrating sequence information, expression data and annotation
data
Can easily accommodate additional information related to domain
Web Mining
Collective Web Page Classification [CDI98]
Title Page
Contents
JJ
II
Page 35 of 39
Class of neighbouring pages (in Web Graph) usually
correlated.
Construct a directed graphical model based on the
web graph.
Nodes - Random Variables for the category of each page
Go Back
Full Screen
Close
Given an assignment of categories for some nodes :
Run inferencing on the above graphical model
Find the Most Probable Explanation for the rest
Quit
Summary
Title Page
Contents
JJ
II
Page 36 of 39
Go Back
Full Screen
Close
Quit
Graphical Models - A natural formalism for modeling multiple correlated random variables
Allows integration of domain knowledge in the form
of dependency structures
Techniques especially useful when data spread
across multiple tables
Allows easy integration of new additional information
Title Page
Contents
JJ
II
Page 37 of 39
Go Back
Full Screen
Close
Quit
Thanks!
References
[NLD99] Nir Friedman, Lise Getoor, Daphne Koller and Avi Pfeffer, Learning Probabilistic Relational Models, In Proceedings of IJCAI 1999, pages 1300-1309, 1999.
[CDI98] Soumen Chakrabarti, Byron E. Dom and Piotr Indyk , Enhanced hypertext categorization using hyperlinks , In Proceedings of SIGMOD-98, ACM International
Conference on Management of Data , pages 307318, 1998.
Title Page
[Chi02] David Maxwell Chickering, The WinMine Toolkit, Microsoft, MSR-TR-2002103, 2002, Redmond, WA.
Contents
JJ
II
Page 38 of 39
Go Back
[Col02] Michael Collins, Discriminative Training Methods for Hidden Markov Models:
Theory and Experiments with Perceptron Algorithms, In the proceedings of EMNLP
2002, pages 18, 2002.
[Fri00] Friedman N., Linial, Nachman I. and Peer D., Using Bayesian Networks to Analyze Expression Data, Journal of Computational Biology, vol 7, pages 601-620,
2000.
[GS04] Shantanu Godbole and Sunita Sarawagi, Discriminative Methods for MultiLabeled Classification, In Proceedings of PAKDD 2004, 2004.
Full Screen
Close
Quit
[LG03] Qing Lu and Lise Getoor, Link-based Classification, In Proceedings of ICML
2003, page 496, August 2003.
[Mur01] Kevin P. Murphy, The Bayes Net Toolbox for MATLAB, Journal of Computing
Science and Statistics, vol. 33, 2001.
[FGKP99] Nir Friedman, Lise Getoor, Daphne Koller and Avi Pfeffer , Learning Probabilistic Relational Models , IJCAI , 1300-1309 , 1999
[STG+01] E. Segal, B. Taskar, A. Gasch, N. Friedman and D. Koller , Rich probabilistic
models for gene expression , Bioinformatics , 17 , s243-52 , 2001
[SBS+02] E. Segal, Y. Barash, I. Simon, N. Friechnan and D. Koller , From promoter
sequence to expression: A probabilistic framework , RECOMB , 2002
[RN95] S. Russel and P. Norvig, Artificial Intelligence: A Modern Approach, PrenticeHall, 1995.
Title Page
Contents
JJ
II
Page 39 of 39
Go Back
Full Screen
Close
Quit
[MWJ99] Kevin P. Murphy, Yair Weiss and Michael I. Jordan, Loopy belief propagation
for approximate inference : An emperical Study. In Proceedings of UAI 99, Pages
467-475, 1999.
[JP98] Pearl, J., Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference, Morgan Kaufmann Publishers, 1988.
27 6, 35
6 27 28 32 4, 7 9