Mining
Social
Network
Graphs
Debapriyo Majumdar
Data Mining – Fall 2014
Indian Statistical Institute Kolkata
November 13, 17, 2014
Social
Network
No
introduc+on
required
Really?
We
s7ll
need
to
understand
a
few
proper7es
disclaimer:
the
brand
logos
are
used
here
en7rely
for
educa7onal
purpose
2
Social
Network
§ A collection of entities
– Typically people, but could be something else too
§ At least one relationship between entities of the network
– For example: friends
– Sometimes boolean: two people are either friends or they are not
– May have a degree
– Discrete degree: friends, family, acquaintances, or none
– Degree – real number: the fraction of the average day that two people
spend talking to each other
§ An assumption of nonrandomness or locality
– Hard to formalize
– Intuition: that relationships tend to cluster
– If entity A is related to both B and C, then the probability that B and C
are related is higher than average (random)
3
Social
Network
as
a
Graph
A B D E
A graph with
boolean (friends)
C relationship
G F
§ Check for the non-randomness criterion
§ In a random graph (V,E) of 7 nodes and 9 edges, if XY is an edge, YZ
is an edge, what is the probability that XZ is an edge?
– For a large random graph, it would be close to |E|/(|V|C2) = 9/21 ~ 0.43
– Small graph: XY and YZ are already edges, so compute within the rest
– So the probability is (|E|−2)/(|V|C2−2) = 7/19 = 0.37
§ Now let’s compute what is the probability for this graph in particular
Example
courtesy:
Leskovec,
Rajaraman
and
Ullman
4
Social
Network
as
a
Graph
A B D E
have A graph with
D o es
boolean (friends)
locality C relationship
G F
ty
proper
§ For each X, check possible YZ and check if YZ is an edge or not
§ Example: if X = A, YZ = {BC}, it is an edge
X= YZ= Yes/Total X= YZ= Yes/Total
A BC 1/1 E DF 1/1
B AC, AD, CD 1/3 F DE,DG,EG 2/3
C AB 1/1 G DF 1/1
BE,BG,BF,EF,
D 2/6 Total 9/16 ~ 0.56
EG,FG
5
Types
of
Social
(or
Professional)
Networks
A B D E
C G F
§ Of course, the “social network”. But also several other types
§ Telephone network
§ Nodes are phone numbers
§ AB is an edge if A and B talked over phone within the last one week,
or month, or ever
§ Edges could be weighted by the number of times phone calls were
made, or total time of conversation
6
Types
of
Social
(or
Professional)
Networks
A B D E
C G F
§ Email network: nodes are email addresses
§ AB is an edge if A and B sent mails to each other within the last one
week, or month, or ever
– One directional edges would allow spammers to have edges
§ Edges could be weighted
§ Other networks: collaboration network – authors of papers, jointly
written papers or not
§ Also networks exhibiting locality property
7
Clustering
of
Social
Network
Graphs
§ Locality property à there are clusters
§ Clusters are communities
– People of the same institute, or company
– People in a photography club
– Set of people with “Something in common” between them
§ Need to define a distance between points (nodes)
§ In graphs with weighted edges, different distances exist
§ For graphs with “friends” or “not friends” relationship
– Distance is 0 (friends) or 1 (not friends)
– Or 1 (friends) and infinity (not friends)
– Both of these violate the triangle inequality
– Fix triangle inequality: distance = 1 (friends) and 1.5 or 2 (not
friends) or length of shortest path
8
Tradi7onal
Clustering
A B D E
C G F
§ Intuitively, two communities
§ Traditional clustering depends on the distance
– Likely to put two nodes with small distance in the same cluster
– Social network graphs would have cross-community edges
– Severe merging of communities likely
§ May join B and D (and hence the two communities) with not
so low probability
9
Betweenness
of
an
Edge
A B D E
C G F
§ Betweenness of an edge AB: #of pairs of nodes (X,Y) such that AB lies on
the shortest path between X and Y
– There can be more than one shortest paths between X and Y
– Credit AB the fraction of those paths which include the edge AB
§ High score of betweenness means?
– The edge runs “between” two communities
§ Betweenness gives a better measure
– Edges such as BD get a higher score than edges such as AB
§ Not a distance measure, may not satisfy triangle inequality. Doesn’t matter!
10
The
Girvan
–
Newman
Algorithm
§ Step 1 – BFS: Start at a node X, Calculate
betweenness
of
edges
perform a BFS with X as root 1
E
§ Observe: level of node Y = length 1
of shortest path from X to Y 1
D F
§ Edges between level are called Level
1
“DAG” edges
– Each DAG edge is part of at
least one shortest path from X 1
B G Level
2
2
§ Step 2 – Labeling: Label each node
Y by the number of shortest paths
from X to Y A C Level
3
1
1
11
The
Girvan
–
Newman
Algorithm
Step 3 – credit sharing: Calculate
betweenness
of
edges
§ Each leaf node gets credit 1 1
§ Each non-leaf node gets 1 + E
sum(credits of the DAG edges to the
1
4.5
1.5
level below) 1
§ Credit of DAG edges: Let Yi (i=1, 4.5
D Level
1
F
… , k) be parents of Z, pi = label(Yi) 1.5
credit(Z ) × pi 3
0.5
0.5
credit(Yi , Z ) =
( p1 +! pk )
§ Intuition: a DAG edge YiZ gets the
1
B G Level
2
share of credit of Z proportional to 3
2
1
the #of shortest paths from X to Z
1
1
going through YiZ
Finally: Repeat Steps 1, 2 and 3 with
each node as root. For each edge, A C Level
3
betweenness = sum credits obtained in all 1
1
1
1
12
iterations / 2
Computa7on
in
prac7ce
§ Complexity: n nodes, e edges
– BFS starting at each node: O(e)
– Do it for n nodes
– Total: O(ne) time
– Very expensive
§ Method in practice
– Choose a random subset W of the nodes
– Compute credit of each edge starting at each node in W
– Sum and compute betweenness
– A reasonable approximation
13
Finding
Communi7es
using
Betweenness
Method 1:
§ Keep adding edges (among existing ones) starting from lowest betweenness
§ Gradually join small components to build large connected components
14
Finding
Communi7es
using
Betweenness
Method 1:
§ Keep adding edges (among existing ones) starting from lowest betweenness
§ Gradually join small components to build large connected components
15
Finding
Communi7es
using
Betweenness
Method 1:
§ Keep adding edges (among existing ones) starting from lowest betweenness
§ Gradually join small components to build large connected components
16
Finding
Communi7es
using
Betweenness
Method 1:
§ Keep adding edges (among existing ones) starting from lowest betweenness
§ Gradually join small components to build large connected components
17
Finding
Communi7es
using
Betweenness
Method 1:
§ Keep adding edges (among existing ones) starting from lowest betweenness
§ Gradually join small components to build large connected components
18
Finding
Communi7es
using
Betweenness
Method 1:
§ Keep adding edges (among existing ones) starting from lowest betweenness
§ Gradually join small components to build large connected components
19
Finding
Communi7es
using
Betweenness
Method 2:
§ Start from all existing edges. The graph may look like one big component.
§ Keep removing edges starting from highest betweenness
§ Gradually split large components to arrive at communities
20
Finding
Communi7es
using
Betweenness
Method 2:
§ Start from all existing edges. The graph may look like one big component.
§ Keep removing edges starting from highest betweenness
§ Gradually split large components to arrive at communities
21
Finding
Communi7es
using
Betweenness
Method 2:
§ Start from all existing edges. The graph may look like one big component.
§ Keep removing edges starting from highest betweenness
§ Gradually split large components to arrive at communities
At
some
point,
removing
the
edge
with
highest
betweenness
would
split
the
graph
into
separate
components
22
Finding
Communi7es
using
Betweenness
§ For a fixed threshold of betweenness, both methods would
ultimately produce the same clustering
§ However, a suitable threshold is not known beforehand
§ Method 1 vs Method 2
– Method 2 is likely to take less number of operations. Why?
– Inter-community edges are less than intra-community edges
23
Triangles
in
Social
Network
Graph
§ Number of triangles in a social network graph is expected to
be much larger than a random graph with the same size
– The locality property
§ Counting the number of triangles
– How much the graph looks like a social network
– Age of community
• A new community forms
• Members bring in their like minded friends
• Such new members are expected to eventually connect to
other members directly
24
Triangle
Coun7ng
Algorithm
Graph (V, E); |V| = n, |E| = m
§ Step 1: Compute degree of each node
– Examine each edge
– Add degree 1 to each of the two nodes
– Takes O(m) time
§ Step 2: A hash table (vi,vj) à 1
– So that, given two nodes, we can determine if they have an edge
between them
– Construction takes O(m) time
– Each query ~expected O(1) time, with a proper hash function
§ Step 3: An index v à list of nodes adjacent to v
– Construction takes O(m) time, querying takes O(1) time
25
Coun7ng
Heavy
Hi[er
Triangles
§ Heavy hitter node: a node with degree ≥ √m
§ Note: there are at most 2√m heavy hitter nodes
– More than 2√m nodes à total degree > 2m (but |E| = m)
§ Heavy hitter triangle: triangle with all 3 heavy hitter nodes
§ Number of possible heavy hitter triangles: at most 2√mC3 ~
O(m3/2)
§ For each possible triangle, use hash table (step 2) to check if
all three edges exist
§ Takes O(m3/2) time
26
Coun7ng
other
Triangles
§ Consider an ordering of nodes vi << vj if
– Either degree(vi) < degree(vj), and
– If degree(vi) = degree(vj) then i < j
§ For each edge (vi,vj)
– If both nodes are heavy hitters, skip (already done)
– Suppose vi is not a heavy hitter
– Find nodes w1,w2,…,wk which are adjacent to vi (using node à
adjacent nodes index, step 3) [Takes O(k) time]
– For each wl , l = 1, … , k check if edge vjwl exist, in O(1) time,
total O(k) time
– Count the triangle {vi vj wl} if and only if
• Edge vjwl exists
• Also vi << wl
– Total time for each edge (vi,vj) is O(√m)
– There are m edges, total time is O(m3/2) time
27
Op7mality
Worst case scenario
§ If G is a complete graph
§ Number of triangles = mC3 ~ O(m3/2)
§ Cannot even enumerate all triangles in less than O(m3/2)
§ Hence it is the lower bound for computing all triangles
If G is sparse
§ Consider a complete graph G’ with n nodes, m edges
§ Note that m = nC2 = O(n2)
§ Construct G from G’ by adding a chain of length n2
§ The number of triangles remain the same, O(m3/2)
§ The number of edges remain of the same order O(m)
§ G is quite sparse, lowering edge to node ratio
§ Still cannot compute the triangles in less than O(m3/2) time
28
Directed
Graphs
in
(Social)
Networks
§ Set of nodes V and directed edges (arcs) u à v
§ The web: pages link to other pages
§ Persons made calls to other persons
§ Twitter, Google+: people follow other people
§ All undirected graphs can be considered as directed
– Think of each edge as bidirectional
29
Paths
and
Neighborhoods
§ Path of length k: a sequence of nodes v0,v1,…,vk from v0 to vk
so that vi à vi+1 is an arc for i = 0, …, k – 1
§ Neighborhood N(v,d) of radius d for a node v: set of all nodes
w such that there is a path from v to w of length ≤ d
§ For a set of nodes V, N(V,d):= {w | there is a path of length ≤ d
from some v in V to w}
§ Neighborhood profile of a node v: sequence of sizes of its
neighborhoods of radius d = 1, 2, …; that is
|N(v,1)|, |N(v,2)|, |N(v,3)|, …
30
Neighborhood
Profile
A B D E
C G F
Neighborhood profile of B Neighborhood profile of A
N(“B”,1) = 4 N(“A”,1) = 3
N(“B”,2) = 7 N(“A”,2) = 4
N(“A”,3) = 7
31
Diameter
of
a
Graph
§ Diameter of a graph G(V,E): the smallest integer d
such that for any two nodes v, w in V, there is a path
of length at most d from v to w
– Only makes sense for strongly connected graphs
– Can reach any node from any node
§ The web graph: not strongly connected
– But there is a large strongly connected component
§ The six degrees of separation conjecture
– The diameter of the graph of the people in the world is six
32
Diameter
and
Neighborhood
Profile
§ Neighborhood profile of a node v
|N(v,1)|, |N(v,2)|, |N(v,3)|, … … |V| = N(v,k) for some k
§ Denote this k as d(v)
§ If G is a complete graph, d(v) = 1
§ Diameter of G is maxv{d(v)}
33
Reference
§ Mining of Massive Datasets, by Leskovec, Rajaraman
and Ullman, Chapter 10
34