Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
32 views46 pages

SNA - Link Prediction

Uploaded by

jvl jsp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views46 pages

SNA - Link Prediction

Uploaded by

jvl jsp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Social Network

Analysis
Link prediction
What is Link Prediction?
❑The problem of predicting the
existence of a link between two entities
in a network

❑Involve several research communities


ranging from statistics and network
science to machine learning and data
mining

❑Help predicting the state of a dynamic


network at future timestamp
Application Areas
Online Social Networks Network Reconstruction
◦ Recommend friends to connect ◦ Remove unauthentic edges
◦ Suggest users/pages to follow ◦ Predict missing links
◦ Predict new links
E-commerce
◦ Recommend products/services Citation Networks
◦ Predict missing citations
Bioinformatics/Biology
◦ Predict future collaboration
◦ Predict protein-protein interactions
◦ Infer interactions between drugs and targets
Link Prediction in Recommender
Systems
Users and items form a bipartite-graph

Predict links between users and items

4
Motivation
Understanding how social networks evolve

The link prediction problem

Given a snapshot of a social network at time t, we seek to accurately predict the edges that will be
added to the network during the interval (t, t’)

?
5
Temporal Changes in a Network
𝐺𝑡0 𝑉, 𝐸 : Topology at time 𝑡 = 𝑡0 𝐺𝑡1 𝑉′, 𝐸′ : Topology at time 𝑡 = 𝑡1 (𝑡1 > 𝑡0 )

❑Case I
❑ new nodes are added, but they do not form any link

❑Case II
❑ new nodes join and form new connections

❑Case III
❑ no new nodes join, but some new edges are formed

❑Case IV
❑ Some existing edges are removed, but endpoints retained

❑Case V
❑ some existing nodes and edges are removed
Link Prediction: Problem Definition
Given snapshots of a network 𝐺𝑡0 𝑉, 𝐸 at time 𝑡 = 𝑡0 and 𝐺𝑡𝑖 𝑉, 𝐸′ at time 𝑡 = 𝑡𝑖 ( > 𝑡0 ),
the set 𝐸′\𝐸 of edges joined the network during the time interval 𝑡0 , 𝑡𝑖 . Then the task of link
prediction is defined as the prediction of the edge set 𝐸′\𝐸 at time 𝑡 = 𝑡0

Alternatively,

The problem of link prediction can be coined as: the task of determining the likelihood that any
two nodes that are not connected at time 𝑡 = 𝑡0 , will be connected at time 𝑡 = 𝑡𝑖 (𝑡𝑖 > 𝑡0 ),
Link Prediction: Heuristic Models
❑Idea of link prediction is
❑to successfully connect nodes that share some similarities, but are not linked as of now

❑Heuristic measures of structural similarity


❑Local Heuristics
❑Global Heuristics
Link Prediction: Local Heuristic
❑𝐺(𝑉, 𝐸): an undirected dynamic network

❑Three nodes 𝑥, 𝑦, 𝑧 ∈ 𝑉 such that, at the current time instance


▪ 𝑥, 𝑧 ∈ 𝐸, 𝑦, 𝑧 ∈ 𝐸
▪ 𝑥, 𝑦 ∉ 𝐸
▪ To decide the formation of the link 𝑥, 𝑦 in near future

❑Some local structural similarity base heuristic for the above


▪ Common Neighbourhood
▪ Jaccard Similarity
▪ Preferential Attachment
▪ Adamic Adar
▪ Salton Index
Local Heuristic:
Common Neighborhood
❑Triadic closure property
❑By virtue of the common friend z, x and y are highly likely to be friends in future

❑Common Neighborhood score between two randomly selected nodes 𝑥 and 𝑦


𝑆𝐶𝑁 𝑥, 𝑦 = Γ 𝑥 ∩ Γ 𝑦

Where Γ 𝑣 : Neighbourhood set of node 𝑣

❑Higher the number of common neighbours, more likely the node will be linked in future

❑Example , 𝑆𝐶𝑁 𝐴, 𝐶 = 𝐵, 𝐷, 𝐸, 𝐹 ∩ 𝐷, 𝐸, 𝐹
Local Heuristic: Jaccard Similarity
❑Normalized version of common neighborhood score

❑Jaccard Similarity score between two randomly selected nodes 𝑥 and 𝑦


Γ 𝑥 ∩Γ 𝑦
𝑆𝐽 𝑥, 𝑦 =
Γ 𝑥 ∪Γ 𝑦
❑The ratio of the number of common neighbors and the number of all neighbors of these two
nodes
𝐵,𝐷,𝐸,𝐹 ∩ 𝐷,𝐸,𝐹
❑Example:, 𝑆𝐽 𝐴, 𝐶 =
𝐵,𝐷,𝐸,𝐹 ∪ 𝐷,𝐸,𝐹
Local Heuristic: Preferential
Attachment
❑Derived from the concept of preferential attachment of scale-free networks

❑Likelihood of a node x to obtain a new edge is proportional to 𝑘𝑥 , the degree of the node

❑Preferential Attachment score between two randomly selected nodes 𝑥 and 𝑦


𝑆𝑃𝐴 𝑥, 𝑦 = 𝑘𝑥 × 𝑘𝑦

❑Future interaction between them depends on the existing degree of the individual nodes

❑Example: In network 𝐺1 , 𝑆𝑃𝐴 𝐴, 𝐶 = 𝑘𝐴 × 𝑘𝐶


Local Heuristic: Adamic Adar
❑Primary Objective: shift focus towards rare events

❑Assigns higher weights to less-connected nodes

❑Adamic Adar metric between two randomly selected nodes 𝑥 and 𝑦


1
𝑆𝐴𝐴 𝑥, 𝑦 = ෍
log 𝑘𝑧
𝑧∈Γ 𝑥 ∩Γ 𝑦

1 1 1
❑Example: In network 𝐺1 , 𝑆𝐴𝐴 𝐴, 𝐶 = log 3 + log 3 + log 5
Local Heuristic
❑Salton Index (Commonly used metric to measure the similarity between a pair of documents or
embeddings in a vector space) between two randomly selected nodes 𝑥 and 𝑦
Γ 𝑥 ∩Γ 𝑦
𝑆𝑆𝐼 𝑥, 𝑦 =
𝑘𝑥 × 𝑘𝑦
Global Heuristic: Katz Score
❑Inspired by Katz centrality

❑Takes into account the influence by neighbors beyond 1-hop

❑However, longer the path length, less likely the end nodes influence each other

❑Between two random nodes x and y, Katz score is given by



𝑝
𝑆𝐾𝑍 𝑥, 𝑦 = ෍ 𝛼 𝑝 ∙ 𝐴𝑥,𝑦
𝑝=1

𝑝
Here, 𝐴𝑥,𝑦 : number of paths of length 𝑝 that exists between 𝑥 and 𝑦

and 𝛼: damping factor that reduces the impact of longer paths


Item-based Approach
◦ Similar with user-based approach but is on the item side

16
Item-based Example: user 1, item 3
Item 1 Item 2 Item 3 Item 4 Item 5

User 1 8 1 2 7
?
User 2 2 5 7 5
?
User 3 5 4 7 4 7

User 4 7 1 7 3 8

User 5 1 7 4 6
?
User 6 8 3 ? 3 7

17
Find Similarity(itemi with item3)
Item 1 Item 2 Item 3 Item 4 Item 5

User 1 8 1 2 7
?
User 2 2 5 7 5
?
User 3 5 4 7 4 7

User 4 7 1 7 3 8

User 5 1 7 4 6
?
User 6 8 3 ? 3 7

18
Similarity between Items
Item 3 Item 4 Item 5

2 7
?
5 7 5

7 4 7

7 3 8

4 6
?
3 7
?
19
Similarity between items
Item 3 Item 5 • Only consider users who have rated both items
For each user:
7
? Calculate difference in ratings for the two items
5 5
• Take the average of this difference over the users

7 7

7 8

4
?
Can also use Pearson Correlation Coefficients as in
? 7
user-based approaches

20
Prediction: Calculating ranking r(user1,item3)
r (user1 , item3 ) =  *{r (user1 , item1 ) sim(item1 , item3 )
Item
2 8 + r (user1 , item2 ) sim(item2 , item3 )
1 Item + r (user1 , item4 ) sim(item4 , item3 )
Item
1
3 + r (user1 , item5 ) sim(item5 , item3 )}

Item
5 7
Item Where  is a normalization factor
4
2 1/[the sum of all sim(itemi,item3)]

21
Link Prediction:
Machine learning
Link Prediction
Case I (task of inferring missing links)
◦ Only a single snapshot 𝐺𝑡𝑖 𝑉, 𝐸 of the network at timestamp 𝑡 = 𝑡𝑖
◦ Split 𝐸 into disjoint sets 𝐸𝑡𝑟𝑎𝑖𝑛 and 𝐸𝑡𝑒𝑠𝑡
◦ To obtain test set, delete edges from 𝐸 and add them to 𝐸𝑡𝑒𝑠𝑡
◦ Deletion strategies:
◦ Uniformly at random
◦ Based on the degrees of their endpoints

Case II (task of predicting future links)


◦ At least two snapshots of the network: 𝐺𝑡𝑖 𝑉, 𝐸 at time 𝑡 = 𝑡𝑖 and 𝐺𝑡𝑗 𝑉, 𝐸′ at time 𝑡 = 𝑡𝑗 (> 𝑡𝑖 )
◦ Set 𝐸𝑡𝑟𝑎𝑖𝑛 = 𝐸 and 𝐸𝑡𝑒𝑠𝑡 = 𝐸′ ∖ 𝐸
Link Prediction
✓Initial Network: 𝐺𝑡𝑖 𝑉, 𝐸 of the network at timestamp 𝑡 = 𝑡𝑖

✓Set of all possible edges in the network: 𝑈

✓Obtain 𝐸𝑡𝑟𝑎𝑖𝑛 and 𝐸𝑡𝑒𝑠𝑡 following either of the cases mentioned in earlier

✓Set of edges not formed till timestamp 𝑡 = 𝑡𝑖 : L = 𝑈 ∖ 𝐸𝑡𝑟𝑎𝑖𝑛

✓Convert the problem of link prediction into a binary classification problem


✓ Edges in 𝐸𝑡𝑒𝑠𝑡 forms the positive samples
✓ Edges in set 𝐿 ∖ 𝐸𝑡𝑒𝑠𝑡 forms the negative samples
Link Prediction using Supervised
learning methods
P1
P3 Feature
P2 Extractor

[1, 2, 0, …, 1] +1
[0, 0, 1, …, 1] -1

Supervised
Learning

25
Link Prediction – Datasets

Dataset Number of papers Number of authors

BIOBASE 831478 156561

DBLP 540459 1564617


Link Prediction
✓ Choose features that represent some form of proximity between the pair of vertices that
represent a data point

✓ The definition of such features may vary from domain to domain for link prediction.

✓ Coauthorship network : two authors are close (in the sense of a social network) to each other, if
their research work evolves around a larger set of identical keywords.

✓ Terrorist network : two suspects can be close, if they are experts in an identical set of
dangerous skills.
Link Prediction
✓Scientists x and y from the social network.

✓P1: Probability that x and y coauthor

✓Scientist z, from the same network (multi-disciplinary research- a rich set of connections in
community

✓P2 : Probability that x will coauthor with z

✓P2 is always higher than P1, if z is a prolific researcher.

✓Either (or both) of the scientists are prolific, it is more likely that they will collaborate.
Link Prediction : Features for ML

✓Individual measure : how prolific a particular scientist is

✓individual feature : number of different areas (s)he has worked on.

✓Aggregated feature Summing the value of individual measure yields a meaningful pair of authors

✓Terrorist network : number of languages a suspect can speak/write


Link Prediction :Features for ML
✓Features that arise from the network topology.

✓Applicable equally to all domains since their values depends only on the structure of the network.
(Topological features)

✓Link prediction : Most obvious feature is the shortest distance among the pair of nodes.

✓Shorter the distance, the better the chance that they will collaborate.

✓Other similar measures : number of common neighbors, Jaccard's coefficient, edge disjoint k
shortest distances, etc.
Link Prediction using Supervised
learning methods
P1
P3 Feature
P2 Extractor

[1, 2, 0, …, 1] +1
[0, 0, 1, …, 1] -1

Supervised
Learning

31
Link Prediction : Features for ML
Individual Features
Proximity Features

Keyword Match Count


✓Measures the proximity of a pair of nodes (authors).
✓list all the keywords that the individual authors had introduced in his papers and take a intersection of both
the sets.
✓The larger the size of the intersection, the more likely they are to work in related areas
✓candidate for future coauthor pair.
Link Prediction : Features for ML
Aggregated Features
✓Authors having higher paper count are more prolific.

✓If either (or both) of the authors is (are) prolific, the probability is higher that this pair will coauthor
compared to the probability for the case of any random pair of authors.

✓Sum of Papers : adding the number of papers that the pair of authors published in the training years.

✓All authors did not appear in all the training years,

✓Normalize the paper count of each author / years they appeared in.
Link Prediction : Features for ML
Aggregated Features
Sum of Neighbors
✓ represents the social connectivity of the pair of authors
✓ adding the number of neighbors they have.

✓ Neighborhood is obtained from the coauthorship information.

✓ Variants of this feature : weighted sum of neighbors, where the weights represent the number
of publication that a node has with that specific neighbor.
Link Prediction : Features for ML
Aggregated Features
Sum of Keyword Counts In scientic publication

✓Keywords play a vital role in representing the specific domain of work of researchers.

✓Researchers that have a wide range of interests or those who work on interdisciplinary research
usually use more keywords.

✓They have better chance to collaborate with new researchers.

✓Use sum function to aggregate this attribute for both the author pair.
Link Prediction : Features for ML
Aggregated Features
Sum of Classication Code

✓Research publication are categorized in code strings to organize related areas.

✓A publication that has multiple codes is more likely to be an interdisciplinary work

✓Researchers in these area usually have more collaborators.


Link Prediction : Features for ML
Aggregated Features
Sum of log (Secondary Neighbors Count)

✓ Assume an author is directly connected to another author who is highly connected

✓Former person has a better chance to coauthor with a distant node through the later person.

✓Number of secondary neighbors in social network usually grow exponentially : Use Log
edge-disjoint shortest distance
Two paths are said edge disjoint if they don’t share any edge.
Link Prediction : Features for ML
Topological Features
Shortest Distance

✓Kleinberg discovered that in social network most of the nodes are connected with a very short distance.

✓Smallest hop count as the shortest distance between two nodes.

✓Variant : k edge-disjoint shortest distance. Each of these can be one feature.

✓Weighted : weight on the edge (Reciprocal of the number of papers the corresponding author pair has
coauthored)
Link Prediction : Features for ML
Topological Features
Clustering Index
✓A node that is dense locally is more likely to grow more edges compared to one that is located in a
more sparse neighborhood.
✓Clustering index measures the localized density.
✓Newman defines clustering index as the fraction of pairs of a person's collaborators who have also
collaborated with one another.
✓u is a node of a graph, The clustering index of u is:

3 x number of triangles with u as one node


number of connected triples with u as one node
Link Prediction : Features for ML
Topological Features
Shortest Distance in Author-KW graph

✓Extended the social network by adding Keyword(KW) nodes.

✓Each KW node is connected to an author node by an edge if that keyword is used by the author in any of
his papers.

✓ Two keywords that appear together in any paper are also connected by an edge.

✓A shortest distance between two nodes in this extended graph is computed to get this attribute value.
Supervised learning methods
✓Citation Network (BIOBASE, DBLP)

✓Use machine learning algorithms to predict future co-authorship (decision tree, k-NN, multilayer
perceptron, SVM, RBF network)

✓Identify a group of features that are most helpful in prediction

✓Best Predictor Features: Keyword Match count, Sum of neighbors, Sum of Papers, Shortest
Distance

42
Evaluation
Link Prediction Methods
Evaluation – Confusion Matrix
Actual → Link formed Link not formed
Predicted ↓
Link formed True Positive (TP) False Positive (FP)
Link not formed False Negative (FN) True Negative (TN)
Link Prediction Methods
❑Accuracy (ACC): ratio of the total number of correct predictions to the total number of predictions
𝑇𝑃 + 𝑇𝑁
𝐴𝐶𝐶 =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
❑Precision (P): out of all the links that are predicted by the model as positive, how many does
actually belong to the positive samples
𝑇𝑃
𝑃=
𝑇𝑃 + 𝐹𝑃
❑Recall (R): out of all the links that are actually positive, how many are predicted as positive by
the model
𝑇𝑃
𝑅=
𝑇𝑃 + 𝐹𝑁
Confusion Matrix
To find confusion matrix related metrics for the prediction of network G in the figure

Actual positive links = 𝒂, 𝒃 , 𝒃, 𝒄 , (𝒄, 𝒅) , Actual negative links = 𝑎, 𝑑 , 𝑎, 𝑐 , (𝑏, 𝑑)

Pred. positive links = 𝑎, 𝑏 , 𝑏, 𝑐 , (𝑏, 𝑑) , Pred. negative links = 𝑎, 𝑑 , 𝑎, 𝑐 , (𝑐, 𝑑)

G
Actual Positive Link

Predicted Positive Link

You might also like