0% found this document useful (0 votes)

72 views12 pages

Probabilistic User Profile Matching

This document discusses probabilistic matching of user profiles across different platforms. It explains that some users may have overlapping profiles across platforms but use different credentials. The goal is to identify these matching user records probabilistically. It describes sources of errors in credentials and different approaches to probabilistic matching including preprocessing data, indexing, comparing records, and building models. The Fellegi-Sunter model for probabilistic record linkage is also summarized, including how it calculates probabilities of matches and non-matches based on comparing records column by column.

Uploaded by

VIKAS PATEL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views12 pages

Probabilistic User Profile Matching

Uploaded by

VIKAS PATEL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

Probabilistic Matching of User Profiles

A typical business uses multiple platforms to interact with its end user that is the customers. It
may make use of mobile application web application, CRM tools, and a slo a number of
marketing channels. In our cdp system we have integrated a number of such platforms, each of
these platforms captures the users demographic data .

the set of users who will be interacting with the CDP system though each of these platforms
would be different. But there is a possibility that a subset of this user might be common
between these platforms. This subset of users intentionally or unintentionally might use different
credentials on each of the platforms. The task of our probabilistic matching model is to identify
such user records.

Possible reasons for errors in the credentials :

● typos/misspelling
● letters or words out of order
● fused or split words
● missing or extra letters
● incomplete words extraneous information
● incorrect or missing punctuation
● Abbreviations

Probabilistic Matching, also known as data matching is the task of finding records in a data set
that refer to the same entity across different data sources. These entities can be a person,
product etc.Record linkage is necessary when joining different data sets based on entities that
may or may not share a common identifier (e.g., database key, URI, National identification
number), which may be due to differences in record shape, storage location, or curator style or
preference.

Possible errors:

Record linkage package

The Python Record Linkage Toolkit is a library to link records in or between data sources. The
toolkit provides most of the tools needed for record linkage and deduplication. The package
contains indexing methods, functions to compare records and classifiers. The package is
developed for research and the linking of small or medium sized files.

1. Preprocessing
The data obtained from different sources might not be in the same format. It is necessary for the
data in the same format to be able to perform matching or computing the similarity score
between values of different columns. Hence the data needs to be standardized using some
preprocessing techniques.

Few of preprocessing task that we have performed for our data set are as follows:
● Lowercase / Uppercase
● Stopwords removal
● Postcode Clean Up
● Removal of Irrelevant Symbols
The preprocessing part may vary from dataset to dataset as the new dataset may require more
preprocessing to bring it into a standardized format. Once the data is standardized we can
proceed with indexing

2. Indexing
we will need to create pairs of records. Pairs records are created and similarity score are
calculated using string similarity alogs to determine if the pair of records are considered a
match/duplicates.

There are several indexing techniques available for record linkage such as:
● Full index A Full Index is created based on all the possible combinations of record pairs
in the data set. Using Full Index has a risk on data volume as the number of records will
increase quadratically.
● Blocking
Index by blocking is a good alternative to Full Index as records pairs are produced based on the
same block (Having common value). By blocking based on a particular column, the number of
record pairs can be greatly reduced.
● Sorted Neighborhood
Index by Sorted Neighborhood is another alternative that produces pairs with nearby values, for
example, the following records are paired up together as there are similarities in the column
“Surname” — Laundon and Lanyon.

just by using the Index by “Blocking” or “Sorted Neighborhood” approach, there are chances of
missing out on actual matches. So why not reduce the possibility of missing out on actual match
records by combining both approaches and still have a lesser volume of records compared to
Full Index.

3. Comparison matrix
Now that record pairs are generated, we would like to perform a comparison on the record pairs
to create a comparison vector that calculates the similarity score between both pairs.
Comparison can be done in many different methods to compute similarity values in a string,
numeric values, or dates. In our scenario, where we are calculating the similarity score for string
values, we can use the following algorithm:
● Jarowinkler
● Levenshtein
Jarowinler similarity score is calculated by giving more importance to the beginning of the string,
therefore this algorithm is used to calculate the similarity score for features such as name,
address, state, etc. The Levenshtein similarity score is calculated and provides higher
importance based on the order of the character, therefore this algorithm is used to calculate the
similarity score for features such as street number, postcode, etc.

4. Model building
The comparison matrix generated above can be used for model implementation. We will train a
model to classify duplicates and non-duplicates based on the data set provided. Since we don't
have a labeled data, we will apply clustering algorithms to the comparison matrix generated
above

Advantages of record linkage packages: this package has a number of supervised,

unsupervised, algorithms to choose from. We can also incorporate user defined algorithms in
this package.

Disadvantages
It works on small to medium size dataset(upto .1million).
Splink
spink is a free and open source PySpark package that implements the Fellegi-Sunter model of
record linking, and enables parameters to be estimated using the Expectation Maximization
algorithm.

General working of probabilistic recorlinkage model.

1. Start with a prior probability, begin comparison of each column in tha pairs. Increase the
probability if column agrees and decrease if its disagrees
2. The quantum of increase and decrease depends upon the amount of evidence
contained in a column. Columns with higher no. of distinct values tend to have stronger
evidence of match if columns match because any two values chosen at random will be
less likely to match by coincidence. For eg dob and gender

This, in a nutshell, is how probabilistic record linkage works. By comparing records and
weighing evidence appropriately, we estimate the probability of a match.
The most common type of probabilistic record linkage model is called the Fellegi-Sunter
model.

Working of fellegi sunter model:

1. indexing
2. To reduce the no. pairs for comparison use blocking
3. The FS model begins by comparing the records column by column, and assigning each
comparison to two or more 'similarity levels'.
A simple example of a two-level comparison rule for a column may be:
● If the values in the column exactly match, assign the comparison to similarity level 1
● Otherwise assign the comparison to similarity level 0

This comparison values are named as gamma. For each row we get a set of gamma values

4. We will combine this individual comparison of each column to get the overall probability
of a match. While combining each column will have different weights. A gender column
is less informative than a dob column. Hence weight of gender will be less than weight of
dob
5. The FS model estimates the weight of each column.
6. All columns are assumed to be mutually independent of each other. This
assumption makes the model equivalent to a Naïve Bayes classifier. This allows
the match probability to be expressed as a repeated multiplication of conditional
probabilities.
7. Determination of the appropriate threshold setting above which to accept record-pairs as
valid matches typically occur through manual inspection of record-pairs within a range of
weight scores

m and u probabilities
How much should we increase our estimate of match probability if we observe a match
on first name? How about a match on gender? And what about if we observe a
mismatch on these fields?

We are interested in evaluating statements like:

Pr(records match | first name matches)

This can be quantified using the m and u probabilities for each column, combined with
Bayes Theorem (see annex for a refresher).

Consider the first name column. We have defined two similarity levels: level 1 if the first
name exactly matches, and level 0 otherwise.

m probabilities
The m probabilities for the first name column are:
That is, amongst record comparisons which are true matches, what proportion have a
match on first name, and what proportion mismatch on first name?

This is a measure of how often misspellings, nicknames or aliases occur in the first
name field.

u probabilities
The u probabilities for the first name column are:

That is, amongst record comparisons which are true non-matches, what proportion have
a match on first name, and what proportion mismatch on first name?

Value of m_probability and u_probability remains same for a column throughout.

for an agreement, we calculate the weight log ( M/ U ).

for an disagreement, we calculate the weight log ( 1 − M/ 1 − U ).
Parameters:

1. link_type
Summary: The type of data linking task - link_and_dedupe or link_only. Required.
Description: - When link_and_dedupe, splink finds links within and between input datasets. If
single dataset is provided, it will be deduped. - When link_only, splink finds links between
datasets, but does not attempt to deduplicate the datasets (it does not try and find links within
each input dataset.)
Default value if not provided: link_and_dedupe

2.Proportion_of_matches
Summary: The proportion of record comparisons thought to be matches
Description: This provides the initial value (prior) from which the EM algorithm will start iterating
Default value if not provided: 0.3

3.em_convergence
Summary: Convergence tolerance for the EM algorithm
Description: The algorithm will stop converging when the maximum of the change in model
parameters between iterations is below this value
Default value if not provided: 0.0001

4.max_iterations
Summary: The maximum number of iterations to run even if convergence has not been reached
Description: Set this value to zero if you do not want to use the EM algorithm and just want to
score matches from values you have manually specified in the m_probabilities and
u_probabilities arrays
Default value if not provided: 25

5.blocking_rules
Summary: A list of one or more blocking rules to apply. A cartesian join is applied if
blocking_rules is empty or not supplied.
Description: Each rule is a SQL expression representing the blocking rule, which will be used to
create a join. The left table is aliased with l and the right table is aliased with r. For example, if
you want to block on a first_name column, the blocking rule would be l.first_name =
r.first_name. Note that splink deduplicates the comparisons generated by the blocking rules. If
empty or not supplied, all comparisons between the input dataset(s) will be generated and
blocking will not be used. For large input datasets, this will generally be computationally
intractable because it will generate comparisons equal to the number of rows squared.

6.num_levels
Summary: The number of different similarity categories (gradations of similarity) that will be
computed for this column.
Description: A greater value for num_levels means the algorithm can be more granular about
how string similarity is treated - e.g. with three levels, it enables it to make a distinction between
strings which are an almost-exact match, strings which are quite similar, and strings which don't
match at all. However, more levels results in longer compute times and can sometimes affect
convergence. By default, for a string variable, two levels would imply level 0: no match, level 1:
almost exact match. Three levels implies level 0: no match, level 1: strings are similar but not
exactly the same, level 2: strings are almost exactly the same.
Default value if not provided: 2

7.term_frequency_adjustments
Summary: Whether ex post term frequency adjustments should be made to match scores for
this column
Description: For some columns such as first name, the value of first name is important due to
the distribution of values being non-uniform. For instance, a match on 'linacre' contains more
information than a match on 'smith'. If this is set to true, a term frequency adjustment is made to
account for these differences.
Default value if not provided: false

setting= {
"link_type": "dedupe_only",
"blocking_rules": [
"l.state = r.state"
],
"comparison_columns": [
{
"col_name": "given_name",
"num_levels": 3,
"term_frequency_adjustments": True
},
{
"col_name": "surname"
},
{
"col_name": "address_1",
"term_frequency_adjustments": True
},
{
"col_name": "address_2"
},
{
"col_name": "suburb"
},
{
"col_name": "postcode"
},
{
"col_name": "date_of_birth"
}
],
"em_convergence": 0.01
}

Use cases:

Although the outcome of probabilistic matching models is to find and match more precise
records out of several similar records. The application differs from industry to industry. Here is a
close look on how probabilistic matching model is applied across multiple contexts

https://dataladder.com/benefits-data-matching/

● Government and Public Sector: it can used to detect frauds in passport application,
license

● Banking and Finance: banks and financial services institutions utilize data matching to
identify culprits as part of anti-money laundering initiatives, meet KYC compliance
requirements, or carry out FICO credit scoring.
● E-commerce: In e-commerce, an everyday use case is all the platforms comparing
prices. They use data matching to locate identical products from different stores, even if
they don't have the same description.

● Mailing lists: Data matching can help clean up email lists to get rid of duplicates and dirty
data.

● Healthcare: Matching medical records with other data to study the effect of things like
drugs, treatments, and the environment.

● Fraud detection: Data matching can help identify suspicious transactions, behaviors, and
individuals.

● Computing: Data matching can help optimize computing processes. By detecting

duplicate data, deduplication algorithms help reduce storage need and network data
transfer.

Presentation points:
Agenda
Introduction to ps- nature of ps(supervised and unsupervsied)
Understanding the data set (switch to notebook)
recordlinkage(keep switching between notebook and ppt)
Result analysis
Splink introduction
Setting dict and parameters explanation
Interpreting the result
Advantages of recordlinkage

References

https://towardsdatascience.com/performing-deduplication-with-record-linkage-and-supervised-
learning-b01a66cc6882

https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-017-0370-0
https://www.robinlinacre.com/maths_of_fellegi_sunter/

https://www.sciencedirect.com/science/article/pii/S1532046409001051

● Record Linkage Using Specialized Packages: Utilized record linkage packages,

such as the Python Record Linkage Toolkit and Splink,

B311-221 10.0.1.1 (H187SP60C983) Firmware Release Notes
100% (1)
B311-221 10.0.1.1 (H187SP60C983) Firmware Release Notes
10 pages
Zambian Grid Code
100% (1)
Zambian Grid Code
174 pages
Venkatesh Ganti, Anish Das Sarma - Data Cleaning. A Practical Perspective-Morgan & Claypool (2013)
No ratings yet
Venkatesh Ganti, Anish Das Sarma - Data Cleaning. A Practical Perspective-Morgan & Claypool (2013)
72 pages
SAP PS Configuration Blogpost Collection Dnjxfi
0% (1)
SAP PS Configuration Blogpost Collection Dnjxfi
76 pages
Explorys Probabalistic Person Matching
No ratings yet
Explorys Probabalistic Person Matching
4 pages
Demo Deck May 2022
No ratings yet
Demo Deck May 2022
12 pages
RJournal 2010-2 Sariyar+Borg
No ratings yet
RJournal 2010-2 Sariyar+Borg
7 pages
Decision Models For Record Linkage
No ratings yet
Decision Models For Record Linkage
15 pages
Conference SVM Classifier
No ratings yet
Conference SVM Classifier
6 pages
Recordlinkage Readthedocs Io en Latest
No ratings yet
Recordlinkage Readthedocs Io en Latest
108 pages
Prob Linkage
No ratings yet
Prob Linkage
43 pages
Proposal
No ratings yet
Proposal
7 pages
Entity Analysis Resolution
100% (1)
Entity Analysis Resolution
22 pages
Data Matching
No ratings yet
Data Matching
74 pages
Wasi Flaaen 2015 Record Linkage Using Stata Preprocessing Linking and Reviewing Utilities
No ratings yet
Wasi Flaaen 2015 Record Linkage Using Stata Preprocessing Linking and Reviewing Utilities
26 pages
Data Matching
No ratings yet
Data Matching
37 pages
DMER - Unit#8 - Intoduction To ER
No ratings yet
DMER - Unit#8 - Intoduction To ER
22 pages
Data Cleansing & Merge/Purge Guide
No ratings yet
Data Cleansing & Merge/Purge Guide
49 pages
Intro To Duplicate Detection
No ratings yet
Intro To Duplicate Detection
87 pages
5 - Text Processing With Transformers
No ratings yet
5 - Text Processing With Transformers
76 pages
Record Linkage System
No ratings yet
Record Linkage System
24 pages
Data Science
No ratings yet
Data Science
9 pages
2024 V15i5016
No ratings yet
2024 V15i5016
12 pages
23042025DM
No ratings yet
23042025DM
5 pages
BigData - W2 - Data Cleaning and Linking - HoangVu
No ratings yet
BigData - W2 - Data Cleaning and Linking - HoangVu
46 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
15 pages
The Effect of Data Cleaning On Record Linkage Qual
No ratings yet
The Effect of Data Cleaning On Record Linkage Qual
11 pages
Customizable and Scalable Fuzzy Join For Big Data
No ratings yet
Customizable and Scalable Fuzzy Join For Big Data
12 pages
Lec 9
No ratings yet
Lec 9
38 pages
DM Merged
No ratings yet
DM Merged
169 pages
Bigmatch: A Program For Large-Scale Record Linkage: ASA Section On Survey Research Methods
No ratings yet
Bigmatch: A Program For Large-Scale Record Linkage: ASA Section On Survey Research Methods
4 pages
Duplicate Detection of Record Linkage in Real-World Data: K. M, P T
No ratings yet
Duplicate Detection of Record Linkage in Real-World Data: K. M, P T
10 pages
Semantic Web Entity Matching Techniques
No ratings yet
Semantic Web Entity Matching Techniques
21 pages
ML Clustering Algorithm
No ratings yet
ML Clustering Algorithm
29 pages
Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application To Schema Matching
No ratings yet
Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application To Schema Matching
12 pages
A Genetic Programming Approach To Record Deduplication
No ratings yet
A Genetic Programming Approach To Record Deduplication
45 pages
Genetic Programming for Data Matching
No ratings yet
Genetic Programming for Data Matching
62 pages
Normalization of Duplicate Records From Multiple Sources: IEEE Transactions On Knowledge and Data Engineering June 2018
No ratings yet
Normalization of Duplicate Records From Multiple Sources: IEEE Transactions On Knowledge and Data Engineering June 2018
15 pages
3 Processing
No ratings yet
3 Processing
79 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
Entity Identification For Heterogeneous Database Integration A Multiple Classifier System Approach and Empirical Evaluation
No ratings yet
Entity Identification For Heterogeneous Database Integration A Multiple Classifier System Approach and Empirical Evaluation
14 pages
Data Science & Aiml (Mile Stone Solution)
No ratings yet
Data Science & Aiml (Mile Stone Solution)
37 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
DP
No ratings yet
DP
44 pages
Efficient Data Cleansing Method
No ratings yet
Efficient Data Cleansing Method
11 pages
4.on Demand Quality of Web Services Using Ranking by Multi Criteria-31-35
No ratings yet
4.on Demand Quality of Web Services Using Ranking by Multi Criteria-31-35
5 pages
Data
No ratings yet
Data
70 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
Unit 1
No ratings yet
Unit 1
8 pages
Data Science Preprocessing Guide
No ratings yet
Data Science Preprocessing Guide
40 pages
Data Pre-Processing Essentials
No ratings yet
Data Pre-Processing Essentials
21 pages
03 Preprocessing
No ratings yet
03 Preprocessing
65 pages
W4-5 03preprocessing
No ratings yet
W4-5 03preprocessing
83 pages
23 State of The Art
No ratings yet
23 State of The Art
61 pages
Data Pre Processing
No ratings yet
Data Pre Processing
62 pages
Paper 1969
No ratings yet
Paper 1969
4 pages
Unit 3
No ratings yet
Unit 3
164 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
Guentner Manual GMMnext V1.1.5 en
No ratings yet
Guentner Manual GMMnext V1.1.5 en
179 pages
A Digitally Tuned Anti-Aliasing and Reconstruction Filter (LTC1564)
No ratings yet
A Digitally Tuned Anti-Aliasing and Reconstruction Filter (LTC1564)
2 pages
Practical Malware Analysis
No ratings yet
Practical Malware Analysis
65 pages
Session 1 and 2 Course Overview and Intro To R
No ratings yet
Session 1 and 2 Course Overview and Intro To R
147 pages
Components in ReactJs
No ratings yet
Components in ReactJs
12 pages
Web Technologies Week 03-04 (CSS)
No ratings yet
Web Technologies Week 03-04 (CSS)
50 pages
3.4.5 Packet Tracer - Configure Trunks
No ratings yet
3.4.5 Packet Tracer - Configure Trunks
2 pages
Mobile Communication Syllabus
No ratings yet
Mobile Communication Syllabus
2 pages
Esolutions Manual - Powered by Cognero
No ratings yet
Esolutions Manual - Powered by Cognero
7 pages
Graphs: A Guide for CS Students
No ratings yet
Graphs: A Guide for CS Students
12 pages
AD 01 Intro To System Analysis N Design
No ratings yet
AD 01 Intro To System Analysis N Design
40 pages
Config Switch Core 10.16.35.1
No ratings yet
Config Switch Core 10.16.35.1
10 pages
Personal Information Sheet
No ratings yet
Personal Information Sheet
2 pages
Agile ETRM From Allegro
No ratings yet
Agile ETRM From Allegro
8 pages
CMMS3 DDL Guide for Ford Suppliers
No ratings yet
CMMS3 DDL Guide for Ford Suppliers
2 pages
Thesis Using Multiple Linear Regression
75% (4)
Thesis Using Multiple Linear Regression
7 pages
Xii CS Practical Programs 2024 - 2025
No ratings yet
Xii CS Practical Programs 2024 - 2025
28 pages
Black Box Testing
No ratings yet
Black Box Testing
3 pages
Simatic Net PG/PC - Industrial Ethernet CP 1623
No ratings yet
Simatic Net PG/PC - Industrial Ethernet CP 1623
22 pages
Peer Tutoor Platform
No ratings yet
Peer Tutoor Platform
9 pages
Real Log Book
No ratings yet
Real Log Book
24 pages
21bec1693 Esd Lab10
No ratings yet
21bec1693 Esd Lab10
9 pages
TradeGecko B2B ECommerce Getting Started Ebook
No ratings yet
TradeGecko B2B ECommerce Getting Started Ebook
17 pages
AS-i Profinet Gateway Solutions
No ratings yet
AS-i Profinet Gateway Solutions
4 pages
Catalog-SHANGHAI M2C SMART DEVICE CO.,LTD
100% (1)
Catalog-SHANGHAI M2C SMART DEVICE CO.,LTD
51 pages
B. Change The Color of Text On A Web Page
No ratings yet
B. Change The Color of Text On A Web Page
10 pages
Using Unicode Character Symbols in Excel
No ratings yet
Using Unicode Character Symbols in Excel
28 pages

Probabilistic User Profile Matching

Uploaded by

Probabilistic User Profile Matching

Uploaded by

Probabilistic Matching of User Profiles

Possible reasons for errors in the credentials :

Record linkage package

Advantages of record linkage packages: this package has a number of supervised,

General working of probabilistic recorlinkage model.

Working of fellegi sunter model:

We are interested in evaluating statements like:

Pr(records match | first name matches)

Value of m_probability and u_probability remains same for a column throughout.

for an agreement, we calculate the weight log ( M/ U ).

● Computing: Data matching can help optimize computing processes. By detecting

● Record Linkage Using Specialized Packages: Utilized record linkage packages,

You might also like