Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
72 views12 pages

Probabilistic User Profile Matching

This document discusses probabilistic matching of user profiles across different platforms. It explains that some users may have overlapping profiles across platforms but use different credentials. The goal is to identify these matching user records probabilistically. It describes sources of errors in credentials and different approaches to probabilistic matching including preprocessing data, indexing, comparing records, and building models. The Fellegi-Sunter model for probabilistic record linkage is also summarized, including how it calculates probabilities of matches and non-matches based on comparing records column by column.

Uploaded by

VIKAS PATEL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views12 pages

Probabilistic User Profile Matching

This document discusses probabilistic matching of user profiles across different platforms. It explains that some users may have overlapping profiles across platforms but use different credentials. The goal is to identify these matching user records probabilistically. It describes sources of errors in credentials and different approaches to probabilistic matching including preprocessing data, indexing, comparing records, and building models. The Fellegi-Sunter model for probabilistic record linkage is also summarized, including how it calculates probabilities of matches and non-matches based on comparing records column by column.

Uploaded by

VIKAS PATEL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Probabilistic Matching of User Profiles

A typical business uses multiple platforms to interact with its end user that is the customers. It
may make use of mobile application web application, CRM tools, and a slo a number of
marketing channels. In our cdp system we have integrated a number of such platforms, each of
these platforms captures the users demographic data .

the set of users who will be interacting with the CDP system though each of these platforms
would be different. But there is a possibility that a subset of this user might be common
between these platforms. This subset of users intentionally or unintentionally might use different
credentials on each of the platforms. The task of our probabilistic matching model is to identify
such user records.

Possible reasons for errors in the credentials :


● typos/misspelling
● letters or words out of order
● fused or split words
● missing or extra letters
● incomplete words extraneous information
● incorrect or missing punctuation
● Abbreviations

Probabilistic Matching, also known as data matching is the task of finding records in a data set
that refer to the same entity across different data sources. These entities can be a person,
product etc.Record linkage is necessary when joining different data sets based on entities that
may or may not share a common identifier (e.g., database key, URI, National identification
number), which may be due to differences in record shape, storage location, or curator style or
preference.

Possible errors:

Record linkage package


The Python Record Linkage Toolkit is a library to link records in or between data sources. The
toolkit provides most of the tools needed for record linkage and deduplication. The package
contains indexing methods, functions to compare records and classifiers. The package is
developed for research and the linking of small or medium sized files.

1. Preprocessing
The data obtained from different sources might not be in the same format. It is necessary for the
data in the same format to be able to perform matching or computing the similarity score
between values of different columns. Hence the data needs to be standardized using some
preprocessing techniques.

Few of preprocessing task that we have performed for our data set are as follows:
● Lowercase / Uppercase
● Stopwords removal
● Postcode Clean Up
● Removal of Irrelevant Symbols
The preprocessing part may vary from dataset to dataset as the new dataset may require more
preprocessing to bring it into a standardized format. Once the data is standardized we can
proceed with indexing

2. Indexing
we will need to create pairs of records. Pairs records are created and similarity score are
calculated using string similarity alogs to determine if the pair of records are considered a
match/duplicates.

There are several indexing techniques available for record linkage such as:
● Full index A Full Index is created based on all the possible combinations of record pairs
in the data set. Using Full Index has a risk on data volume as the number of records will
increase quadratically.
● Blocking
Index by blocking is a good alternative to Full Index as records pairs are produced based on the
same block (Having common value). By blocking based on a particular column, the number of
record pairs can be greatly reduced.
● Sorted Neighborhood
Index by Sorted Neighborhood is another alternative that produces pairs with nearby values, for
example, the following records are paired up together as there are similarities in the column
“Surname” — Laundon and Lanyon.

just by using the Index by “Blocking” or “Sorted Neighborhood” approach, there are chances of
missing out on actual matches. So why not reduce the possibility of missing out on actual match
records by combining both approaches and still have a lesser volume of records compared to
Full Index.

3. Comparison matrix
Now that record pairs are generated, we would like to perform a comparison on the record pairs
to create a comparison vector that calculates the similarity score between both pairs.
Comparison can be done in many different methods to compute similarity values in a string,
numeric values, or dates. In our scenario, where we are calculating the similarity score for string
values, we can use the following algorithm:
● Jarowinkler
● Levenshtein
Jarowinler similarity score is calculated by giving more importance to the beginning of the string,
therefore this algorithm is used to calculate the similarity score for features such as name,
address, state, etc. The Levenshtein similarity score is calculated and provides higher
importance based on the order of the character, therefore this algorithm is used to calculate the
similarity score for features such as street number, postcode, etc.

4. Model building
The comparison matrix generated above can be used for model implementation. We will train a
model to classify duplicates and non-duplicates based on the data set provided. Since we don't
have a labeled data, we will apply clustering algorithms to the comparison matrix generated
above

Advantages of record linkage packages: this package has a number of supervised,


unsupervised, algorithms to choose from. We can also incorporate user defined algorithms in
this package.

Disadvantages
It works on small to medium size dataset(upto .1million).
Splink
spink is a free and open source PySpark package that implements the Fellegi-Sunter model of
record linking, and enables parameters to be estimated using the Expectation Maximization
algorithm.

General working of probabilistic recorlinkage model.

1. Start with a prior probability, begin comparison of each column in tha pairs. Increase the
probability if column agrees and decrease if its disagrees
2. The quantum of increase and decrease depends upon the amount of evidence
contained in a column. Columns with higher no. of distinct values tend to have stronger
evidence of match if columns match because any two values chosen at random will be
less likely to match by coincidence. For eg dob and gender

This, in a nutshell, is how probabilistic record linkage works. By comparing records and
weighing evidence appropriately, we estimate the probability of a match.
The most common type of probabilistic record linkage model is called the Fellegi-Sunter
model.

Working of fellegi sunter model:


1. indexing
2. To reduce the no. pairs for comparison use blocking
3. The FS model begins by comparing the records column by column, and assigning each
comparison to two or more 'similarity levels'.
A simple example of a two-level comparison rule for a column may be:
● If the values in the column exactly match, assign the comparison to similarity level 1
● Otherwise assign the comparison to similarity level 0

This comparison values are named as gamma. For each row we get a set of gamma values

4. We will combine this individual comparison of each column to get the overall probability
of a match. While combining each column will have different weights. A gender column
is less informative than a dob column. Hence weight of gender will be less than weight of
dob
5. The FS model estimates the weight of each column.
6. All columns are assumed to be mutually independent of each other. This
assumption makes the model equivalent to a Naïve Bayes classifier. This allows
the match probability to be expressed as a repeated multiplication of conditional
probabilities.
7. Determination of the appropriate threshold setting above which to accept record-pairs as
valid matches typically occur through manual inspection of record-pairs within a range of
weight scores

m and u probabilities
How much should we increase our estimate of match probability if we observe a match
on first name? How about a match on gender? And what about if we observe a
mismatch on these fields?

We are interested in evaluating statements like:

Pr(records match | first name matches)

This can be quantified using the m and u probabilities for each column, combined with
Bayes Theorem (see annex for a refresher).

Consider the first name column. We have defined two similarity levels: level 1 if the first
name exactly matches, and level 0 otherwise.

m probabilities
The m probabilities for the first name column are:
That is, amongst record comparisons which are true matches, what proportion have a
match on first name, and what proportion mismatch on first name?

This is a measure of how often misspellings, nicknames or aliases occur in the first
name field.

u probabilities
The u probabilities for the first name column are:

That is, amongst record comparisons which are true non-matches, what proportion have
a match on first name, and what proportion mismatch on first name?

Value of m_probability and u_probability remains same for a column throughout.

for an agreement, we calculate the weight log ( M/ U ).


for an disagreement, we calculate the weight log ( 1 − M/ 1 − U ).
Parameters:

1. link_type
Summary: The type of data linking task - link_and_dedupe or link_only. Required.
Description: - When link_and_dedupe, splink finds links within and between input datasets. If
single dataset is provided, it will be deduped. - When link_only, splink finds links between
datasets, but does not attempt to deduplicate the datasets (it does not try and find links within
each input dataset.)
Default value if not provided: link_and_dedupe

2.Proportion_of_matches
Summary: The proportion of record comparisons thought to be matches
Description: This provides the initial value (prior) from which the EM algorithm will start iterating
Default value if not provided: 0.3

3.em_convergence
Summary: Convergence tolerance for the EM algorithm
Description: The algorithm will stop converging when the maximum of the change in model
parameters between iterations is below this value
Default value if not provided: 0.0001

4.max_iterations
Summary: The maximum number of iterations to run even if convergence has not been reached
Description: Set this value to zero if you do not want to use the EM algorithm and just want to
score matches from values you have manually specified in the m_probabilities and
u_probabilities arrays
Default value if not provided: 25

5.blocking_rules
Summary: A list of one or more blocking rules to apply. A cartesian join is applied if
blocking_rules is empty or not supplied.
Description: Each rule is a SQL expression representing the blocking rule, which will be used to
create a join. The left table is aliased with l and the right table is aliased with r. For example, if
you want to block on a first_name column, the blocking rule would be l.first_name =
r.first_name. Note that splink deduplicates the comparisons generated by the blocking rules. If
empty or not supplied, all comparisons between the input dataset(s) will be generated and
blocking will not be used. For large input datasets, this will generally be computationally
intractable because it will generate comparisons equal to the number of rows squared.

6.num_levels
Summary: The number of different similarity categories (gradations of similarity) that will be
computed for this column.
Description: A greater value for num_levels means the algorithm can be more granular about
how string similarity is treated - e.g. with three levels, it enables it to make a distinction between
strings which are an almost-exact match, strings which are quite similar, and strings which don't
match at all. However, more levels results in longer compute times and can sometimes affect
convergence. By default, for a string variable, two levels would imply level 0: no match, level 1:
almost exact match. Three levels implies level 0: no match, level 1: strings are similar but not
exactly the same, level 2: strings are almost exactly the same.
Default value if not provided: 2

7.term_frequency_adjustments
Summary: Whether ex post term frequency adjustments should be made to match scores for
this column
Description: For some columns such as first name, the value of first name is important due to
the distribution of values being non-uniform. For instance, a match on 'linacre' contains more
information than a match on 'smith'. If this is set to true, a term frequency adjustment is made to
account for these differences.
Default value if not provided: false

setting= {
"link_type": "dedupe_only",
"blocking_rules": [
"l.state = r.state"
],
"comparison_columns": [
{
"col_name": "given_name",
"num_levels": 3,
"term_frequency_adjustments": True
},
{
"col_name": "surname"
},
{
"col_name": "address_1",
"term_frequency_adjustments": True
},
{
"col_name": "address_2"
},
{
"col_name": "suburb"
},
{
"col_name": "postcode"
},
{
"col_name": "date_of_birth"
}
],
"em_convergence": 0.01
}

Use cases:

Although the outcome of probabilistic matching models is to find and match more precise
records out of several similar records. The application differs from industry to industry. Here is a
close look on how probabilistic matching model is applied across multiple contexts

https://dataladder.com/benefits-data-matching/

● Government and Public Sector: it can used to detect frauds in passport application,
license

● Banking and Finance: banks and financial services institutions utilize data matching to
identify culprits as part of anti-money laundering initiatives, meet KYC compliance
requirements, or carry out FICO credit scoring.
● E-commerce: In e-commerce, an everyday use case is all the platforms comparing
prices. They use data matching to locate identical products from different stores, even if
they don't have the same description.

● Mailing lists: Data matching can help clean up email lists to get rid of duplicates and dirty
data.

● Healthcare: Matching medical records with other data to study the effect of things like
drugs, treatments, and the environment.

● Fraud detection: Data matching can help identify suspicious transactions, behaviors, and
individuals.

● Computing: Data matching can help optimize computing processes. By detecting


duplicate data, deduplication algorithms help reduce storage need and network data
transfer.

Presentation points:
Agenda
Introduction to ps- nature of ps(supervised and unsupervsied)
Understanding the data set (switch to notebook)
recordlinkage(keep switching between notebook and ppt)
Result analysis
Splink introduction
Setting dict and parameters explanation
Interpreting the result
Advantages of recordlinkage

References

https://towardsdatascience.com/performing-deduplication-with-record-linkage-and-supervised-
learning-b01a66cc6882

https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-017-0370-0
https://www.robinlinacre.com/maths_of_fellegi_sunter/

https://www.sciencedirect.com/science/article/pii/S1532046409001051

● Record Linkage Using Specialized Packages: Utilized record linkage packages,


such as the Python Record Linkage Toolkit and Splink,

You might also like