Record Deduplication Using Genetic Programming Approach: Research Paper

In the upcoming growing of technology the use of databases are very high. Dirty data can contain such mistakes as spelling or punctuation, incorrect data associated with a field, incomplete or outdated data or even data that is duplicated in the database. We propose a genetic programming approach to find a deduplication function that is able to identify whether two entries in a repository are replicas or not.

Uploaded by

Dikshant Deshmukh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views8 pages

Record Deduplication Using Genetic Programming Approach: Research Paper

Uploaded by

Dikshant Deshmukh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Int. J. Engg. Res. & Sci. & Tech.

2013
This article can be downloaded from http://www.ijerst.com/currentissue.php
49
ISSN 2319-5991 www.ijerst.com
Vol. 2, No. 2, May 2013
2013 IJERST. All Rights Reserved
Research Paper
RECORD DEDUPLICATION USING GENETIC
PROGRAMMING APPROACH
S Kavitha
1
, K Arunadevi
1
, S V Sivaranjani
2
and V Karthika
1

*Corresponding Author: V Karthika,[email protected]

In the upcoming growing of technology the use of databases are very high. As the use of
databases grows higher the dirty data on the other side is the biggest disadvantage with the
databases. Dirty data can contain such mistakes as spelling or punctuation, incorrect data
associated with a field, incomplete or outdated data or even data that is duplicated in the
database. Various data cleaning software's are used to remove the dirty data. In our paper
we are proposed a concept of Genetic programming approach to record Deduplication that
combines several different pieces of evidence extracted from the data content to find a
Deduplication function that is able to identify whether two entries in a repository are replicas
or not. In addition, our genetic programming approach is capable of automatically adapting
these functions to a given fixed replica identification boundary. We are applying this genetic
programming approach for the blood bank database management to deduplicate the records.

Keywords: Database integration, Evolutionary computing and Genetic algorithms, Database
integration

1
Kumaraguru College of Technology, Coimbatore, Tamil Nadu, India.
2
Department of Information Technnology, Kumaraguru College of Technology, Coimbatore, Tamil Nadu, India.
V Karthika et al., 2013

INTRODUCTION
Several systems such as digital libraries and other database systems like organization
databases are affected by the duplicates. We propose a genetic programming approach to
find a deduplication function that is able to identify whether two entries in a repository are
replicas or not. Deduplication is a task of identifying the duplicate data in a repository that
refer to the same real world entity or object and systematically substitutes the reference
pointers for the redundant blocks; also known as storage capacity optimization. Dirty data is
defined in various categories (1) performance degradation as additional useless data
demand more processing, more time is required to answer simple user queries; (2) quality
lossthe presence of replicas and other inconsistencies leads to distortions in reports and
misleading conclusions based on the existing data; (3) increasing operational costs
because of the additional volume of useless data, investments are required on more storage
media and extra computational processing power to keep the response time levels
acceptable. To avoid these problems, it is necessary to study the causes of dirty data in
repositories. A major cause is the presence of duplicates, quasi replicas, or nearduplicates in
these repositories, mainly those constructed by the aggregation or integration of distinct data
sources. The problem of detecting and removing duplicate entries in a repository is generally
known as record deduplication.
In our project we remove the dirty date in the blood bank management system. As a part
of genetic programming approach the gaining concepts and the entropy calculations are
used to deduplicate the records.
RELATED WORKS
Record deduplication is a growing research topic in database and many other fields as we
mentioned above. The data collected from disparate sources having the redundant data.
Other replicas present because of the OCR documents. This leads to the inconsistent that
may affect the originality of the database and the database management systems.
This could be overcome by the Genetic programming approach an evolutionary
algorithmbased methodology inspired by biological evolution to find computer programs that
perform a user-defined task. It is a specialization of Genetic Algorithms (GA) where each
individual is a computer program. It is a machine learning technique used to optimize a
population of computer programs according to a fitness determined by a programs ability to
Figure 1: Overview of Our Project of Record Deduplication

perform a given computational task.
The main contribution of this paper is a GPbased approach to record deduplication that:
Outperforms an existing state-of-the-art machine learning based method found in the
literature; provides solutions less computationally intensive, since it suggests deduplication
functions that use the available evidence more efficiently and frees the user from the burden
of choosing how to combine similarity functions and repository attributes .This distinguishes
our approach from all existing methods, since they require userprovided settings; frees the
user from the burden of choosing the replica identification boundary value, since it is able to
automatically select the deduplication functions that better fit this deduplication parameter.
PROPOSED WORK
In Figure 1 overview of our project record deduplication detection is shown. The Blood group
data set in which we are going to find duplicates is taken. The entropy value is calculated for
the data set as a whole, i.e., for positive as well as negative. Based on the entropy value the
donor records will be displayed in a Tree structure in which the blood groups are grouped
together. Entropy is the part of gain process. The entropy value is applied into the gain
formula which is used to display the donor record with the highest priority.
DESCRIPTION
Administrator
The person who is responsible for setting up and maintaining the system is called as the
system administrator. Administrator maintains the database in a secure manner.
Responsible for installation, configuration, monitoring and administration and improving all
the duplicates in the database performance and capacity. Admin are usually charged with
installing, supporting, and maintaining servers or other computer systems, and planning for
and responding to service outages and other problems. Administrator is also responsible for
creating backup and recover policy monitor network communication.
Our project is implemented for blood bank system. Here the administrator maintains the
whole database which contains the details like registration of users and their blood donation
details. Also the various blood bank branches for that particular blood bank Admin can look
for all those details with that he can insert, update any information into the database as well
as he can delete any unwanted information from the database. He calculate the entropy and
gain values in order to group the user details with priority. So that he can display the blood
groups by order. From that the admin can come to know which blood group is required most.
Finally, he merge/integrate all the blood bank branches to find out the duplicate entries in the
database by means of gaining values. So that the duplicates will be displayed separately.
After that he will send the mails to the blood banks which are having the duplicates in their
database.
The admin create new branches with branch ids, they can change the user name and
password of each branch admins, lock/unlock their accounts, monitor the security over the
truncations.

Creation of DB and Entropy Calculation
Entropy is one kind of measurement procedure in information theory, details about Entropy is
in here. In here, we will see how to calculate Entropy of given set of data. Entropy(S) = n =
1 p(I)log2p(I) p(I) refers to the category of the blood group.
S refers to the collection size we will see the example for entropy calculation from the
following tables.
From Table 1 its known that there are two persons with positive type of AB blood
group, one with negative type of AB and other with Positive type of B has been registered.
AB + = 2
AB - = 1
B+=1
Table 1: Master Table
Phone number Blood Group Blood Type
123456789 AB +
1234567891 AB -
1234567892 AB +
1234567893 B +
The entropy value is calculated for each of the blood group by means of the above
formulae.
Entropy (AB+) = 1 ( 2/5) log 2(2/5) = 0.960
Entropy (AB-) = 1 (1/5) log 2(1/5) =0.590
Entropy (B +) = 1 (1/5) log 2(1/5) = 0.789
Integration of Dataset and Detection Using Gaining Value
The gaining value is calculated for the records. Based on the gaining value the records
which have the same key attribute values are grouped and they are displayed with their
highest priority. Grouping records makes easier in identify the duplicate records and also this
makes easy access of records. It improves the system performance in searching and
retrieving the records. After finding entropy we next going to find gain value. Entropy is the
part of gaining process. Information gain is G(S,A), where S is the collection of the data in
the data set and A is the attribute for which information gain will be calculated over the
collection S.

Table 2:Transaction Table
(Excluding Dirty Data)

Phone Number Donation Date

B1

123456789 10/1/12
B2

1234567893 10/1/12
B1

1234567891 1/1/12
B1

1234567892 2/1/12
B2

123456789 1/7/12
B1

123456789 10/1/12
B2

1234567893 10/1/12
B1

1234567891 1/1/12
B1

1234567892 2/1/12
B2

123456789 1/7/12
Gain(S, A) = Entropy(S) (|Sv|/|S|) x Entropy(Sv))
The entropy value is applied into the above formula in order to find the gain value for each
blood group the gain value is calculated to the corresponding entropy value.
Table 2 is the transaction table which shows that blood donors can donate blood at
different branches with their personal details.
Display the Duplicates
The administrator can merge the database to find out the duplicate entries for a whole-based
on the gain value the admin can come to know the duplicate entries in the database. For
example if the gain is negative value means the admin can know that the corresponding
blood group is duplicated, most he will get the overall duplicates from all the database. To
detect the duplicates from each branch he can split the databases. From each branch the
duplicate entries will be displayed and the mails have to be sent to those branches.
GENETIC PROGRAMMING APPROACH
The problem of record duplication is solved by some of the evolutionary techniques. Genetic
programming is one of the best known evolutionary programming techniques. The main
aspect that distinguishes GP from other evolutionary techniques is that it represents the
concepts and the interpretation of a problem as a computer program and even the data are
viewed and manipulated in this way. This special characteristics enables GP to model any
other machine learning representation, another advantage of GP over other evolutionary
techniques, its applicability to symbolic regression problems, since the representation

structures are variable. Gp is able to discover the independent variables and their
relationships with each other and with any dependent variable. Thus,
GP can find the correct functional form that fits
the data and discover the appropriate coefficients.
EXPERIMENTAL RESULTS
Figure 2 depict us about the duplicate records in each branch and Figure 3 is the output of
genetic programming approach which clearly shows the duplicate entries from entire
branch.
Figure 2: Duplicates at Each Branch

Figure 3: Duplicate Entries for all the Branches

CONCLUSION
Identifying and handling replicas is important to guarantee the quality of the information
made available by the data intensive systems such as digital libraries and e-commerce
brokers. These systems rely on consistent data to offer highquality services, and may be
affected by the existence of duplicates, quasi replicas, or nearduplicate entries in their
repositories. Thus, for this reason, there have been significant investments from private and
government organizations for developing methods for removing replicas from large data
repositories. In this paper, we presented a GP-based approach to record deduplication. Our
approach is able to automatically suggest deduplication functions based on evidence
present in the data repositories. The suggested functions properly combine the best
evidence available in order to identify whether two or more distinct record entries are
replicas (i.e., represent the same real-world entity) or not.
Our experiments show that our GP-based approach is able to adapt the suggested
deduplication functions to different boundary values used to classify a pair of records as
replica or not. Moreover, the results suggest that the use of a fixed boundary value, as close
to 1 as possible, eases the evolutionary effort and also leads to better solutions.
As future work, we intend to conduct additional research in order to extend the range of
use of our GPbased approach to record deduplication.
REFERENCES
1. Banzhaf W, Nordin P, Keller R E and Francone F D (1998), Genetic Programming -
An Introduction: On the Automatic Evolution of Computer Programs and Its
Applications.
Morgan Kaufmann Publishers.
2. Bell R and Dravis F (2006), Is You Data Dirty? and Does that Matter?, Accenture
Whiter Paper, http://www.accenture.com.
3. Bhattacharya I and Getoor L (2004),
Iterative Record Linkage for Cleaning and Integration, Proc. Ninth ACM SIGMOD
Workshop Research Issues in Data Mining and Knowledge Discovery, pp. 11-18.
4. Chaudhuri S, GanjamK, Ganti V and Motwani R (2003), Robust and Efficient Fuzzy
Match for Online Data Cleaning,
Proc. ACM SIGMOD Intl Conf.
Management of Data, pp. 313-324.
5. de Carvalho M G, Goncalves M A, Laender A H F and da Silva A S (2006), Learning
to Deduplicate, Proc. Sixth ACM/IEEE CS Joint Conf. Digital Libraries, pp. 41-50.
6. Fellegi I P and Sunter A B (1969), A Theory for Record Linkage, J. Am. Statistical
Assoc., Vol. 66, No. 1, pp. 1183-1210.

7. Koudas N, Sarawagi S and Srivastava D (2006), Record Linkage: Similarity
Measures and Algorithms, Proc. ACM SIGMOD Intl Conf. Management of Data, pp.
802-803.
8. Koza J R (1992), Gentic Programming: On the Programming of Computers by Means
of Natural Selection, MIT Press.
9. Verykios V S, Moustakides G V and Elfeky M G (2003), A Bayesian Decision Model
for Cost Optimal Record Matching, The Very Large Databases J., Vol. 12, No. 1, pp.
28-40.
10. Wheatley M (2004), Operation Clean Data, CIO Asia Magazine,
http://www.cioasia.com, August.

Quantitative Methods in Procurement
No ratings yet
Quantitative Methods in Procurement
15 pages
Vide
No ratings yet
Vide
80 pages
A Genetic Programming Approach To Record Deduplication
No ratings yet
A Genetic Programming Approach To Record Deduplication
45 pages
An Improving Genetic Programming Approach Based Deduplication Using KFINDMR
No ratings yet
An Improving Genetic Programming Approach Based Deduplication Using KFINDMR
8 pages
Support Vector Machine Based Data Classification To Avoid Data Redundancy Removal Before Persist The Data in A DBMS
No ratings yet
Support Vector Machine Based Data Classification To Avoid Data Redundancy Removal Before Persist The Data in A DBMS
4 pages
Genetic Approach Deduplication
No ratings yet
Genetic Approach Deduplication
5 pages
Probabilistic Data Deduplication Study
No ratings yet
Probabilistic Data Deduplication Study
5 pages
Final Srs
No ratings yet
Final Srs
18 pages
Database Management Systems Lab ETCS-256
No ratings yet
Database Management Systems Lab ETCS-256
28 pages
Normalization 2
No ratings yet
Normalization 2
23 pages
Unit 2 and 3 (2 Part)
No ratings yet
Unit 2 and 3 (2 Part)
9 pages
DB CH 4
No ratings yet
DB CH 4
29 pages
DBDesign 7
No ratings yet
DBDesign 7
22 pages
Novel Authomatic Algoritm For Normalization
No ratings yet
Novel Authomatic Algoritm For Normalization
16 pages
Boolean Data Clustering Insights
No ratings yet
Boolean Data Clustering Insights
31 pages
7-DBDesign (Schema Refinement)
No ratings yet
7-DBDesign (Schema Refinement)
30 pages
Duplicate Detection of Record Linkage in Real-World Data: K. M, P T
No ratings yet
Duplicate Detection of Record Linkage in Real-World Data: K. M, P T
10 pages
An Association Rule Mining Algorithm Based On A Boolean Matrix
No ratings yet
An Association Rule Mining Algorithm Based On A Boolean Matrix
7 pages
DB & ML
No ratings yet
DB & ML
8 pages
DBMS Module 2,4
No ratings yet
DBMS Module 2,4
12 pages
DBMS Project Report - BLOOD BANK MANAGEMENT
No ratings yet
DBMS Project Report - BLOOD BANK MANAGEMENT
27 pages
Interactive Deduplication Using Active Learning: Bombay Bhamidipaty Bombay
No ratings yet
Interactive Deduplication Using Active Learning: Bombay Bhamidipaty Bombay
10 pages
DBWK 8
No ratings yet
DBWK 8
33 pages
Anomaly Detection Via Eliminating Data Redundancy and Rectifying Data Error in Uncertain Data Streams
No ratings yet
Anomaly Detection Via Eliminating Data Redundancy and Rectifying Data Error in Uncertain Data Streams
18 pages
C20.0046: Database Management Systems Lecture #5: M.P. Johnson Stern School of Business, NYU Spring, 2008
No ratings yet
C20.0046: Database Management Systems Lecture #5: M.P. Johnson Stern School of Business, NYU Spring, 2008
39 pages
Unit 5 Relational Database Design BitinfoNepal
No ratings yet
Unit 5 Relational Database Design BitinfoNepal
9 pages
Unit-6 Note
No ratings yet
Unit-6 Note
5 pages
Intro To Duplicate Detection
No ratings yet
Intro To Duplicate Detection
87 pages
Chapter 4
No ratings yet
Chapter 4
29 pages
NORMALIZATION
No ratings yet
NORMALIZATION
11 pages
Dbmsmicroproject 2
No ratings yet
Dbmsmicroproject 2
7 pages
6 Normalization Part 1
No ratings yet
6 Normalization Part 1
69 pages
Unit-Iv Schema Refinement and Normalisation: Unit 4 Contents at A Glance
No ratings yet
Unit-Iv Schema Refinement and Normalisation: Unit 4 Contents at A Glance
26 pages
8 Part B Relational Database Design
No ratings yet
8 Part B Relational Database Design
51 pages
Database System Part II
No ratings yet
Database System Part II
50 pages
Solved (DBMS)
No ratings yet
Solved (DBMS)
2 pages
DBMS Unit 3
No ratings yet
DBMS Unit 3
6 pages
Basic and Advanced Database Courses: Srdjan Skrbi C
No ratings yet
Basic and Advanced Database Courses: Srdjan Skrbi C
25 pages
Basic and Advanced Database Courses: Srdjan Skrbi C
No ratings yet
Basic and Advanced Database Courses: Srdjan Skrbi C
25 pages
Indeterministic Handling of Uncertain Decisions in Deduplication
No ratings yet
Indeterministic Handling of Uncertain Decisions in Deduplication
25 pages
Automatic Database Normalization and Primary Key Generation
No ratings yet
Automatic Database Normalization and Primary Key Generation
6 pages
International Journal of Computational Engineering Research (IJCER)
No ratings yet
International Journal of Computational Engineering Research (IJCER)
8 pages
CSCI 4707 - Written Submission 3 Solutions: Question# Sections Max Score Details Score
No ratings yet
CSCI 4707 - Written Submission 3 Solutions: Question# Sections Max Score Details Score
16 pages
A Car-Insurance Company Whose Customers Own One or More Cars Each. Each Car Has Associated With It Zero To Any Number of Recorded Accidents
No ratings yet
A Car-Insurance Company Whose Customers Own One or More Cars Each. Each Car Has Associated With It Zero To Any Number of Recorded Accidents
3 pages
Assignment I
No ratings yet
Assignment I
4 pages
Mujiono Sadikin - A Binary Matrix Synthetic Data and It's Bi-Set Ground Truth Generator
No ratings yet
Mujiono Sadikin - A Binary Matrix Synthetic Data and It's Bi-Set Ground Truth Generator
10 pages
LSJ1512 - Progressive Duplicate Detection
No ratings yet
LSJ1512 - Progressive Duplicate Detection
5 pages
Blood Donation Database Project
No ratings yet
Blood Donation Database Project
14 pages
Advanced DBMS Solutions 2022
No ratings yet
Advanced DBMS Solutions 2022
17 pages
SS ZG518-L7
No ratings yet
SS ZG518-L7
29 pages
DDB Final Note (Full)
No ratings yet
DDB Final Note (Full)
13 pages
Construction Planning
No ratings yet
Construction Planning
10 pages
Biostatistics - Data and Its Types
No ratings yet
Biostatistics - Data and Its Types
11 pages
FDBC 4
No ratings yet
FDBC 4
9 pages
Denormalization
No ratings yet
Denormalization
9 pages
A Domain-Independent Data Cleaning Algorithm For Detecting Similar-Duplicates
No ratings yet
A Domain-Independent Data Cleaning Algorithm For Detecting Similar-Duplicates
10 pages
An Effective Methodology
No ratings yet
An Effective Methodology
6 pages
Cs 614
No ratings yet
Cs 614
12 pages
Unit 4
No ratings yet
Unit 4
61 pages
Bro vd10 20140115
No ratings yet
Bro vd10 20140115
2 pages
Model Questions Elasticity
No ratings yet
Model Questions Elasticity
3 pages
73 1st Long Problem Set
No ratings yet
73 1st Long Problem Set
11 pages
Ifrs 8 Aggregation of Operating Segments
No ratings yet
Ifrs 8 Aggregation of Operating Segments
8 pages
1.2. Free Radical Bromination of Alkanes - Master Organic Chemistry
No ratings yet
1.2. Free Radical Bromination of Alkanes - Master Organic Chemistry
1 page
DCFC Exam Dumps
No ratings yet
DCFC Exam Dumps
3 pages
Bar Syllabus 2020 - Criminal Law
No ratings yet
Bar Syllabus 2020 - Criminal Law
7 pages
CAT DP40 Electric Equipment Parts
No ratings yet
CAT DP40 Electric Equipment Parts
6 pages
Import Java - Util.Scanner Import Java - Text.Decimalformat Public Class Javaapplication4 (
No ratings yet
Import Java - Util.Scanner Import Java - Text.Decimalformat Public Class Javaapplication4 (
1 page
Investment Math Syllabus
No ratings yet
Investment Math Syllabus
7 pages
Stock Transport
No ratings yet
Stock Transport
1 page
O Level English Project
100% (1)
O Level English Project
3 pages
Merritt V Government FT
No ratings yet
Merritt V Government FT
11 pages
Eng Pcdmis 2022.1 Vision Manual
No ratings yet
Eng Pcdmis 2022.1 Vision Manual
274 pages
LTE End To End Call Flow: With Logs Using Common Troubleshooting Tools
100% (1)
LTE End To End Call Flow: With Logs Using Common Troubleshooting Tools
132 pages
Guidelines For Writing Thesis SCF 2023-2024
No ratings yet
Guidelines For Writing Thesis SCF 2023-2024
5 pages
Para Banking by Management Fund A
No ratings yet
Para Banking by Management Fund A
32 pages
Hartley Oscillator
No ratings yet
Hartley Oscillator
4 pages
KITI FHK Technik 2015 Engl INT PDF
No ratings yet
KITI FHK Technik 2015 Engl INT PDF
140 pages
Kappu Potet'o Brief Background of The Business
No ratings yet
Kappu Potet'o Brief Background of The Business
3 pages
Guide To Developing An Approved Culinology Degree Program - Updated 2017
No ratings yet
Guide To Developing An Approved Culinology Degree Program - Updated 2017
15 pages
Reading Unit 4
No ratings yet
Reading Unit 4
3 pages
Configuring The Network Settings
No ratings yet
Configuring The Network Settings
23 pages
Designz Tweet Book
No ratings yet
Designz Tweet Book
117 pages
Alshooaa Althaqib Company For General Trading, Contracting and Technical Services LTD
No ratings yet
Alshooaa Althaqib Company For General Trading, Contracting and Technical Services LTD
12 pages
Industrial Ventilation A Manual of Recommended Practice For Operation and Maintenance 2nd Edition Acgih Download
100% (2)
Industrial Ventilation A Manual of Recommended Practice For Operation and Maintenance 2nd Edition Acgih Download
58 pages
Impacts of The World Recession and Economic Crisis On Tourism North America
No ratings yet
Impacts of The World Recession and Economic Crisis On Tourism North America
11 pages
Brand Ambassador Playbook Roster
No ratings yet
Brand Ambassador Playbook Roster
27 pages
Sajan Reliance MF
No ratings yet
Sajan Reliance MF
2 pages

Record Deduplication Using Genetic Programming Approach: Research Paper

Uploaded by

Record Deduplication Using Genetic Programming Approach: Research Paper

Uploaded by

Int. J. Engg. Res. & Sci. & Tech.

You might also like