© 2014 Ancestry.com Operations, Inc. All rights reserved.. This dataset, and any portions thereof, may be downloaded and reused without permission for the limited purposes of research, private study or education (non-commercial use) only, and subject to the following terms: You may not remove any copyright, trademark, or other proprietary notices including attribution information, credits, and notices, that are placed in or near the text, images, or data; you may not use the dataset, or any portions thereof, for commercial purposes. If you wish to use the dataset for any purpose beyond the permitted uses, you must obtain prior written permission from the Ancestry and the authors; the dataset is provided “as is” without a warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchantability, fitness for a particular use, and/or non-infringement; and you must cite the author and source of the content as you would material from any printed work. This dataset should be cited as: Jeffrey Sukharev, Leonid Zhukov, Alexandrin Popescul "Parallel corpus approach for name matching in record linkage" Proceedings of IEEE ICDM 2014, Shenzhen, China.
Format:
records25k_data.tsv (5 columns)
last name #1 : last name originated from Ancestry user tree nodes
last name #2 : last name originated from Ancestry records
coocurrence counter : number of identical name pairs in Records dataset
marginal counter last name #1 : counter of last name #1 among tree node last names
marginal counter last name #2 : counter of last name #2 among last name in Ancestry records
search12.5k_data.tsv
last name #1 : last name originated from user search logs
last name #2 : last name originated from user search reformulation (also from search logs)
coocurrence counter : number of identical pairs in Search dataset
marginal counter last name #1 : counter of last name #1 in search logs
marginal counter last name #2 : counter of last name #2 in search logs
search_surnames_counts_250k.tsv (2 columns)
last name
counter
search_surnames_counts_250k.tsv
last name
counter