Thanks to visit codestin.com
Credit goes to github.com

Skip to content

alexZeakis/TokenJoin_preprocessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TokenJoin_preprocessing

This repository is part of TokenJoin. It is necessary to run this code in order to preprocess the data needed as input for TokenJoin.

Usage

Step 1. Download or clone the project:

$ git clone https://github.com/alexZeakis/TokenJoin_preprocessing

Step 2. Open terminal inside root folder and install by running:

$ mvn install

Step 3 Make create_datasets.sh executable and run it:

$ ./create_datasets.sh <in_dir> <out_dir>

Datasets

We have used six real-world datasets:

  • Yelp: 160,016 sets extracted from the Yelp Open Dataset. Each set refers to a business. Its elements are the categories associated to it.

  • GDELT: 500,000 randomly selected sets from January 2019 extracted from the GDELT Project. Each set refers to a news article. Its elements are the themes associated with it. Themes are hierarchical. Each theme is represented by a string concatenating all themes from it to the root of the hierarchy.

  • Enron: 517,431 sets, each corresponding to an email message. The elements are the words contained in the message body.

  • Flickr: 500,000 randomly selected images from the Flickr Creative Commons dataset. Each set corresponds to a photo. The elements are the tags associated to that photo.

  • DBLP: 500,000 publications from the DBLP computer science bibliography. Each set refers to a publication. The elements are author names and words in the title.

  • MIND: 123,130 articles from the MIcrosoft News Dataset. Each set corresponds to an article. The elements are the words in its abstract.

The preprocessed csv versions of the datasets used in the experiments can be found here.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors