Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Jaedong95/Reddit

Repository files navigation

Reddit

reddit archive data curating.

0. Requirements

1) packages
zstd   - pip install zstandard 
nltk   - pip install nltk
2) data directory
|---data 
|   |---2010
|   |   |--- RS_2010-01.zst 
|   |   |--- ... 
|   |   |--- RS_2010-12.zst 
|   |   |--- RC_2010-01.zst 
|   |   |--- ...
|   |   |--- RC_2010-12.zst
|   |---origin
|   |   |--- rs_2010.csv 
|   |   |--- rc_2010.csv 
|   |   |--- ...  
|   |---processed
|   |   |--- dataset1.csv 
|   |   |--- dataset2.csv 
|   |   |--- dataset3.csv

1. Data Explanation

1) rs*.csv, rc*.csv (folder: origin)
extracted data from RS*.zst, RC*.zst
2) dataset1.csv (folder: processed)
it's text is consist of multi sentence (column: id, subreddit, text, type: title, post, comment)
3) dataset2.csv (folder: processed)
it's text is consist of single sentence (column: id, subreddit, text, type: title, post, comment)
4) dataset3.csv (folder: processed)
we remove personal info from dataset2.csv using bert ner tagger

2. How to use

1) Extract data
We extract data corresponding to the specified subreddit and year from the zst file.
$ python data_extract.py --data_path {$DATA_PATH} --subreddit {$SUBREDDIT_NAME} --year {$YEAR} 
The extracted data is stored in the origin folder.

2) Process data
we pre-process reddit data to create dataset1.csv as follows
$ python create_dataset1.py --data_path {$DATA_PATH} --subreddit {$SUBREDDIT_NAME} --year {$YEAR} 
we pre-process datset1.csv to create dataset2.csv as follows
$ python create_dataset2.py --data_path {$DATA_PATH} --subreddit {$SUBREDDIT_NAME} --year {$YEAR} 
we delete personal info (name) from datset2.csv to create dataset3.csv as follows
$ python de-identification.py --data_path {$DATA_PATH} --subreddit {$SUBREDDIT_NAME} --year {$YEAR}
Processed data is stored in the processed folder. (dataset1: document, dataset2: single sentence)

3) Concat data

we concat dataset1 to one file
$ python concat_doc.py --data_path {$DATA_PATH}
we concat dataset2 to one file
$ python concat_data.py --data_path {$DATA_PATH}
we concat dialog to one file
$ python concat_dialog.py --data_path {$DATA_PATH} --subreddit {$SUBREDDIT_NAME} --dtype {$DIALOG_TYPE}

About

Reddit data crawling & archieve data processing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages