Reddit

reddit archive data curating.

1) reddit submissions: https://files.pushshift.io/reddit/submissions/

2) reddit comment: http://files.pushshift.io/reddit/comments/

0. Requirements

1) packages

zstd   - pip install zstandard 
nltk   - pip install nltk

2) data directory

|---data 
|   |---2010
|   |   |--- RS_2010-01.zst 
|   |   |--- ... 
|   |   |--- RS_2010-12.zst 
|   |   |--- RC_2010-01.zst 
|   |   |--- ...
|   |   |--- RC_2010-12.zst
|   |---origin
|   |   |--- rs_2010.csv 
|   |   |--- rc_2010.csv 
|   |   |--- ...  
|   |---processed
|   |   |--- dataset1.csv 
|   |   |--- dataset2.csv 
|   |   |--- dataset3.csv

1. Data Explanation

1) rs.csv, rc.csv (folder: origin)

extracted data from RS.zst, RC.zst

2) dataset1.csv (folder: processed)

it's text is consist of multi sentence (column: id, subreddit, text, type: title, post, comment)

3) dataset2.csv (folder: processed)

it's text is consist of single sentence (column: id, subreddit, text, type: title, post, comment)

4) dataset3.csv (folder: processed)

we remove personal info from dataset2.csv using bert ner tagger

2. How to use

1) Extract data

We extract data corresponding to the specified subreddit and year from the zst file.

$ python data_extract.py --data_path {$DATA_PATH} --subreddit {$SUBREDDIT_NAME} --year {$YEAR}

The extracted data is stored in the origin folder.

2) Process data

we pre-process reddit data to create dataset1.csv as follows

$ python create_dataset1.py --data_path {$DATA_PATH} --subreddit {$SUBREDDIT_NAME} --year {$YEAR}

we pre-process datset1.csv to create dataset2.csv as follows

$ python create_dataset2.py --data_path {$DATA_PATH} --subreddit {$SUBREDDIT_NAME} --year {$YEAR}

we delete personal info (name) from datset2.csv to create dataset3.csv as follows

$ python de-identification.py --data_path {$DATA_PATH} --subreddit {$SUBREDDIT_NAME} --year {$YEAR}

Processed data is stored in the processed folder. (dataset1: document, dataset2: single sentence)

3) Concat data

we concat dataset1 to one file

$ python concat_doc.py --data_path {$DATA_PATH}

we concat dataset2 to one file

$ python concat_data.py --data_path {$DATA_PATH}

we concat dialog to one file

$ python concat_dialog.py --data_path {$DATA_PATH} --subreddit {$SUBREDDIT_NAME} --dtype {$DIALOG_TYPE}

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
data		data
model		model
src		src
README.md		README.md
concat_data.py		concat_data.py
concat_dialog.py		concat_dialog.py
concat_doc.py		concat_doc.py
create_dataset1.py		create_dataset1.py
create_dataset2.py		create_dataset2.py
create_dialogset.py		create_dialogset.py
curate_dialogset.py		curate_dialogset.py
data_extract.py		data_extract.py
data_reconstruct.py		data_reconstruct.py
de-identification.py		de-identification.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reddit

reddit archive data curating.

1) reddit submissions: https://files.pushshift.io/reddit/submissions/

2) reddit comment: http://files.pushshift.io/reddit/comments/

0. Requirements

1) packages

2) data directory

1. Data Explanation

1) rs.csv, rc.csv (folder: origin)

extracted data from RS.zst, RC.zst

2) dataset1.csv (folder: processed)

it's text is consist of multi sentence (column: id, subreddit, text, type: title, post, comment)

3) dataset2.csv (folder: processed)

it's text is consist of single sentence (column: id, subreddit, text, type: title, post, comment)

4) dataset3.csv (folder: processed)

we remove personal info from dataset2.csv using bert ner tagger

2. How to use

1) Extract data

We extract data corresponding to the specified subreddit and year from the zst file.

The extracted data is stored in the origin folder.

2) Process data

we pre-process reddit data to create dataset1.csv as follows

we pre-process datset1.csv to create dataset2.csv as follows

we delete personal info (name) from datset2.csv to create dataset3.csv as follows

Processed data is stored in the processed folder. (dataset1: document, dataset2: single sentence)

3) Concat data

we concat dataset1 to one file

we concat dataset2 to one file

we concat dialog to one file

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Jaedong95/Reddit

Folders and files

Latest commit

History

Repository files navigation

Reddit

reddit archive data curating.

1) reddit submissions: https://files.pushshift.io/reddit/submissions/

2) reddit comment: http://files.pushshift.io/reddit/comments/

0. Requirements

1) packages

2) data directory

1. Data Explanation

1) rs*.csv, rc*.csv (folder: origin)

extracted data from RS*.zst, RC*.zst

2) dataset1.csv (folder: processed)

it's text is consist of multi sentence (column: id, subreddit, text, type: title, post, comment)

3) dataset2.csv (folder: processed)

it's text is consist of single sentence (column: id, subreddit, text, type: title, post, comment)

4) dataset3.csv (folder: processed)

we remove personal info from dataset2.csv using bert ner tagger

2. How to use

1) Extract data

We extract data corresponding to the specified subreddit and year from the zst file.

The extracted data is stored in the origin folder.

2) Process data

we pre-process reddit data to create dataset1.csv as follows

we pre-process datset1.csv to create dataset2.csv as follows

we delete personal info (name) from datset2.csv to create dataset3.csv as follows

Processed data is stored in the processed folder. (dataset1: document, dataset2: single sentence)

3) Concat data

we concat dataset1 to one file

we concat dataset2 to one file

we concat dialog to one file

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

1) rs.csv, rc.csv (folder: origin)

extracted data from RS.zst, RC.zst

Packages