Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ihardy/radio-transcripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

radio-transcripts

radioscraper.py scrapes transcripts from WAMU's website for (non-commercial!) research purposes.

Usage

Run the scraper in a directory that does not contain directories called 'corpus' or 'transcript_collection'. Change the constants to reflect the details of the target show. URL_STEM should not include the trailing '/'.

Important: the scrapetranscripturls() function needs to be changed as well. Change the first url[##:##] to return only the date of the show, and the second url[##:##] to return the first few letters of the show title without any '/''s.

Example:

>>>url = 'http://thekojonnamdishow.org/shows/2013-06-10/tale-two-sequesters/transcript'

>>>url[35:45]
'2013-06-10'

>>>url[46:50]
'tale'

The result will be a corpus directory with clean files including only words spoken on the broadcast, and a transcript_collection directory with files that include show title, speaker names, and timestamps.

About

Scraper for radio transcripts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages