radio-transcripts

radioscraper.py scrapes transcripts from WAMU's website for (non-commercial!) research purposes.

Usage

Run the scraper in a directory that does not contain directories called 'corpus' or 'transcript_collection'. Change the constants to reflect the details of the target show. URL_STEM should not include the trailing '/'.

Important: the scrapetranscripturls() function needs to be changed as well. Change the first url[##:##] to return only the date of the show, and the second url[##:##] to return the first few letters of the show title without any '/''s.

Example:

>>>url = 'http://thekojonnamdishow.org/shows/2013-06-10/tale-two-sequesters/transcript'

>>>url[35:45]
'2013-06-10'

>>>url[46:50]
'tale'

The result will be a corpus directory with clean files including only words spoken on the broadcast, and a transcript_collection directory with files that include show title, speaker names, and timestamps.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
radiocraper.py		radiocraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

radio-transcripts

Usage

About

Uh oh!

Releases

Packages

Languages

Uh oh!

Uh oh!

ihardy/radio-transcripts

Folders and files

Latest commit

History

Repository files navigation

radio-transcripts

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages