-
Notifications
You must be signed in to change notification settings - Fork 13
Home
Welcome to the Podget wiki!
When we download items from a Podcast feed, we often find there are filename issues with the downloaded files. There are a few ways we can try to mitigate those. In version 0.9.1 of Podget we added the ability to rename the downloaded file based on the contents of the TITLE tag just before or after it in the feed. Before that we could look at the tags set by the server on the downloaded file to see if there was a better option to rename to.
This is something we can do for RSS Feeds. It does not work for ATOM feeds yet.
The first thing we need to do is determine if a feed uses TITLE tags and if they come before or after an items enclosure tag.
To do so, from a shell we can run:
wget -O - http://somesite.com/feed.rss | sed -n -e :a -e 's/.*<enclosure.*url\s*=\s*"\([^"]\+\)".*/URL \1/Ip' -e t -e "s/.*<enclosure.*url\s*'=\s*\([^i]\+\)'.*/URL \1/Ip" -e t -e 's/.*<title>\(.*\)<[/]title>.*$/TITLE \1/Ip' -e t -e '/\(<enclosure\|<title>\).*/I{N; | s/\ *\n/\ /;T;ba}'
This will give us a list of all the download URLS from the ENCLOSURE tags and TITLE tags from the feed. Each line is identified by starting with either TITLE or URL. If it is an item to download it is an URL. If it is the name for the item, it will be the TITLE. One exception to this and that is for many feeds we may see several TITLE lines at the top. Podget ignores repeated TITLE tags and only uses the one in closest proximity to the URL.
Now there are two common options for TITLE and URL tags. Most commonly feeds will have the TITLE tag then the URL tag. Infrequently a feed may have the URL then the TITLE tag.
So we have two possible options:
- OPT_FILENAME_RENAME_TITLETAG - For feeds where TITLE tag precedes the URL tag.
- OPT_FILENAME_RENAME_REVTITLETAG - For feeds where the URL tag precedes the TITLE tag.
We use these options in our serverlist file like so:
http://somesite.com/feed.rss CATEGORY Feed Name OPT_FILENAME_RENAME_TITLETAG
http://somesite.com/feed.rss CATEGORY Feed Name OPT_FILENAME_RENAME_REVTITLETAG
Which version you use is dependent upon how the feed is formatted.
On some feeds, you will find very little in common between the downloaded filename and the title. For those feeds you may need to manually download a few items then listen to them to determine what order they have their tags.
For some feeds, the downloaded filename will be superior to the title so you won't want to use these options.
For some feeds, the title will not allow for easy listing in the order they were created and so you may want to also rename the downloaded file to include it's modification date. To do so:
http://somesite.com/feed.rss CATEGORY Feed Name OPT_FILENAME_RENAME_TITLETAG OPT_FILENAME_RENAME_MDATE
This renames the downloaded file twice. Once with the TITLE and then prefixes the name with the date / time that it was modified. Renaming by modification date does not work for feeds hosted on all servers so it might take some testing to see if it works for you. See below for more information about the OPT_FILENAME_RENAME_MDATE option.
This is perhaps the hardest question when it comes to using any of the options described by "OPT_" tags in your server list. Unfortunately, we do not have an automated way to test feeds yet so we have to fallback on manual tests.
To determine if a feed is configured in a way where OPT_CONTENT_DISPOSITION or OPT_FILENAME_LOCATION can help, we need to first download the enclosures list from the feed and then download a few items from the feed to examine what tags they give us.
NOTE: For this example, I will be using a feed from Naked Astronomy because it is formatted nicely for these explanations. Other feeds will not be as easy to read.
To download a feed and filter for enclosure tags, run:
$ wget -O - http://rss.acast.com/naked_astronomy_podcast | grep enclosure
--2016-05-03 09:54:02-- http://rss.acast.com/naked_astronomy_podcast
Resolving rss.acast.com (rss.acast.com)... 137.117.90.63
Connecting to rss.acast.com (rss.acast.com)|137.117.90.63|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 150865 (147K) [text/xml]
Saving to: ‘STDOUT’
<enclosure url="http://rss.acast.com/naked_astronomy_podcast/eyes-on-the-sky-for-mercury/media.mp3" length="24728564" type="audio/mpeg"/>
<enclosure url="http://rss.acast.com/naked_astronomy_podcast/riding-on-a-space-sofa/media.mp3" length="36828171" type="audio/mpeg"/>
<enclosure url="http://rss.acast.com/naked_astronomy_podcast/adventures-in-satspotting/media.mp3" length="25695313" type="audio/mpeg"/>
[CLIP]....
The "-O -" option tells wget to output what is downloaded to the standard out (generally the screen) and we pipe it to grep to filter for just the enclosure tags we care about. From those tags, we can see that the URL for every item ends in "media.mp3". This would create problems if we simply downloaded them all. So we need to dive a little deeper to see how they are handling the naming issue.
Next, we randomly pick at least one item from the feed and download it to see what tags they've configured.
$ wget --server-response http://rss.acast.com/naked_astronomy_podcast/eyes-on-the-sky-for-mercury/media.mp3
--2016-05-03 10:03:15-- http://rss.acast.com/naked_astronomy_podcast/eyes-on-the-sky-for-mercury/media.mp3
Resolving rss.acast.com (rss.acast.com)... 137.117.90.63
Connecting to rss.acast.com (rss.acast.com)|137.117.90.63|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 302 Found
Cache-Control: no-cache, no-store, must-revalidate
Content-Length: 148
Content-Type: text/plain; charset=utf-8
Location: http://audiostitcher.bwh9c8255b4.netdna-cdn.com/naked_astronomy_podcast/eyes-on-the-sky-for-mercury/lsD7Oa1uNC1-aWC_IWzAKg.mp3
Vary: Accept, Accept-Encoding
Server: Microsoft-IIS/8.0
Set-Cookie: TiPMix=41.3533173694058; path=/; Domain=rss.acast.com
Set-Cookie: acastRss=0ce0cb9c5eb6cc91261d3ededd232234f478228d5204a52c534c63f1d79e7af3e07e478a9ff2adb567f7ff9857cfdf8dc4cc0389d43aaaace80b1e61f6a65b18043028394c7a8326ffe8b073332dfa29; path=/; expires=Wed, 03 May 2017 16:03:15 GMT
Arr-Disable-Session-Affinity: true
Date: Tue, 03 May 2016 16:03:16 GMT
Location: http://audiostitcher.bwh9c8255b4.netdna-cdn.com/naked_astronomy_podcast/eyes-on-the-sky-for-mercury/lsD7Oa1uNC1-aWC_IWzAKg.mp3 [following]
--2016-05-03 10:03:16-- http://audiostitcher.bwh9c8255b4.netdna-cdn.com/naked_astronomy_podcast/eyes-on-the-sky-for-mercury/lsD7Oa1uNC1-aWC_IWzAKg.mp3
Resolving audiostitcher.bwh9c8255b4.netdna-cdn.com (audiostitcher.bwh9c8255b4.netdna-cdn.com)... 94.31.29.128
Connecting to audiostitcher.bwh9c8255b4.netdna-cdn.com (audiostitcher.bwh9c8255b4.netdna-cdn.com)|94.31.29.128|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Tue, 03 May 2016 16:03:16 GMT
Content-Type: audio/mpeg
Content-Length: 12603873
Connection: keep-alive
Cache-Control: public, max-age=604800, s-max-age=604800
X-Powered-By: Express
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Content-Length
Content-Disposition: filename=eyes-on-the-sky-for-mercury.mp3
Arr-Disable-Session-Affinity: true
Server: NetDNA-cache/2.2
X-Cache: HIT
Accept-Ranges: bytes
Length: 12603873 (12M) [audio/mpeg]
Saving to: ‘media.mp3’
media.mp3 100%[======================================================================================================================>] 12.02M 1.38MB/s in 7.6s
2016-05-03 10:03:24 (1.58 MB/s) - ‘media.mp3’ saved [12603873/12603873
OK, looking at the server response tags, we see two that could be useful. First the LOCATION tags. We care about the last LOCATION tag but unfortunately in this case, it ends in a filename that isn't very useful. So we look at the Content-Disposition tag and it has the option "filename=eyes-on-the-sky-for-mercury.mp3". This is the filename we want!
So if we were to configure this feed in our server list, it would look something like:
http://rss.acast.com/naked_astronomy_podcast CATEGORY Naked Astronomy OPT_CONTENT_DISPOSITION
Now if the Location tag had ended in a usable filename then we could have used OPT_FILENAME_LOCATION.
Now, some feeds are hosted on servers that do not reliably provide the "Content-Disposition: filename=" tag. For these feeds, you will find that multiple podget sessions with '--force' will result in different files getting renamed to understandable strings. For these feeds, we have a second tag to be used in conjunction with the OPT_CONTENT_DISPOSITION. This second tag, tells Podget to not place the URL in the COMPLETED log unless it gets a 'filename=' tag. By not placing it in the log, that URL will be retried the next time Podget runs and due to the hit-or-miss nature of the server's providing the "filename=" tag that it may take a few runs for every file to be renamed. I had one feed that took six runs of Podget to correctly rename all the files and this feed was provided by a reputable source. Unfortunately I have not determined an easy way to test in advance if this will be a problem, so it may be advisable to use the OPT_DISPOSITION_FAIL tag as a default until a feed has proven to be well behaved.
Another option we have for feeds that use a common filename for every enclosure item is to prefix their filename with their modification date. Now this only works for some feeds. Some feeds include a usable modification date and others do not (they all end up with a modification date of when they were saved to your PC). To determine if this option is usable for a feed, you need to manually download a few times from it as we did above. Then look at the items with:
ls -ltr
This will list them by modification date (oldest first). If we have different modification dates, then we can use this option even if the filenames are the same or appear to be semi-random strings. If we use this option, then when they are saved, the filenames will be prefixed with a tag in this format YYYYMMDD_HHh_MMm_[filename]. Which means they will be prefixed with a 4 digit year, 2 digit month, 2 digit day, 2 digit 24-Hour, and 2-digit minute. This gives us a filename that is sortable and tells us a little about it.
Now if filenames do not provide us with their modification date but are unique strings. We can still use this option, however the date & time that the file was saved to our system will be used as the modification date. So we can then see the files in the order we got them. It's not much but sometimes its the best we can do.