Track plain text file changes as feed kind, diffing against previous content#8107
Track plain text file changes as feed kind, diffing against previous content#8107nykula wants to merge 1 commit intoFreshRSS:edgefrom
Conversation
|
Hello and thanks for this PR.
|
|
Hmm, I initially thought of reusing the existing HTML+Xpath scraping function and adding a checkbox to turn off DOM parsing, but that was a more invasive change and decoupling the two functions seemed to make sense to me. The idea is to create and expose the simplest RSS feed possible (using FreshRSS capabilities) from a non-RSS source, which isn't as much about the mime type, but about it being just a text, config or code file. |
|
I suspect there's a misunderstanding here. XML feeds that are sometimes sent out as But what if these plain text files are really, really big? |
|
What would be an appropriate |
|
My question was merely what would happen. :-) If the answer is unlimited I'd say limit it to whatever the code has decided elsewhere. |
|
Sure. I monitor the following files: Faircamp en.rs at Codeberg. The host provides an RSS feed of all commits but not one of changes to a specific file, as far as I know. Ghost Commander strings.xml at SourceForge. The host also seems to provide an RSS feed of all commits without more precision. Helio translations at Google Sheets. The host offers no RSS feeds, so I monitor its CSV export. (would like to, haven't figured out a setup yet) LineageOS at Crowdin. The host used to offer an RSS feed but removed it with a promise of a JSON API and didn't implement it. I learned yesterday about a utility called urlwatch, which is probably a cleaner way to observe files than patching FreshRSS, but I rely on a shared server and would rather it didn't keep my SMTP credentials. |
|
Added a naive diff using php array functions. It gets stored on the entry as a The diff runs against the latest existing article in the same feed that doesn't have the same guid (content hash), for this I added a Since the PHP standard library doesn't provide a more sophisticated text diffing algorithm like Python's difflib, I want to see the insertions first and the deletions later, so the diff in the resulting feed looks like this (is it acceptable to hardcode Unicode emoji in the model?): |
Yes |
|
Hello, this patch has been working as expected for me for the last few weeks, so I rebased against edge, fixed all linter complaints and unmarked its draft status. |
| $sql = <<<SQL | ||
| SELECT id, guid, title, author, {$content}, link, date, `lastSeen`, `lastUserModified`, {$hash} AS hash, is_read, is_favorite, id_feed, tags, attributes | ||
| FROM `_entry` WHERE id_feed=:id_feed AND guid!=:guid | ||
| ORDER BY date DESC |
There was a problem hiding this comment.
| ORDER BY date DESC | |
| ORDER BY id DESC |
We are probably interested in the latest received (date can potentially be anything)
There was a problem hiding this comment.
Yes, comparing against the latest received is the idea. I'll try changing that to id locally. Where can different dates come from? I saw custom timestamp parsing logic in loadHtmlXpath but in my branch there's currently no logic like that.
| } | ||
| } | ||
| if ($diff !== '') { | ||
| $attributesByGuid[$item['guid']] = ['diff' => $diff]; |
There was a problem hiding this comment.
I have not considered it in depth, but instead of saving the diff in an attribute, could it be an idea to just use the normal article content and adding <ins> and <del> tags, which can be further styled by CSS, and during the next diff, the <del> tags can just be stripped before the comparison.
There was a problem hiding this comment.
About keeping diffs in content, my case involves long files with just a couple lines being added or removed at a time, for which showing small diffs instead of the full content with highlights seemed a good match. I also used simple array_diff, no full-fledged diff algorithm, so the code doesn't currently know at which places new lines appear or disappear, making it difficult to figure out where to put <ins> and <del> (otherwise, I agree using semantic HTML for that is a good idea).
I did initially try to keep the content and diff together in the content column, but I got lost in things related to loadCompleteContent, withEnclosures, _content, DOMDocument etc and didn't come up with a version of code working reliably. (Maybe someone else reading this is more confident about implementing it that way?) Keeping just the original content in the content column, while treating the diff like an optional view-related thing stored in an attribute, on the other hand, was quite straightforward to implement and worked the first time.
Note: Draft, seems to work but the files I need to watch don't update very often so I still have to test it more.
Goal: Notify about updates to translatable files in free software using different version control systems and hosting platforms (SourceForge SVN, Google Sheets). There are specialized tools that scrape websites using headless browsers, make screenshots etc, but my need is simpler and mostly matches what FreshRSS is already doing.
Ideally the tool would store a diff output against the most recently existing item, not the entire file contents, but I don't know how to implement that.
Changes proposed in this pull request:
KIND_PLAIN_TEXTandTYPE_PLAIN_TEXTconstantsloadPlainTextfunction in Feed model, call it from feed controllertextmimetype option tohttpGetHow to test the feature manually:
Plain text (entire file)type and file URLPull request checklist:
Additional information can be found in the documentation.