Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Track plain text file changes as feed kind, diffing against previous content#8107

Open
nykula wants to merge 1 commit intoFreshRSS:edgefrom
nykula:edge
Open

Track plain text file changes as feed kind, diffing against previous content#8107
nykula wants to merge 1 commit intoFreshRSS:edgefrom
nykula:edge

Conversation

@nykula
Copy link
Contributor

@nykula nykula commented Oct 13, 2025

Note: Draft, seems to work but the files I need to watch don't update very often so I still have to test it more.

Goal: Notify about updates to translatable files in free software using different version control systems and hosting platforms (SourceForge SVN, Google Sheets). There are specialized tools that scrape websites using headless browsers, make screenshots etc, but my need is simpler and mostly matches what FreshRSS is already doing.

Ideally the tool would store a diff output against the most recently existing item, not the entire file contents, but I don't know how to implement that.

Changes proposed in this pull request:

  • add KIND_PLAIN_TEXT and TYPE_PLAIN_TEXT constants
  • implement loadPlainText function in Feed model, call it from feed controller
  • add text mimetype option to httpGet

How to test the feature manually:

  1. Host a plain text file somewhere
  2. Subscribe to a new feed using Plain text (entire file) type and file URL
  3. Replace feed title and description with something short describing the file
  4. Edit the file
  5. Get a notification about the edit during next feeds update

Pull request checklist:

  • clear commit messages
  • code manually tested
  • unit tests written (optional if too hard)
  • documentation updated

Additional information can be found in the documentation.

@Alkarex
Copy link
Member

Alkarex commented Oct 13, 2025

Hello and thanks for this PR.
Quick feedback already, and I will come back to it later:

  • It is not a new type of feed, but an attribute, similar to what we do with feeds requiring special cURL options, etc.
  • So instead of a new feed type, we could expose a checkbox when adding a feed, to allow wrong Mime types.
  • The information is already saved at the moment as an anchor #force_feed at the end of an URL when adding a feed, but updating to a checkbox (saved as an attribute) would probably be a nice improvement

@Alkarex Alkarex added this to the 1.28.0 milestone Oct 13, 2025
@Alkarex Alkarex added the RSS standard the standard is defined f.e. on https://www.rssboard.org label Oct 13, 2025
@nykula
Copy link
Contributor Author

nykula commented Oct 13, 2025

Hmm, I initially thought of reusing the existing HTML+Xpath scraping function and adding a checkbox to turn off DOM parsing, but that was a more invasive change and decoupling the two functions seemed to make sense to me. The idea is to create and expose the simplest RSS feed possible (using FreshRSS capabilities) from a non-RSS source, which isn't as much about the mime type, but about it being just a text, config or code file.

@Frenzie
Copy link
Member

Frenzie commented Oct 13, 2025

I suspect there's a misunderstanding here. XML feeds that are sometimes sent out as text/plain.

But what if these plain text files are really, really big?

@nykula
Copy link
Contributor Author

nykula commented Oct 13, 2025

What would be an appropriate CURLOPT_MAXFILESIZE in that case, and would it help? The files I want to watch are indeed of varied size and not very small, some are just 5 KB, but the largest one so far is 400 KB.

@Frenzie
Copy link
Member

Frenzie commented Oct 13, 2025

My question was merely what would happen. :-) If the answer is unlimited I'd say limit it to whatever the code has decided elsewhere.

@Alkarex
Copy link
Member

Alkarex commented Oct 13, 2025

I suspect there's a misunderstanding here. XML feeds that are sometimes sent out as text/plain.

@Frenzie Indeed, that is what I thought it was about.

@nykula Would you have an example of use-case including example of corresponding text document?

@nykula
Copy link
Contributor Author

nykula commented Oct 14, 2025

Sure. I monitor the following files:

Faircamp en.rs at Codeberg. The host provides an RSS feed of all commits but not one of changes to a specific file, as far as I know.

Ghost Commander strings.xml at SourceForge. The host also seems to provide an RSS feed of all commits without more precision.

Helio translations at Google Sheets. The host offers no RSS feeds, so I monitor its CSV export.

(would like to, haven't figured out a setup yet) LineageOS at Crowdin. The host used to offer an RSS feed but removed it with a promise of a JSON API and didn't implement it.

I learned yesterday about a utility called urlwatch, which is probably a cleaner way to observe files than patching FreshRSS, but I rely on a shared server and would rather it didn't keep my SMTP credentials.

@nykula
Copy link
Contributor Author

nykula commented Nov 1, 2025

Added a naive diff using php array functions. It gets stored on the entry as a diff attribute, which feed views show instead of the full content when available. This addresses the concern about big files, and lets me see which new lines are there to translate without comparing the files manually.

The diff runs against the latest existing article in the same feed that doesn't have the same guid (content hash), for this I added a latestExceptGuid function to the EntryDAO model. If there's nothing yet to diff against, the full text is returned to the views.

Since the PHP standard library doesn't provide a more sophisticated text diffing algorithm like Python's difflib, I want to see the insertions first and the deletions later, so the diff in the resulting feed looks like this (is it acceptable to hardcode Unicode emoji in the model?):

➕
'New line for translation' => '',
'Another new line' => '',
'Etc' => '',

➖
'Removed line' => 'Translation that is now unused',

@Alkarex
Copy link
Member

Alkarex commented Nov 2, 2025

is it acceptable to hardcode Unicode emoji

Yes

@Alkarex Alkarex modified the milestones: 1.28.0, 1.29.0 Dec 16, 2025
@nykula nykula changed the title WIP: Track plain text file changes as feed kind Track plain text file changes as feed kind, diffing against previous content Jan 2, 2026
@nykula nykula marked this pull request as ready for review January 2, 2026 22:50
@nykula
Copy link
Contributor Author

nykula commented Jan 2, 2026

Hello, this patch has been working as expected for me for the last few weeks, so I rebased against edge, fixed all linter complaints and unmarked its draft status.

@Alkarex
Copy link
Member

Alkarex commented Jan 2, 2026

Thanks, I will try to review shortly. This will target version 1.29.0 and not the upcoming minor 1.28.1 that focusses on bug fixes and which will be released ASAP.

rebased against edge

Side note: I prefer to just merge edge (or just click the Update branch button, which does it well), because rebasing or any other form of git history rewriting break many things: comments, reviews, changes since last review, co-authored-by, permalinks, local branches, etc.

image

We Squash and merge so no need to manually squash multiple commits :-)

$sql = <<<SQL
SELECT id, guid, title, author, {$content}, link, date, `lastSeen`, `lastUserModified`, {$hash} AS hash, is_read, is_favorite, id_feed, tags, attributes
FROM `_entry` WHERE id_feed=:id_feed AND guid!=:guid
ORDER BY date DESC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ORDER BY date DESC
ORDER BY id DESC

We are probably interested in the latest received (date can potentially be anything)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, comparing against the latest received is the idea. I'll try changing that to id locally. Where can different dates come from? I saw custom timestamp parsing logic in loadHtmlXpath but in my branch there's currently no logic like that.

}
}
if ($diff !== '') {
$attributesByGuid[$item['guid']] = ['diff' => $diff];
Copy link
Member

@Alkarex Alkarex Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not considered it in depth, but instead of saving the diff in an attribute, could it be an idea to just use the normal article content and adding <ins> and <del> tags, which can be further styled by CSS, and during the next diff, the <del> tags can just be stripped before the comparison.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About keeping diffs in content, my case involves long files with just a couple lines being added or removed at a time, for which showing small diffs instead of the full content with highlights seemed a good match. I also used simple array_diff, no full-fledged diff algorithm, so the code doesn't currently know at which places new lines appear or disappear, making it difficult to figure out where to put <ins> and <del> (otherwise, I agree using semantic HTML for that is a good idea).

I did initially try to keep the content and diff together in the content column, but I got lost in things related to loadCompleteContent, withEnclosures, _content, DOMDocument etc and didn't come up with a version of code working reliably. (Maybe someone else reading this is more confident about implementing it that way?) Keeping just the original content in the content column, while treating the diff like an optional view-related thing stored in an attribute, on the other hand, was quite straightforward to implement and worked the first time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

RSS standard the standard is defined f.e. on https://www.rssboard.org

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants