Codestin Search App

nykula · 2025-10-13T10:42:34Z

Note: Draft, seems to work but the files I need to watch don't update very often so I still have to test it more.

Goal: Notify about updates to translatable files in free software using different version control systems and hosting platforms (SourceForge SVN, Google Sheets). There are specialized tools that scrape websites using headless browsers, make screenshots etc, but my need is simpler and mostly matches what FreshRSS is already doing.

Ideally the tool would store a diff output against the most recently existing item, not the entire file contents, but I don't know how to implement that.

Changes proposed in this pull request:

add KIND_PLAIN_TEXT and TYPE_PLAIN_TEXT constants
implement loadPlainText function in Feed model, call it from feed controller
add text mimetype option to httpGet

How to test the feature manually:

Host a plain text file somewhere
Subscribe to a new feed using Plain text (entire file) type and file URL
Replace feed title and description with something short describing the file
Edit the file
Get a notification about the edit during next feeds update

Pull request checklist:

clear commit messages
code manually tested
unit tests written (optional if too hard)
documentation updated

Additional information can be found in the documentation.

Alkarex · 2025-10-13T10:50:28Z

Hello and thanks for this PR.
Quick feedback already, and I will come back to it later:

It is not a new type of feed, but an attribute, similar to what we do with feeds requiring special cURL options, etc.
So instead of a new feed type, we could expose a checkbox when adding a feed, to allow wrong Mime types.
The information is already saved at the moment as an anchor #force_feed at the end of an URL when adding a feed, but updating to a checkbox (saved as an attribute) would probably be a nice improvement

nykula · 2025-10-13T11:27:37Z

Hmm, I initially thought of reusing the existing HTML+Xpath scraping function and adding a checkbox to turn off DOM parsing, but that was a more invasive change and decoupling the two functions seemed to make sense to me. The idea is to create and expose the simplest RSS feed possible (using FreshRSS capabilities) from a non-RSS source, which isn't as much about the mime type, but about it being just a text, config or code file.

Frenzie · 2025-10-13T11:33:24Z

I suspect there's a misunderstanding here. XML feeds that are sometimes sent out as text/plain.

But what if these plain text files are really, really big?

nykula · 2025-10-13T11:49:30Z

What would be an appropriate CURLOPT_MAXFILESIZE in that case, and would it help? The files I want to watch are indeed of varied size and not very small, some are just 5 KB, but the largest one so far is 400 KB.

Frenzie · 2025-10-13T11:53:34Z

My question was merely what would happen. :-) If the answer is unlimited I'd say limit it to whatever the code has decided elsewhere.

Alkarex · 2025-10-13T18:50:16Z

I suspect there's a misunderstanding here. XML feeds that are sometimes sent out as text/plain.

@Frenzie Indeed, that is what I thought it was about.

@nykula Would you have an example of use-case including example of corresponding text document?

nykula · 2025-10-14T16:35:31Z

Sure. I monitor the following files:

Faircamp en.rs at Codeberg. The host provides an RSS feed of all commits but not one of changes to a specific file, as far as I know.

Ghost Commander strings.xml at SourceForge. The host also seems to provide an RSS feed of all commits without more precision.

Helio translations at Google Sheets. The host offers no RSS feeds, so I monitor its CSV export.

(would like to, haven't figured out a setup yet) LineageOS at Crowdin. The host used to offer an RSS feed but removed it with a promise of a JSON API and didn't implement it.

I learned yesterday about a utility called urlwatch, which is probably a cleaner way to observe files than patching FreshRSS, but I rely on a shared server and would rather it didn't keep my SMTP credentials.

nykula · 2025-11-01T22:15:42Z

Added a naive diff using php array functions. It gets stored on the entry as a diff attribute, which feed views show instead of the full content when available. This addresses the concern about big files, and lets me see which new lines are there to translate without comparing the files manually.

The diff runs against the latest existing article in the same feed that doesn't have the same guid (content hash), for this I added a latestExceptGuid function to the EntryDAO model. If there's nothing yet to diff against, the full text is returned to the views.

Since the PHP standard library doesn't provide a more sophisticated text diffing algorithm like Python's difflib, I want to see the insertions first and the deletions later, so the diff in the resulting feed looks like this (is it acceptable to hardcode Unicode emoji in the model?):

➕
'New line for translation' => '',
'Another new line' => '',
'Etc' => '',

➖
'Removed line' => 'Translation that is now unused',

Alkarex · 2025-11-02T09:35:15Z

is it acceptable to hardcode Unicode emoji

Yes

…content

nykula · 2026-01-02T22:52:41Z

Hello, this patch has been working as expected for me for the last few weeks, so I rebased against edge, fixed all linter complaints and unmarked its draft status.

Alkarex · 2026-01-02T23:22:02Z

Thanks, I will try to review shortly. This will target version 1.29.0 and not the upcoming minor 1.28.1 that focusses on bug fixes and which will be released ASAP.

rebased against edge

Side note: I prefer to just merge edge (or just click the Update branch button, which does it well), because rebasing or any other form of git history rewriting break many things: comments, reviews, changes since last review, co-authored-by, permalinks, local branches, etc.

We Squash and merge so no need to manually squash multiple commits :-)

Alkarex · 2026-01-03T22:47:13Z

app/Models/EntryDAO.php

+		$sql = <<<SQL
+SELECT id, guid, title, author, {$content}, link, date, `lastSeen`, `lastUserModified`, {$hash} AS hash, is_read, is_favorite, id_feed, tags, attributes
+FROM `_entry` WHERE id_feed=:id_feed AND guid!=:guid
+ORDER BY date DESC


Suggested change

ORDER BY date DESC

ORDER BY id DESC

We are probably interested in the latest received (date can potentially be anything)

Yes, comparing against the latest received is the idea. I'll try changing that to id locally. Where can different dates come from? I saw custom timestamp parsing logic in loadHtmlXpath but in my branch there's currently no logic like that.

Alkarex · 2026-01-03T23:06:59Z

app/Models/Feed.php

+			}
+		}
+		if ($diff !== '') {
+			$attributesByGuid[$item['guid']] = ['diff' => $diff];


I have not considered it in depth, but instead of saving the diff in an attribute, could it be an idea to just use the normal article content and adding <ins> and <del> tags, which can be further styled by CSS, and during the next diff, the <del> tags can just be stripped before the comparison.

https://developer.mozilla.org/docs/Web/HTML/Reference/Elements/ins

https://developer.mozilla.org/docs/Web/HTML/Reference/Elements/del

About keeping diffs in content, my case involves long files with just a couple lines being added or removed at a time, for which showing small diffs instead of the full content with highlights seemed a good match. I also used simple array_diff, no full-fledged diff algorithm, so the code doesn't currently know at which places new lines appear or disappear, making it difficult to figure out where to put <ins> and <del> (otherwise, I agree using semantic HTML for that is a good idea).

I did initially try to keep the content and diff together in the content column, but I got lost in things related to loadCompleteContent, withEnclosures, _content, DOMDocument etc and didn't come up with a version of code working reliably. (Maybe someone else reading this is more confident about implementing it that way?) Keeping just the original content in the content column, while treating the diff like an optional view-related thing stored in an attribute, on the other hand, was quite straightforward to implement and worked the first time.

Alkarex added this to the 1.28.0 milestone Oct 13, 2025

Alkarex added the RSS standard the standard is defined f.e. on https://www.rssboard.org label Oct 13, 2025

Alkarex modified the milestones: 1.28.0, 1.29.0 Dec 16, 2025

Track plain text file changes as feed kind, diffing against previous …

95e448f

…content

nykula force-pushed the edge branch from f01de4d to 95e448f Compare January 2, 2026 22:44

nykula changed the title ~~WIP: Track plain text file changes as feed kind~~ Track plain text file changes as feed kind, diffing against previous content Jan 2, 2026

nykula marked this pull request as ready for review January 2, 2026 22:50

Alkarex reviewed Jan 3, 2026

View reviewed changes

Uh oh!

Conversation

nykula commented Oct 13, 2025

Uh oh!

Alkarex commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nykula commented Oct 13, 2025

Uh oh!

Frenzie commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nykula commented Oct 13, 2025

Uh oh!

Frenzie commented Oct 13, 2025

Uh oh!

Alkarex commented Oct 13, 2025

Uh oh!

nykula commented Oct 14, 2025

Uh oh!

nykula commented Nov 1, 2025

Uh oh!

Alkarex commented Nov 2, 2025

Uh oh!

nykula commented Jan 2, 2026

Uh oh!

Alkarex commented Jan 2, 2026

Uh oh!

Alkarex Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

nykula Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Alkarex Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nykula Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Alkarex commented Oct 13, 2025 •

edited

Loading

Frenzie commented Oct 13, 2025 •

edited

Loading

Alkarex Jan 3, 2026 •

edited

Loading