-
Notifications
You must be signed in to change notification settings - Fork 363
Watermarking script #6017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Watermarking script #6017
Conversation
|
Build successful. Some useful links:
This preview will be removed when the branch is merged. |
|
On a whim, I vibe-coded this in about 30 minutes: https://aclanthology.org/watermark.html (Thanks to VS Code and GPT 5!) |
|
Haven't had too close of a look yet, but the first thing I noticed: why are page numbers below the footer? In the official proceedings, they're above the footer. |
The instructions say "Separate footer lines with the Enter key." but if I do that, a black square appears on the PDF. It only works correctly when entering "\n". |
|
Thanks—fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty cool, though I remain skeptical about vibe coding. :)
Most of add_footer.py is quite arcane (mostly due to reportlab.pdfgen's API) and not really commented so it's hard – though possible with some effort – to follow what's going on. I mostly focused on the CGI script now to check for security concerns and the like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general comment: I have been wondering about throwing all of our scripts into bin/, which has become a mixture of (i) core build scripts, (ii) data ingestion & modification scripts, (iii) one-off scripts that are probably outdated by now, and (iv) other miscellaneous stuff. It’s quite unclear which of these scripts are still useful and what for, unless you look into each of them.
I was wondering if we could start categorizing them into subfolders, or at least name them more explicitly (e.g. here I would prefer add_footer_to_pdf.py).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely long overdue for a reorg. bin/ itself isn't that great of a name. One suggestion is to use scripts/ instead, and then have some kind of minimal one-level nesting within it, following your taxonomy above: build, data, misc.
| ).pages[0] | ||
| page.merge_page(pnum_cache[nkey]) | ||
|
|
||
| # Footer only on first page; place it ABOVE the fixed page number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting again for the record that this is not where *ACL proceedings currently place the footer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah....but the current ACL choice is ugly, and also (I suspect) just some random person's quick decision. Witness (from ACL 2025):
It's different even from ten years ago (source):
Maybe I shouldn't in turn just arbitrarily change it, but I think it looks better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that the second one looks better, but it still has the footer below the page number on the first page, which my (completely subjective) gut reaction finds more appealing :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless of subjective appeal, there is an argument though for making the footer of revisions consistent with the original.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The footstamp offset varies by conference and year. Our options are (a) come up with a good default, and ideally get ACL to consolidate on that or (b) provide more knobs in this user interface to allow users to fiddle and match the original. I guess we should do both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I obviously haven't checked all conferences, but regarding the two examples you posted, it seems to me that the difference between them is not actually the placement of the footer, but the margins of the page content. In other words, I think the absolute placement of the footer may actually be the same there.
| python add_footer.py -p 199 in.pdf out.pdf "…" | ||
| python add_footer.py -p 199 --footer-size 9 --pagenum-size 10 --bottom-margin 14 in.pdf out.pdf "…" | ||
| Copyright 2025, Matt Post |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not Apache license, like all the other scripts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an oversight. This is mostly just a proof of concept.
| POST a multipart/form-data request with fields: | ||
| pdf (file, required) The input PDF | ||
| footer_text (text, optional) First-page footer block; use <i>…</i> for italics; newlines allowed | ||
| page_start (int, optional) Starting page number (>=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random thought: does it ever happen that people submit revisions with more pages than the original?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in fact, the page counts occasionally don't line up. I'm not sure what this should mean, in fact, for revisions. I think this is what it will be important to add the revision stamp on the side.
| Depends on: reportlab, pypdf (already required by bin/add_footer.py) | ||
| """ | ||
|
|
||
| import cgi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This module is deprecated and removed in Python 3.13. https://docs.python.org/3/library/cgi.html
(I haven’t looked into alternatives, but I noted that Python appears to be systematically removing CGI-related functions from its core libs, stating e.g. that "CGI has not been considered a good way to do things for well over a decade." https://docs.python.org/3/library/http.server.html#http.server.CGIHTTPRequestHandler)
| if method != 'POST': | ||
| http_error(405, 'Use POST with multipart/form-data.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect use of 405:
The origin server MUST generate an Allow header field in a 405 response containing a list of the target resource's currently supported methods.
— https://httpwg.org/specs/rfc9110.html#status.405
| if length > MAX_BYTES: | ||
| http_error(400, f'File too large (> {MAX_BYTES//1024//1024}MB).') | ||
|
|
||
| form = cgi.FieldStorage() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the cgi module is deprecated, a quick search suggests that multipart may be a suitable replacement here, though I haven't looked into this more closely.
| if 'pdf' not in form or not getattr(form['pdf'], 'file', None): | ||
| http_error(400, 'Missing PDF file.') | ||
| pdf_item = form['pdf'] | ||
| footer_text = form.getfirst('footer_text', '')[:10000] # cap length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
footer_text is never checked or sanitized, meaning that the corresponding argument to add_footer.py can contain arbitrary bytes/Unicode sequences. I wonder if funky input here could cause problems, but haven't checked this thoroughly.
| # Cleanup temp dir | ||
| try: | ||
| shutil.rmtree(tmp_dir) | ||
| except Exception: | ||
| pass | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any error is raised before, this is never reached, potentially filling up the disk with temporary files. You probably want to wrap the entire main function in a try/except block (and give it the temporary directory as an argument).
| add_footer = find_add_footer() | ||
| if not add_footer.exists(): | ||
| http_error(500, 'Server configuration error: add_footer.py not found.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d probably move this higher, before doing anything with the uploaded data at all.
|
Thanks for all the feedback, @mbollmann. Can you elaborate on reportlab? Is that module deprecated or something? It made it very easy to do PDF modifications. aclpub2 is using a big clunky Java package for this. |
| c.setFont(FONT_REG, size) | ||
| text = str(page_num) | ||
| tw = c.stringWidth(text, FONT_REG, size) | ||
| x = (w - tw) / 2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mjpost I know nothing at all about reportlab, I was just commenting that its API seems to be very low-level, making it not very intuitive to follow what's going on in the code. For example, lines like
x = (w - tw) / 2.0are not very descriptive, I can guess what this does but it’s hard to review because of it. Maybe not super important either for a script like this though.
| python-slugify>=2.0 | ||
| pytz | ||
| PyYAML>=3.0 | ||
| reportlab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If reportlab is here, pypdf should also be
For revised PDFs, we are currently unable to add the footer. This script enables that and has only Python dependencies. I used it to correct the PDF here: https://aclanthology.org/2025.acl-long.426/
This also updates the

add_revision.pyscript to add a revision watermark:Comments on format etc welcomed.
Edit: There is now a web service: https://aclanthology.org/watermark.html