Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@mjpost
Copy link
Member

@mjpost mjpost commented Sep 18, 2025

For revised PDFs, we are currently unable to add the footer. This script enables that and has only Python dependencies. I used it to correct the PDF here: https://aclanthology.org/2025.acl-long.426/

This also updates the add_revision.py script to add a revision watermark:
Image

Comments on format etc welcomed.

Edit: There is now a web service: https://aclanthology.org/watermark.html

@github-actions
Copy link

Build successful. Some useful links:

This preview will be removed when the branch is merged.

@mjpost mjpost requested a review from nschneid September 18, 2025 16:45
@mjpost
Copy link
Member Author

mjpost commented Sep 18, 2025

On a whim, I vibe-coded this in about 30 minutes: https://aclanthology.org/watermark.html

(Thanks to VS Code and GPT 5!)

@mbollmann
Copy link
Member

Haven't had too close of a look yet, but the first thing I noticed: why are page numbers below the footer? In the official proceedings, they're above the footer.

@mbollmann
Copy link
Member

On a whim, I vibe-coded this in about 30 minutes: aclanthology.org/watermark.html

The instructions say "Separate footer lines with the Enter key." but if I do that, a black square appears on the PDF. It only works correctly when entering "\n".

@mjpost
Copy link
Member Author

mjpost commented Sep 18, 2025

Thanks—fixed.

Copy link
Member

@mbollmann mbollmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty cool, though I remain skeptical about vibe coding. :)

Most of add_footer.py is quite arcane (mostly due to reportlab.pdfgen's API) and not really commented so it's hard – though possible with some effort – to follow what's going on. I mostly focused on the CGI script now to check for security concerns and the like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general comment: I have been wondering about throwing all of our scripts into bin/, which has become a mixture of (i) core build scripts, (ii) data ingestion & modification scripts, (iii) one-off scripts that are probably outdated by now, and (iv) other miscellaneous stuff. It’s quite unclear which of these scripts are still useful and what for, unless you look into each of them.

I was wondering if we could start categorizing them into subfolders, or at least name them more explicitly (e.g. here I would prefer add_footer_to_pdf.py).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely long overdue for a reorg. bin/ itself isn't that great of a name. One suggestion is to use scripts/ instead, and then have some kind of minimal one-level nesting within it, following your taxonomy above: build, data, misc.

).pages[0]
page.merge_page(pnum_cache[nkey])

# Footer only on first page; place it ABOVE the fixed page number
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting again for the record that this is not where *ACL proceedings currently place the footer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah....but the current ACL choice is ugly, and also (I suspect) just some random person's quick decision. Witness (from ACL 2025):

image

It's different even from ten years ago (source):

image

Maybe I shouldn't in turn just arbitrarily change it, but I think it looks better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that the second one looks better, but it still has the footer below the page number on the first page, which my (completely subjective) gut reaction finds more appealing :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of subjective appeal, there is an argument though for making the footer of revisions consistent with the original.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The footstamp offset varies by conference and year. Our options are (a) come up with a good default, and ideally get ACL to consolidate on that or (b) provide more knobs in this user interface to allow users to fiddle and match the original. I guess we should do both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I obviously haven't checked all conferences, but regarding the two examples you posted, it seems to me that the difference between them is not actually the placement of the footer, but the margins of the page content. In other words, I think the absolute placement of the footer may actually be the same there.

python add_footer.py -p 199 in.pdf out.pdf "…"
python add_footer.py -p 199 --footer-size 9 --pagenum-size 10 --bottom-margin 14 in.pdf out.pdf "…"
Copyright 2025, Matt Post
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not Apache license, like all the other scripts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an oversight. This is mostly just a proof of concept.

POST a multipart/form-data request with fields:
pdf (file, required) The input PDF
footer_text (text, optional) First-page footer block; use <i>…</i> for italics; newlines allowed
page_start (int, optional) Starting page number (>=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random thought: does it ever happen that people submit revisions with more pages than the original?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in fact, the page counts occasionally don't line up. I'm not sure what this should mean, in fact, for revisions. I think this is what it will be important to add the revision stamp on the side.

Depends on: reportlab, pypdf (already required by bin/add_footer.py)
"""

import cgi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module is deprecated and removed in Python 3.13. https://docs.python.org/3/library/cgi.html

(I haven’t looked into alternatives, but I noted that Python appears to be systematically removing CGI-related functions from its core libs, stating e.g. that "CGI has not been considered a good way to do things for well over a decade." https://docs.python.org/3/library/http.server.html#http.server.CGIHTTPRequestHandler)

Comment on lines +59 to +60
if method != 'POST':
http_error(405, 'Use POST with multipart/form-data.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect use of 405:

The origin server MUST generate an Allow header field in a 405 response containing a list of the target resource's currently supported methods.
https://httpwg.org/specs/rfc9110.html#status.405

if length > MAX_BYTES:
http_error(400, f'File too large (> {MAX_BYTES//1024//1024}MB).')

form = cgi.FieldStorage()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the cgi module is deprecated, a quick search suggests that multipart may be a suitable replacement here, though I haven't looked into this more closely.

if 'pdf' not in form or not getattr(form['pdf'], 'file', None):
http_error(400, 'Missing PDF file.')
pdf_item = form['pdf']
footer_text = form.getfirst('footer_text', '')[:10000] # cap length
Copy link
Member

@mbollmann mbollmann Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

footer_text is never checked or sanitized, meaning that the corresponding argument to add_footer.py can contain arbitrary bytes/Unicode sequences. I wonder if funky input here could cause problems, but haven't checked this thoroughly.

Comment on lines +159 to +164
# Cleanup temp dir
try:
shutil.rmtree(tmp_dir)
except Exception:
pass

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any error is raised before, this is never reached, potentially filling up the disk with temporary files. You probably want to wrap the entire main function in a try/except block (and give it the temporary directory as an argument).

Comment on lines +114 to +116
add_footer = find_add_footer()
if not add_footer.exists():
http_error(500, 'Server configuration error: add_footer.py not found.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d probably move this higher, before doing anything with the uploaded data at all.

@mjpost
Copy link
Member Author

mjpost commented Sep 19, 2025

Thanks for all the feedback, @mbollmann. Can you elaborate on reportlab? Is that module deprecated or something? It made it very easy to do PDF modifications. aclpub2 is using a big clunky Java package for this.

c.setFont(FONT_REG, size)
text = str(page_num)
tw = c.stringWidth(text, FONT_REG, size)
x = (w - tw) / 2.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjpost I know nothing at all about reportlab, I was just commenting that its API seems to be very low-level, making it not very intuitive to follow what's going on in the code. For example, lines like

    x = (w - tw) / 2.0

are not very descriptive, I can guess what this does but it’s hard to review because of it. Maybe not super important either for a script like this though.

python-slugify>=2.0
pytz
PyYAML>=3.0
reportlab
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If reportlab is here, pypdf should also be

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants