Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Suggestion: Track comparison test images via git-lfs #13068

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
smheidrich opened this issue Dec 31, 2018 · 4 comments
Closed

Suggestion: Track comparison test images via git-lfs #13068

smheidrich opened this issue Dec 31, 2018 · 4 comments

Comments

@smheidrich
Copy link
Contributor

I was told in #10748 (comment) that there shouldn't be too many image comparison tests, and I assume (though I didn't ask) that this is primarily because the repository size would grow too large, in particular as these images have to be replaced whenever significant changes to the code are made. Are there other important reasons that I'm missing?

Anyway, I wanted to ask: has there ever been discussion about tracking images for image comparison tests via git-lfs? It would only store a hash of each image in the repository, while the actual file contents are hosted separately, so that e.g. cloning the repo can be sped up by only downloading a few recent image versions (of course, it makes absolutely no difference if you really need to have the entire history of all images locally).

Thoughts? Couldn't find any existing issues mentioning LFS, sorry if this has already been discussed elsewhere.

@smheidrich
Copy link
Contributor Author

I only now found out how terrible the quota and bandwith limits are. Nevermind then, this wouldn't work for matplotlib.

@timhoffm
Copy link
Member

timhoffm commented Jan 1, 2019

How much quota and bandwidth would we need? One can purchase additional data packs. If we decide that git LFS would really help us, I assume we can probably get this funded.

I haven‘t used git LFS myself, so I cannot contribute much to the discussion.

@smheidrich
Copy link
Contributor Author

smheidrich commented Jan 2, 2019

@timhoffm The two baseline_images folders together are about 45MB at the current revision, so assuming all those images would be replaced by ones tracked via LFS eventually and that almost everyone would only pull their latest versions when cloning (which is git-lfs's default behavior), we have to sum up:

  • Traffic consumed by Travis builds:
    • This of course depends on whether Travis can cache the repo or needs to clone it again on every build. Does anyone know if Travis does this for regular git repos without LFS?
    • This seems to indicate that Travis doesn't cache LFS content by default.
      • There is a proposed workaround in the issue comments, which I guess should work? Not sure.
    • Still, assuming no caching whatsoever, and with matplotlib's average of around 4 builds per day (rough estimate from looking at https://travis-ci.org/matplotlib/matplotlib/builds), this would use up at least 4 × 30 × 45 MB = 5.4 GB per month.
  • Regular users / contributors who clone the repo with default settings (i.e. only most recent LFS content revisions):
  • Contributors who clone more than just the most recent LFS content revisions, e.g. because they need the complete history (not just hashes) of all test images. No idea how to estimate this, but this could easily grow to the same order of magnitude (or beyond) as the other numbers even if only very few users do this, as it would be at least an order of magnitude more data per clone1.
  • I think people pushing and pulling changed images to/from the repo while contributing can be neglected in comparison, but I may be wrong.
  • Anything else I forgot or estimated incorrectly? Let me know!

So we're looking at a lower bound of ~ 13 GB per month, assuming the size of all test images at a given revision together stays the same (45 MB), that the repo doesn't become much more popular or active, etc.

But I'm really not sure about this whole idea anymore: even if you buy more bandwidth, anyone can disrupt the whole project by pulling enough LFS content anonymously, no account or anything required. This could happen even without malicious intent, if enough people want to get the full history of all images, as mentioned above.


1 Estimated from the whole matplotlib repo with all of its history being ~ 600 MB, though to be honest I don't know what fraction of that actually goes into test images; this script yields absolute directory history sizes that cannot be right because they're smaller than the directories at just the most recent revision, but relatively speaking the test image folders come out near the top, so I should think they make up a significant portion of the total repository size.

@timhoffm
Copy link
Member

timhoffm commented Jan 4, 2019

@smheidrich Thanks for the detailed discussion! For reference a git LFS data pack with 50GB bandwidth costs $5 per month (and can be bought multiple times, i.e. 100GB for $10 etc.).

I cannot comment on the usefulness of git LFS for our case. Leaving this to the other devs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants