-
-
Notifications
You must be signed in to change notification settings - Fork 32.2k
gh-51067: Add remove()
and repack()
to ZipFile
#134627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably would be better to raise an attributeError instead of a valueError here since you are trying to access an attribute a closed zipfile doesn’t have
This behavior simply resembles
|
Nicely inform @ubershmekel, @barneygale, @merwok, and @wimglenn about this PR. This should be more desirable and flexible than the previous PR, although cares must be taken as there might be a potential risk on the algorithm about reclaiming spaces. The previous PR is kept open in case some folks are interested in it. Will close when either one is accepted. |
- Separate individual validation tests. - Check underlying repacker not called in validation. - Use `unlink` to prevent FileNotFoundError. - Fix mode 'x' test.
- Set `_writing` to prevent `open('w').write()` during repacking. - Move the protection logic to `ZipFile.repack()`.
@emmatyping Still pending review... I'd like to know if there is still any problem about this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it’s fine
Unfortunately it's premature to move forward with this PR until the discussion in the issue is resolved. There are open questions about whether we should add this API or not, and if we do what it would look like. That needs to be worked out in the issue before this PR can move forward, sorry! |
- NotImplementedError is a subclass of RuntimeError.
Quickly released as I'm tired of doing all the migration jobs and dealing with all the compatibility crabs over and over again. This work will be migrated to that package and updating of this branch will be halted, until folks are sincerely going to merge into the standard library. |
This sounds like another feature, but it does sound useful. Perhaps it should have its own issue (if it doesn't already). I'd say let's limit the scope here to addressing the linked issue. |
I don't love that there's two implementations here. My reading from the code is that unsigned descriptors are deprecated, so maybe it would be enough to simply omit support for them, opt for the faster implementation, and (maybe) warn if such a descriptor is encountered that it's unsupported. How prevalent are such descriptors? |
Actually there are three implementations. 😂 Therefore the brief algorithm/condition for the slow scan is:
It should be quite unlikely to happen even if strict_descriptor==False, but the problem is still that it may be catastrophically slow once it happens, and could be used intentionally and offensively. The prevalence of the deprecated unsigned data descriptor is hard to tell. Most apps like WinZip, WinRAR, 7z would probably write no data descriptor since they are generally not used in a streaming condition. For streaming cases I think most developers would simply use the ZIP implementation at their hand, and to answer this question we'd have to check the ZIP implementation of popular programming languages like C, Java, JS, php, C#, etc. Since this feature has been already implemented, I'd prefer to keep it unless there's a strong reason that it should not exist. Defaulting strict_descriptor to False should be enough if it's not welcome. A minor reason is that we still cannot clearly say that unsigned data descriptor is not supported even if the slow scan is removed, due to the decompression approach. But maybe another choice is that the decompression approach should also be skipped when strict_descriptor==True. |
Reworked |
@danny0838 in reply to #51067 (comment) (sorry about fragmenting the discussion) - I'm generally supportive of the iterative approach of start simple and expand from that. Having said that, since you already implemented _remove_members, I don't see why not expose it through a public API, given that ZipFile has other examples for functions that receive a list of members (members arg of extractall) - can simply follow the same pattern for consistency. I'm not fond of the idea of a 2 step removal (remove/repack), since it can leave the file in state that some may see (me included) as corruption - zip files aren't support to have a CD entry without a corresponding LF entry. Also some of the questions (how to handle a folder) are relevant regardless of whether we take the 1 step or 2 step approach. And last but not least - perfect is the enemy of good - it's not the last python version, and we can always improve as long as the initial version is reasonable, which I believe it is. |
I don't think There are stil design considerations even for your proposed members, for example:
The current implementation of
I don't get this.
A delayed repacking isn't necessarily done after the archive is closed, it can also happen simply after the method is called and returned. For example, a repacking after a mixed operations of writing, removing, copying, renaming, and just before archive closing, such as an interactive ZIP file editing tool would do. It's probably challenging enough to design a one-time
I don't think there's too much need to dig into such details for low level APIs—just let them work simply like other existing methods. It's not the case for a one-time high level |
@danny0838 I can share my own opinions about the desired behavior but of course not everyone would agree:
I don't see the need to separate into read/write, as I believe they should all be consistent.
I'd follow extract in this case.
I might be wrong here, but I think
Just like in
Ok, fair enough - it just seems less desireable/clean to me, but I might be a minority here.
Yes of course, but it might also be called after the file is closed. I can see some performance benefits, but it would seem like something that a low level API would provide, as opposed to a high level one.
I'm fine with a "naive" API as well, documentation is enough for these cases at this point IMO. |
The truth is that they are not consistent. When you say they should be consistent, do you mean there should be multiple file version for
Actually there is no
In Likewise, for Anyway, the current |
What I'm saying is that there's already precendence in the package for a function that has 2 versions, so it's not out of the ordinary to add another one.
Yup, I mean
The behavior of
Yeah I agree, and since multiple files with the same names are not very common, I'd opt for treating
My implementation is based on the original PR, so not much there in regards to these options. In any case I'd proceed with your approach for the time being. |
This is a revised version of PR #103033, implementing two new methods in
zipfile.ZipFile
:remove()
andrepack()
, as suggested in this comment.Features
ZipFile.remove(zinfo_or_arcname)
str
path orZipInfo
) from the central directory.str
path is provided.ZipInfo
instance.'a'
,'w'
,'x'
.ZipFile.repack(removed=None)
removed
is passed (as a sequence of removedZipInfo
s), only their corresponding local file entry data are removed.'a'
.Rationales
Heuristics Used in
repack()
Since
repack()
does not immediately clean up removed entries at the time aremove()
is called, the header information of removed file entries may be missing, and thus it can be technically difficult to determine whether certain stale bytes are really previously removed files and safe to remove.While local file entries begin with the magic signature
PK\x03\x04
, this alone is not a reliable indicator. For instance, a self-extracting ZIP file may contain executable code before the actual archive, which could coincidentally include such a signature, especially if it embeds ZIP-based content.To safely reclaim space,
repack()
assumes that in a normal ZIP file, local file entries are stored consecutively:BadZipFile
error is raised and no changes are made.Check the doc in the source code of
_ZipRepacker.repack()
(which is internally called byZipFile.repack()
) for more details.Supported Modes
There has been opinions that a repacking should support mode
'w'
and'x'
(e. g. #51067 (comment)).This is NOT introduced since such modes do not truncate the file at the end of writing, and won't really shrink the file size after a removal has been made. Although we do can change the behavior for the existing API, some further care has to be made because mode
'w'
and'x'
may be used on an unseekable file and will be broken by such change. OTOH, mode'a'
is not expected to work with an unseekable file since an initial seek is made immediately when it is opened.📚 Documentation preview 📚: https://cpython-previews--134627.org.readthedocs.build/