-
Notifications
You must be signed in to change notification settings - Fork 671
fix: filter duplicates from previous page during pagination #2989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: filter duplicates from previous page during pagination #2989
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2989 +/- ##
==========================================
- Coverage 96.61% 96.58% -0.03%
==========================================
Files 95 95
Lines 6080 6088 +8
==========================================
+ Hits 5874 5880 +6
- Misses 206 208 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @ericfrederich, if this is important to users we can can add it to the client.
However I'd add this as a feat
rather than a fix as I think it's expected that offset pagination is not reliable, so we're providing an enhancement for getting around GitLab's limitation here, not fixing a bug in python-gitlab.
I just have a few additional questions 🙇
gitlab/client.py
Outdated
@@ -1174,6 +1174,8 @@ def __init__( | |||
# Preserve kwargs for subsequent queries | |||
self._kwargs = kwargs.copy() | |||
|
|||
self._prev_page_objects = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure it's enough to just compare against the previous page? Or is there a chance that an item from one of the earlier pages is repeated (e.g. by using a certain sort parameter that ends up returning an old item again)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This duplication occurs when the sorting is done by ID and new objects are created because the new objects now exist on pages previously retrieved and the pagination requests have no context (i.e. no dedicated database curser for that query). It's like in the UI looking at page 2 of something, then creating a new object in another tab, going back and refreshing page 2... everything shifted and stuff from page 1 (which you already retreived) is now on page 2.
I believe this duplication is independent of sort order but more pronounced with the default ID sorting because any new objects will result in duplication.
Let's think about any other kind of sorting... for example use alphabetical sorting.
In this case new objects which would appear on pages higher than your current page would not result in duplication, but objects which would appear on previous pages would result in duplication.
You're listing projects, you've got all the A's, B's, and are in the middle of listing projects starting with C's. A new project called Zoo
is created. You will not see duplication, but if a new project called Apples
is created the shift will happen and you'll get a duplicated project starting with C that you got on the previous page.
I do not believe it's necessary to store any extra items other than the ones from the previous page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I almost hate to mention it but there may be a case where the previous page is not enough, but it doesn't have to do with other sort parameters.
If between pages, more items are created on previous pages than the per_page
number, then it would be possible for duplicates to occur.
The rarity of this occurring depends heavily on how the user code is implemented.
# This code has higher possibility of 100 projects being created between pagination calls
for project in gl.projects.list(iterator=True):
something_that_takes_a_long_time(project)
# This code has lower possibility of 100 projects being created between pagination calls
projects = gl.projects.list(get_all=True)
for project in projects:
something_that_takes_a_long_time(project)
If we're worried about OutOfMemory
errors as @max-wittig mentioned, the entire thing could be refactored to work off of sets of integers (objects IDs) instead of list of objects.
gitlab/client.py
Outdated
except Exception as e: | ||
raise gitlab.exceptions.GitlabParsingError( | ||
error_message="Failed to parse the server message" | ||
) from e | ||
|
||
self._data = [] | ||
for item in data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should not be the default behavior, but conditional on an argument we can supply. For example something like if remove_duplicates:
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this were a feature offered an an option I have a hard time thinking anyone would opt to have the duplicates returned.
gitlab/client.py
Outdated
except Exception as e: | ||
raise gitlab.exceptions.GitlabParsingError( | ||
error_message="Failed to parse the server message" | ||
) from e | ||
|
||
self._data = [] | ||
for item in data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we manually looping over items and adding them one by one for performance reasons here? Or could we just do a list(OrderedDict.fromkeys(data))
and emit a single warning with the difference from the initial data vs. filtered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could remove the loop and make the code more concise. As far as complexity goes, doing a set
difference still has a loop it's just abstracted away and done in C instead of Python. I imagine the performance difference (loop is max 100 items) is negligible compared to the network API call.
I will change the code though and update this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, looks like we can't use set()
or OrderedDict
on these as here the objects are still just unhashable dicts.
I like option 2 as it's more concise. Also perhaps CPython has optimized list comprehensions where option 1 is a generic for loop.
If we only keep a single page's previous objects we can choose between these two options.
If we decide to keep all objects, we should instead use a set()
of int
s. So maybe let's decide that first.
option 1
One loop; calculate dupes and filter within a single loop
duplicates = []
self._data = []
for item in data:
if item in self._prev_page_objects:
duplicates.append(item)
continue
self._data.append(item)
option 2
Two loops; one to calculate dupes, another to do the filtering.
# loop once to calculate dupes
duplicates = [o for o in data if o in self._prev_page_objects]
# loop again to remove the dupes
self._data = [o for o in data if o not in dupes]
single warning
In both cases we can defer the warning until all duplicates are detected and emit a single warning.
if duplicates:
utils.warn(
message=(
f"During pagination duplicate object with id(s) "
f"{[item['id'] for item in duplicates]} returned from Gitlab and filtered"
),
category=UserWarning,
)
Couldn't this one cause |
I think this is why only the previous page is stored in this implementation (so max 100 entries). Not sure if this 100% ensures deduplication (see my comment above), but it's probably not as bad. |
Note: I've also made remarks above on possibly changing it to use a set of integers instead of objects which should help immensely with memory consumption. I just wanted to mention here though that it would not not be forever, just during pagination. The generator holding reference to the |
Sorry for ignoring this for 2 weeks, I was watching for activity on the "issue" and wasn't checking this "pr". I see 3 open discussion points which need decisions (or just 2 decisions depending on the results of the others).
I'll refrain from changing code until we have consensus on those decision points. I just wanted to apologize, summarize and communicate that I'm currently waiting for decisions. We can use the dedicated conversations above for each of the points. |
cd22af5
to
99cee2a
Compare
Updated the PR branch.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again and sorry for the delay @ericfrederich.
I have some more concerns, see my comments. It's a bit tricky to get this right in a generic way it seems, due to the nature of what gets returned in paginated endpoints.
If feasible it would be great if we could test this somehow, based on the examples you outlined in the issue.
duplicate_ids = ( | ||
set(o["id"] for o in self._data) & self._retrieved_object_ids | ||
) | ||
if duplicate_ids: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can shorten this and use a set comprehension directly:
duplicate_ids = ( | |
set(o["id"] for o in self._data) & self._retrieved_object_ids | |
) | |
if duplicate_ids: | |
if duplicate_ids := {o["id"] for o in self._data) & self._retrieved_ids}: |
**kwargs: Any, | ||
) -> None: | ||
self._gl = gl | ||
|
||
# Preserve kwargs for subsequent queries | ||
self._kwargs = kwargs.copy() | ||
|
||
self._dedupe = dedupe | ||
self._retrieved_object_ids: set[int] = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(to go with the duplicate_ids
below)
self._retrieved_object_ids: set[int] = set() | |
self._retrieved_ids: set[int] = set() |
@@ -1167,13 +1167,17 @@ def __init__( | |||
url: str, | |||
query_data: Dict[str, Any], | |||
get_next: bool = True, | |||
dedupe: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can just go with this in case some newbies get confused :)
dedupe: bool = True, | |
deduplicate: bool = True, |
**kwargs: Any, | ||
) -> None: | ||
self._gl = gl | ||
|
||
# Preserve kwargs for subsequent queries | ||
self._kwargs = kwargs.copy() | ||
|
||
self._dedupe = dedupe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._dedupe = dedupe | |
self._dedupe = deduplicate |
@@ -1205,6 +1209,21 @@ def _query( | |||
error_message="Failed to parse the server message" | |||
) from e | |||
|
|||
if self._dedupe: | |||
duplicate_ids = ( | |||
set(o["id"] for o in self._data) & self._retrieved_object_ids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another issue, as you can see from all the failing tests: we're not actually guaranteed to have id
attributes returned, some endpoints will return a different attribute as the unique identifier (we use _id_attr
in our classes for this reason) and http_list()
could potentially be used to return arbitrary paginated data.
This makes me think it would almost be better to do this in ListMixin
based on the presence of self._objc_cls._id_attr
, but not sure if it's too late at that stage to do it efficiently:
python-gitlab/gitlab/mixins.py
Lines 248 to 250 in 541a7e3
if isinstance(obj, list): | |
return [self._obj_cls(self, item, created_from_list=True) for item in obj] | |
return base.RESTObjectList(self, self._obj_cls, obj) |
fixes #2979
Changes
Added a
_prev_page_objects
attribute to the GitlabList class to be able to filter out duplicates when items are created while pagination is happening.Code could be much more concise if I didn't use the
utils.warn
:Documentation and testing
Please consider whether this PR needs documentation and tests. This is not required, but highly appreciated:
Docs and tests skipped for now while discussion happens on #2979.