Thanks to visit codestin.com
Credit goes to github.com

Skip to content

fix: filter duplicates from previous page during pagination #2989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ericfrederich
Copy link
Contributor

fixes #2979

Changes

Added a _prev_page_objects attribute to the GitlabList class to be able to filter out duplicates when items are created while pagination is happening.

Code could be much more concise if I didn't use the utils.warn:

self._data: List[Dict[str, Any]] = [r for r in result.json() if r not in self._prev_page_objects]
...
self._prev_page_objects = list(self._data)

Documentation and testing

Please consider whether this PR needs documentation and tests. This is not required, but highly appreciated:

Docs and tests skipped for now while discussion happens on #2979.

Copy link

codecov bot commented Sep 17, 2024

Codecov Report

Attention: Patch coverage is 77.77778% with 2 lines in your changes missing coverage. Please review.

Project coverage is 96.58%. Comparing base (10ee58a) to head (cd22af5).

Files with missing lines Patch % Lines
gitlab/client.py 77.77% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2989      +/-   ##
==========================================
- Coverage   96.61%   96.58%   -0.03%     
==========================================
  Files          95       95              
  Lines        6080     6088       +8     
==========================================
+ Hits         5874     5880       +6     
- Misses        206      208       +2     
Flag Coverage Δ
api_func_v4 82.62% <77.77%> (-0.02%) ⬇️
cli_func_v4 82.93% <77.77%> (-0.02%) ⬇️
unit 88.71% <77.77%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
gitlab/client.py 98.30% <77.77%> (-0.36%) ⬇️

Copy link
Member

@nejch nejch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @ericfrederich, if this is important to users we can can add it to the client.

However I'd add this as a feat rather than a fix as I think it's expected that offset pagination is not reliable, so we're providing an enhancement for getting around GitLab's limitation here, not fixing a bug in python-gitlab.

I just have a few additional questions 🙇

gitlab/client.py Outdated
@@ -1174,6 +1174,8 @@ def __init__(
# Preserve kwargs for subsequent queries
self._kwargs = kwargs.copy()

self._prev_page_objects = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure it's enough to just compare against the previous page? Or is there a chance that an item from one of the earlier pages is repeated (e.g. by using a certain sort parameter that ends up returning an old item again)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplication occurs when the sorting is done by ID and new objects are created because the new objects now exist on pages previously retrieved and the pagination requests have no context (i.e. no dedicated database curser for that query). It's like in the UI looking at page 2 of something, then creating a new object in another tab, going back and refreshing page 2... everything shifted and stuff from page 1 (which you already retreived) is now on page 2.

I believe this duplication is independent of sort order but more pronounced with the default ID sorting because any new objects will result in duplication.

Let's think about any other kind of sorting... for example use alphabetical sorting.
In this case new objects which would appear on pages higher than your current page would not result in duplication, but objects which would appear on previous pages would result in duplication.
You're listing projects, you've got all the A's, B's, and are in the middle of listing projects starting with C's. A new project called Zoo is created. You will not see duplication, but if a new project called Apples is created the shift will happen and you'll get a duplicated project starting with C that you got on the previous page.

I do not believe it's necessary to store any extra items other than the ones from the previous page.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost hate to mention it but there may be a case where the previous page is not enough, but it doesn't have to do with other sort parameters.

If between pages, more items are created on previous pages than the per_page number, then it would be possible for duplicates to occur.

The rarity of this occurring depends heavily on how the user code is implemented.

# This code has higher possibility of 100 projects being created between pagination calls
for project in gl.projects.list(iterator=True):
  something_that_takes_a_long_time(project)

# This code has lower possibility of 100 projects being created between pagination calls
projects = gl.projects.list(get_all=True)
for project in projects:
  something_that_takes_a_long_time(project)

If we're worried about OutOfMemory errors as @max-wittig mentioned, the entire thing could be refactored to work off of sets of integers (objects IDs) instead of list of objects.

gitlab/client.py Outdated
except Exception as e:
raise gitlab.exceptions.GitlabParsingError(
error_message="Failed to parse the server message"
) from e

self._data = []
for item in data:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should not be the default behavior, but conditional on an argument we can supply. For example something like if remove_duplicates:.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this were a feature offered an an option I have a hard time thinking anyone would opt to have the duplicates returned.

gitlab/client.py Outdated
except Exception as e:
raise gitlab.exceptions.GitlabParsingError(
error_message="Failed to parse the server message"
) from e

self._data = []
for item in data:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we manually looping over items and adding them one by one for performance reasons here? Or could we just do a list(OrderedDict.fromkeys(data)) and emit a single warning with the difference from the initial data vs. filtered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could remove the loop and make the code more concise. As far as complexity goes, doing a set difference still has a loop it's just abstracted away and done in C instead of Python. I imagine the performance difference (loop is max 100 items) is negligible compared to the network API call.

I will change the code though and update this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looks like we can't use set() or OrderedDict on these as here the objects are still just unhashable dicts.

I like option 2 as it's more concise. Also perhaps CPython has optimized list comprehensions where option 1 is a generic for loop.

If we only keep a single page's previous objects we can choose between these two options.

If we decide to keep all objects, we should instead use a set() of ints. So maybe let's decide that first.

option 1

One loop; calculate dupes and filter within a single loop

duplicates = []
self._data = []
for item in data:
    if item in self._prev_page_objects:
        duplicates.append(item)
        continue
    self._data.append(item)

option 2

Two loops; one to calculate dupes, another to do the filtering.

# loop once to calculate dupes
duplicates = [o for o in data if o in self._prev_page_objects]
# loop again to remove the dupes
self._data = [o for o in data if o not in dupes]

single warning

In both cases we can defer the warning until all duplicates are detected and emit a single warning.

if duplicates:
    utils.warn(
        message=(
            f"During pagination duplicate object with id(s) "
            f"{[item['id'] for item in duplicates]} returned from Gitlab and filtered"
        ),
        category=UserWarning,
    )

@max-wittig
Copy link
Member

Couldn't this one cause OutOfMemory errors as we need to store the dataset forever? What's the problem in handling this in client code?

@nejch
Copy link
Member

nejch commented Sep 17, 2024

Couldn't this one cause OutOfMemory errors as we need to store the dataset forever? What's the problem in handling this in client code?

I think this is why only the previous page is stored in this implementation (so max 100 entries). Not sure if this 100% ensures deduplication (see my comment above), but it's probably not as bad.

@ericfrederich
Copy link
Contributor Author

Couldn't this one cause OutOfMemory errors as we need to store the dataset forever? What's the problem in handling this in client code?

Note: I've also made remarks above on possibly changing it to use a set of integers instead of objects which should help immensely with memory consumption.

I just wanted to mention here though that it would not not be forever, just during pagination. The generator holding reference to the GitlabList with all this data can be garbage collected.

@ericfrederich
Copy link
Contributor Author

Sorry for ignoring this for 2 weeks, I was watching for activity on the "issue" and wasn't checking this "pr".

I see 3 open discussion points which need decisions (or just 2 decisions depending on the results of the others).

  • retain last page of dicts or all pages of object ids.
  • make deduplication optional? and if so what should default be?
  • code style of for loop iteration (goes away if we decide to retain all object is's as they'd be ints and we could use sets)

I'll refrain from changing code until we have consensus on those decision points.

I just wanted to apologize, summarize and communicate that I'm currently waiting for decisions. We can use the dedicated conversations above for each of the points.

@ericfrederich ericfrederich force-pushed the filter-pagination-dupes branch from cd22af5 to 99cee2a Compare October 15, 2024 14:52
@ericfrederich
Copy link
Contributor Author

Updated the PR branch.

  • It now holds all object ids in the case that more objects then per_page were added during pagination
  • Single warning with all duplicated ids rather than a warning per item
  • Deduplication is optional

Copy link
Member

@nejch nejch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again and sorry for the delay @ericfrederich.

I have some more concerns, see my comments. It's a bit tricky to get this right in a generic way it seems, due to the nature of what gets returned in paginated endpoints.

If feasible it would be great if we could test this somehow, based on the examples you outlined in the issue.

Comment on lines +1213 to +1216
duplicate_ids = (
set(o["id"] for o in self._data) & self._retrieved_object_ids
)
if duplicate_ids:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can shorten this and use a set comprehension directly:

Suggested change
duplicate_ids = (
set(o["id"] for o in self._data) & self._retrieved_object_ids
)
if duplicate_ids:
if duplicate_ids := {o["id"] for o in self._data) & self._retrieved_ids}:

**kwargs: Any,
) -> None:
self._gl = gl

# Preserve kwargs for subsequent queries
self._kwargs = kwargs.copy()

self._dedupe = dedupe
self._retrieved_object_ids: set[int] = set()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(to go with the duplicate_ids below)

Suggested change
self._retrieved_object_ids: set[int] = set()
self._retrieved_ids: set[int] = set()

@@ -1167,13 +1167,17 @@ def __init__(
url: str,
query_data: Dict[str, Any],
get_next: bool = True,
dedupe: bool = True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can just go with this in case some newbies get confused :)

Suggested change
dedupe: bool = True,
deduplicate: bool = True,

**kwargs: Any,
) -> None:
self._gl = gl

# Preserve kwargs for subsequent queries
self._kwargs = kwargs.copy()

self._dedupe = dedupe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._dedupe = dedupe
self._dedupe = deduplicate

@@ -1205,6 +1209,21 @@ def _query(
error_message="Failed to parse the server message"
) from e

if self._dedupe:
duplicate_ids = (
set(o["id"] for o in self._data) & self._retrieved_object_ids
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue, as you can see from all the failing tests: we're not actually guaranteed to have id attributes returned, some endpoints will return a different attribute as the unique identifier (we use _id_attr in our classes for this reason) and http_list() could potentially be used to return arbitrary paginated data.

This makes me think it would almost be better to do this in ListMixin based on the presence of self._objc_cls._id_attr, but not sure if it's too late at that stage to do it efficiently:

if isinstance(obj, list):
return [self._obj_cls(self, item, created_from_list=True) for item in obj]
return base.RESTObjectList(self, self._obj_cls, obj)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicate objects returned from list
3 participants