fix: filter duplicates from previous page during pagination #2989

ericfrederich · 2024-09-17T13:57:44Z

Changes

Added a _prev_page_objects attribute to the GitlabList class to be able to filter out duplicates when items are created while pagination is happening.

Code could be much more concise if I didn't use the utils.warn:

self._data: List[Dict[str, Any]] = [r for r in result.json() if r not in self._prev_page_objects]
...
self._prev_page_objects = list(self._data)

Documentation and testing

Please consider whether this PR needs documentation and tests. This is not required, but highly appreciated:

Docs and tests skipped for now while discussion happens on #2979.

Documentation in the matching docs section
Unit tests and/or functional tests

codecov · 2024-09-17T14:15:15Z

Codecov Report

Attention: Patch coverage is 77.77778% with 2 lines in your changes missing coverage. Please review.

Project coverage is 96.58%. Comparing base (10ee58a) to head (cd22af5).

Files with missing lines	Patch %	Lines
gitlab/client.py	77.77%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2989      +/-   ##
==========================================
- Coverage   96.61%   96.58%   -0.03%     
==========================================
  Files          95       95              
  Lines        6080     6088       +8     
==========================================
+ Hits         5874     5880       +6     
- Misses        206      208       +2

Flag	Coverage Δ
api_func_v4	`82.62% <77.77%> (-0.02%)`	⬇️
cli_func_v4	`82.93% <77.77%> (-0.02%)`	⬇️
unit	`88.71% <77.77%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
gitlab/client.py	`98.30% <77.77%> (-0.36%)`	⬇️

nejch

Thanks for working on this @ericfrederich, if this is important to users we can can add it to the client.

However I'd add this as a feat rather than a fix as I think it's expected that offset pagination is not reliable, so we're providing an enhancement for getting around GitLab's limitation here, not fixing a bug in python-gitlab.

I just have a few additional questions 🙇

nejch · 2024-09-17T14:15:56Z

gitlab/client.py

@@ -1174,6 +1174,8 @@ def __init__(
        # Preserve kwargs for subsequent queries
        self._kwargs = kwargs.copy()

+        self._prev_page_objects = []


Are we sure it's enough to just compare against the previous page? Or is there a chance that an item from one of the earlier pages is repeated (e.g. by using a certain sort parameter that ends up returning an old item again)?

This duplication occurs when the sorting is done by ID and new objects are created because the new objects now exist on pages previously retrieved and the pagination requests have no context (i.e. no dedicated database curser for that query). It's like in the UI looking at page 2 of something, then creating a new object in another tab, going back and refreshing page 2... everything shifted and stuff from page 1 (which you already retreived) is now on page 2.

I believe this duplication is independent of sort order but more pronounced with the default ID sorting because any new objects will result in duplication.

Let's think about any other kind of sorting... for example use alphabetical sorting.
In this case new objects which would appear on pages higher than your current page would not result in duplication, but objects which would appear on previous pages would result in duplication.
You're listing projects, you've got all the A's, B's, and are in the middle of listing projects starting with C's. A new project called Zoo is created. You will not see duplication, but if a new project called Apples is created the shift will happen and you'll get a duplicated project starting with C that you got on the previous page.

I do not believe it's necessary to store any extra items other than the ones from the previous page.

I almost hate to mention it but there may be a case where the previous page is not enough, but it doesn't have to do with other sort parameters.

If between pages, more items are created on previous pages than the per_page number, then it would be possible for duplicates to occur.

The rarity of this occurring depends heavily on how the user code is implemented.

# This code has higher possibility of 100 projects being created between pagination calls for project in gl.projects.list(iterator=True): something_that_takes_a_long_time(project) # This code has lower possibility of 100 projects being created between pagination calls projects = gl.projects.list(get_all=True) for project in projects: something_that_takes_a_long_time(project)

If we're worried about OutOfMemory errors as @max-wittig mentioned, the entire thing could be refactored to work off of sets of integers (objects IDs) instead of list of objects.

nejch · 2024-09-17T14:20:19Z

gitlab/client.py

        except Exception as e:
            raise gitlab.exceptions.GitlabParsingError(
                error_message="Failed to parse the server message"
            ) from e

+        self._data = []
+        for item in data:


I think this should not be the default behavior, but conditional on an argument we can supply. For example something like if remove_duplicates:.

If this were a feature offered an an option I have a hard time thinking anyone would opt to have the duplicates returned.

nejch · 2024-09-17T14:24:29Z

gitlab/client.py

        except Exception as e:
            raise gitlab.exceptions.GitlabParsingError(
                error_message="Failed to parse the server message"
            ) from e

+        self._data = []
+        for item in data:


Are we manually looping over items and adding them one by one for performance reasons here? Or could we just do a list(OrderedDict.fromkeys(data)) and emit a single warning with the difference from the initial data vs. filtered?

We could remove the loop and make the code more concise. As far as complexity goes, doing a set difference still has a loop it's just abstracted away and done in C instead of Python. I imagine the performance difference (loop is max 100 items) is negligible compared to the network API call.

I will change the code though and update this PR.

Actually, looks like we can't use set() or OrderedDict on these as here the objects are still just unhashable dicts.

I like option 2 as it's more concise. Also perhaps CPython has optimized list comprehensions where option 1 is a generic for loop.

If we only keep a single page's previous objects we can choose between these two options.

If we decide to keep all objects, we should instead use a set() of ints. So maybe let's decide that first.

option 1

One loop; calculate dupes and filter within a single loop

duplicates = [] self._data = [] for item in data: if item in self._prev_page_objects: duplicates.append(item) continue self._data.append(item)

option 2

Two loops; one to calculate dupes, another to do the filtering.

# loop once to calculate dupes duplicates = [o for o in data if o in self._prev_page_objects] # loop again to remove the dupes self._data = [o for o in data if o not in dupes]

single warning

In both cases we can defer the warning until all duplicates are detected and emit a single warning.

if duplicates: utils.warn( message=( f"During pagination duplicate object with id(s) " f"{[item['id'] for item in duplicates]} returned from Gitlab and filtered" ), category=UserWarning, )

max-wittig · 2024-09-17T14:44:33Z

Couldn't this one cause OutOfMemory errors as we need to store the dataset forever? What's the problem in handling this in client code?

nejch · 2024-09-17T14:52:48Z

Couldn't this one cause OutOfMemory errors as we need to store the dataset forever? What's the problem in handling this in client code?

I think this is why only the previous page is stored in this implementation (so max 100 entries). Not sure if this 100% ensures deduplication (see my comment above), but it's probably not as bad.

ericfrederich · 2024-10-02T15:05:57Z

Couldn't this one cause OutOfMemory errors as we need to store the dataset forever? What's the problem in handling this in client code?

Note: I've also made remarks above on possibly changing it to use a set of integers instead of objects which should help immensely with memory consumption.

I just wanted to mention here though that it would not not be forever, just during pagination. The generator holding reference to the GitlabList with all this data can be garbage collected.

ericfrederich · 2024-10-02T18:42:35Z

Sorry for ignoring this for 2 weeks, I was watching for activity on the "issue" and wasn't checking this "pr".

I see 3 open discussion points which need decisions (or just 2 decisions depending on the results of the others).

retain last page of dicts or all pages of object ids.
make deduplication optional? and if so what should default be?
code style of for loop iteration (goes away if we decide to retain all object is's as they'd be ints and we could use sets)

I'll refrain from changing code until we have consensus on those decision points.

I just wanted to apologize, summarize and communicate that I'm currently waiting for decisions. We can use the dedicated conversations above for each of the points.

fixes python-gitlab#2979

ericfrederich · 2024-10-15T14:54:10Z

Updated the PR branch.

It now holds all object ids in the case that more objects then per_page were added during pagination
Single warning with all duplicated ids rather than a warning per item
Deduplication is optional

nejch

Thanks again and sorry for the delay @ericfrederich.

I have some more concerns, see my comments. It's a bit tricky to get this right in a generic way it seems, due to the nature of what gets returned in paginated endpoints.

If feasible it would be great if we could test this somehow, based on the examples you outlined in the issue.

nejch · 2024-10-17T15:16:09Z

gitlab/client.py

+            duplicate_ids = (
+                set(o["id"] for o in self._data) & self._retrieved_object_ids
+            )
+            if duplicate_ids:


I think we can shorten this and use a set comprehension directly:

Suggested change

duplicate_ids = (

set(o["id"] for o in self._data) & self._retrieved_object_ids

)

if duplicate_ids:

if duplicate_ids := {o["id"] for o in self._data) & self._retrieved_ids}:

nejch · 2024-10-17T15:16:50Z

gitlab/client.py

        **kwargs: Any,
    ) -> None:
        self._gl = gl

        # Preserve kwargs for subsequent queries
        self._kwargs = kwargs.copy()

+        self._dedupe = dedupe
+        self._retrieved_object_ids: set[int] = set()


(to go with the duplicate_ids below)

Suggested change

self._retrieved_object_ids: set[int] = set()

self._retrieved_ids: set[int] = set()

nejch · 2024-10-18T14:45:13Z

gitlab/client.py

@@ -1167,13 +1167,17 @@ def __init__(
        url: str,
        query_data: Dict[str, Any],
        get_next: bool = True,
+        dedupe: bool = True,


Maybe we can just go with this in case some newbies get confused :)

Suggested change

dedupe: bool = True,

deduplicate: bool = True,

nejch · 2024-10-18T14:45:24Z

gitlab/client.py

        **kwargs: Any,
    ) -> None:
        self._gl = gl

        # Preserve kwargs for subsequent queries
        self._kwargs = kwargs.copy()

+        self._dedupe = dedupe


Suggested change

self._dedupe = dedupe

self._dedupe = deduplicate

nejch · 2024-10-30T17:11:08Z

gitlab/client.py

@@ -1205,6 +1209,21 @@ def _query(
                error_message="Failed to parse the server message"
            ) from e

+        if self._dedupe:
+            duplicate_ids = (
+                set(o["id"] for o in self._data) & self._retrieved_object_ids


Another issue, as you can see from all the failing tests: we're not actually guaranteed to have id attributes returned, some endpoints will return a different attribute as the unique identifier (we use _id_attr in our classes for this reason) and http_list() could potentially be used to return arbitrary paginated data.

This makes me think it would almost be better to do this in ListMixin based on the presence of self._objc_cls._id_attr, but not sure if it's too late at that stage to do it efficiently:

python-gitlab/gitlab/mixins.py

Lines 248 to 250 in 541a7e3

if isinstance(obj, list):

return [self._obj_cls(self, item, created_from_list=True) for item in obj]

return base.RESTObjectList(self, self._obj_cls, obj)

nejch added the enhancement label Sep 17, 2024

nejch requested changes Sep 17, 2024

View reviewed changes

ericfrederich added 2 commits October 15, 2024 10:51

feat: filter duplicates from previous page during pagination

b7ac626

fixes python-gitlab#2979

feat: make deduplication optional

99cee2a

ericfrederich force-pushed the filter-pagination-dupes branch from cd22af5 to 99cee2a Compare October 15, 2024 14:52

nejch requested changes Oct 30, 2024

View reviewed changes

	self._retrieved_object_ids: set[int] = set()
	self._retrieved_ids: set[int] = set()

	if isinstance(obj, list):
	return [self._obj_cls(self, item, created_from_list=True) for item in obj]
	return base.RESTObjectList(self, self._obj_cls, obj)

fix: filter duplicates from previous page during pagination #2989

Are you sure you want to change the base?

fix: filter duplicates from previous page during pagination #2989

Uh oh!

Conversation

ericfrederich commented Sep 17, 2024

Changes

Documentation and testing

Uh oh!

codecov bot commented Sep 17, 2024

Codecov Report

Uh oh!

nejch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

option 1

option 2

single warning

Uh oh!

max-wittig commented Sep 17, 2024

Uh oh!

nejch commented Sep 17, 2024

Uh oh!

ericfrederich commented Oct 2, 2024

Uh oh!

ericfrederich commented Oct 2, 2024

Uh oh!

ericfrederich commented Oct 15, 2024

Uh oh!

nejch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!