Thanks to visit codestin.com
Credit goes to github.com

Skip to content

fix: filter duplicates from previous page during pagination #2989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions gitlab/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -1167,13 +1167,17 @@ def __init__(
url: str,
query_data: Dict[str, Any],
get_next: bool = True,
dedupe: bool = True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can just go with this in case some newbies get confused :)

Suggested change
dedupe: bool = True,
deduplicate: bool = True,

**kwargs: Any,
) -> None:
self._gl = gl

# Preserve kwargs for subsequent queries
self._kwargs = kwargs.copy()

self._dedupe = dedupe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._dedupe = dedupe
self._dedupe = deduplicate

self._retrieved_object_ids: set[int] = set()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(to go with the duplicate_ids below)

Suggested change
self._retrieved_object_ids: set[int] = set()
self._retrieved_ids: set[int] = set()


self._query(url, query_data, **self._kwargs)
self._get_next = get_next

Expand Down Expand Up @@ -1205,6 +1209,21 @@ def _query(
error_message="Failed to parse the server message"
) from e

if self._dedupe:
duplicate_ids = (
set(o["id"] for o in self._data) & self._retrieved_object_ids
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue, as you can see from all the failing tests: we're not actually guaranteed to have id attributes returned, some endpoints will return a different attribute as the unique identifier (we use _id_attr in our classes for this reason) and http_list() could potentially be used to return arbitrary paginated data.

This makes me think it would almost be better to do this in ListMixin based on the presence of self._objc_cls._id_attr, but not sure if it's too late at that stage to do it efficiently:

if isinstance(obj, list):
return [self._obj_cls(self, item, created_from_list=True) for item in obj]
return base.RESTObjectList(self, self._obj_cls, obj)

)
if duplicate_ids:
Comment on lines +1213 to +1216
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can shorten this and use a set comprehension directly:

Suggested change
duplicate_ids = (
set(o["id"] for o in self._data) & self._retrieved_object_ids
)
if duplicate_ids:
if duplicate_ids := {o["id"] for o in self._data) & self._retrieved_ids}:

utils.warn(
message=(
f"During pagination duplicate object(s) with id(s) "
f"{duplicate_ids} returned from Gitlab and filtered"
),
category=UserWarning,
)
self._data = [o for o in self._data if o["id"] not in duplicate_ids]
self._retrieved_object_ids.update(o["id"] for o in self._data)

self._current = 0

@property
Expand Down
Loading