Description
Description of the problem, including code/CLI snippet
At least several REST resources are returning duplicate objects. I have noticed this on both projects and users.
This may be the expected behavior of GitLab itself, but perhaps this Python package which handles pagination could also handle deduplication based on id
.
Expected Behavior
I would expect no duplicate objects when using a .list(get_all=True, iterator=True)
even if objects of that type are created while in the middle of all the pages.
Actual Behavior
If calling gl.projects.list(get_all=True, iterator=True)
and a project is created (or the same with users and likely all other object types as well), you'll get a duplicate object.
end-user mitigation and thoughts
It would be nice if end users didn't have to dedupe themselves.
The below code is overkill but has info I was using while trying to understand the problem.
What I have found is that I do get that warning log about an exact match being returned. I have never seen the AssertionError
raised. I also tracked the indices for information. In every instance it's been at index x99
and x00
(right on a page boundary).
This makes sense as a new project or user is created we've already missed it and everything shifts by one index.
WARNING Duplicate project id 31393 at index 1099 and 1100
WARNING Duplicate project id 30028 at index 2099 and 2100
WARNING Duplicate project id 22457 at index 7899 and 7900
WARNING Duplicate user id 222 at index 10299 and 10300
If deduplication is implemented within python-gitlab itself it wouldn't need to keep track of all object ids, just the previous page's object ids, since this only occurs on page boundaries.
def get_stuff(manager: CRUDMixin, **kwargs):
things = []
things_by_id = {}
obj_type = manager.__class__.__name__.removesuffix("Manager").lower()
for i, thing in enumerate(manager.list(iterator=True, **kwargs)):
if thing.id in things_by_id:
existing_idx, existing_thing = things_by_id.get(thing.id)
if existing_thing == thing:
logger.warning("Duplicate %s id %s at index %d and %d", obj_type, thing.id, existing_idx, i)
continue
else:
p1 = Path(tempfile.gettempdir()) / f"{obj_type}_{existing_thing.id}_idx_{existing_idx}"
p2 = Path(tempfile.gettempdir()) / f"{obj_type}_{thing.id}_idx_{i}"
with p1.open("wt") as f:
print(json.dumps(existing_thing.attributes, indent=2, sort_keys=True), file=f)
with p2.open("wt") as f:
print(json.dumps(thing.attributes, indent=2, sort_keys=True), file=f)
raise AssertionError(
f"Duplicate {obj_type} id {thing.id} at index {existing_idx} and {i}; look at {str(p1)} and {str(p2)}"
)
things_by_id[thing.id] = (len(things), thing)
things.append(thing)
# TODO: this would be better done w/ rich or something
if len(things) % 100 == 0:
if len(things) % 200 == 0:
click.secho("...", nl=False, fg="yellow", bold=True)
else:
click.secho("...", nl=False, fg="green", bold=True)
if len(things) % 1000 == 0:
click.secho(f"\n{len(things)} {obj_type}s", fg="blue")
click.secho(f"\n{len(things)} {obj_type}s total", fg="blue", bold=True)
return things
Specifications
- python-gitlab version:
python-gitlab==4.10.0
- API version you are using (v3/v4):
v4
- Gitlab server version (or gitlab.com):
16.11.6-ee