Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Duplicate objects returned from list #2979

Open
@ericfrederich

Description

@ericfrederich

Description of the problem, including code/CLI snippet

At least several REST resources are returning duplicate objects. I have noticed this on both projects and users.
This may be the expected behavior of GitLab itself, but perhaps this Python package which handles pagination could also handle deduplication based on id.

Expected Behavior

I would expect no duplicate objects when using a .list(get_all=True, iterator=True) even if objects of that type are created while in the middle of all the pages.

Actual Behavior

If calling gl.projects.list(get_all=True, iterator=True) and a project is created (or the same with users and likely all other object types as well), you'll get a duplicate object.

end-user mitigation and thoughts

It would be nice if end users didn't have to dedupe themselves.

The below code is overkill but has info I was using while trying to understand the problem.

What I have found is that I do get that warning log about an exact match being returned. I have never seen the AssertionError raised. I also tracked the indices for information. In every instance it's been at index x99 and x00 (right on a page boundary).
This makes sense as a new project or user is created we've already missed it and everything shifts by one index.

WARNING  Duplicate project id 31393 at index 1099 and 1100
WARNING  Duplicate project id 30028 at index 2099 and 2100
WARNING  Duplicate project id 22457 at index 7899 and 7900
WARNING  Duplicate user id 222 at index 10299 and 10300

If deduplication is implemented within python-gitlab itself it wouldn't need to keep track of all object ids, just the previous page's object ids, since this only occurs on page boundaries.

def get_stuff(manager: CRUDMixin, **kwargs):
    things = []
    things_by_id = {}
    obj_type = manager.__class__.__name__.removesuffix("Manager").lower()
    for i, thing in enumerate(manager.list(iterator=True, **kwargs)):
        if thing.id in things_by_id:
            existing_idx, existing_thing = things_by_id.get(thing.id)
            if existing_thing == thing:
                logger.warning("Duplicate %s id %s at index %d and %d", obj_type, thing.id, existing_idx, i)
                continue
            else:
                p1 = Path(tempfile.gettempdir()) / f"{obj_type}_{existing_thing.id}_idx_{existing_idx}"
                p2 = Path(tempfile.gettempdir()) / f"{obj_type}_{thing.id}_idx_{i}"
                with p1.open("wt") as f:
                    print(json.dumps(existing_thing.attributes, indent=2, sort_keys=True), file=f)
                with p2.open("wt") as f:
                    print(json.dumps(thing.attributes, indent=2, sort_keys=True), file=f)
                raise AssertionError(
                    f"Duplicate {obj_type} id {thing.id} at index {existing_idx} and {i}; look at {str(p1)} and {str(p2)}"
                )
        things_by_id[thing.id] = (len(things), thing)
        things.append(thing)
        # TODO: this would be better done w/ rich or something
        if len(things) % 100 == 0:
            if len(things) % 200 == 0:
                click.secho("...", nl=False, fg="yellow", bold=True)
            else:
                click.secho("...", nl=False, fg="green", bold=True)
        if len(things) % 1000 == 0:
            click.secho(f"\n{len(things)} {obj_type}s", fg="blue")
    click.secho(f"\n{len(things)} {obj_type}s total", fg="blue", bold=True)
    return things

Specifications

  • python-gitlab version: python-gitlab==4.10.0
  • API version you are using (v3/v4): v4
  • Gitlab server version (or gitlab.com): 16.11.6-ee

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions