Fix ocrd workspace list-page: page ranges #1145

MehmedGIT · 2023-12-04T16:08:56Z

Problematic example case: 25 pages, 8 chunks -> 3,125 pages per chunk, the chunk should contain 3 pages, but contains 4 (due to the ceil method used), which makes the 7th chunk (chunk index 6) having only the 25th page, and the 8th chunk (chunk index 7) being empty which triggers an exception. Using the floor method would introduce other errors such as page overflowing (e.g., 3 pages per chunk is 24 pages, and the 25th page is lost).

Simply using NumPy, prevents such edge cases.
Old output range: [[1, 2, 3, 4], [5, 6, 7, 8],..., [21, 22, 23, 24], [25]]
New output range: [[1, 2, 3, 4], [5, 6, 7],..., [20, 21, 22], [23, 24, 25]]

Of course, it is also possible to have chunks divided based on leaps, e.g. [1]:

def chunks(l, amount):
    if amount < 1:
        raise ValueError('amount must be positive integer')
    chunk_len = len(l) // amount
    leap_parts = len(l) % amount
    remainder = amount // 2  # make it symmetrical
    i = 0
    while i < len(l):
        remainder += leap_parts
        end_index = i + chunk_len
        if remainder >= amount:
            remainder -= amount
            end_index += 1
        yield l[i:end_index]
        i = end_index

would produce:
[[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12, 13], [14, 15, 16], [17, 18, 19], [20, 21, 22], [23, 24, 25]]

MehmedGIT · 2023-12-05T11:00:53Z

If I understand it correctly, parameter l is a list and amount is the number of chunks we want to have. Isn't it easier if we pass in the size of chunk instead of number of chunks?

So, we can have a function which splits a list into n-size chunks.

Easier - yes, useful - not as much. The initial idea for dividing workspaces into chunks was to let users divide their workspaces into equal chunks based on the number of CPU cores they have without having to know the amount of pages a workspace has. With the n-size chunks approach the user still needs to know the total amount of pages and decide the N value for different workspaces to get optimal performance on the amount of CPU cores they use. Of course, the list-page could also support n-size chunk division as an additional option if that would be useful for the user. I could implement that.

tdoan2010 · 2023-12-05T11:03:10Z

Yeah, just realized that. Btw, I found this solution, which do not require numpy. It seems fine to me. Did you try it?

def chunks(l, n):
    """Yield n number of sequential chunks from l."""
    d, r = divmod(len(l), n)
    for i in range(n):
        si = (d+1)*(i if i < r else r) + d*(0 if i < r else i - r)
        yield l[si:si+(d+1 if i < r else d)]

It could be even simpler if we don't need to have sequential chunks.

def chunks(l, n):
    """Yield n number of striped chunks from l."""
    for i in range(0, n):
        yield l[i::n]

MehmedGIT · 2023-12-05T11:10:05Z

Ideally, sequential chunks are preferred when possible. It is also easier for reference. For example, it is easier to say pages in ranges 7..11 have failed, than to nail the exact IDs in a failed range. Also easier to rerun tasks on the failed range, e.g., the page_id would be just PHYS_0007..PHYS_0011 instead of writing the full id of 5 pages. The sequence, of course, does not matter for workspaces where the ids are hashes and the shortcut referencing does not work.

kba

LGTM. Agree that using numpy for this (as we originally did) is the best solution. It is a simple enough problem but there's so many edge cases that np.array_split already takes care of.

fix ranges, use numpy

8171afb

MehmedGIT requested a review from kba December 4, 2023 16:13

kba approved these changes Dec 5, 2023

View reviewed changes

kba merged commit 8171afb into master Dec 5, 2023

kba deleted the fix-ranges-list-page branch December 5, 2023 13:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix ocrd workspace list-page: page ranges #1145

Fix ocrd workspace list-page: page ranges #1145

Uh oh!

MehmedGIT commented Dec 4, 2023 •

edited

Loading

Uh oh!

MehmedGIT commented Dec 5, 2023 •

edited

Loading

Uh oh!

tdoan2010 commented Dec 5, 2023 •

edited

Loading

Uh oh!

MehmedGIT commented Dec 5, 2023 •

edited

Loading

Uh oh!

kba left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix ocrd workspace list-page: page ranges #1145

Fix ocrd workspace list-page: page ranges #1145

Uh oh!

Conversation

MehmedGIT commented Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MehmedGIT commented Dec 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdoan2010 commented Dec 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MehmedGIT commented Dec 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kba left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MehmedGIT commented Dec 4, 2023 •

edited

Loading

MehmedGIT commented Dec 5, 2023 •

edited

Loading

tdoan2010 commented Dec 5, 2023 •

edited

Loading

MehmedGIT commented Dec 5, 2023 •

edited

Loading