Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@MehmedGIT
Copy link
Contributor

@MehmedGIT MehmedGIT commented Dec 4, 2023

Problematic example case: 25 pages, 8 chunks -> 3,125 pages per chunk, the chunk should contain 3 pages, but contains 4 (due to the ceil method used), which makes the 7th chunk (chunk index 6) having only the 25th page, and the 8th chunk (chunk index 7) being empty which triggers an exception. Using the floor method would introduce other errors such as page overflowing (e.g., 3 pages per chunk is 24 pages, and the 25th page is lost).

Simply using NumPy, prevents such edge cases.
Old output range: [[1, 2, 3, 4], [5, 6, 7, 8],..., [21, 22, 23, 24], [25]]
New output range: [[1, 2, 3, 4], [5, 6, 7],..., [20, 21, 22], [23, 24, 25]]

Of course, it is also possible to have chunks divided based on leaps, e.g. [1]:

def chunks(l, amount):
    if amount < 1:
        raise ValueError('amount must be positive integer')
    chunk_len = len(l) // amount
    leap_parts = len(l) % amount
    remainder = amount // 2  # make it symmetrical
    i = 0
    while i < len(l):
        remainder += leap_parts
        end_index = i + chunk_len
        if remainder >= amount:
            remainder -= amount
            end_index += 1
        yield l[i:end_index]
        i = end_index

would produce:
[[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12, 13], [14, 15, 16], [17, 18, 19], [20, 21, 22], [23, 24, 25]]

@MehmedGIT MehmedGIT requested a review from kba December 4, 2023 16:13
@MehmedGIT
Copy link
Contributor Author

MehmedGIT commented Dec 5, 2023

If I understand it correctly, parameter l is a list and amount is the number of chunks we want to have. Isn't it easier if we pass in the size of chunk instead of number of chunks?

So, we can have a function which splits a list into n-size chunks.

Easier - yes, useful - not as much. The initial idea for dividing workspaces into chunks was to let users divide their workspaces into equal chunks based on the number of CPU cores they have without having to know the amount of pages a workspace has. With the n-size chunks approach the user still needs to know the total amount of pages and decide the N value for different workspaces to get optimal performance on the amount of CPU cores they use. Of course, the list-page could also support n-size chunk division as an additional option if that would be useful for the user. I could implement that.

@tdoan2010
Copy link
Contributor

tdoan2010 commented Dec 5, 2023

Yeah, just realized that. Btw, I found this solution, which do not require numpy. It seems fine to me. Did you try it?

def chunks(l, n):
    """Yield n number of sequential chunks from l."""
    d, r = divmod(len(l), n)
    for i in range(n):
        si = (d+1)*(i if i < r else r) + d*(0 if i < r else i - r)
        yield l[si:si+(d+1 if i < r else d)]

It could be even simpler if we don't need to have sequential chunks.

def chunks(l, n):
    """Yield n number of striped chunks from l."""
    for i in range(0, n):
        yield l[i::n]

@MehmedGIT
Copy link
Contributor Author

MehmedGIT commented Dec 5, 2023

Ideally, sequential chunks are preferred when possible. It is also easier for reference. For example, it is easier to say pages in ranges 7..11 have failed, than to nail the exact IDs in a failed range. Also easier to rerun tasks on the failed range, e.g., the page_id would be just PHYS_0007..PHYS_0011 instead of writing the full id of 5 pages. The sequence, of course, does not matter for workspaces where the ids are hashes and the shortcut referencing does not work.

Copy link
Member

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Agree that using numpy for this (as we originally did) is the best solution. It is a simple enough problem but there's so many edge cases that np.array_split already takes care of.

@kba kba merged commit 8171afb into master Dec 5, 2023
@kba kba deleted the fix-ranges-list-page branch December 5, 2023 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants