Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@kba
Copy link
Member

@kba kba commented Nov 20, 2023

This extends the functionality of ocrd workspace list-page with a number of options to support different output formats, partitioning the list of pageIds into roughly equally distributed chunks and supporting both pageId and numerical ranges.

E.g. for a workspace with non-contiguous pageIds PHYS_0001..PHYS_0006,PHYS_0008..PHYS_0009..PHYS_0021,PHYS_0023..PHYS_0029 (i.e. PHYS_007 and PHYS_0021 missing, cf. test workspace in the PR).

ocrd workspace list-page --help
Usage: ocrd workspace list-page [OPTIONS]

  List physical page IDs

Options:
  -f, --output-format [one-per-line|comma-separated|json]
                                  Output format
  -D, --chunk-number INTEGER      Partition the return value into n roughly
                                  equally sized chunks
  -C, --chunk-index INTEGER       Output the nth chunk of results, -1 for all
                                  of them.
  -r, --page-id-range TEXT        Restrict the pages to those matching the
                                  provided range, based on the @ID attribute.
                                  Separate start/end with ..
  -R, --numeric-range TEXT        Restrict the pages to those in the range, in
                                  numerical document order. Separate start/end
                                  with ..
  --help                          Show this message and exit.


# all IDs but comma-separated
ocrd workspace list-page -f comma-separated          
PHYS_0001,PHYS_0002,PHYS_0003,PHYS_0004,PHYS_0005,PHYS_0006,PHYS_0008,PHYS_0009,PHYS_0010,PHYS_0011,PHYS_0012,PHYS_0013,PHYS_0014,PHYS_0015,PHYS_0016,PHYS_0017,PHYS_0018,PHYS_0019,PHYS_0020,PHYS_0022,PHYS_0023,PHYS_0024,PHYS_0025,PHYS_0026,PHYS_0027,PHYS_0028,PHYS_0029

# all IDs but as JSON
ocrd workspace list-page -f json           
[["PHYS_0001", "PHYS_0002", "PHYS_0003", "PHYS_0004", "PHYS_0005", "PHYS_0006", "PHYS_0008", "PHYS_0009", "PHYS_0010", "PHYS_0011", "PHYS_0012", "PHYS_0013", "PHYS_0014", "PHYS_0015", "PHYS_0016", "PHYS_0017", "PHYS_0018", "PHYS_0019", "PHYS_0020", "PHYS_0022", "PHYS_0023", "PHYS_0024", "PHYS_0025", "PHYS_0026", "PHYS_0027", "PHYS_0028", "PHYS_0029"]]

# numeric page id range
ocrd workspace list-page -f comma-separated -R 5..20
PHYS_0006,PHYS_0008,PHYS_0009,PHYS_0010,PHYS_0011,PHYS_0012,PHYS_0013,PHYS_0014,PHYS_0015,PHYS_0016,PHYS_0017,PHYS_0018,PHYS_0019,PHYS_0020,PHYS_0022

# pageID range
ocrd workspace list-page -f comma-separated -r 'PHYS_0006..PHYS_0009'
PHYS_0006,PHYS_0008,PHYS_0009

# Partition into 5 chunks
ocrd workspace list-page -f comma-separated -D 5                     
PHYS_0001,PHYS_0002,PHYS_0003,PHYS_0004,PHYS_0005,PHYS_0006
PHYS_0008,PHYS_0009,PHYS_0010,PHYS_0011,PHYS_0012,PHYS_0013
PHYS_0014,PHYS_0015,PHYS_0016,PHYS_0017,PHYS_0018
PHYS_0019,PHYS_0020,PHYS_0022,PHYS_0023,PHYS_0024
PHYS_0025,PHYS_0026,PHYS_0027,PHYS_0028,PHYS_0029

# Partition into 5 chunks and output only the third chunk
ocrd workspace list-page -f comma-separated -D 5 -C 2
PHYS_0014,PHYS_0015,PHYS_0016,PHYS_0017,PHYS_0018

This uses numpy.array_split to do the chunking.

It's inefficient at the moment, using find_files and then sorting and I cannot guarantee off-by-one errors with the indexing, but if the general behavior is what has been wished for, then I can optimize it properly.

@kba kba requested a review from MehmedGIT November 20, 2023 18:58
@kba kba linked an issue Nov 20, 2023 that may be closed by this pull request
@kba kba merged commit 30e3763 into master Nov 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Suggestion: support split page ranges

3 participants