Thanks to visit codestin.com
Credit goes to github.com

Skip to content

refactor/clean up batch-mode workflow #558

@Mikejmnez

Description

@Mikejmnez

In batch mode machinery, there are some places where pydap should refactor code. Of particular importance is after downloading data and de-serializing.

The following snippet of code is not production ready, even though the result is correct:

pydap/src/pydap/model.py

Lines 1166 to 1176 in 9dee9aa

# Collect results
results_dict = {}
for var in variables:
results_dict[var.id] = np.asarray(parsed_dataset[var.id].data[:])
var._pending_batch_slice = None
var._is_registered_for_batch = False
self._batch_registry.discard(var)
var._batch_promise = None
# Resolve the promise for all waiting arrays
batch_promise.set_results(results_dict)

The data is deserialized into numpy arrays (in memory) and held into a dictionary within the dataset object. However, a better way to handle this deserialized data that does not hold the arrays in memory, is the way the Dap4BaseProxy handles it. For example, in the following:

def decode_variable(buffer, start, stop, variable, endian):
dtype = variable.dtype
dtype = dtype.newbyteorder(endian)
if dtype.kind == "S":
data = numpy.array(decode_utf8_string_array(buffer)).astype(dtype.kind)
data = data.reshape(variable.shape)
return data
else:
data = numpy.frombuffer(buffer[start:stop], dtype=dtype)
data = data.reshape(variable.shape)
return DapDecodedArray(data)

So that data, once it is read once, it clears the RAM and no longer is held into memory.

The workflow then depends on a secondary function called which retrieves the data from the batch promise, and then assigns it back to the original dataset as an inmemory numpy object. The function snippet is

pydap/src/pydap/lib.py

Lines 449 to 452 in 9dee9aa

for var in Variables:
var = ds[var]
data = promise.wait_for_result(var.id)
ds[var.id].data = np.asarray(data)

Rather than having a dictionary retain the data arrays in memory and then fetched them / assign them as in-memory numpy arrays, these need to be assigned to the dataset itself from the initial step of deserializing the dap response.

This is a must before any release. A proper testing should show there is no explosion of RAM memory usage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions