-
Notifications
You must be signed in to change notification settings - Fork 88
Description
In batch mode machinery, there are some places where pydap should refactor code. Of particular importance is after downloading data and de-serializing.
The following snippet of code is not production ready, even though the result is correct:
Lines 1166 to 1176 in 9dee9aa
| # Collect results | |
| results_dict = {} | |
| for var in variables: | |
| results_dict[var.id] = np.asarray(parsed_dataset[var.id].data[:]) | |
| var._pending_batch_slice = None | |
| var._is_registered_for_batch = False | |
| self._batch_registry.discard(var) | |
| var._batch_promise = None | |
| # Resolve the promise for all waiting arrays | |
| batch_promise.set_results(results_dict) |
The data is deserialized into numpy arrays (in memory) and held into a dictionary within the dataset object. However, a better way to handle this deserialized data that does not hold the arrays in memory, is the way the Dap4BaseProxy handles it. For example, in the following:
pydap/src/pydap/handlers/dap.py
Lines 876 to 886 in 9dee9aa
| def decode_variable(buffer, start, stop, variable, endian): | |
| dtype = variable.dtype | |
| dtype = dtype.newbyteorder(endian) | |
| if dtype.kind == "S": | |
| data = numpy.array(decode_utf8_string_array(buffer)).astype(dtype.kind) | |
| data = data.reshape(variable.shape) | |
| return data | |
| else: | |
| data = numpy.frombuffer(buffer[start:stop], dtype=dtype) | |
| data = data.reshape(variable.shape) | |
| return DapDecodedArray(data) |
So that data, once it is read once, it clears the RAM and no longer is held into memory.
The workflow then depends on a secondary function called which retrieves the data from the batch promise, and then assigns it back to the original dataset as an inmemory numpy object. The function snippet is
Lines 449 to 452 in 9dee9aa
| for var in Variables: | |
| var = ds[var] | |
| data = promise.wait_for_result(var.id) | |
| ds[var.id].data = np.asarray(data) |
Rather than having a dictionary retain the data arrays in memory and then fetched them / assign them as in-memory numpy arrays, these need to be assigned to the dataset itself from the initial step of deserializing the dap response.
This is a must before any release. A proper testing should show there is no explosion of RAM memory usage.