Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

Humboldt-Penguin
Copy link

@Humboldt-Penguin Humboldt-Penguin commented Mar 9, 2025

Right now, the dataset pyshtools.datasets.Mars.MOLA_shape has multiple resolutions available (5759, 2879, 1439, and 719), and chooses/downloads the smallest model which satisfies the user's input lmax. For example, if you call MOLA_shape(2000), it will download the 2879 model. However, if you then call MOLA_shape(1000), it will download the 1439 model instead of reusing the previously downloaded 2879 model. This commit modifies MOLA_shape so in the aforementioned scenario, it will use the 2879 model.

Note that this relies on dictionaries maintaining order of declaration/insertion which was implemented in python 3.6.


This change is only applied to Mars datasets. I'd be happy to apply this change to other datasets with multiple resolutions as well, just let me know.


Reminders

  • Base all changes on the develop branch: the master branch is used only when releasing new versions.
  • Run make check to ensure that the python code follows standard formatting conventions.
  • If adding new features, update the docstring to provide all information that is required to use the feature.

@MarkWieczorek
Copy link
Member

Thanks for the PR. It looks good, but I'll need about 2 weeks before I can get to this. After we agree on this, we'll have to make changes to a couple similar datasets that do the same thing.

@Humboldt-Penguin
Copy link
Author

Humboldt-Penguin commented Apr 25, 2025

Thanks for the PR. It looks good, but I'll need about 2 weeks before I can get to this. After we agree on this, we'll have to make changes to a couple similar datasets that do the same thing.

@MarkWieczorek Thanks, I can get started on applying this to other datasets if you'd like. I see a total of nine datasets which offer multiple resolutions.

Just let me know which of these two implementations you'd prefer:

  1. Copy-paste the full logic from my PR into each existing loading function, or
  2. Move all of this logic into a new function (perhaps in pyshtools/utils/), which would take a generic pooch.Pooch object with an existing registry, and return the appropriate filename.

The latter is more abstracted/complicated, but since it would apply to nine datasets, it would hopefully make future adjustments much easier (cleanness and aesthetics are a bonus). Like I said I'm happy to implement either one, just want to avoid stepping on any toes.

Thanks again for your time.

@MarkWieczorek
Copy link
Member

MarkWieczorek commented Apr 27, 2025

Thanks for your effort on putting this together, and sorry about how long it took to get back to you.

I am thinking that it might be best to create some kind of small helper function to find out if a suitable local file already exists. But instead of putting it in the utils module, what if we just created a hidden function in the pyshtools/datasets/__init__.py file? This won't be used for anything else outside of the datasets module, so I don't think we really need to put it in a different module. Also, at some point in the future I am thinking of breaking this project into several smaller projects (python classes, fortran backend, and datasets+constants). Keeping this function in the datasets module would help for when we that happens.

I just have one comment about this comment in the code:

Note that from python 3.6 onwards, dictionaries maintain order of
declaration/insertion, so this will go from largest model to smallest.

I think that you are assuming that the registry will always be ordered like in this example. Maybe that is ok, but I am just worried that in the future someone might use an inverted order. Perhaps we shouldn't make any assumptions on the ordering.

What do you think?

@Humboldt-Penguin Humboldt-Penguin force-pushed the improve_dataset_downloading branch from 087eb30 to 0d90fbc Compare August 24, 2025 03:59
…isting high-res models instead of downloading lower-res models.

- Context:
    - Some datasets have multiple resolutions, e.g. `pyshtools.datasets.Mars.MOLA_shape` has degrees 5759, 2879, 1439, and 719.
- Current behavior:
    - When a user loads a dataset, pyshtools chooses the smallest model which satisfies the user's input lmax -- for example, calling `MOLA_shape(2000)` will download the 2879 model. However, if you then call `MOLA_shape(1000)`, pyshtools will download the 1439 model. We would save space/time to reuse the previously downloaded 2879 model.
- Improvement:
    - This commit makes it so in aforementioned scenario, pyshtools will load the existing 2879 model.
    - In terms of code, all datasets with multiple resolutions now call the method `_choose_sh_model(...)` (located in `/pyshtools/datasets/_utils.py`). This takes the user's desired lmax and a pooch object containing available models, and either loads any existing model with sufficient resolution, or downloads the smallest model with enough resolution.
@Humboldt-Penguin Humboldt-Penguin force-pushed the improve_dataset_downloading branch from 0d90fbc to b38cdc9 Compare August 24, 2025 04:20
@Humboldt-Penguin
Copy link
Author

Humboldt-Penguin commented Aug 24, 2025

@MarkWieczorek Apologies for the delay. I've cleaned up this PR so it's a single commit with all intended changes.


To summarize (since it's been a while):

  • The goal of this PR is to modify downloading logic for datasets with multiple resolutions to reuse existing high-res model files instead of unnecessarily downloading additional low-res model files.
    • For example, let's say you've called MOLA_shape(2000) which downloaded the degree 2,879 model. If you then call MOLA_shape(1000), pyshtools will download the 1,439 model. We would save space/time to reuse the previously downloaded 2,879 model, which is what this PR aims to implement.
  • Initially, this PR only applied the change to Mars.MOLA_shape(...), with logic written directly into the function.
  • Now, based on your feedback, this PR applies the change to all datasets with multiple resolutions via a helper function _choose_sh_model(...).
    • In your previous comment (April 27), you suggested creating a helper function in pyshtools/datasets/__init__.py. However, this leads to circular import errors $^{\dagger}$ unless you define the helper function at the very top of __init__.py, which would push existing imports down ~150 lines. Aesthetically, this doesn't feel right to me, so I placed the helper function in a new file pyshtools/datasets/_utils.py — let me know if that's okay with you.
    • In your previous comment (April 27), you suggested the helper function should NOT assume the filenames in the registry are already sorted from largest to smallest max degree. Therefore, _choose_sh_model(...) automatically determines the degree of each model by assuming all filenames are identical except for a single integer which is taken as the degree. For example, given two filenames Planet_DEM128_314.sh.gz and Planet_DEM128_2718.sh.gz, we identify the degrees as 314 and 2,718 respectively.
$^{\dagger}$: circular import error explanation

If _choose_sh_model(...) is defined in pyshtools/datasets/__init__.py, then:

  1. __init__.py calls from . import Mercury
  2. Mercury.py calls from . import _choose_sh_model
  3. The previous step only succeeds if __init__.py has already bound _choose_sh_model (i.e. that function is defined before calling from . import Mercury) — if it hasn’t yet, we get a circular import error.

Everything builds just fine in a docker container, make check passes, and the full output of make python-tests is pasted here if you'd like to review: https://gist.github.com/Humboldt-Penguin/abdf2e2d10f8a8a615669e67666cdcda

If everything looks good on your side, this should be ready to merge. I'm happy to make any further tweaks (definitely more timely than before), just let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants