Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@ThomasMGeo
Copy link

GXF File Format IO is a common file type for grav/mag for USGS releases. It's an ASCII file type with variable header structures.

On the dataset that I tested this on, it is upside down in the decoding to xarray. This is not impossible to fix, but I do like keeping the option for numpy and metadata output.

I bet this could be optimized further, will try to find more datasets to test with (smaller the better!). Any comments/optimizations are welcome.

Relevant issues/PRs:: This addresses #538 that has some more information about the format itself.

@ThomasMGeo
Copy link
Author

I found a small file that works from here: https://pubs.usgs.gov/of/2000/ofr-00-0198/html/wyoming.htm

Specifically: mag5001_gxf.gz | 2000-04-24 16:50 | 57K

Potential citation:

Kucks, R.P., and Hill, P.L., 2000, Wyoming aeromagnetic and gravity maps and data—A web site for distribution of data: U.S. Geological Survey Open-File Report 00-0198, https://pubs.usgs.gov/of/2000/ofr-00-0198/html/wyoming.htm

@ThomasMGeo
Copy link
Author

ToDo: Licensing of data file from USGS.

@ThomasMGeo
Copy link
Author

ToDo: Add Attribution to gist

@ThomasMGeo
Copy link
Author

For USGS, looks like we just need to add attribution as well: https://www.usgs.gov/faqs/are-usgs-reportspublications-copyrighted

@ThomasMGeo
Copy link
Author

@santisoler added some text about the USGS, the attribution to the gist was already there. Let me know if I need to do anything else! I probably won't edit this further until I have some more feedback.

@ThomasMGeo
Copy link
Author

@santisoler , is there anything else you want to see?

@santisoler
Copy link
Member

Hi @ThomasMGeo. Thanks for pushing these changes. Let me take a look at this and I'll come back to you.

Copy link
Member

@santisoler santisoler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ThomasMGeo for opening this PR!

I took a quick look at it and I have the following comments.

I'm not sure what's the benefit by not making the read_gxf to always return an xarray.DataArray. Once we have the DataArray, it's trivial to get the underlying Numpy array and the metadata as a dict. And by doing so, we only need to add one public function to Harmonica (read_gxf), instead of read_gxf and gxf_to_xarray (which has only a single use). In my opinion we should make read_gxf to behave similarly to the other readers in Harmonica and return a DataArray.

BTW, we need to remember to add any new public function to harmonica/__init__.py and to the API reference in doc/api/index.rst.

One little thing to note is that the docstrings should follow numpydoc's style, so they can be nicely rendered in Sphinx.

I left some more comments below.

Let me know what do you think. We can iterate over it later on! And feel free to ask for help if you need so.

Comment on lines 10 to 15
1. Grid eXchange Format (*.gxf)

GXF (Grid eXchange File) is a standard ASCII file format for
exchanging gridded data among different software systems.
Software that supports the GXF standard will be able to import
properly formatted GXF files and export grids in GXF format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that these paragraphs are coming from this PDF: https://pubs.usgs.gov/of/1999/of99-514/grids/gxf.pdf. Am I right?

Do we have permission to reproduce the same text here? Can we double check the license of those files? In doubt, I would remove these lines from this file, a reference should be enough. If we do have permission we can leave them, but I would make sure we are compliant with the license (reference the document, acknowledgement, include license text, etc?).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to remove it.

Comment on lines 187 to 188
if reading_data:
data_list.append(line)
Copy link
Member

@santisoler santisoler Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understood these lines are populating the data_list with the values of the grid. Since the number of lines for the grid are usually going to be way larger than the header lines, why not using np.loadtxt to read them? It performs much faster than appending lines to a list, it already returns an array in the right shape, and it also has some sanity checks (the number of elements per row are consistent) that we don't have to take care of.

I think we could start reading the header lines until we find the #GRID line, keep record of the number of rows for the header and stop the iteration. Then we can use np.loadtxt to read the rest of the file by passing the number of rows to skiprows (+/- 1, we need to figure out the correct indexing).

Comment on lines 238 to 240
# Get grid dimensions
nrows = int(headers['ROWS'])
ncols = int(headers['POINTS'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In lines https://github.com/fatiando/harmonica/pull/539/files#diff-5ac31ec9d8d11b47ff39f59bf25e1cad43484857580dd7b6c65cddd8e0f504cfR219-R229 we populate the metadata dictionary with parsed values (float, int, stripped str). Isn't it better to use metadata to get the number of rows and columns, instead of having to parse them again as ints?

That being said, maybe we want the keys in metadata to be lowercase (as you do in following lines with nx, ny, x_inc, etc.).

Comment on lines 283 to 308
def get_grid_info(metadata: Dict[str, Any]) -> None:
"""
Print comprehensive information about the GXF grid.

Parameters:
metadata (dict): Metadata dictionary from read_gxf
"""
print("=== Grid Information ===")
print(f"Title: {metadata.get('TITLE', 'Not specified')}")
print(f"Grid size: {metadata['nx']} x {metadata['ny']} points")
print(f"Cell size: {metadata['x_inc']} x {metadata['y_inc']}")
print(f"Origin: ({metadata['x_min']}, {metadata['y_min']})")
print(f"Rotation: {metadata.get('ROTATION', 0)}")
print(f"Dummy value: {metadata.get('DUMMY', 'Not specified')}")

if 'projection' in metadata:
print("\n=== Projection Information ===")
print(f"Type: {metadata['projection']['type']}")
print(f"Units: {metadata['projection']['units']}")

params = metadata['projection']['parameters']
if any(params.values()):
print("\nProjection Parameters:")
for key, value in params.items():
if value is not None:
print(f"{key}: {value}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function is nice for debugging the reader. But I hardly see users using it (we would need to expose it as well). Instead, if read_gxf returns a DataArray, then we don't need this function: xarray can nicely display all the data.

What do you think? Is there any special reason to include this function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can drop this!

@ThomasMGeo
Copy link
Author

Hi @santisoler , thanks for the review, I just want to touch on this few comments before I dig into the rest of the issues.

I'm not sure what's the benefit by not making the read_gxf to always return an xarray.DataArray

I am happy to have it be a very plain DataArray without the CRS (usually the CRS is not given as a code in the files that I have seen). For some of my work, I have had to do some weird warping to get them to match, but this might be rare and isolated.

The get_grid_info was just used to figure out the projections (again, these are not standard CRS's with modern EPSG codes from what I have seen). So my general workflow would be to load it as a numpy array, look at that output, and then load it into xarray with all of the EPSG codes attached.

Let me know what you think is best, and I can tackle it!

@santisoler
Copy link
Member

Hi @ThomasMGeo. I see your point. Since the gxf files don't have a strict specification for setting the CRS, I think we should just add to the attrs in the xarray.DataArray whatever information we can find in the header of the file. If that information is enough to determine the CRS, that's cool. But if it's not, too bad. I don't think there's nothing the reader function can do to solve that, am I right?

I do like having the information about the projection in the metadata. A nested dict (like the one you already implemented) should be ok (we might want to make sure that that structure follows the CF metadata conventions).

Another thing we were discussing today during the meeting is to use easting and northing for the x and y directions. But use x and y in case the grid is rotated (very much like the read_oasis_grd function does).

@ThomasMGeo
Copy link
Author

Awesome, that all makes sense. Let me work on it for a bit and get back to you. Thanks for giving it a look and for all the feedback! :)

@ThomasMGeo
Copy link
Author

ok it's been a long time, but check out some of the changes, happy for any feedback!

Copy link
Member

@santisoler santisoler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ThomasMGeo. Thanks for pinging me back. I'm sorry for the delayed reply. I'll try to provide some feedback soon.

In the meantime, I noticed that the grid values in the GXF files are not laid out with number of rows and columns that match the shape of the grid. They seem to be a flattened array with always 5 columns, and some rows with less than 5 values. That's annoying, and basically rules out my idea of using np.loadtxt: we cannot use it in the general case, only if the number of points per row is multiple of 5. Sorry for bringing that up.

Also, I think we can safely remove the read_gxf_raw function, right? I don't see it being used elsewhere.

I'll leave a comment below with one more thing I saw while quickly looking at the code.

I'll try to come back soon to this PR.

Comment on lines 232 to 233
with open(infile, 'r') as f:
lines = f.readlines()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend avoiding this. The readlines loads all lines in the file into memory as strings. In case the file is too large, we'll be using more memory than needed. I'd recommend iterating over each line, one by one. Something like:

with open(infile, "r") as f:
    for line in f:
        ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fixed!

@ThomasMGeo
Copy link
Author

Deleted the unnecessary function and test.

Pulling imports to the top.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants