Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

geliAIMila
Copy link

When working with machine learning + geospatial projects,, it is common to train a model based on multiple input data features, each representing different observed aspects. Collectively, these features form a high-dimensional data space that contributes to the model's predictions. Instead of analyzing each dimension individually, it can be useful to perform vargioram analysis on this high-dimensional data.

The current codebase can be easily extended to this high-dimension scenario by re-writing the semivariance calculation formula as:
$$\gamma(h)= \frac{1}{2N(h)}\sum_{i=1}^{N(h)}\lVert Z(x_i)-Z(x_i+h) \lVert_2^2 ,$$
where:

  • $\gamma$ is the semi-variance.
  • $h$ is the separation distance of the data pair.
  • $N(h)$ is the total number of pairs separated by distance $h$.
  • $Z(x) \in R^{M} = [z_0(x), z_1(x), \dots,z_{M-1}(x)]$ is the feature vector observed at location $x$ and $\lVert \circ\rVert_2$ denotes the $L_2$ norm.
  • $M$ is the feature vector length or the number input features.

This is implemented in the code as:
diffs = np.sqrt(np.sum(diffs**2, axis=0)) which calculates the $L_2$ norm of the differences between feature vectors.

@geliAIMila geliAIMila changed the title Variogram analysis on high dimenion data Variogram analysis on high dimensional data Jun 5, 2025
@mmaelicke
Copy link
Owner

Hi Ge Li,

Thanks for this really interesting PR!

Could you provide a reference detailing some additional info for me? That would be very helpful to understand the use-case better.
I see the appeal, however, my only concern is that a multi-dimensional input is currently used for creating cross-variograms. So, we would end up with two parameters which are mutually exclusive and I fear it could get confusing to parameterize the variogram properly.
I would suggest that we introduce a new parameter, that specifies how multivariate data should be used, ie: multivariate='cross' multivariate='aggregate', if you feel 'aggregate' describes the the action well. Maybe 'norm' would be better....
It should only be a small adaption, but I think the interface would stay cleaner-ish.
What do you think?

@geliAIMila
Copy link
Author

geliAIMila commented Jun 5, 2025

Hello Mirko,

Thanks for your quick response.
Background
I am working on machine leaning for geospatial/geoscience project. It is very common to observe spatial autocorrelation, which is the property that the observations of a variable are not independent across space but have more similar values to nearby neighbours than distant locations. Under this circumstance, it would be problematic if two spatially adjacent observations were divided into train and validation sets, respectively, during data splitting, as this would lead to data leakage. A useful way to address it is to apply the exclusion zone-based data splitting strategy, which excludes training samples within a specific radius of testing samples. However, it is hard to determine the exclusion zone size. Maybe we can use the effective range estimated in a variogram analysis. Since it is a machine learning problem in high-dimensional data space, it would be useful to perform the anaysis in high D.

Code

Yes, it is a great idea to introduce a multivariate parameter. I prefer multivariate='aggregate'. The code is also updated.

Ge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants