Variogram analysis on high dimensional data #198

geliAIMila · 2025-06-04T23:52:35Z

When working with machine learning + geospatial projects,, it is common to train a model based on multiple input data features, each representing different observed aspects. Collectively, these features form a high-dimensional data space that contributes to the model's predictions. Instead of analyzing each dimension individually, it can be useful to perform vargioram analysis on this high-dimensional data.

The current codebase can be easily extended to this high-dimension scenario by re-writing the semivariance calculation formula as:
$$\gamma(h)= \frac{1}{2N(h)}\sum_{i=1}^{N(h)}\lVert Z(x_i)-Z(x_i+h) \lVert_2^2 ,$$
where:

$\gamma$ is the semi-variance.
$h$ is the separation distance of the data pair.
$N(h)$ is the total number of pairs separated by distance $h$.
$Z(x) \in R^{M} = [z_0(x), z_1(x), \dots,z_{M-1}(x)]$ is the feature vector observed at location $x$ and $\lVert \circ\rVert_2$ denotes the $L_2$ norm.
$M$ is the feature vector length or the number input features.

This is implemented in the code as:
diffs = np.sqrt(np.sum(diffs**2, axis=0)) which calculates the $L_2$ norm of the differences between feature vectors.

mmaelicke · 2025-06-05T13:41:10Z

Hi Ge Li,

Thanks for this really interesting PR!

Could you provide a reference detailing some additional info for me? That would be very helpful to understand the use-case better.
I see the appeal, however, my only concern is that a multi-dimensional input is currently used for creating cross-variograms. So, we would end up with two parameters which are mutually exclusive and I fear it could get confusing to parameterize the variogram properly.
I would suggest that we introduce a new parameter, that specifies how multivariate data should be used, ie: multivariate='cross' multivariate='aggregate', if you feel 'aggregate' describes the the action well. Maybe 'norm' would be better....
It should only be a small adaption, but I think the interface would stay cleaner-ish.
What do you think?

geliAIMila · 2025-06-05T20:58:19Z

Hello Mirko,

Thanks for your quick response.
Background
I am working on machine leaning for geospatial/geoscience project. It is very common to observe spatial autocorrelation, which is the property that the observations of a variable are not independent across space but have more similar values to nearby neighbours than distant locations. Under this circumstance, it would be problematic if two spatially adjacent observations were divided into train and validation sets, respectively, during data splitting, as this would lead to data leakage. A useful way to address it is to apply the exclusion zone-based data splitting strategy, which excludes training samples within a specific radius of testing samples. However, it is hard to determine the exclusion zone size. Maybe we can use the effective range estimated in a variogram analysis. Since it is a machine learning problem in high-dimensional data space, it would be useful to perform the anaysis in high D.

Code

Yes, it is a great idea to introduce a multivariate parameter. I prefer multivariate='aggregate'. The code is also updated.

Ge

geliAI and others added 3 commits November 18, 2024 12:01

support HD data [skip ci]

51fe29f

Update DirectionalVariogram

bd2588f

Make argument name more informative [skip ci]

1938485

geliAIMila changed the title ~~Variogram analysis on high dimenion data~~ Variogram analysis on high dimensional data Jun 5, 2025

introduce parameter multivariate

a00885f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Variogram analysis on high dimensional data #198

Variogram analysis on high dimensional data #198

Uh oh!

geliAIMila commented Jun 4, 2025

Uh oh!

mmaelicke commented Jun 5, 2025

Uh oh!

geliAIMila commented Jun 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Variogram analysis on high dimensional data #198

Are you sure you want to change the base?

Variogram analysis on high dimensional data #198

Uh oh!

Conversation

geliAIMila commented Jun 4, 2025

Uh oh!

mmaelicke commented Jun 5, 2025

Uh oh!

geliAIMila commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

geliAIMila commented Jun 5, 2025 •

edited

Loading