add spatial correlogram function #259

knaaptime · 2023-08-04T00:18:28Z

add first draft of correlogram

* CI: properly test min and dev * better pins * fix nightly index * pin scipy to actual min req * update requirements.txt

knaaptime · 2023-08-04T00:20:43Z

supercedes #253

knaaptime · 2023-08-04T00:22:04Z

there's nothing harder than pulling off a successful rebase. I'll die on this hill.

jGaboardi · 2023-08-04T00:23:50Z

there's nothing harder than pulling off a successful rebase. I'll die on this hill.

martinfleis · 2023-08-04T08:07:16Z

esda/correlogram.py

+        which concept of distance to increment. Options are {`band`, `knn`}.
+        by default 'band' (for `libpysal.weights.DistanceBand` weights)
+    statistic : str, by default 'I'
+        which spatial autocorrelation statistic to compute. Options in {`I`, `G`, `C`}


could you add a key to what I, G, C mean? Saying that it is Moran's I, Geary C, Getis-Ord G?

This is a nice suggestion.

martinfleis · 2023-08-04T08:08:29Z

esda/correlogram.py

+            raise e("distance_type must be either `band` or `knn` ")
+
+    #  should be able to build the tree once and reuse it?
+    #  but in practice, im not seeing any real difference from starting a new W from scratch each time


Not even for large data? I suppose that repeated creation of the tree could make a dent on this.

I'm a bit confused by @knaaptime's comment, and also this response. This should not be building a KDTree repeatedly, but it is instantiating a new W every time. The tree-rebuilding is avoided because the Distance classes (KNN, Kernel, and DistanceBand) all correctly take pre-built KDTrees as input. Build a tree once & it can get reused.

However, the corresponding query(), query_ball_point(), or sparse_distance_matrix() calls are repeated, which could be more efficient if we computed it once for the maximum distance/k value we need.

I was "testing" on a dataset with ~65k points, but just not terribly thoroughly

what i meant was, I'd also done KNN.from_dataframe(df, k=k) in each loop, instead of W(tree), where the former needs to build the tree each time in addition to querying it. Then i figured it might be faster to re-use the same tree repeatedly instead

what i was hoping for was something like @ljwolf said, where i'd do the expensive call once for the largest distance, then repeatedly use the same structure for repeated calls (sorta like how pandana works with preprocess). If we just loop over k sorted in descending order, does that help?

(a) I think it's fine to keep this current implementation for now.
(b) yeah basically: a more performant implementation would probably use max(distances) to compute .sparse_distance_matrix() (or built a sparse matrix from the output of query() if KNN) one time, and then progressively mask that based on each smaller distance value... basically, we want to query once, rather than repeat the query to yield results we can "precompute".

gotcha. Same way i did it for the isochrones with pandana. Get the adjacency once and trim it down for each distance. So here it would be something like

maxtree = KDTree(max(distances)) mintree = KDTree(min(distances)) distmat = maxtree.sparse_distance_matrix(mintree) for dist in distances w = Graph.build_knn(distmat, k=k)

or something?

i guess what im asking is, what's the most efficient way to do this with the new Graph?

If we have a Graph based on the max distance, then it's already got all the information we need for every lesser distance, and all we need to do is filter on the adjlist where wij<distance. Presumably thats what simething like Graph.build_knn(distmat, k=k) would do when passed an adjlist or a sparse matrix because all it needs to do is subset the existing data, not rebuild the distance relationships

for this problem its easy enough to just build the adjlist once and literally filter it down each distance and instantiate a new Graph/W from that adjlist, but is that the best way?

for this problem its easy enough to just build the adjlist once and literally filter it down each distance and instantiate a new Graph/W from that adjlist, but is that the best way?

I suppose this would need to be tested. My sense is that a filter of adjacency based on a distance will be significantly slower than a tree query but I might be wrong.

so, i'm playing with this, and it would be straightforward to filter the adjlist, which is fast, since its just adj[adj['weight']<=distance].

...But once we subset the W, now we have to subset the original dataframe to align it, which is something we said we wouldnt do. Is there a good way to do this?

To keep the structure, you can just assign weight=0 to everything above the set distance threshold.

martinfleis · 2023-08-04T08:27:09Z

esda/tests/test_correlogram.py

+
+distances = [i + 500 for i in range(0, 2000, 500)]
+
+def test_correlogram():


Do we have some ecosystem-wide standards on what is the minimal acceptable test suite?

No... We've talked about separating "correctness" tests (does the statistic recover the same result every time?) and option tests/user tests (does the function handle all combination of arguments/input types correctly?), but we've never agreed on what is necessary for any contribution...

we should put this on the agenda for the monthly call and/or steering committee.

tbh i dont think an ecosystem-wide policy is realistic. We don't have the resources. There's a lot of heterogeneity in the maturity of different subpackages, so a lot of variation in how the lead maintainers want to manage their stack. Most often, I'm happy to accept new functionality once the test shows that it gets the 'right answer', even if i dont have the capacity to write a fully-parameterized test suite that hits every line--so that's how i manage tobler, segregation, geosnap, etc, otherwise they won't move forward. I know which lines are untested, so I'm ok with that until (a) an issue is raised, (b) i get some time to focus on testing, or (c) the total coverage drops too low (which sometimes triggers b)

Yup, makes sense. I asked because I am not sure when should I be fine with the state of tests when doing the review, so an agreed guideline would be nice.

Very much agree with Martin on this. Seems like our general rule of thumb has been that somewhere between 70-90% is the acceptable coverage. I always try to get as close to 100% as possible to (attempt to) hit any weird edge cases, but I know that isn't required a lot of the time. If we want to decide on an actual number, this is surely something for at least the steering committee to vote on. Moreover, (1) as a step towards catching edge cases we may want to consider seeing about implementing hypothesis, and (2) we should improve testing across all submodules to properly test minimum requirements, as Martin points out here.

If we want to decide on an actual number, this is surely something for at least the steering committee to vote on. Moreover, (1) as a step towards catching edge cases we may want to consider seeing about implementing hypothesis, and (2) we should improve testing across all submodules to properly test minimum requirements, #258.

again, i dont think an ecosystem-wide policy is realistic, but definitely something we should discuss. The question is about cost-benefit. If we're too strict about test coverage, it will discourage contributions, and i think for many packages, that's a greater risk than having some untested code. I don't have time to get absorbed in edge-cases, so I'd prefer to let those issues surface before letting them block new functionality getting merged.

case in point, please feel free to write out the rest of the test suite here :)

If we want to decide on an actual number, this is surely something for at least the steering committee to vote on. Moreover, (1) as a step towards catching edge cases we may want to consider seeing about implementing hypothesis, and (2) we should improve testing across all submodules to properly test minimum requirements, #258.

again, i dont think an ecosystem-wide policy is realistic, but definitely something we should discuss. The question is about cost-benefit. If we're too strict about test coverage, it will discourage contributions, and i think for many packages, that's a greater risk than having some untested code. I don't have time to get absorbed in edge-cases, so I'd prefer to let those issues surface before letting them block new functionality getting merged.

case in point, please feel free to write out the rest of the test suite here :)

FWIW @knaaptime I am "on your side" here (though I think we're all on the same side). In my opinion, getting new functionality merged should take precedence over a strict coverage number for individual PRs, so long as that new functionality is being "mostly" covered (where "mostly" can either be more lines covered or majority cases covered). More testing can be added later as time & energy permit. I hope my comments did not come off as combative.

😅 sorry my style is always blunt! no offense taken or intended. We're all on the same page. We need some guidelines on merging--definitely. I just want to avoid a situation where we let stuff sit here, because it's usually easier for us core maintainers to take incremental passes at improving test coverage, instead of imposing a big burden on the PR opener

(for this one, I certainly don't need this PR merged, but while it's sitting here it makes for a good example. I already have the function myself, but if other folks want to use it, i'd prefer not to let corner cases prevent that, since i know i don't have a ton of time to write out the test suite. I can usually commit to responding to bug reports though)

My question was primarily coming from my GeoPandas experience and understanding that what we aim for there is not applicable in PySAL. So I was just wondering what is the level people feel is enough. Totally agree with all you said above, there's no need to block stuff just because tests are not 100%. Happy to have a further chat about it during the next call. And happy to merge this with the test suite as is :).

Once again I agree with Martin, but maybe after addressing #259 (comment)?

ljwolf · 2023-08-04T09:01:52Z

esda/correlogram.py

+        STATISTIC = G
+    elif statistic == "C":
+        STATISTIC = Geary
+    else:


The extraction functions look sufficiently general to allow us to admit any callable like stat(x, w, **kwargs) that returns an object with attributes? There's nothing I/G/C specific in how the attributes are extracted.

I'd love to allow users to send a callable, since this gives us pretty major extensibility for free...

cool, that makes sense. I think the first version I wrote actually took the Moran class as the first argument, but i thought the dispatching actually helped clarify what the function was doing. The internals ended up a lot more generic than i figured they'd need to be

so the most abstract version returns the output of a callable for a range of W/Graph specifications (in table form), and there are probably plenty of uses for that. But then do we want to call the function something more generic since it no longer necessarily computes some measure of correlation?

Totally, and then define a specific correlogram() function that only supports Moran, spatial Pearson, or spatial tau?

For the general function, @TaylorOshan and I have been calling these "profile" functions (bottom of page 302), so I'm partial to distance_profile() or spatial_profile().

Well, I guess even like.... I have a nascent implementation of a nonparametric robust Moran/Local Moran statistic that would benefit from being able to plug in directly? I think it's OK to call the function correlogram(), and then allow for a user-defined callable. The fact that it's user-defined means that we can't enforce the output is callable, and this keeps things more simple for the user.

profile is the same nomenclature over in segregation but maybe we should centralize some of this logic then

knaaptime · 2023-08-05T20:29:13Z

esda/correlogram.py

+
+    return (
+        pd.DataFrame(outputs)
+        .select_dtypes(["number"])


so, in the abstract version, this function becomes spatial_profile(callable, y, gdf, **kwargs), where callable is any function that takes (y,w) as arguments.

in that case, should we get rid of this .select_dtypes here, and return everything the callable has? Also, since this line requires a class, not a generic callable, does that need to be abstracted a bit?

I think just keep it as correlogram. Allowing a user defined function already means we can't assume/ensure the output is "correlation", but it's a relatively advanced way to work with the function.

And yes, the output type of the callable has to be a namespace/namedtuple/object kind of thing, but I think that's an ok requirement

knaaptime · 2023-08-06T02:29:04Z

this should be just about good to go, but it's still failing for stuff like join counts because the class stashes all sorts of things as attriubutes (like the W and its adjacency list) that cant get serialized by joblib, and LOSH/Spatial_Pearson which have a different signature that needs w.sparse

knaaptime · 2023-08-06T15:35:20Z

Smart. The issue I was runnning into was that w.to_adjlist sorts the neighbors, so once you convert and try to subset the W, it and df are no longer aligned, which is a mess

…

-- Elijah Knaap, Ph.D. Senior Research Scientist & Associate Director Center for Open Geographical Science San Diego State University knaaptime.com | @knaaptime

On Aug 6, 2023 at 1:47 AM -0700, Martin Fleischmann ***@***.***>, wrote: @martinfleis commented on this pull request. In esda/correlogram.py: > + with NotImplementedError as e: + raise e("Only I, G, and C statistics are currently implemented") + + if distance_type == "band": + W = DistanceBand + elif distance_type == "knn": + if max(distances) > gdf.shape[0] - 1: + with ValueError as e: + raise e("max number of neighbors must be less than or equal to n-1") + W = KNN + else: + with NotImplementedError as e: + raise e("distance_type must be either `band` or `knn` ") + + # should be able to build the tree once and reuse it? + # but in practice, im not seeing any real difference from starting a new W from scratch each time To keep the structure, you can just assign weight=0 to everything above the set distance threshold. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

ljwolf · 2025-09-24T12:50:53Z

Having my hands in this code now, I'd suggest the following argument renames:

gdf should become geometry. We can always accept a GeoDataFrame and get geometry.geometry when needed, but users should also be able to pass gdf.geometry for explicitness.
variable should be a series or array, not a string. This matches the API elsewhere in esda. This came from segregation, which is mostly gdf/str based, right?
distances should be support. This will mimic other places (like in pointpats) where we use distances to refer to a pairwise distance matrix and support for the values at which a function is evaluated. Further, when distance_type=='knn', distances are not distances. Finally, this will also make it clearer for users to estimate a lowess on precomputed distances:

correlogram(
    gdf.geometry, 
    gdf.variable, 
    support=[0,1,2,3],
    statistic = 'lowess', 
    stat_kwargs = dict(
        metric='precomputed', 
        coordinates=my_distance_matrix
    )
)

If support is instead distances, things look confusing to me:

correlogram(
    gdf.geometry, 
    gdf.variable, 
    distances=[0,1,2,3], # why is this referring to specific values of...
    statistic = 'lowess', 
    stat_kwargs = dict(
        metric='precomputed', 
        coordinates=my_distance_matrix # this distance matrix, which is "coordinates"?
    )
)

knaaptime · 2025-09-24T13:38:53Z

this all makes sense to me. Thanks @ljwolf !

knaaptime · 2025-09-24T22:55:09Z

@ljwolf thanks for the lowess version, looks great. Should we also implement/document that this version works a little differently than geoda, which, apart from using lowesss also seems to work on an expanding donut (i.e. it trims lower-bound distances as it moves outward?)

ljwolf · 2025-09-25T08:12:05Z

This does not use a donut. it's using the same lowess regression as in that section, estimating using only data near each support value.

But, I think we must set the lowess frac parameter in a data-dependent fashion, though, to match their behavior---it should depend on the data rather than constant. Setting it to 1/n_bins/2 would estimate using only data closest to each bin center.

ljwolf · 2025-09-25T13:23:14Z

To explain more fully:

the _lowess_correlogram() part estimates a lowess regression on the correlation and distance:

z_i*z_j = f(d_{ij}) + u_{ij}

This is the equation listed in the GeoDa documentation. Unfortuately, we have to parameterise this a little differently from GeoDa no matter what we do, unless we want to implement our own lowess regressor or simply use the average (unregularized) correlation in bin.

statsmodels lowess lets us specify the points at which the lowess is calculated (xvals) and what fraction of the data should be used around each point (frac). This second part is because their approach uses a (novel to me!) interval knn-based method to track points near xvals.

Hence, we cannot quite do lowess over bins exactly like GeoDa; the bins must have dynamic width (in terms of distance) and represent a fixed fraction of the data. This has tradeoffs. Hence, I'm suggesting we deviate from GeoDa's parameterization and make the following tradeoffs:

set the end of the support to half the longest bounding box diagonal by default. This allows us to ensure coverage for both parametric and nonparametric calculations without requiring us to build the full distance matrix up front. But, it's often longer than we want to visualize, since correlations often quickly go to zero.
use only the upper-triangle of the covariance and distance matrices if the distance matrix is symmetric. GeoDa docs suggest using the full matrix is not always the best choice, and that they use it to avoid having bins with too-few observations. Given statsmodels lowess uses a knn local regression, we don't need to worry about having too few observations at each xval and the duplication just slows the routine down. If the distance matrix is not symmetric (as might happen if metric='precomputed'), we use the full matrix anyway.
set the lowess frac according to the typical fraction of data in each bin. Here, note that support won't always span all distances. This minimizes the re-use of points (with no reuse when bins are constant width and span a subset of the data).
(a) calculate the implied bin widths from user-supplied support,
(b) calculate the width of the first and last bin.
(c) look up/down 1/2 the first/last bin width from the lowest/highest support value (bounded by zero on the left), and
(d) calculate the fraction of distances that fall within this span: frac_span, the fraction of point pairs spanned by bins implied by the support values.
(e) Then, we set the lowess frac to frac_span/n_bins.

For evenly spaced bins, this will assign each observation to its nearest support, even if support.max()+1/2*bin_width leaves off some pairs. For unevenly-spaced support, this will recycle observations across bins. I don't think we can get around this with statsmodels' lowess.

ljwolf · 2025-09-25T15:22:40Z

This now has full tests and nonparametric is in the notebook. If we're fine with the defaults, this is ready to ship I think.

ljwolf · 2025-09-25T15:31:47Z

~~test failure looks to be a network error. I'm not sure how to re-run this?~~ nvm. rerunning now.

martinfleis

Pushed some minor fixes of docs.

knaaptime · 2025-09-25T18:13:28Z

teamwork ftw

knaaptime and others added 13 commits July 20, 2023 15:49

add first draft of correlogram

ef35c4b

add comment to include geoda version

5e7807e

add correlogram test

59f41b2

typos, test func

062560c

add explanation and some notebook text

3b20096

update to pyproject

82a359b

scm init

b686d6d

update release workflow

7e38346

use radius

6d7c0cd

CI: properly test min and dev (pysal#258)

77bce5f

* CI: properly test min and dev * better pins * fix nightly index * pin scipy to actual min req * update requirements.txt

rebase

c0260bc

rm manifest

e5cc443

Merge branch 'main' into corrfix

98afa33

double comment

f2b952e

knaaptime added the enhancement label Aug 4, 2023

knaaptime changed the title ~~corrfix~~ add spatial correlogram function Aug 4, 2023

martinfleis reviewed Aug 4, 2023

View reviewed changes

ljwolf reviewed Aug 4, 2023

View reviewed changes

jGaboardi assigned knaaptime Aug 5, 2023

knaaptime changed the title ~~add spatial correlogram function~~ [WIP] add spatial correlogram function Aug 5, 2023

knaaptime commented Aug 5, 2023

View reviewed changes

knaaptime added 4 commits August 5, 2023 15:39

callable

d8708fa

knn test

94c6b56

formatting

943b659

local import

a73b94b

validate, fix typehints, etc.

5b03268

ljwolf added 6 commits September 25, 2025 11:20

make naming swaps

d18c5b0

improve binning logic

0b5858c

continue fleshing out tests

dc3dedc

span half the bounding box by default

df67faf

address colocation issues in lowess correlogram

1c82649

finalize bin approach, fix bugs and tests

7dc78fa

ljwolf added 2 commits September 25, 2025 14:39

move back to libpysal for compat

4184b36

update tests for compat too

3d6e101

ljwolf force-pushed the corrfix branch from 79b8a9d to 449e784 Compare September 25, 2025 14:46

add skip for min environments that do not have statsmodels

0dabf26

ljwolf force-pushed the corrfix branch from 449e784 to 0dabf26 Compare September 25, 2025 14:50

ljwolf added 2 commits September 25, 2025 16:21

catch runtime warnings during statistic estimation

4da7167

clean up notebook plots and add nonparametric

a2a1458

ljwolf approved these changes Sep 25, 2025

View reviewed changes

ljwolf changed the title ~~[WIP] add spatial correlogram function~~ add spatial correlogram function Sep 25, 2025

martinfleis added 3 commits September 25, 2025 19:53

add to api docs

d33fb83

fix notebook

975b56a

fix docstring rendering

45d19e1

martinfleis approved these changes Sep 25, 2025

View reviewed changes

jGaboardi merged commit d89afe3 into pysal:main Sep 25, 2025
16 checks passed

jGaboardi mentioned this pull request Sep 25, 2025

correlogram #252

Closed


		distances = [i + 500 for i in range(0, 2000, 500)]

		def test_correlogram():

add spatial correlogram function #259

add spatial correlogram function #259

Uh oh!

Conversation

knaaptime commented Aug 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

knaaptime commented Aug 4, 2023

Uh oh!

knaaptime commented Aug 4, 2023

Uh oh!

jGaboardi commented Aug 4, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knaaptime Aug 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knaaptime commented Aug 6, 2023

Uh oh!

knaaptime commented Aug 6, 2023 via email

Uh oh!

ljwolf commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

knaaptime commented Sep 24, 2025

Uh oh!

knaaptime commented Aug 4, 2023 •

edited

Loading

knaaptime Aug 5, 2023 •

edited

Loading

ljwolf commented Sep 24, 2025 •

edited

Loading

ljwolf commented Sep 25, 2025 •

edited

Loading

ljwolf commented Sep 25, 2025 •

edited

Loading