Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Preparations for multivariate plotting #29877

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

trygvrad
Copy link
Contributor

@trygvrad trygvrad commented Apr 6, 2025

PR summary

This PR continues the work of #28658 and #28454 and #29876, aiming to close #14168. (Feature request: Bivariate colormapping)

This is part two of the former PR, #29221, and builds upon #29876. Please see #29221 for the previous discussion

#29876 includes:

  • A MultiNorm class. This is a subclass of colors.Normalize and holds n_variate norms.
  • Testing of the MultiNorm class

This PR includes in this PR:

  • changes to colorizer.py needed to expose the MultiNorm class

Features not included in this PR:

  • Exposes the functionality provided by MultiNorm together with BivarColormap and MultivarColormap to the plotting functions axes.imshow(...), axes.pcolor, and `axes.pcolormesh(...)
  • Testing of the new plotting methods
  • Examples in the docs

This commit introduces the MultiNorm calss to prepare for the introduction of multivariate plotting methods
@trygvrad trygvrad changed the title Multivariate plot prapare 2 Preparations for multivariate plotting Apr 6, 2025
return x
else:
# in case of a dtype with multiple fields:
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to get at least partial coverage for this branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't really been involved in this work nor understand how it works, but there is quite a bit of introduced code to deal with multiple datatypes? If this will be covered by tests/functionality in later PRs, that is fine, if not, please add tests for (most of) it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asked to split #29221 into multiple PRs, and this PR is one of them.
There is tests for this functionality in #29221 using the top-level plotting functions (axes.imshow() etc.)
In my mind it is better to test using the top-level API, but if you wish I could add dedicated testing to this PR.

if self.norm.n_output != cmap_obj.n_variates:
raise ValueError(f"The colormap {cmap} does not support "
f"{self.norm.n_output} variates as required by "
f"the {type(self.norm)} on this Colorizer.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error messages typically have no end dot (same comment applies throughout).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll need to change this in the other PR as well.

mask = np.empty(x.shape, dtype=np.dtype('bool, '*len(x.dtype.descr)))
for dd, dm in zip(x.dtype.descr, mask.dtype.descr):
mask[dm[0]] = ~(np.isfinite(x[dd[0]]))
xm = np.ma.array(x, mask=mask, copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do numpy masked arrays actually support struct arrays as mask, with possibly different masking of the fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found that this is the only way numpy supports masking dtypes with multiple fields, but I will see if [("mask", bool, len(x.dtype.descr))] as you suggest bellow is a reasonable approach to using a single mask.

else:
# in case of a dtype with multiple fields:
try:
mask = np.empty(x.shape, dtype=np.dtype('bool, '*len(x.dtype.descr)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the dtype be e.g. [("mask", bool, len(x.dtype.descr))] (with a slightly different API)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting idea. I'll make a prototype and see if this would add unnecessary complexity somewhere else.

@trygvrad trygvrad force-pushed the multivariate-plot-prapare-2 branch from 54a945c to eeb895c Compare April 10, 2025 20:42
@trygvrad trygvrad force-pushed the multivariate-plot-prapare-2 branch from 41acef7 to 9c62126 Compare April 13, 2025 10:52
@trygvrad
Copy link
Contributor Author

@anntzer I think this is important, so I wanted to reply to this in the main thread.

Could the dtype be e.g. [("mask", bool, len(x.dtype.descr))] (with a slightly different API)?

The context here is that mulrivariate data is stored internally as an array with a data type with multiple fields.
This has been chosen, because it ensures that data.shape returns the same shape for both scalar and multivariate data.
If a numpy array with multiple fields is masked, it must have a separate mask for each channel. I read @anntzer s suggestion as letting the mask be another field, i.e. ['bool', 'float64', 'float64'] interpreted as [mask, variate0, variate1] when a dataset with two variates is masked.

It should be noted that when a regular np.array is masked, and the mask is false for all values, only a single instance of is stored (instead of a full array of bools). This is not the case for structured arrays. For structured arrays, full mask [with a separate bool for each field] is encoded in all cases.

I didn't actually get as far as to prototype this, but I did have a look around.

I have found that it will largely involve changes to colors.multi_norm._iterable_variates_in_data() and cbook.safe_masked_invalid()

I have tried to list the advantages/disadvantages of the two approaches below:

A: Use a masked array with a struct array.

  • Implication: each variate has a separate mask

Advantages:

  1. It is easy to iterate over the channels (this is in any case handled by colors.multi_norm._iterable_variates_in_data())
  2. Easy to parse masked input
  3. np.ma.is_masked() will work for both multivariate and scalar data
    3.1 I don't think this is actually used internally in the context of the data for a relevant plotting method, so this appears to be a minor issue.
  4. Each variate may have a different mask, and we may implement different belending mode in color for each.
    4.1 i.e. instead of having the masked values be transparent, it is possible to map them to unique colors, [typically colors that otherwise do not occur naturally in the colormap, typically cyan, magenta, bright green ] so that the user knows which channel has masked [invalid] data.
    4.2 We will probably not support this initially, but choosing this route allows us the flexibility in the future

Disadvantages:

  1. Need to store a separate mask for each channel.

B: store the mask as an additional dtype in the struct array i.e. [("mask", bool, len(x.dtype.descr))]

  • Implication: a shared mask for all channels

Advantages:

  1. Only one mask
    1.1 Less memory use
    1.2 No ambiguity as to what data is masked

Disadvantages:

  1. In order to iterate over the channels, a masked array must be created for each channel. (i.e. slicing the array will not produce masked arrays – this can be handled in colors.multi_norm._iterable_variates_in_data().)
    1.1. The data may be iterated over multiple times in order to produce a single plot [autoset limits(?) etc.]. One way to interpret option A is that it caches each variate in its masked form, whereas with option B the masked version of each variate is created only upon access.
  2. A no-masked and masked version of the same array has different number of fields, which can lead to confusion.
  3. Masked input must be parsed
    3.1 With this implementation, practically all data will need to be formatted upon input, whereas with implementation A, data that is already structs [or is complex!] is interoperable with the internal workings of matplotlib.
  4. I suspect it will be more difficult to onboard new developers with this approach.

Having looked at this, my personal opinion is that option A is more suitable for matplotlib because I think it will be easier to maintain.

@anntzer let me know if I have interpreted your suggestion correctly, and if you agree with my assessment of approach A or B, or if you think I should make a full prototype to explore this further.

@trygvrad trygvrad mentioned this pull request Apr 17, 2025
@trygvrad trygvrad force-pushed the multivariate-plot-prapare-2 branch from 9c62126 to a276d89 Compare May 7, 2025 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: Bivariate colormapping
3 participants