Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Excessive rugplot memory usage #4695

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cviner opened this issue Jul 14, 2015 · 12 comments
Closed

Excessive rugplot memory usage #4695

cviner opened this issue Jul 14, 2015 · 12 comments

Comments

@cviner
Copy link

cviner commented Jul 14, 2015

Seaborn's sns.distplot uses very large amounts of memory when attempting to plot a large dataset. An attempt to plot a dataset composed of a single vector of 19 591 561 elements (of type float64, all values between 0 and 1, with no NA elements), failed after exceeding 250 GB of memory (using the Agg backend, on Python 2.7.8, with the latest version of all packages obtained from pip). This only occurs when rug=True.

This issue was previously reported on Seaborn's issue tracker, and the package author suggested that this is in fact a matplotlib issue.

@WeatherGod
Copy link
Member

Here is the relevant code from seaborn that interfaces with matplotlib:

    if ax is None:
        ax = plt.gca()
    a = np.asarray(a)
    vertical = kwargs.pop("vertical", axis == "y")
    func = ax.axhline if vertical else ax.axvline
    kwargs.setdefault("linewidth", 1)
    for pt in a:
        func(pt, 0, height, **kwargs)

    return ax

So, it is creating 1e7 ax{h,v}lines, each with it own transform stack and
properties, rather than using a collection object that can share all this
information and significantly reduce the memory footprint?

Keeping in mind that ax{h,v}lines was originally designed only for doing a
few in a chart at a time, certainly not anything more than 10 or so.
Perhaps there are some things that could be done to clean things up a bit,
but I suspect that the greatest gains would come from seaborn using a
Line2DCollection here instead of just doing a massive for-loop.

On Tue, Jul 14, 2015 at 12:08 PM, Coby Viner [email protected]
wrote:

Seaborn's http://stanford.edu/%7Emwaskom/software/seaborn/ sns.distplot
uses very large amounts of memory when attempting to plot a large dataset.
An attempt to plot a dataset composed of a single vector of 19 591 561
elements (of type float64, all values between 0 and 1, with no NA
elements), failed after exceeding 250 GB of memory (using the Agg backend,
on Python 2.7.8, with the latest version of all packages obtained from
pip). This only occurs when rug=True.

This issue was previously reported on Seaborn's issue tracker
mwaskom/seaborn#645, and the package author
suggested that this is in fact a matplotlib issue
mwaskom/seaborn#645 (comment).


Reply to this email directly or view it on GitHub
#4695.

@mwaskom
Copy link

mwaskom commented Jul 14, 2015

Wait the official word from matplotlib is "we can't draw more than 10 lines on a plot?"

OK.

@WeatherGod
Copy link
Member

Where did I say that? I merely noted that the ax{v,h}line function was
never intended for drawing more than just a few lines, which is why it
isn't very efficient. Not that it can't be fixed/improved.

On Tue, Jul 14, 2015 at 12:58 PM, Michael Waskom [email protected]
wrote:

Wait the official word from matplotlib is "we can't draw more than 10
lines on a plot?"

OK.


Reply to this email directly or view it on GitHub
#4695 (comment)
.

@WeatherGod
Copy link
Member

Note, we do have a way to draw large number of lines, it is called a
Line2DCollection, which is highly efficient at what it does.

On Tue, Jul 14, 2015 at 1:02 PM, Benjamin Root [email protected] wrote:

Where did I say that? I merely noted that the ax{v,h}line function was
never intended for drawing more than just a few lines, which is why it
isn't very efficient. Not that it can't be fixed/improved.

On Tue, Jul 14, 2015 at 12:58 PM, Michael Waskom <[email protected]

wrote:

Wait the official word from matplotlib is "we can't draw more than 10
lines on a plot?"

OK.


Reply to this email directly or view it on GitHub
#4695 (comment)
.

@mwaskom
Copy link

mwaskom commented Jul 14, 2015

By similar logic, rugplot is "not intended to draw 1e7 lines".

@WeatherGod
Copy link
Member

I would not disagree with that notion, but as the maintainer of seaborn and user of the matplotlib API, it would make sense to utilize its API in an efficient manner, especially when the mechanisms for doing so are available. Productive feedback to matplotlib would be how we can help make those mechanisms more easily apparent (documentation and/or api changes).

Now, it may very well be that there are some fixable inefficiencies in those methods that will help improve performance a bit, but I can guarantee you that the biggest gains would come from creating a Line2DCollection object with 1e7 elements in it, rather than 1e7 Line2D objects. Remember that matplotlib's drawing stack requires sorting the list of artists that it has to handle, along with looping over each artist, calling its draw() at every refresh. Meanwhile, many of the Collection objects can bypass a lot of the typical inefficiencies by assuming certain commonalities.

@efiring
Copy link
Member

efiring commented Jul 14, 2015

It looks like it would be straightforward to add axhlines and axvlines methods to Axes, (or just write them as functions) which would accept vector or scalar arguments and which would generate LineCollection objects.

@WeatherGod
Copy link
Member

right, I was thinking along those lines, but we would need to be careful of
API breakage. Perhaps return a Line2DCollection only if the input is
iterable?

On Tue, Jul 14, 2015 at 3:46 PM, Eric Firing [email protected]
wrote:

It looks like it would be straightforward to add axhlines and axvlines
methods to Axes, (or just write them as functions) which would accept
vector or scalar arguments and which would generate LineCollection objects.


Reply to this email directly or view it on GitHub
#4695 (comment)
.

@efiring
Copy link
Member

efiring commented Jul 14, 2015

Yes, I think we could do that. I imagine the right way to do it might be with an underlying refactoring, so that LineCollection and Line2D would inherit from a base class. Then the return would be guaranteed to be an instance of that base, but might be further specialized depending on the inputs. This sort of refactoring could help us unify Collections with their related single types. I haven't thought it through; but the combination of close similarity and subtle differences in API between Collection types and the single types has always been problematic.
Another example of a place where we are using single types (Line2D) where the collection would make more sense is in Axis grid lines; they really don't need the staggering complexity of Line2D, and they are almost always generated in bunches.

@tacaswell
Copy link
Member

I thought the grid lines were drawn as part of the axis Ticks?

@efiring
Copy link
Member

efiring commented Jul 15, 2015

Ticks and ticklabels are another performance nightmare; I think they are largely responsible for the abysmal performance in making a 10x10 array of subplots.
You are right in the sense that gridline is an attribute of Tick, but it is implemented as a Line2D object. The underlying tradeoff is between performance (vectorize as much as possible, as in collections) and flexibility (present system: each Tick is a highly complex independent object, including a Line2D, with all its capability for fancy markers etc.) There is some optimization in the tick system to reduce the time required to make new Tick objects, but overall it is still a major performance bottleneck.

@tacaswell tacaswell added this to the proposed next point release milestone Aug 30, 2015
@tacaswell tacaswell modified the milestones: 2.1 (next point release), 2.2 (next next feature release) Oct 3, 2017
@anntzer
Copy link
Contributor

anntzer commented Apr 21, 2020

The Tick situation is already tracked at #6664. For rugplot's case I think ax.vlines(..., transform=ax.get_xaxis_transform()) should be good enough to create the LineCollection. Closing, as I don't think there's anything actionable on Matplotlib's side.

@anntzer anntzer closed this as completed Apr 21, 2020
@story645 story645 removed this from the future releases milestone Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants