Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Proposal for improving support for wide data #6260

@philippjfr

Description

@philippjfr

From the beginning HoloViews was designed primarily around tidy data. This has the major benefit that data can clearly be delineated into key dimensions (or independent values / coordinates) and value dimensions, which represent a dependent variable, i.e. some kind of measurement. Additionally it makes it possible to easily perform the groupby operations that allow HoloViews to easily facet data in a grid (GridSpace), layout (NdLayout), using widgets (HoloMap/DynamicMap) and as a set of trace in a plot (NdOverlay). However in many common scenarios data will not be tidy, the most common of which is when you are storing a bunch of timeseries indexed by the date(time) and then store multiple measurements all representing the kind of value, e.g. the most common example is stock prices where the index is the date and each column records the stock price for a different ticker.

The problem with reshaping this data is that it's tremendously inefficient. Where before you could have one DataFrame you now have to create N DataFrames, one for each stock ticker. So here I will lay out my proposal for formally supporting wide data in HoloViews.

The Problem

While today you can already organize data in such a way that you create an NdOverlay where each Element provides a view into one column in the wide DataFrame, it breaks HoloViews' internal model of the world. E.g. let's look at what the structure of the ticker data looks like if you do this:

NdOverlay [ticker]
    Curve [datetime] (AAPL)
    Curve [datetime] (MSFT)
    Curve [datetime] (IBM)

Here the ticker names now become the values of the NdOverlay key dimension AND they are the value dimension names of each Curve elements. This is clearly inelegant and also conceptually not correct, i.e. AAPL is not a dimension, it does not represent some actual measurable quantity with some associated unit. The actual measurable quantity is "Stock Price". The reason this is necessary is because the element equates the value dimension with the name of the variable in the underlying data, i.e. the string 'AAPL' will be used to look up the column in the underlying DataFrame. Downstream this causes issues for the sharing of dimension ranges in plots and other features that rely on the identity of Dimensions.

The proposal

There are a few proposals that might give us a way out of this but they are potentially quite disruptive since HoloViews deeply embeds the assumption that the Dimension.name is the name of the variable in the underlying dataset. Introducing a new distinct variable on the Dimension to distinguish the name of the Dimension and the variable to look up does therefore not seem feasible. The only thing that I believe can be feasibly implemented is relying entirely on the Dimension.label for the identity of the Dimension. In most scenarios the name and label are mirrors of each other anyway but when a user defines label that should be sufficient to uniquely identify the Dimension.

Based on some initial testing this would already almost achieve what we want without breaking anything. Based on a quick survey the changes required to make this work are relatively minor:

  • Dimension.__eq__ should compare just the label not the name and label ensuring that Dimension('AAPL', label='Price') and Dimension('MSFT', label='Price') are treated as the same dimension.
  • The Dimension and Dimensioned reprs should be updated to reflect the label as the source of truth of the identity of the dimension.
  • The plotting code must now index the dimension ranges by label and also look them up by label.
  • Logic to link Bokeh axes should be updated to consider only the Dimension.label

This would be sufficient to fully support wide data without major disruptive changes to HoloViews, ensuring that linking of dimension ranges continues to work and that the reprs correctly represent the conceptual model HoloViews has of the data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions