Proposal for improving support for wide data

From the beginning HoloViews was designed primarily around [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). This has the major benefit that data can clearly be delineated into key dimensions (or independent values / coordinates) and value dimensions, which represent a dependent variable, i.e. some kind of measurement. Additionally it makes it possible to easily perform the groupby operations that allow HoloViews to easily facet data in a grid (GridSpace), layout (NdLayout), using widgets (HoloMap/DynamicMap) and as a set of trace in a plot (NdOverlay). However in many common scenarios data will not be tidy, the most common of which is when you are storing a bunch of timeseries indexed by the date(time) and then store multiple measurements all representing the kind of value, e.g. the most common example is stock prices where the index is the date and each column records the stock price for a different ticker.

The problem with reshaping this data is that it's tremendously inefficient. Where before you could have one DataFrame you now have to create `N` DataFrames, one for each stock ticker. So here I will lay out my proposal for formally supporting wide data in HoloViews.

### The Problem

While today you can already organize data in such a way that you create an NdOverlay where each Element provides a view into one column in the wide DataFrame, it breaks HoloViews' internal model of the world. E.g. let's look at what the structure of the ticker data looks like if you do this:

```python
NdOverlay [ticker]
    Curve [datetime] (AAPL)
    Curve [datetime] (MSFT)
    Curve [datetime] (IBM)
```

Here the ticker names now become the values of the NdOverlay key dimension AND they are the value dimension names of each `Curve` elements. This is clearly inelegant and also conceptually not correct, i.e. AAPL is not a dimension, it does not represent some actual measurable quantity with some associated unit. The actual measurable quantity is "Stock Price". The reason this is necessary is because the element equates the value dimension with the name of the variable in the underlying data, i.e. the string 'AAPL' will be used to look up the column in the underlying DataFrame. Downstream this causes issues for the sharing of dimension ranges in plots and other features that rely on the identity of Dimensions.

### The proposal

There are a few proposals that might give us a way out of this but they are potentially quite disruptive since HoloViews deeply embeds the assumption that the `Dimension.name` is the name of the variable in the underlying dataset. Introducing a new distinct variable on the `Dimension` to distinguish the name of the Dimension and the variable to look up does therefore not seem feasible. The only thing that I believe can be feasibly implemented is relying entirely on the `Dimension.label` for the identity of the `Dimension`. In most scenarios the `name` and `label` are mirrors of each other anyway but when a user defines `label` that should be sufficient to uniquely identify the Dimension. 

Based on some initial testing this would already almost achieve what we want without breaking anything. Based on a quick survey the changes required to make this work are relatively minor:

- `Dimension.__eq__` should compare just the `label` not the `name` and `label` ensuring that `Dimension('AAPL', label='Price')` and `Dimension('MSFT', label='Price')` are treated as the same dimension.
- The `Dimension` and `Dimensioned` reprs should be updated to reflect the `label` as the source of truth of the identity of the dimension.
- The plotting code must now index the dimension ranges by label and also look them up by label.
- Logic to link Bokeh axes should be updated to consider only the `Dimension.label`

This would be sufficient to fully support wide data without major disruptive changes to HoloViews, ensuring that linking of dimension ranges continues to work and that the reprs correctly represent the conceptual model HoloViews has of the data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal for improving support for wide data #6260

The Problem

The proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Proposal for improving support for wide data #6260

Description

The Problem

The proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions