From the beginning HoloViews was designed primarily around tidy data. This has the major benefit that data can clearly be delineated into key dimensions (or independent values / coordinates) and value dimensions, which represent a dependent variable, i.e. some kind of measurement. Additionally it makes it possible to easily perform the groupby operations that allow HoloViews to easily facet data in a grid (GridSpace), layout (NdLayout), using widgets (HoloMap/DynamicMap) and as a set of trace in a plot (NdOverlay). However in many common scenarios data will not be tidy, the most common of which is when you are storing a bunch of timeseries indexed by the date(time) and then store multiple measurements all representing the kind of value, e.g. the most common example is stock prices where the index is the date and each column records the stock price for a different ticker.
The problem with reshaping this data is that it's tremendously inefficient. Where before you could have one DataFrame you now have to create N DataFrames, one for each stock ticker. So here I will lay out my proposal for formally supporting wide data in HoloViews.
The Problem
While today you can already organize data in such a way that you create an NdOverlay where each Element provides a view into one column in the wide DataFrame, it breaks HoloViews' internal model of the world. E.g. let's look at what the structure of the ticker data looks like if you do this:
NdOverlay [ticker]
Curve [datetime] (AAPL)
Curve [datetime] (MSFT)
Curve [datetime] (IBM)
Here the ticker names now become the values of the NdOverlay key dimension AND they are the value dimension names of each Curve elements. This is clearly inelegant and also conceptually not correct, i.e. AAPL is not a dimension, it does not represent some actual measurable quantity with some associated unit. The actual measurable quantity is "Stock Price". The reason this is necessary is because the element equates the value dimension with the name of the variable in the underlying data, i.e. the string 'AAPL' will be used to look up the column in the underlying DataFrame. Downstream this causes issues for the sharing of dimension ranges in plots and other features that rely on the identity of Dimensions.
The proposal
There are a few proposals that might give us a way out of this but they are potentially quite disruptive since HoloViews deeply embeds the assumption that the Dimension.name is the name of the variable in the underlying dataset. Introducing a new distinct variable on the Dimension to distinguish the name of the Dimension and the variable to look up does therefore not seem feasible. The only thing that I believe can be feasibly implemented is relying entirely on the Dimension.label for the identity of the Dimension. In most scenarios the name and label are mirrors of each other anyway but when a user defines label that should be sufficient to uniquely identify the Dimension.
Based on some initial testing this would already almost achieve what we want without breaking anything. Based on a quick survey the changes required to make this work are relatively minor:
Dimension.__eq__ should compare just the label not the name and label ensuring that Dimension('AAPL', label='Price') and Dimension('MSFT', label='Price') are treated as the same dimension.
- The
Dimension and Dimensioned reprs should be updated to reflect the label as the source of truth of the identity of the dimension.
- The plotting code must now index the dimension ranges by label and also look them up by label.
- Logic to link Bokeh axes should be updated to consider only the
Dimension.label
This would be sufficient to fully support wide data without major disruptive changes to HoloViews, ensuring that linking of dimension ranges continues to work and that the reprs correctly represent the conceptual model HoloViews has of the data.
From the beginning HoloViews was designed primarily around tidy data. This has the major benefit that data can clearly be delineated into key dimensions (or independent values / coordinates) and value dimensions, which represent a dependent variable, i.e. some kind of measurement. Additionally it makes it possible to easily perform the groupby operations that allow HoloViews to easily facet data in a grid (GridSpace), layout (NdLayout), using widgets (HoloMap/DynamicMap) and as a set of trace in a plot (NdOverlay). However in many common scenarios data will not be tidy, the most common of which is when you are storing a bunch of timeseries indexed by the date(time) and then store multiple measurements all representing the kind of value, e.g. the most common example is stock prices where the index is the date and each column records the stock price for a different ticker.
The problem with reshaping this data is that it's tremendously inefficient. Where before you could have one DataFrame you now have to create
NDataFrames, one for each stock ticker. So here I will lay out my proposal for formally supporting wide data in HoloViews.The Problem
While today you can already organize data in such a way that you create an NdOverlay where each Element provides a view into one column in the wide DataFrame, it breaks HoloViews' internal model of the world. E.g. let's look at what the structure of the ticker data looks like if you do this:
Here the ticker names now become the values of the NdOverlay key dimension AND they are the value dimension names of each
Curveelements. This is clearly inelegant and also conceptually not correct, i.e. AAPL is not a dimension, it does not represent some actual measurable quantity with some associated unit. The actual measurable quantity is "Stock Price". The reason this is necessary is because the element equates the value dimension with the name of the variable in the underlying data, i.e. the string 'AAPL' will be used to look up the column in the underlying DataFrame. Downstream this causes issues for the sharing of dimension ranges in plots and other features that rely on the identity of Dimensions.The proposal
There are a few proposals that might give us a way out of this but they are potentially quite disruptive since HoloViews deeply embeds the assumption that the
Dimension.nameis the name of the variable in the underlying dataset. Introducing a new distinct variable on theDimensionto distinguish the name of the Dimension and the variable to look up does therefore not seem feasible. The only thing that I believe can be feasibly implemented is relying entirely on theDimension.labelfor the identity of theDimension. In most scenarios thenameandlabelare mirrors of each other anyway but when a user defineslabelthat should be sufficient to uniquely identify the Dimension.Based on some initial testing this would already almost achieve what we want without breaking anything. Based on a quick survey the changes required to make this work are relatively minor:
Dimension.__eq__should compare just thelabelnot thenameandlabelensuring thatDimension('AAPL', label='Price')andDimension('MSFT', label='Price')are treated as the same dimension.DimensionandDimensionedreprs should be updated to reflect thelabelas the source of truth of the identity of the dimension.Dimension.labelThis would be sufficient to fully support wide data without major disruptive changes to HoloViews, ensuring that linking of dimension ranges continues to work and that the reprs correctly represent the conceptual model HoloViews has of the data.