Description
The following code produces a ValueError at the second print with pandas >= 0.22 and any version of xpress (available on conda and PyPI):
import pandas as pd
n = 3
str1 = ['a'] * n
str2 = ['b'] * n
str3 = ['c'] * n
str1[0] = 'd'
df = pd.DataFrame({'key':str1, 'val1':str2, 'val2':str3})
df = df.set_index('key')
print (df.loc['d'])
import xpress as xp
print (df.loc['d'])
This is because xpress' overloading of a NumPy eq operation, which is done through NumPy's PyUFunc_FromFuncAndData() function.
This overloading works by replacing function pointers for an array of (operand_type, operand_type, result_type) tuples and possibly changing those types. For xpress to work, one of the two elements of the array having NPY_OBJECT as operand types should be changed so that the result is also NPY_OBJECT.
The ValueError is raised in pandas' _maybe_get_bool_indexer(), where indexer, an ndarray of bytes, is cython-defined and then assigned the result of the comparison. The comparison runs xpress' code, which realizes it's a comparison of non-xpress objects and just reverts to the original comparison operation, but returns an array of objects rather than of bytes. Assigning it to indexer thus returning a ValueError.
Issue does not exist with pandas < 0.22.
Output is as follows:
val1 b
val2 c
Name: d, dtype: object
Traceback (most recent call last):
File "bug2.py", line 19, in <module>
print (df.loc['d'])
File "/home/pietro/.local/lib/python3.5/site-packages/pandas/core/indexing.py", line 1478, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/home/pietro/.local/lib/python3.5/site-packages/pandas/core/indexing.py", line 1912, in _getitem_axis
return self._get_label(key, axis=axis)
File "/home/pietro/.local/lib/python3.5/site-packages/pandas/core/indexing.py", line 140, in _get_label
return self.obj._xs(label, axis=axis)
File "/home/pietro/.local/lib/python3.5/site-packages/pandas/core/generic.py", line 2987, in xs
loc = self.index.get_loc(key)
File "/home/pietro/.local/lib/python3.5/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 157, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 183, in pandas._libs.index.IndexEngine._get_loc_duplicates
File "pandas/_libs/index.pyx", line 191, in pandas._libs.index.IndexEngine._maybe_get_bool_indexer
ValueError: Item size of buffer (8 bytes) does not match size of 'uint8_t' (1 byte)
The expected output is as follows:
val1 b
val2 c
Name: d, dtype: object
val1 b
val2 c
Name: d, dtype: object
and here is the output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-8-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_GB.utf8
LOCALE: en_GB.UTF-8
pandas: 0.23.4
pytest: None
pip: 9.0.1
setuptools: 40.0.0
Cython: None
numpy: 1.15.1
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.4.8
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None