Thanks to visit codestin.com
Credit goes to github.com

Skip to content

BUG: astype(object) downcasts for datetime-dtype #12550

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
h-vetinari opened this issue Dec 14, 2018 · 45 comments
Open

BUG: astype(object) downcasts for datetime-dtype #12550

h-vetinari opened this issue Dec 14, 2018 · 45 comments
Assignees
Labels
component: numpy.datetime64 (and timedelta64) triaged Issue/PR that was discussed in a triage meeting

Comments

@h-vetinari
Copy link
Contributor

The dtype np.object_ is the most general catch-all type to contain arbitrary python objects. As such, .astype(object) should preferably not upcast, but certainly never downcast:

Reproducing code example:

>>> import numpy as np
>>> arr = np.array([10 ** 18], dtype='M8[ns]')
>>> arr
array(['2001-09-09T01:46:40.000000000'], dtype='datetime64[ns]')
>>> arr[0]
numpy.datetime64('2001-09-09T01:46:40.000000000')
>>> arr.astype(object)
array([1000000000000000000], dtype=object)
>>> arr.astype(object)[0]
1000000000000000000

The expected outcome for the last two lines is:

>>> arr.astype(object)
array([numpy.datetime64('2001-09-09T01:46:40.000000000')], dtype=object)
>>> arr.astype(object)[0]
numpy.datetime64('2001-09-09T01:46:40.000000000')

as well as

>>> arr.astype(object)[0] == arr[0]
True

Numpy/Python version information:

numpy: 1.15.4
python: 3.7.1
@eric-wieser
Copy link
Member

Note that the result is usually a datetime.timedelta, if one is possible to create:

>>> np.array([10 ** 10], dtype='M8[us]').astype(object)
array([datetime.datetime(1970, 1, 1, 2, 46, 40)], dtype=object)

@h-vetinari
Copy link
Contributor Author

Note that the result is usually a datetime.timedeltadatetime, if one is possible to create:

Fair enough, as long as it's not an int... ;-)

@eric-wieser
Copy link
Member

eric-wieser commented Dec 14, 2018

Well in this case, it doesn't fit in a date time, so we don't have that option.

Is it better to return:

  1. Only datetime64
  2. Only rounded / truncated datetime
  3. A mixture of datetime and int
  4. A mixture of datetime64 and datetime

Right now, we do 3.

@h-vetinari
Copy link
Contributor Author

To me the answer is clearly 1 (should be clear that a mixture is bound to inconsistencies...).

TBH, I was really surprised that (from your example):

>>> np.array([10 ** 10], dtype='M8[us]').astype(object)
array([datetime.datetime(1970, 1, 1, 2, 46, 40)], dtype=object)

rather than

>>> np.array([10 ** 10], dtype='M8[us]').astype(object)
array([numpy.datetime64('1970-01-01T02:46:40.000000')], dtype=object)

This is also inconsistent because passing an np.datetime64 to a constructor with dtype=object does not change it to datetime.datetime, so I was really surprised that .astype(object) does something different.

>>> np.array([np.datetime64('1970-01-01T02:46:40.000000')], dtype=object)
array([numpy.datetime64('1970-01-01T02:46:40.000000')], dtype=object)
>>>
>>> np.array([np.datetime64('1970-01-01T02:46:40.000000')]).astype(object)
array([datetime.datetime(1970, 1, 1, 2, 46, 40)], dtype=object)

@seberg
Copy link
Member

seberg commented Dec 15, 2018

Well, the current state does seem strange. But I am not sure I like to change the behaviour because of something that on first sight seems rather hypothetical? (I mean changing the behaviour to returning datetime64, which seems pretty different.)

Do you actually work with dates past 10000 years? If that is a real use case, maybe we should see that python allows it rather?

@h-vetinari
Copy link
Contributor Author

@seberg

Well, the current state does seem strange. But I am not sure I like to change the behaviour because of something that on first sight seems rather hypothetical? (I mean changing the behaviour to returning datetime64, which seems pretty different.)

I don't really get that point. The underlying values were datetime64 to begin with, why should astype(object) change the values (rather than just the dtype of the array)?
I understand of course that other astype-cases deliberately change the values (indeed, that's often the whole point of the method call), but object is special as the most general container, and should never need to cast at all.

Do you actually work with dates past 10000 years? If that is a real use case, maybe we should see that python allows it rather?

This is not an issue of date range limitations (to me). I'm writing lots of parametrized tests for a larger pandas-PR, and this discrepancy between dtype=object in the constructor and .astype(object) caught me out.

@seberg
Copy link
Member

seberg commented Dec 17, 2018

This is annoying, but... If you look closer, you will notice that also other types cast to the python version. For datetime this is a bigger step/change and also annoying.

What I mean is that we use .item()'s "convert to a python object" logic for datetime:

  • This assumes that numpy datetimes can be safely cast to python datetimes (which is not quite true)
  • But: Changing this is a pretty large change that needs to be thoroughly discussed (at least as far as I can see). For example, what will happen with matplotlib, and all the other users down there that (unwittingly) are using it to get datetime objects.

So, yes, this is not a safe cast. And yes it is utterly broken. But, I need a a lot more to be convinced that changing the cast from returning datetime objects to returning numpy datetime64 objects will not have a huge impact on downstream.

If such a fix would actually fix bugs for downstream in the long run, it would be more compelling. But it seems to me that for most users the bug will be very mild...

@seberg
Copy link
Member

seberg commented Dec 17, 2018

Note that this discrepancy occurs also for all of the numerical types in numpy, even if it may be just as surprising there.

@h-vetinari
Copy link
Contributor Author

@seberg
We got side-tracked a bit with the datetime vs datetime64 - I see your points, even though I believe downstream effort is not a good reason to keep something in an "utterly broken" state. That's why there's version numbers. ;-)

I can work around the datetime/datetime64 thing, but the ints that get returned in the OP are really a killer.

@seberg
Copy link
Member

seberg commented Dec 30, 2018

I noticed that years ago I once glanced at changing the default behaviour for all types here (so that the object cast would retain the numpy type). But I am not quite sure we should do it.

EDIT: or well, aim for it maybe. First, there is no way to warn about it. Second, we probably have no clue about how disruptive that would be.

@eric-wieser
Copy link
Member

eric-wieser commented Dec 30, 2018 via email

@seberg
Copy link
Member

seberg commented Jan 3, 2019

There seem to be some similar odd casts noted in gh-5180, will close that one in favor if this one.

@eric-wieser
Copy link
Member

eric-wieser commented Feb 18, 2019

This is a duplicate of #8546 (closed) and #7619

@gerritholl
Copy link
Contributor

Is it better to return:

1. Only datetime64

2. Only rounded / truncated datetime

3. A mixture of datetime and int

4. A mixture of datetime64 and datetime

Right now, we do 3.

How about adding a keyword argument to .astype and related methods, such that the user can choose between those four options? I usually want a truncated datetime, but maybe sometimes a datetime64. I never want int.

@eraoul
Copy link

eraoul commented Dec 2, 2019

There seems to be tons of discussion, but it's clear that when I do x.astype(datetime) and it returns an integer instead of a datetime is a serious bug. Just wasted far too long tracking this down, and was completely shocked to realize that the type was wrong after ".astype(datetime)". Error message or options would be great.

@seberg
Copy link
Member

seberg commented Dec 2, 2019

@eraoul yes, I think we should do something. An error, or at least a warning seems fair to me (it is within the casting machinery, so it may take a bit of care). PRs are welcome, I can give you pointers if you want to look into it.

@seberg seberg added the triage review Issue/PR to be discussed at the next triage meeting label Dec 2, 2019
@h-vetinari
Copy link
Contributor Author

I commented a bit on the sister issue #7619, suggesting that .item could also just return np.datetimes64 instead of trying (and failing) to cast to datetime.datetime, whereas @eric-wieser floated the possibility of separating the (currently shared/similar) semantics of .astype(object) from those of .item.

I'm fine with either solution, but certainly, the astype-stuff is more pressing IMO.

@seberg seberg added triaged Issue/PR that was discussed in a triage meeting and removed triage review Issue/PR to be discussed at the next triage meeting labels Dec 18, 2019
@seberg
Copy link
Member

seberg commented Dec 18, 2019

It seems we converged to the following when talking about it today:

  1. We will deprecate any return which is not a python datetime
  2. It will give an error in the future.
  3. If someone has a use-case, we may try to provide a utility function.

Note that this would include the conversion of NaT to None. Do you think that this will help with the issue? Is the deprecation of conversion to None reasonable?

Please just comment on this, doing the actual change should be fairly straight forward.

EDIT: Marking with milestone to not forget it, please feel free to move when it comes to it.

@seberg seberg added this to the 1.19.0 release milestone Dec 18, 2019
@h-vetinari
Copy link
Contributor Author

@seberg
Thanks for tackling this. I dislike that .astype(object) will still cast np.datetime64 to python datetime at all, but I'm realize that providing an consistent and realistic path for migration is a challenge.

Still, I think that valid np.datetime64 that are "just" cast to object shouldn't raise an error, just because it's outside of what datetime.datetime can handle. The output there is already demonstrably wrong, so why not return np.datetime64 for those cases, and deprecate returning datetime.datetime?

h-vetinari added a commit to h-vetinari/numpy that referenced this issue Mar 27, 2021
This is to avoid implicitly casting e.g. datetime64 to (python-)datetimes when
using operations - like .astype(object) - that are expected not to change the
type. Fixes numpy#12550.

Co-Authored-By: Sebastian Berg <[email protected]>
@charris charris modified the milestones: 1.21.0 release, 1.22.0 release May 5, 2021
@charris
Copy link
Member

charris commented May 5, 2021

@h-vetinari Are you still working on this?

@h-vetinari
Copy link
Contributor Author

@h-vetinari Are you still working on this?

Hey @charris, thanks for checking in; fundamentally yes, I still want to fix it, but since I read somewhere that 1.21 branches in May, that's not gonna happen until then. #18683 is my first foray into the entrails of numpy, and that's a pretty large initial hurdle to overcome (and work/life hasn't been kind recently).

@rgommers
Copy link
Member

Bumped to 1.23.0, doesn't seem blocking for 1.22.0

@jbrockmendel
Copy link
Contributor

Started a branch with the just the diff proposed by @seberg here, very similar to #18683.

Eventually ended up scaling it back to only try to change the cases that give integers. So the diff looks like

diff --git a/numpy/core/src/multiarray/arraytypes.c.src b/numpy/core/src/multiarray/arraytypes.c.src
index 71401c60e..733413edd 100644
--- a/numpy/core/src/multiarray/arraytypes.c.src
+++ b/numpy/core/src/multiarray/arraytypes.c.src
@@ -1024,6 +1024,11 @@ DATETIME_getitem(void *ip, void *vap)
         PyArray_DESCR(ap)->f->copyswap(&dt, ip, PyArray_ISBYTESWAPPED(ap), ap);
     }
 
+    // See https://github.com/numpy/numpy/issues/12550
+    if (PyLong_Check(convert_datetime_to_pyobject(dt, meta))) {
+        return PyArray_Scalar(ip, PyArray_DESCR((PyArrayObject *)vap), NULL);
+    }
+
     return convert_datetime_to_pyobject(dt, meta);
 }
 
@@ -1048,6 +1053,11 @@ TIMEDELTA_getitem(void *ip, void *vap)
         PyArray_DESCR(ap)->f->copyswap(&td, ip, PyArray_ISBYTESWAPPED(ap), ap);
     }
 
+    // See https://github.com/numpy/numpy/issues/12550
+    if (PyLong_Check(convert_timedelta_to_pyobject(td, meta))) {
+        return PyArray_Scalar(ip, PyArray_DESCR((PyArrayObject *)vap), NULL);
+    }
+
     return convert_timedelta_to_pyobject(td, meta);
 }

and added tests

diff --git a/numpy/core/tests/test_datetime.py b/numpy/core/tests/test_datetime.py
index baae77a35..4d44e7b1a 100644
--- a/numpy/core/tests/test_datetime.py
+++ b/numpy/core/tests/test_datetime.py
@@ -1557,6 +1557,37 @@ def test_hours(self):
         t[0] = 60*60*24 + 60*60*10
         assert_(t[0].item().hour == 10)
 
+    def test_astype_object_not_ints(self):
+        dts = np.ones(3, dtype=f'M8[ns]')
+        dts[0] = 60*60*24 + 60*60*10
+        assert isinstance(dts[0].item(), np.datetime64)
+
+        obj_dt = dts.astype(object)
+        assert isinstance(obj_dt[0], np.datetime64)
+        assert obj_dt[0].dtype == dts.dtype  # i.e. same resolution
+        assert isinstance(obj_dt[:1].item(), np.datetime64)
+
+        tds = dts.view(f"m8[ns]")
+        obj_td = tds.astype(object)
+        assert isinstance(obj_td[0], np.timedelta64)
+        assert obj_td[0].dtype == tds.dtype  # i.e. same resolution
+        assert isinstance(obj_td[:1].item(), np.timedelta64)
+
+    @pytest.mark.parametrize("unit", ["us", "ms", "s", "m", "h", "D", "W", "M", "Y"])
+    def test_astype_object_not_ints_oob(self, unit):
+        dts = np.array([11000], dtype="M8[Y]").astype(f"M8[{unit}]")
+
+        res = dts.astype(object)
+        assert isinstance(res[0], np.datetime64)
+        assert res[0].dtype == dts.dtype
+
+        tds = dts - np.datetime64(0, unit)
+        res = tds.astype(object)
+        if not isinstance(res[0], datetime.timedelta):
+            # we just want NOT an integer
+            assert isinstance(res[0], np.timedelta64)
+            assert res[0].dtype == tds.dtype

With this, everything in numpy/core/tests/test_datetime.py passes. But I still get a segfault in numpy/core/tests/test_array_coercion.py::TestTimeScalars::test_coercion_timedelta_convert_to_number

I also tried to just call convert_timedelta_to_pyobject/convert_datetime_to_pyobject once in each function but that gave compile-time errors that I punted on.

@h-vetinari think you can get this over the finish line?

@h-vetinari
Copy link
Contributor Author

@jbrockmendel: @h-vetinari think you can get this over the finish line?

Started looking at this again. I don't have a compiler stack set up on the only machine I have easily available currently, so I'm mostly restricted to CI. Might set up a VM if it becomes necessary.

@seberg: some code in NumPy assumes that most types will return a Python scalar with arr.item() and similar. Datetimes (and float128) don't, so they need extra care sometimes.

I started digging a bit, but I'm not familiar at all with the numpy code-base. Any pointers which pieces I should look at to avoid the kind of recursion you mentioned?

@seberg seberg modified the milestones: 1.23.0 release, 1.24.0 release May 4, 2022
@seberg
Copy link
Member

seberg commented May 4, 2022

Pushing this off again unfortunately since branching is getting close. Please ping to bump, although IIRC the whole thing was a bit tricky to fix unfortunately.

@seberg
Copy link
Member

seberg commented Aug 7, 2024

I don't know if anyone has the cycles/enthusiasm to dive into this one again. But I fixed some infinite recursions here recently which may make this more feasible.
(Of course it is still a breaking change for things that know how to do deal with Python datetimes but not NumPy ones and currently can work via datetime_arr + other_obj.)

@mattip mattip removed this from the 2.2.0 release milestone Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: numpy.datetime64 (and timedelta64) triaged Issue/PR that was discussed in a triage meeting
Projects
None yet
9 participants