Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: Move loadtxt to C for much better speed #20580

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 70 commits into from
Feb 8, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
1e15b43
ENH: Move npreadtext into NumPy for faster text reading
seberg Oct 20, 2021
db47a42
Fixup size_t's to (mostly?) use npy_intp to silence compiler warnings
seberg Dec 8, 2021
ff91f2b
MAINT: Remove float-to-int integer parsing fallback
seberg Dec 8, 2021
684cefc
ENH: Allow a single converter to be used for all columns
seberg Dec 8, 2021
b8c8240
TST: Fixup current loadtxt tests for changes
seberg Dec 13, 2021
7a42518
STY: Fix some style issues (mainly long lines)
seberg Dec 13, 2021
07389a7
MAINT: Make text reader custom mem handler compatible
seberg Dec 13, 2021
0a636c4
MAINT: Address Tylers review comments
seberg Jan 6, 2022
6bf1b21
BUG: Fix skiprows handling and simplify lineskipping logic
seberg Jan 7, 2022
37523dc
ENH: Raise an error for (most) stray newline characters
seberg Jan 8, 2022
3f2b8d3
ENH: Reject empty string as control character
seberg Jan 8, 2022
66a61b0
Port over tests from npreadtext test suite
rossbar Jan 6, 2022
10a90f0
TST: Small fixups for tests to make sure they pass again
seberg Jan 8, 2022
2a0a4f4
TST: Fix test to align with stricter integer parsing
seberg Jan 8, 2022
1270a17
BUG: Add missing quote copy
seberg Jan 10, 2022
b670ff7
Rename quotechar param and update docstring.
rossbar Jan 10, 2022
bbf14c0
TST: Add tests for quote character support.
rossbar Jan 10, 2022
942d4f8
Add test to check quoting support disabled by default.
rossbar Jan 10, 2022
2912231
Add tests for quote+multichar comments.
rossbar Jan 10, 2022
ff5eb64
TST: structured dtype w/ quotes.
rossbar Jan 10, 2022
1489805
Add tests for empty quotes and escaped quotechars.
rossbar Jan 10, 2022
156964d
rm incorrect comment.
rossbar Jan 10, 2022
6d116b4
DOC: Add release notes for loadtxt changes
seberg Jan 11, 2022
ad0a8e4
MAINT: Replace last uses of raw malloc with PyMem_MALLOC
seberg Jan 11, 2022
4ca4c1a
MAINT: Fixup include guards and use NPY_NO_EXPORT (or static) throughout
seberg Jan 11, 2022
10b04d6
MAINT: Add sanity check to ensure usecols is correct.
seberg Jan 11, 2022
530c954
Add UserWarning when reading no data.
rossbar Jan 11, 2022
3ca9f5a
Add warning on empty file + tests.
rossbar Jan 11, 2022
e1f7ad1
BUG: Fix complex parser and add tests for whitespace and failure paths
seberg Jan 11, 2022
5692292
BUG,TST: Add test for huge-float buffer path and ensure error return
seberg Jan 11, 2022
fac9134
TST: Add test to cover copyswap (byte-swap and unaligned)
seberg Jan 11, 2022
e0e3a72
DOC: See if adding a newline fixes release note rendering
seberg Jan 11, 2022
e4d0e60
BUG: Fix some issues found by a valgrind run
seberg Jan 12, 2022
334470e
BUG: Fix growing when NPY_RELAXED_STRIDES_DEBUG=1 is used
seberg Jan 12, 2022
e2d9f6b
MAINT: Move usecol handling to C and support more than integer cols
seberg Jan 13, 2022
c000c1e
BUG: Make sure num-fields is intp/ssize_t compatible
seberg Jan 13, 2022
cc2c582
BUG: Ensure current num fields holds enough space for ultra-wide columns
seberg Jan 13, 2022
da00bf4
ENH: Give warning for empty-lines not counting towards max-rows
seberg Jan 13, 2022
08fa5ce
MAINT: Small cleanup, use FINLINE for int parsers
seberg Jan 14, 2022
73940d6
MAINT,TST,BUG: Simplify streamer init, fix issues, and add tests
seberg Jan 14, 2022
d2473c0
TST,BUG: Additional bad-file-like test, add missing error path free
seberg Jan 14, 2022
245af22
TST,MAINT: New tests, byteswap cleanups and fixed assert
seberg Jan 14, 2022
cc67c19
TST: Improve test coverage, replace impossible error with assert
seberg Jan 14, 2022
6e67e17
MAINT: Remove unused/unnecessary allow-embedded-newlines option
seberg Jan 14, 2022
d58d361
TST: Add test for hard/impossible to reach universal-newline support …
seberg Jan 14, 2022
eb68e87
MAINT: Use skiplines rather than skiprows internally throughout
seberg Jan 14, 2022
4626931
MAINT: Very minor style cleanups (mostly)
seberg Jan 14, 2022
0cb6bdc
MAINT: Only allocate converters if necessary
seberg Jan 14, 2022
90c71f0
TST: Move most new loadtxt tests to its own file
seberg Jan 15, 2022
1e6b72b
TST,STY: Add small additional tests for converters/usecols
seberg Jan 15, 2022
14cd1bb
DOC: Remove outdated loadtxt TODOs from code
seberg Jan 15, 2022
4f3b3d2
BUG: Fix loadtxt no data warning stacklevel
seberg Jan 15, 2022
3e0d432
TST,BUG: Fortify byteswapping tests and make a small fix
seberg Jan 15, 2022
ecff02c
Update and add converters examples.
rossbar Jan 12, 2022
e15d853
Add quotechar to examples.
rossbar Jan 12, 2022
9f9d755
ENH: Give a clear error when control characters match/are newlines
seberg Jan 19, 2022
5d98d67
TST: Skip unparsable field error tests on PyPy
seberg Jan 19, 2022
9ef1a61
TST: Use hand-picked values for byte-swapping tests
seberg Jan 19, 2022
dfc8989
TST: Catch two more errors that runs into the PyPy issue
seberg Jan 19, 2022
763a3d4
TST: Use repr in byteswapping tests
seberg Jan 19, 2022
b335431
TST: Some tests for control character collisions.
rossbar Jan 28, 2022
0576327
Add test for datetime parametric unit discovery.
rossbar Jan 28, 2022
370792b
Add test for unicode, parametrize for chunksize.
rossbar Jan 28, 2022
8a31abc
Add test for empty string as control characters.
rossbar Jan 28, 2022
0ee03c8
Add test for str dtype len discovery with converters.
rossbar Jan 28, 2022
bab8610
Handle delimiter as bytes.
rossbar Jan 28, 2022
5332a41
Linting.
rossbar Jan 28, 2022
59c2084
TST: Fix exception msg matching in tests.
rossbar Jan 28, 2022
a756bfb
TST: Skip error test using on PyPy (test uses %.100R)
seberg Jan 30, 2022
ef7492c
Add two new examples of converters to docstring examples
rossbar Feb 7, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions doc/release/upcoming_changes/20580.compatibility.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
``np.loadtxt`` has recieved several changes
-------------------------------------------

The row counting of `numpy.loadtxt` was fixed. ``loadtxt`` ignores fully
empty lines in the file, but counted them towards ``max_rows``.
When ``max_rows`` is used and the file contains empty lines, these will now
not be counted. Previously, it was possible that the result contained fewer
than ``max_rows`` rows even though more data was available to be read.
If the old behaviour is required, ``itertools.islice`` may be used::

import itertools
lines = itertools.islice(open("file"), 0, max_rows)
result = np.loadtxt(lines, ...)

While generally much faster and improved, `numpy.loadtxt` may now fail to
converter certain strings to numbers that were previously successfully read.
The most important cases for this are:

* Parsing floating point values such as ``1.0`` into integers will now fail
* Parsing hexadecimal floats such as ``0x3p3`` will fail
* An ``_`` was previously accepted as a thousands delimiter ``100_000``.
This will now result in an error.

If you experience these limitations, they can all be worked around by passing
appropriate ``converters=``. NumPy now supports passing a single converter
to be used for all columns to make this more convenient.
For example, ``converters=float.fromhex`` can read hexadecimal float numbers
and ``converters=int`` will be able to read ``100_000``.

Further, the error messages have been generally improved. However, this means
that error types may differ. In particularly, a ``ValueError`` is now always
raised when parsing of a single entry fails.

8 changes: 8 additions & 0 deletions doc/release/upcoming_changes/20580.new_feature.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
``np.loadtxt`` now supports quote character and single converter function
-------------------------------------------------------------------------
`numpy.loadtxt` now supports an additional ``quotechar`` keyword argument
which is not set by default. Using ``quotechar='"'`` will read quoted fields
as used by the Excel CSV dialect.

Further, it is now possible to pass a single callable rather than a dictionary
for the ``converters`` argument.
4 changes: 4 additions & 0 deletions doc/release/upcoming_changes/20580.performance.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Faster ``np.loadtxt``
---------------------
`numpy.loadtxt` is now generally much faster than previously as most of it
is now implemented in C.
9 changes: 9 additions & 0 deletions numpy/core/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -868,6 +868,7 @@ def gl_if_msvc(build_cmd):
join('src', 'multiarray', 'typeinfo.h'),
join('src', 'multiarray', 'usertypes.h'),
join('src', 'multiarray', 'vdot.h'),
join('src', 'multiarray', 'textreading', 'readtext.h'),
join('include', 'numpy', 'arrayobject.h'),
join('include', 'numpy', '_neighborhood_iterator_imp.h'),
join('include', 'numpy', 'npy_endian.h'),
Expand Down Expand Up @@ -955,6 +956,14 @@ def gl_if_msvc(build_cmd):
join('src', 'npysort', 'selection.c.src'),
join('src', 'common', 'npy_binsearch.h'),
join('src', 'npysort', 'binsearch.cpp'),
join('src', 'multiarray', 'textreading', 'conversions.c'),
join('src', 'multiarray', 'textreading', 'field_types.c'),
join('src', 'multiarray', 'textreading', 'growth.c'),
join('src', 'multiarray', 'textreading', 'readtext.c'),
join('src', 'multiarray', 'textreading', 'rows.c'),
join('src', 'multiarray', 'textreading', 'stream_pyobject.c'),
join('src', 'multiarray', 'textreading', 'str_to_int.c'),
join('src', 'multiarray', 'textreading', 'tokenize.c.src'),
]

#######################################################################
Expand Down
11 changes: 11 additions & 0 deletions numpy/core/src/multiarray/conversion_utils.c
Original file line number Diff line number Diff line change
Expand Up @@ -993,6 +993,17 @@ PyArray_PyIntAsIntp(PyObject *o)
}


NPY_NO_EXPORT int
PyArray_IntpFromPyIntConverter(PyObject *o, npy_intp *val)
{
*val = PyArray_PyIntAsIntp(o);
if (error_converting(*val)) {
return NPY_FAIL;
}
return NPY_SUCCEED;
}


/*
* PyArray_IntpFromIndexSequence
* Returns the number of dimensions or -1 if an error occurred.
Expand Down
3 changes: 3 additions & 0 deletions numpy/core/src/multiarray/conversion_utils.h
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@
NPY_NO_EXPORT int
PyArray_IntpConverter(PyObject *obj, PyArray_Dims *seq);

NPY_NO_EXPORT int
PyArray_IntpFromPyIntConverter(PyObject *o, npy_intp *val);

NPY_NO_EXPORT int
PyArray_OptionalIntpConverter(PyObject *obj, PyArray_Dims *seq);

Expand Down
3 changes: 3 additions & 0 deletions numpy/core/src/multiarray/multiarraymodule.c
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ NPY_NO_EXPORT int NPY_NUMUSERTYPES = 0;

#include "get_attr_string.h"
#include "experimental_public_dtype_api.h" /* _get_experimental_dtype_api */
#include "textreading/readtext.h" /* _readtext_from_file_object */

#include "npy_dlpack.h"

Expand Down Expand Up @@ -4456,6 +4457,8 @@ static struct PyMethodDef array_module_methods[] = {
METH_VARARGS | METH_KEYWORDS, NULL},
{"_get_experimental_dtype_api", (PyCFunction)_get_experimental_dtype_api,
METH_O, NULL},
{"_load_from_filelike", (PyCFunction)_load_from_filelike,
METH_FASTCALL | METH_KEYWORDS, NULL},
/* from umath */
{"frompyfunc",
(PyCFunction) ufunc_frompyfunc,
Expand Down
Loading