Shortcut from Read XLS[X|B]
Reading both XLSX and XLSB files, fast and memory-safe, with Python, into PyArrow.
This module provides one function: read for reading both .xlsx and .xlsb files.
import polars as pl
import pandas as pd
from rxls.reader import read
polars_df = pl.from_arrow(read(some_file, header=True))
pandas_df = read(some_file, header=True).to_pandas()Some file has no header:
from rxls.reader import read
table = read(some_file, header=0) # produce `Unnamed: 0` ... `Unnamed: N` columns
table = read(some_file, header=['a', 'b', 'c']) # produce table with columns `a`, `b`, `c`, and raise an exception, when column count != 3.Header is placed in unknown row (but we know, it is in top-30 and one cell of it contains word
timestamp)
from rxls.reader import read
table = read(some_file, header=True, lookup_head='^timestamp$') # First row of header must contain cell with word `timestamp`Some rows in column
timestampcontains unwanted cells (with subscripts, totals, or something else), but columnrow_iddoesn't:
from rxls.reader import read
table = read(some_file, header=True, row_filters='^row_id$') # After `header`, will be only rows, that contains non-null value in column `row_id`Some rows in column
timestampcontains unwanted cells, columnrow_iddoesn't, but we will ensure, that table has also non-null columnclient:
from rxls.reader import read
table = read(some_file, header=True, row_filters=['^row_id$', '^client$']) # After `header`, will be only rows, that contains non-null value in columns `row_id` AND `client`Some rows in a sheet contains unwanted cells, but we know, that
row_idcolumn cells are not null, orclientcolumn cells are not null either:
from rxls.reader import read
table = read(some_file, header=True, row_filters=['^row_id$', '^client$'], row_filters_strategy='or') # After `header`, will be only rows, that contains non-null value in columns `row_id` OR `client`For tracking rows events:
from rxls.reader import read
from tqdm import tqdm
with tqdm(desc='Reading some file', unit=' rows') as tq:
table = read(some_file, header=True, row_filters=['^row_id$', '^client$'], row_filters_strategy='or', row_callback=tq.update)RXLS uses only TWO datatypes for cells of excel files: NUMERIC and STRING. It is because all other datatypes, as I discovered, are just styles, applied to them (named representation in RXLS).
So, when MS Excel show us a date (or time, or timestamp, or duration interval), it's actually a floating-point number "under the hood". This number, when is between [0.0, 1.0), presents part of day (0.5 = 12 hours, 0.125 = 3 hours and vice versa), which useful for time and duration datatypes, and, when bigger or equals to 1, presents days since 1900 with part of day. For date - decimals will be zero. All temporal cells (date/time, datetime, timedelta) will be converted to timestamp[ms] during prepare step. This behaviour may be overriden with dtypes argument - after all, pyarrow's timestamp[ms] datatype may be convert to date64 or time32['ms'] instantly and without any problems. duration datatype in Excel files, usually, not presented, and pyarrow cannot cast timestamp['ms'] to duration['ms']. For such conversion, you can extract timestamp column and convert it as:
import pyarrow as pa
column_name = 'duration_column'
table = read(some_file, header=True)
table.set_column(
next(i for i, x in enumerate(table.column_names) if x == column_name),
pa.field(column_name, pa.duration('ms')),
table[column_name].combine_chunks().view(pa.duration('ms')) # original dtype is pyarrow.timestamp['ms'], which is binary equivalent for pyarrow.duration['ms']
)Strings most common representation is 4-byte little-endian integer (in xlsb) or some number (in xlsx). This number is an index of shared string in special part of Excel file (xl/sharedStrings.[xml|bin]). This special part will be scanned on each read call, before all other steps.
Keeping all this in mind, I've implemented class chunk for keeping original data as-is, until it should be accessed during final table creation. Chunks, actually, are column parts, and contains non-null values with same type and representation. Chunk may be of NULL type, but it contains just count of null values between not-null chunks.
Original data will be contained inside chunk as-is, intact until prepare step:
numpy.NDArray[numpy.float64]- simple numeric values (with optionaltemporalflag)numpy.NDArray[numpy.uint32]- forXLSBRkNumber-formatted floating-poing or integer values.pyarrow.LargeStringArray- forXLSXandXLSBstrings, that are not inshared, but presented as-is.pyarrow.UInt64Array- forXLSXandXLSBshared strings indices.int- fornullchunks: count of null values
RXLS can skip entire chunks (or their parts) before prepare step, when they should be skipped (see lookup_head and row_filters arguments descriptions) - they will never be accessed afterwards.
RXLS performs two main steps for a sheet of Excel file: reading and prepare.
On reading step, no computations will be performed - it's just filling xl_series objects by chunks, keeping original data.
On prepare step, unwanted chunks will be dropped, and others will be prepared as:
numpy.NDArray[numpy.uint32]->numpy.NDArray[numpy.float64]- when numeric chunk hasrknubmerflag.numpy.NDArray[numpy.float64]->numpy.NDArray[numpy.int64]- when numeric chunk hastemporalflag. count of days with decimal part since 1900 (Windows) -> milliseconds since 1970-01-01 (UNIX timestamp)pyarrow.UInt64Array->pyarrow.LargeStringArray- shared strings indices to actual data fromshared.int->pyarrow.NullArrayorpyarrow.Array[T, pyarrow.Scalar[T]]if column has chunks of another common datatypeT.
These operations were totally optimized (and may performs much faster with numba along with (or without) tbb - see Dependencies section).
Also, about lookup_head and header with height > 1.
If header has more than 1 row, every column will fill it's topmost empty cells by use corresponding non-empty cell values from left neighbour. All other cells will be leave as is. After this, non-empty cells will be concatenated from top to down with ', ' separator. These operations performs for each column from left to right.
So, RXLS features are:
- Lazy and fast column-wise computations
- Smart header searching and row filtering
- Keeping datatypes and resolve conflicts.
- At the moment, RXLS just skip empty column (which has no header, nor any data).
- Positional only:
- file: path to the file, or BytesIO with file.
- sheet: index or name of sheet for reading (default:
0)
- Positional:
- header:
bool:Truewhen header is presented in the sheet and have one row heightint:0when no header is presented in the sheet,Nfor header ofNrows height.list[str]: when no header is presented, but we are know, what it should be.
- header:
- Keyword-only
- dtypes: Specify datatypes for columns.
Sequence[pyarrow.DataType]: when we know dtype for each non-empty column.dict[str, pyarrow.DataType]: when we will override dtype for some columns.pyarrow.DataType: when all columns must be of this dtype.None: for original dtypes of columns.
- skip_cols: Skip some columns by their
0-based indices (A == 0). Performs onreadingstep. - skip_rows: Skip some rows on the top of the file. Performs on
readingstep. - skip_rows_after_header: Skip some rows after
header. Performs onpreparestep. - take_rows: Stop reading after this row (
0-based). Performs onreadingstep. - take_rows_non_empty: Leave empty rows in resulting table. Performs on
readingstep. - lookup_head: Regular expression for smart-search of the first row of
header, orcolumnindex, where first non-empty cell is the top-level cell ofheader. - lookup_size: Count of rows to perform lookup when searching for
header. Note: RXLS will raise an exception, iflookup_headwith thislookup_sizeis failed. - row_filters: Regular expression(s) for columns, which content determines empty and non-empty rows.
- row_filters_strategy: Boolean operator(s) for situations with two or more columns in
row_filters. - float_precision. All numeric values in MS Excel are
floating-pointunder the hood, so, when rounding wholefloat64column to this precision gives equal result to just truncate decimals, this column will be converted toint64. - datetime_formats: One or more formats, which may appears in columns with
conflicts. - conflict_resolve: When column contains two or more datatypes, this is a
conflict. When conflict cannot be resolved, whole column will be convert toutf-8. Conflicts may be resolved as:no: All parts in columns withconflictswill be convert toutf8.temporal: Try to convert non-temporal parts of column with some temporal parts to temporal (float64->timestampandutf8->timestamp(using default formats (ISO 8601), or as specified indatetime_formats)numeric: Try to convert non-numeric parts of column with some numeric parts to numeric (utf8->float64).all: Use both strategies to resolve conflicts. When some parts of column is temporal, try to convert all other parts to temporal (also enable two-step string converting:utf8->float64->timestamp)
- utf8_type_infers:
(WIP)When resulting column isutf-8and all non-null cells of it passes regular expression ofnumericvalues, convert it tofloat64(and, maybe, toint64after). - null_values: Advanced list of values, that should be skipped on
readingstep (orcallablepredicate for them). - row_callback: Any callable, which may be called without arguments on each row event. Useful for progress tracking.
- dtypes: Specify datatypes for columns.
- pyarrow>=
14.0.2 - numpy>=
1.24.4 - recordclass>=
0.21.1 - typing_extensions>=
4.9.0
- numba>=
0.58.1(increase import time, but reading speed also increases x3/x4 and up) - tbb>=
2021.11.0(only for numba - additional performance gain) - polars>=
0.20.4(if needs to parse timestamps with milliseconds/microseconds/nanoseconds or AM/PM with timezone) - pandas>=
2.0.3(for pyarrowto_pandasfunctionality) - tqdm>=
4.66.1(fast progress tracking)