forked from tidyverse/vroom
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Problem
In non-ALTREP mode (immediate materialization), vroom extracts fields one at a time:
for (size_t row = 0; row < num_rows; ++row) {
string val = idx.get(row, col); // Individual call per row
result[row] = Rf_mkCharLen(val.data(), val.size());
}This creates function call overhead and prevents optimization opportunities.
Proposed Solution
Add a bulk extraction API for entire columns:
// Instead of N individual get() calls:
std::vector<std::string_view> extract_column(size_t col) const {
std::vector<std::string_view> result;
result.reserve(num_rows());
// Single pass through column data
for (size_t row = 0; row < num_rows(); ++row) {
auto span = idx_.get_field(row, col);
result.emplace_back(buffer_ + span.start, span.length());
}
return result;
}Benefits
- Pre-allocation of output vector
- Sequential memory access pattern
- Potential SIMD acceleration for field extraction
- Reduced function call overhead
Implementation Notes
- Only beneficial for non-ALTREP mode where entire columns are materialized upfront
- Should be opt-in via a flag or detected automatically based on access pattern
- Works well with the flat index design (Implement flat index array for O(1) field access libvroom#590)
Related
- Depends on libvroom flat index implementation (Implement flat index array for O(1) field access libvroom#590)
- Complements direct type parsing optimization
Metadata
Metadata
Assignees
Labels
No labels