-
Notifications
You must be signed in to change notification settings - Fork 378
Description
Background
Users do not notice that once load package gets into a pipeline it is stuck there - until loaded or discarded. In a notebook environment it seems more intuitive that after an exception pending packages are deleted. Here we do not change this behavior but we'll make it really obvious what is going on
Requirements
PR 1:
- our primary warning should be improved (let's do it, it is cheap). right now we are warning only if package is stuck in load step. let's warn for all pending packages so:
# normalize and load pending data
if self.list_extracted_load_packages():
self.normalize()
if self.list_normalized_load_packages():
# if there were any pending loads, load them and **exit**
if data is not None:
logger.warn(
"The pipeline `run` method will now load the pending load packages. The data"
" you passed to the run function will not be loaded. In order to do that you"
" must run the pipeline again"
)
return self.load(destination, dataset_name, credentials=credentials)should instead use self.has_pending_data to issue a warning and get into the execution branch above
2. In the load and normalize step exception handler, check if there's any pending data and warn user in the error message that this happen (we are adding similar check to workspace dashboard). NOTE: extract state will not create pending packages on failure!
3. Make sure this warning stands out - starts on separate line, see how it looks in a notebook
The warning should explain
- that the package is left in the pipeline and that it will be loaded on the next run and any new data to extract will be ignored
- if there are any pending packages in load state (
list_normalized_load_packages) that are partially loaded (is_package_partially_loaded) warn that data in the destination was modified and we recommend to retry loading or manual pipeline inspection (look fordef _display_pending_packages()that shows this warning in cli.) - explain how to remove pending packages. you can use command line, or
pipeline.drop_pending_packages()orpipeline.drop()which will reset the local working folder fully
PR 2:
if `refres="drop_sources" is used we can actually drop all pending packages with sources from extract step. in that case
- we skip checking for pending packages and proceed with extract
- we inspect schema names in the newly extracred packages
- we delete all old pending packages with schema names as above
- we can add this behavior to documentation
root cause:
dlt version
1.17.1
Describe the problem
I applied a schema contract to a source to prevent schema changes. Everything worked as expected while the contract was in place.
However, when I tried to change the contract to evolve the schema (for example, to add a new column), the pipeline failed with an error stating that the schema is frozen and cannot be modified.
Changing the schema contract setting from
schema_contract = "freeze"
to
schema_contract = "evolve"
does not resolve the issue, the schema remains locked, and the pipeline cannot evolve as expected.
In other words, once a schema contract is set to "freeze", there seems to be no way to unfreeze or modify the schema afterward.
Expected behavior
after changing the configuration from schema_contract = "freeze" → schema_contract = "evolve" the schema should unlock and allow controlled evolution again.
Steps to reproduce
Google Colab: https://colab.research.google.com/drive/1Z2SuA8sKyqzye3dcLq0Na2wS9-IdLTaT#scrollTo=5q82qCe4kYRH
Operating system
Linux
Runtime environment
Google Colab
Python version
3.12
dlt data source
No response
dlt destination
DuckDB
Other deployment details
No response
Additional information
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status