Thanks to visit codestin.com
Credit goes to github.com

Skip to content

warn users when pending packages are processed #3227

@AstrakhantsevaAA

Description

@AstrakhantsevaAA

Background
Users do not notice that once load package gets into a pipeline it is stuck there - until loaded or discarded. In a notebook environment it seems more intuitive that after an exception pending packages are deleted. Here we do not change this behavior but we'll make it really obvious what is going on

Requirements
PR 1:

  1. our primary warning should be improved (let's do it, it is cheap). right now we are warning only if package is stuck in load step. let's warn for all pending packages so:
# normalize and load pending data
        if self.list_extracted_load_packages():
            self.normalize()
        if self.list_normalized_load_packages():
            # if there were any pending loads, load them and **exit**
            if data is not None:
                logger.warn(
                    "The pipeline `run` method will now load the pending load packages. The data"
                    " you passed to the run function will not be loaded. In order to do that you"
                    " must run the pipeline again"
                )
            return self.load(destination, dataset_name, credentials=credentials)

should instead use self.has_pending_data to issue a warning and get into the execution branch above
2. In the load and normalize step exception handler, check if there's any pending data and warn user in the error message that this happen (we are adding similar check to workspace dashboard). NOTE: extract state will not create pending packages on failure!
3. Make sure this warning stands out - starts on separate line, see how it looks in a notebook

The warning should explain

  • that the package is left in the pipeline and that it will be loaded on the next run and any new data to extract will be ignored
  • if there are any pending packages in load state (list_normalized_load_packages) that are partially loaded (is_package_partially_loaded) warn that data in the destination was modified and we recommend to retry loading or manual pipeline inspection (look for def _display_pending_packages() that shows this warning in cli.)
  • explain how to remove pending packages. you can use command line, or pipeline.drop_pending_packages() or pipeline.drop() which will reset the local working folder fully

PR 2:
if `refres="drop_sources" is used we can actually drop all pending packages with sources from extract step. in that case

  • we skip checking for pending packages and proceed with extract
  • we inspect schema names in the newly extracred packages
  • we delete all old pending packages with schema names as above
  • we can add this behavior to documentation

root cause:

dlt version

1.17.1

Describe the problem

I applied a schema contract to a source to prevent schema changes. Everything worked as expected while the contract was in place.

However, when I tried to change the contract to evolve the schema (for example, to add a new column), the pipeline failed with an error stating that the schema is frozen and cannot be modified.

Changing the schema contract setting from

schema_contract = "freeze"

to

schema_contract = "evolve"

does not resolve the issue, the schema remains locked, and the pipeline cannot evolve as expected.

In other words, once a schema contract is set to "freeze", there seems to be no way to unfreeze or modify the schema afterward.

Expected behavior

after changing the configuration from schema_contract = "freeze" → schema_contract = "evolve" the schema should unlock and allow controlled evolution again.

Steps to reproduce

Google Colab: https://colab.research.google.com/drive/1Z2SuA8sKyqzye3dcLq0Na2wS9-IdLTaT#scrollTo=5q82qCe4kYRH

Operating system

Linux

Runtime environment

Google Colab

Python version

3.12

dlt data source

No response

dlt destination

DuckDB

Other deployment details

No response

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    QoLQuality of Life: improve the developer experiencegood first issueGood for newcomers

    Type

    No type

    Projects

    Status

    Planned

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions