kp-forks · pull · May 14, 2025 · May 14, 2025 · May 14, 2025 · May 14, 2025
diff --git a/docs/website/docs/dlt-ecosystem/transformations/add-map.md b/docs/website/docs/dlt-ecosystem/transformations/add-map.md
@@ -0,0 +1,212 @@
+---
+title: Transform data with `add_map`
+description: Apply lightweight python transformations to your data inline using `add_map`.
+keywords: [add_map, transform data, remove columns]
+---
+
+`add_map` is a method in dlt used to apply custom logic to each data item after extraction. It is typically used to modify records **before** they continue through the pipeline or are loaded to the destination. Common examples include transforming, enriching, validating, cleaning, restructuring, or anonymizing data early in the pipeline.
+
+
+## Method signature
+### `add_map` method
+```py
+def add_map(
+    item_map: ItemTransformFunc[TDataItem],
+    insert_at: int = None
+) -> TDltResourceImpl:
+    ...
+```
+
+Use `add_map` to apply a function to each item extracted by a resource. It runs your logic on every record before it continues through the pipeline.
+
+**Arguments**:
+
+- `item_map`: A function that takes a single data item (and optionally metadata) and returns a modified item. If the resource yields a list, `add_map` applies the function to each item automatically.
+- `insert_at` (optional): An integer that specifies where to insert the function in the pipeline. For example, if data yielding is at index 0, your transformation at index 1, and incremental processing at index 2, you should set `insert_at=1`.
+
+This page covers how `add_map` works, where it fits in the pipeline, and how to use it in different scenarios.
+
+## Related methods
+
+In addition to `add_map`, dlt provides:
+
+- **`add_filter`**: Excludes records based on a condition. Works like a filter function that removes items you don't want to load. ([`resource.add_filter`)](../../../api_reference/dlt/extract/resource#add_filter).
+- **`add_yield_map`**: Produces multiple outputs from a single input item. Returns an iterator instead of a single item. ([`resource.add_yield_map`](../../../api_reference/dlt/extract/resource#add_yield_map)).
+- **`add_limit`**: Limits the number of records processed by a resource. Useful for testing or reducing data volume during development. ([`resource.add_limit`](../../../api_reference/dlt/extract/resource#add_limit)).
+
+These methods help you control the shape and flow of data during transformation.
+
+## `add_map` vs `@dlt.transformer`
+
+dlt offers two primary ways to handle data operations during extraction:
+
+- **`add_map`**: Ideal for simple, item-level operations within a single resource. Typical examples include masking sensitive fields, formatting dates, or computing additional fields for each record individually as it's extracted.
+- **`@dlt.transformer`**: Defines a separate transformer resource that enriches or transforms data during extraction, often by fetching related information concurrently or performing complex operations involving other endpoints or external APIs.
+
+If your needs are straightforward and focused on single-record modifications or operations, `add_map` is usually the simplest, convenient and much more efficient choice.
+
+## Common use cases for `add_map`
+
+- **Data cleaning and enrichment:**
+    Modify fields in each record as they are pulled from the source. For example, standardize date formats, compute new fields, or enrich records with additional info.
+
+- **Anonymizing or masking sensitive data:**
+    Before data is loaded into your warehouse, you might want to pseudonymize personally identifiable information (PII) or for GDPR compliance. Read docs here: [Pseudonymizing columns.](../../general-usage/customising-pipelines/pseudonymizing_columns)
+
+- **Removing or renaming fields:**
+    If certain fields from the source are not needed or should have different names, you can modify the record dictionaries in-place. Please find the docs here:
+
+    - [Removing columns.](../../general-usage/customising-pipelines/removing_columns)
+    - [Renaming columns.](../../general-usage/customising-pipelines/renaming_columns)
+- **Incremental loading:**
+    When using incremental loading, you may need to adjust records before the incremental logic runs. This includes filling in missing timestamp or ID fields used as cursors, or dropping records that don’t meet criteria. The `add_map` function with the `insert_at` parameter lets you run these transformations at the right stage in the pipeline.
+
+
+## Controlling transformation order with `insert_at`
+
+dlt pipelines execute in multiple stages. For example, data is typically yielded at step index `0`, transformations like `add_map` are applied at index `1`, and incremental processing occurs in subsequent steps.
+
+To ensure your transformations are applied before the incremental logic kicks in, it’s important to control the execution order using the `insert_at` parameter of the `add_map` function. This parameter lets you define exactly where your transformation logic is inserted within the pipeline.
+
+```py
+import dlt
+import hashlib
+
+@dlt.resource
+def user_data():
+    yield {"id": 1, "first_name": "John", "last_name": "Doe", "email": "[email protected]"}
+    yield {"id": 2, "first_name": "Jane", "last_name": "Smith", "email": "[email protected]"}
+
+# First transformation: mask email addresses
+def mask_email(record):
+    record["email"] = hashlib.sha256(record["email"].encode()).hexdigest()
+    return record
+
+# Second transformation: enrich with full name
+def enrich_full_name(record):
+    record["full_name"] = f"{record['first_name']} {record['last_name']}"
+    return record
+
+# Attach transformations explicitly controlling their order
+transformed_users = (
+    user_data()
+    .add_map(enrich_full_name)  # By default, this would be at the end
+    .add_map(mask_email, insert_at=1)  # Explicitly run masking first after extraction
+)
+
+# Verify the transformed data
+for user in transformed_users:
+    print(user)
+```
+
+**Expected output**
+
+```py
+{'id': 1, 'first_name': 'John', 'last_name': 'Doe', 'email': '<hashed_value>', 'full_name': 'John Doe'}
+{'id': 2, 'first_name': 'Jane', 'last_name': 'Smith', 'email': '<hashed_value>', 'full_name': 'Jane Smith'}
+```
+
+By explicitly using `insert_at=1`, the email masking step (`mask_email`) is executed right after data extraction and before enrichment. This ensures sensitive data is handled securely at the earliest stage possible.
+
+## Incremental behavior with `insert_at`
+
+Use `insert_at` to control when your transformation runs in the pipeline and ensure it executes before incremental filtering. See [incremental loading documentation](../../general-usage/incremental/cursor#transform-records-before-incremental-processing) for more details.
+
+## Filling missing data for incremental loading
+
+Use `add_map` to ensure records are compatible with incremental loading.
+
+If the incremental cursor field (e.g., `updated_at`) is missing, you can provide a fallback like `created_at`. This ensures all records have a valid cursor value and can be processed correctly by the incremental step.
+
+[In this example](../../general-usage/incremental/cursor#transform-records-before-incremental-processing), the third record is made incremental-ready by assigning it a fallback `updated_at` value. This ensures it isn't skipped by the incremental loader.
+
+
+## `add_map` vs `add_yield_map`
+
+The difference between `add_map` and `add_yield_map` matters when a transformation returns multiple records from a single input.
+
+
+### **`add_map`**
+- Use `add_map` when you want to transform each item into exactly one item.
+- Think of it like modifying or enriching a row.
+- You use a regular function that returns one modified item.
+- Great for adding fields or changing structure.
+
+#### Example
+
+```py
+import dlt
+
+@dlt.resource
+def resource():
+    yield [{"name": "Alice"}, {"name": "Bob"}]
+
+def add_greeting(item):
+    item["greeting"] = f"Hello, {item['name']}!"
+    return item
+
+resource.add_map(add_greeting)
+
+for row in resource():
+    print(row)
+```
+
+#### Output
+
+```sh
+{'name': 'Alice', 'greeting': 'Hello, Alice!'}
+{'name': 'Bob', 'greeting': 'Hello, Bob!'}
+```
+
+### **`add_yield_map`**
+- Use `add_yield_map` when you want to turn one item into multiple items, or possibly no items.
+- Your function is a generator that uses yield.
+- Great for pivoting nested data, flattening lists, or filtering rows.
+
+#### Example
+
+```py
+import dlt
+
+@dlt.resource
+def resource():
+    yield [
+        {"name": "Alice", "hobbies": ["reading", "chess"]},
+        {"name": "Bob", "hobbies": ["cycling"]}
+    ]
+
+def expand_hobbies(item):
+    for hobby in item["hobbies"]:
+        yield {"name": item["name"], "hobby": hobby}
+
+resource.add_yield_map(expand_hobbies)
+
+for row in resource():
+    print(row)
+```
+#### Output
+
+```sh
+{'name': 'Alice', 'hobby': 'reading'}
+{'name': 'Alice', 'hobby': 'chess'}
+{'name': 'Bob', 'hobby': 'cycling'}
+```
+
+## Best practices for using `add_map`
+
+- **Keep transformations simple:**
+    Functions passed to `add_map` run on each record and should be stateless and lightweight. Use them for tasks like string cleanup or basic calculations. For heavier operations (like per-record API calls), batch the work or move it outside `add_map`, for example into a transformer resource or a post-load step.
+
+- **Use the right tool for the job:**
+    Use `add_map` for one-to-one record transformations. If you need to drop records, use `add_filter` instead of returning `None` in a map function. To split or expand one record into many, use `add_yield_map`.
+
+- **Chain transformations when needed:**
+    Since `add_map` and `add_filter` return the resource object, you can chain multiple transformations in sequence. Just be mindful of the order, they execute in the order they are added unless you explicitly control it using `insert_at`.
+
+- **Ordering with `insert_at`:**
+    When using multiple transforms and built-in steps (like incremental loading), control their order with `insert_at`.
+
+    Pipeline steps are zero-indexed in the order they are added. Index `0` is usually the initial data extraction. To run a custom map first, set `insert_at=1`. For multiple custom steps, assign different indices (e.g., one at `1`, another at `2`). If you're unsure about the order, iterate over the resource or check the `dlt` logs to confirm how steps are applied.
+
+- **Advanced consideration - data formats**
+    Most `dlt` sources yield dictionaries or lists of them. However, some backends, such as PyArrow, may return data as NumPy arrays or Arrow tables. In these cases, your `add_map` or `add_yield_map` function must handle the input format, possibly by converting it to a list of dicts or a pandas DataFrame. This is an advanced scenario, but important if your transformation fails due to unexpected input types.
diff --git a/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt.md b/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt.md
@@ -1,10 +1,10 @@
 ---
-title: Transforming data with dbt
+title: Transform data with dbt
 description: Transforming the data loaded by a dlt pipeline with dbt
 keywords: [transform, dbt, runner]
 ---
 
-# Transforming data with dbt
+# Transform data with dbt
 
 :::tip dlt+
 If you want to generate your dbt models automatically, check out [dlt+](../../../plus/features/transformations/dbt-transformations.md).

diff --git a/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt_cloud.md b/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt_cloud.md
@@ -1,5 +1,5 @@
 ---
-title: Transforming the Data with dbt Cloud
+title: Transform data with dbt Cloud
 description: Transforming the data loaded by a dlt pipeline with dbt Cloud
 keywords: [transform, sql]
 ---

diff --git a/docs/website/docs/dlt-ecosystem/transformations/python.md b/docs/website/docs/dlt-ecosystem/transformations/python.md
@@ -1,10 +1,10 @@
 ---
-title: Transforming data in Python with Arrow tables or DataFrames
+title: Transform data in Python with Arrow tables or DataFrames
 description: Transforming data loaded by a dlt pipeline with pandas dataframes or arrow tables
 keywords: [transform, pandas]
 ---
 
-# Transforming data in Python with Arrow tables or DataFrames
+# Transform data in Python with Arrow tables or DataFrames
 
 You can transform your data in Python using Pandas DataFrames or Arrow tables. To get started, please read the [dataset docs](../../general-usage/dataset-access/dataset).
 

diff --git a/docs/website/docs/dlt-ecosystem/transformations/sql.md b/docs/website/docs/dlt-ecosystem/transformations/sql.md
@@ -1,10 +1,10 @@
 ---
-title: Transforming data with SQL
+title: Transform data with SQL
 description: Transforming the data loaded by a dlt pipeline with the dlt SQL client
 keywords: [transform, sql]
 ---
 
-# Transforming data using the `dlt` SQL client
+# Transform data using the `dlt` SQL client
 
 A simple alternative to dbt is to query the data using the `dlt` SQL client and then perform the
 transformations using SQL statements in Python. The `execute_sql` method allows you to execute any SQL statement,
@@ -20,7 +20,7 @@ connection.
 
 
 Typically you will use this type of transformation if you can create or update tables directly from existing tables
-without any need to insert data from your Python environment. 
+without any need to insert data from your Python environment.
 
 The example below creates a new table `aggregated_sales` that contains the total and average sales for each category and region
 
@@ -32,21 +32,21 @@ pipeline = dlt.pipeline(destination="duckdb", dataset_name="crm")
 with pipeline.sql_client() as client:
     client.execute_sql(
         """ CREATE OR REPLACE TABLE aggregated_sales AS
-            SELECT 
+            SELECT
                 category,
                 region,
                 SUM(amount) AS total_sales,
                 AVG(amount) AS average_sales
-            FROM 
+            FROM
                 sales
-            GROUP BY 
-                category, 
+            GROUP BY
+                category,
                 region;
     """)
 ```
 
 You can also use the `execute_sql` method to run select queries. The data is returned as a list of rows, with the elements of a row
-corresponding to selected columns. A more convenient way to extract data is to use dlt datasets. 
+corresponding to selected columns. A more convenient way to extract data is to use dlt datasets.
 
 ```py
 try:

diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/pg_replication.md b/docs/website/docs/dlt-ecosystem/verified-sources/pg_replication.md
@@ -275,3 +275,27 @@ If you wish to create your own pipelines, you can leverage source and resource m
 
    Similarly, to replicate changes from selected columns, you can use the `table_names` and `include_columns` arguments in the `replication_resource` function.
 
+## Optional: Using `xmin` for Change Data Capture (CDC)
+
+PostgreSQL internally uses the `xmin` system column to track row versions. You can use `xmin` to enable an efficient CDC mechanism when working with the `sql_database` source.
+
+To do this, define a `query_adapter_callback` that extracts the `xmin` value from the source table and filters based on an incremental cursor:
+
+```py
+def query_adapter_callback(query, table, incremental=None, _engine=None) -> sa.TextClause:
+    """Generate a SQLAlchemy text clause for querying a table with optional incremental filtering."""
+    select_clause = (
+        f"SELECT {table.fullname}.*, xmin::text::bigint as xmin FROM {table.fullname}"
+    )
+
+    if incremental:
+        where_clause = (
+            f" WHERE {incremental.cursor_path}::text::bigint >= "
+            f"({incremental.start_value}::int8)"
+        )
+        return sa.text(select_clause + where_clause)
+
+    return sa.text(select_clause)
+```
+
+This approach enables you to track changes based on the `xmin` value instead of a manually defined column, which is especially useful in cases where mutation tracking is needed but a timestamp or serial column is not available.
diff --git a/docs/website/docs/general-usage/destination-tables.md b/docs/website/docs/general-usage/destination-tables.md
@@ -275,6 +275,71 @@ load_info = pipeline.run(data, table_name="users")
 Every time you run this pipeline, a new schema will be created in the destination database with a datetime-based suffix. The data will be loaded into tables in this schema.
 For example, the first time you run the pipeline, the schema will be named `mydata_20230912064403`, the second time it will be named `mydata_20230912064407`, and so on.
 
+## dlt’s internal tables
+
+dlt automatically creates internal tables in the destination schema to track pipeline runs, support incremental loading, and manage schema versions. These tables use the `_dlt_` prefix.
+
+### `_dlt_loads`: Load history tracking
+This table records each pipeline run. Every time you execute a pipeline, a new row is added to this table with a unique `load_id`. This table tracks which loads have been completed and supports chaining of transformations.
+
+
+| Column name          | Type      | Description                               |
+|----------------------|-----------|-------------------------------------------|
+| `load_id`            | STRING    | Unique identifier for the load job        |
+| `schema_name`        | STRING    | Name of the schema used during the load   |
+| `schema_version_hash`| STRING    | Hash of the schema version                |
+| `status`             | INTEGER   | Load status. Value `0` means completed    |
+| `inserted_at`        | TIMESTAMP | When the load was recorded                |
+
+Only rows with `status = 0` are considered complete. Other values represent incomplete or interrupted loads. The status column can also be used to coordinate multi-step transformations.
+
+### `_dlt_pipeline_state`: Pipeline state and checkpoints
+This table stores the internal state of the pipeline for each run. This state enables incremental loading and allows the pipeline to resume from where it left off if a previous run was interrupted.
+
+
+| Column name       | Type            | Description                                          |
+|-------------------|------------------|------------------------------------------------------|
+| `version`         | INTEGER          | Version of this state entry                         |
+| `engine_version`  | INTEGER          | Version of the dlt engine used                      |
+| `pipeline_name`   | STRING           | Name of the pipeline                                |
+| `state`           | STRING or BLOB   | Serialized Python dictionary of pipeline state      |
+| `created_at`      | TIMESTAMP        | When this state entry was created                   |
+| `version_hash`    | STRING           | Hash to detect changes in the state                 |
+| `_dlt_load_id`    | STRING           | Reference to related load in `_dlt_loads`           |
+| `_dlt_id`         | STRING           | Unique identifier for the pipeline state row        |
+
+
+The state column contains a serialized Python dictionary that includes:
+
+    - Incremental progress (e.g. last item or timestamp processed).
+    - Checkpoints for transformations.
+    - Source-specific metadata and settings.
+
+This allows dlt to resume interrupted pipelines, avoid reloading already processed data, and ensure pipelines are idempotent and efficient.
+
+The `version_hash` is recalculated on each update. dlt uses this table to implement last-value incremental loading. If a run fails or stops, this table ensures the next run picks up from the correct checkpoint.
+
+### `_dlt_version`: Schema version tracking
+This table tracks the history of all schema versions used by the pipeline. Every time dlt updates the schema. For example, when new columns or tables are added, a new entry is written to this table.
+
+| Column name     | Type            | Description                                      |
+|------------------|------------------|--------------------------------------------------|
+| `version`        | INTEGER          | Numeric version of the schema                   |
+| `engine_version` | INTEGER          | Version of the dlt engine used                  |
+| `inserted_at`    | TIMESTAMP        | Time the schema version entry was created       |
+| `schema_name`    | STRING           | Name of the schema                              |
+| `version_hash`   | STRING           | Unique hash representing the schema content     |
+| `schema`         | STRING or JSON   | Full schema in JSON format                      |
+
+By keeping previous schema definitions, `_dlt_version` ensures that:
+
+- Older data remains readable
+- New data uses updated schema rules
+- Backward compatibility is maintained
+
+This table also supports troubleshooting and compatibility checks. It lets you track which schema and engine version were used for any load. This helps with debugging and ensures safe evolution of your data model.
+
+
 ## Loading data into existing tables not created by dlt
 
 You can also load data from `dlt` into tables that already exist in the destination dataset and were not created by `dlt`.