Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@ImDoubD-datazip
Copy link
Collaborator

@ImDoubD-datazip ImDoubD-datazip commented Oct 25, 2025

Description

Kafka as source has been added as Olake's latest driver. The messages from the topics of Kafka will be synced directly to destination. Currently modes are supported: streaming cdc.

  • In case of streaming cdc :
    • if consumer group provided then that specific consumer group will be used to subscribe to all the selected topics for sync. Else if not, consumer group will be auto generated.
    • when using state file, the sync will resume from the checkpoints in the persisted state file. User has to provide continuous data or messages which require to be synced.

To Do [New ideas are welcome!!]

TO FIX:

As of now if the sync is interrupted in between then commit to kafka will take place even if not written to destination. So to get back the data, sync from starting offset must be run [latest offset reset : earliest].

  • Implement a 2-phase commit system
    • 1st: process the messages
    • 2nd: write to destination, get acknowledgement of destination commit, lastly commit to Kafka. Here state won't be used.

Fixes #87

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Testing of incremental sync is done on both parquet and apache iceberg.

  • When data not json, unmarshaling fails and it is returned as null
  • max_threads checks the level of parallelism partition wise.

cdc streaming Sync:
image

How to test Kafka as source

  • Create a topic in a kafka cluster
  • Now write messages to topic in the kafka cluster using the producer command.
  • As for source, if authentication enabled for kafka broker use security protocol as SASL_PLAINTEXT, if encryption also enabled then use SASL_SSL and provide SASLMechanism either PLAIN or SCRAM-SHA-512 with SASL JAAS configuration string containing username and paasword.

Example source.json:
Using PLAINTEXT protocol

{
    "bootstrap_servers": "broker:9092, broker:9093",
    "protocol": {
      "security_protocol": "PLAINTEXT"
    },
    "consumer_group_id": "example-consumer-group",
    "max_threads": 3,
    "auto_offset_reset": "latest"
}

Using SASL-SSL protocol

{
    "bootstrap_servers": "broker:9092, broker:9093",
    "protocol": {
      "security_protocol": "SASL_SSL",
      "sasl_mechanism": "SCRAM-SHA-512",
      "sasl_jaas_config": "org.apache.kafka.common.security.scram.ScramLoginModule required username=\"YOUR_KAFKA_USERNAME\" password=\"YOUR_KAFKA_PASSWORD\";"
    },
    "consumer_group_id": "example-consumer-group",
    "max_threads": 3,
    "auto_offset_reset": "earliest"
}

Documentation

  • Documentation Link: [link to README, olake.io/docs, or olake-docs]
  • N/A (bug fix, refactor, or test changes only)

@hash-data
Copy link
Collaborator

Nice Work @ImDoubD-datazip

@hash-data hash-data merged commit 0d2ee8f into staging Nov 10, 2025
11 checks passed
@hash-data hash-data deleted the feat/kafka-v0 branch November 10, 2025 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants