S3 (json) connector #351

alex-thc · 2025-11-29T06:01:08Z

Summary by CodeRabbit

New Features
- Added an S3 connector with per-task in-memory batching and flush-on-task-completion; supports reading/writing partitioned data.
- CLI flags to configure S3 (region, credentials, profile, endpoint, prefix, path-style).
- Task ID now flows through transport and messages; connectors can implement an optional task-completion hook to flush per-task buffers.
Chores
- Upgraded AWS SDK and related dependencies to enable S3 support.
- CI security scanner adjusted to exclude generated code.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-29T06:01:19Z

Walkthrough

Propagates TaskId through data and barrier flows, adds OnTaskCompletionBarrierHandlerServicable, introduces an S3 sink connector with per-task buffering and barrier-triggered flush to S3, updates proto/Java messages to include task_id, and bumps Go module dependencies for S3.

Changes

Cohort / File(s)	Summary
Task tracking & barrier handling `connectors/common/base.go`, `protocol/iface/transport.go`	Adds `TaskId` to `DataMessage`, propagates TaskId in StartReadToChannel and ProcessDataMessages, adds `OnTaskCompletionBarrierHandlerServicable` interface and invokes it for `BarrierType_TaskComplete`.
S3 connector implementation & registration `connectors/s3/connector.go`, `internal/app/options/connectorflags.go`	New S3 connector (`ConnectorSettings`, `NewConn`) with per-task in-memory buffering, WriteData buffering, flush-on-task-completion writes JSON arrays to S3, per-namespace metadata read/update, concurrency and error handling, CLI flags and connector registration.
Proto and generated messages `proto/adiom/v1/messages.proto`, `java/src/main/java/adiom/v1/Messages.java`	Adds `uint32 task_id = 4` to `WriteDataRequest`; updates generated Java classes to include `taskId_`, accessors, builders, serialization/size/equality/hash handling for `task_id` across affected messages.
Go module updates `go.mod`	Bumps AWS SDK v2, adds `service/s3` and related internal/indirect dependency updates to support S3 integration.
CI tweak `.github/workflows/test.yml`	Adds `-exclude-generated` to the Gosec scanner arguments.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Core as Connector Core
  participant Buffer as Per-Task Buffer
  participant Barrier as Barrier Handler
  participant S3 as AWS S3
  participant Meta as Namespace Metadata

  Core->>Buffer: WriteData(taskId, data)
  Note right of Buffer: Accumulate JSON docs per task (in-memory)
  Core->>Core: ProcessDataMessages (preserve TaskId)
  Core->>Barrier: Emit Barrier(TaskComplete, barrierTaskId)
  Barrier-->>Core: If implements OnTaskCompletionBarrierHandler -> call OnTaskCompletionBarrierHandler(barrierTaskId)
  Core->>Buffer: Flush buffer for barrierTaskId
  Buffer->>S3: PutObject(namespace/derived-key.json) [JSON array]
  S3-->>Core: PutObject response
  Core->>Meta: Read/Update namespace metadata with new key/count
  Meta-->>Core: Metadata update response

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Attention points:
- connectors/s3/connector.go — concurrency (per-task buffers, metadata mutex), flush-on-barrier correctness, S3 key derivation, error propagation and currentError() handling.
- connectors/common/base.go & protocol/iface/transport.go — ensure TaskId propagation is consistent across transformation and non-transformation code paths.
- proto/adiom/v1/messages.proto & java/src/main/java/adiom/v1/Messages.java — serialization compatibility, builder semantics, and default handling for the new task_id field.
- go.mod — verify dependency upgrades build and do not introduce breaking changes.

Poem

🐰 I buffered bytes with patient paws,
Task IDs marched in tidy rows,
A barrier chimed — flush on cue,
JSON bundles sailed to S3 anew,
Rabbity cheers and crunchy carrots too!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 38.46% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'S3 (json) connector' directly describes the main change: adding a new S3 connector for JSON data, which aligns with the primary additions across connectors/s3/connector.go and supporting infrastructure.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch s3json

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

internal/app/options/connectorflags.go (1)
604-608: Clarify prefix flag behavior in usage text.

The usage says "Override or append to" but the implementation only overrides (via Destination). If the CLI flag is provided, it completely replaces the prefix from the connection string.

Consider updating the usage to reflect actual behavior:
 		altsrc.NewStringFlag(&cli.StringFlag{
 			Name:        "prefix",
-			Usage:       "Override or append to the key prefix derived from the connection string",
+			Usage:       "Override the key prefix derived from the connection string",
 			Destination: &settings.Prefix,
 		}),
connectors/s3/connector.go (2)
418-432: Consider accepting context parameter.

flushBatch uses context.Background() which won't respect cancellation signals from the caller. If the application is shutting down, this S3 put operation could block indefinitely.
-func (c *connector) flushBatch(namespace string, taskID uint, docs [][]byte) error {
+func (c *connector) flushBatch(ctx context.Context, namespace string, taskID uint, docs [][]byte) error {
 	payload := buildJSONArray(docs)
 	key := c.objectKey(namespace, taskID)
-	_, err := c.client.PutObject(context.Background(), &s3.PutObjectInput{
+	_, err := c.client.PutObject(ctx, &s3.PutObjectInput{
462-474: Buffer pre-allocation is underestimated.

buf.Grow(len(docs) * 2) significantly underestimates the required capacity since each document could be much larger than 2 bytes. While this won't cause correctness issues (the buffer grows automatically), it defeats the purpose of pre-allocation.

Consider estimating based on actual document sizes:
 func buildJSONArray(docs [][]byte) []byte {
 	var buf bytes.Buffer
-	buf.Grow(len(docs) * 2)
+	totalSize := 2 + len(docs) // brackets + commas
+	for _, doc := range docs {
+		totalSize += len(doc)
+	}
+	buf.Grow(totalSize)
 	buf.WriteByte('[')

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e223a75 and f9d3667.

⛔ Files ignored due to path filters (6)

gen/adiom/v1/adiom.pb.go is excluded by !**/*.pb.go, !**/gen/**
gen/adiom/v1/adiomv1connect/adiom.connect.go is excluded by !**/gen/**
gen/adiom/v1/adiomv1connect/vector.connect.go is excluded by !**/gen/**
gen/adiom/v1/messages.pb.go is excluded by !**/*.pb.go, !**/gen/**
gen/adiom/v1/vector.pb.go is excluded by !**/*.pb.go, !**/gen/**
go.sum is excluded by !**/*.sum

📒 Files selected for processing (7)

connectors/common/base.go (6 hunks)
connectors/s3/connector.go (1 hunks)
go.mod (2 hunks)
internal/app/options/connectorflags.go (4 hunks)
java/src/main/java/adiom/v1/Messages.java (13 hunks)
proto/adiom/v1/messages.proto (1 hunks)
protocol/iface/transport.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

connectors/s3/connector.go (4)

connectors/common/base.go (1)

ConnectorSettings (49-58)

gen/adiom/v1/adiomv1connect/adiom.connect.go (1)

ConnectorServiceHandler (199-210)

gen/adiom/v1/messages.pb.go (2)

DataType_DATA_TYPE_JSON_ID (30-30)

DataType_DATA_TYPE_MONGO_BSON (29-29)

protocol/iface/connector.go (1)

Namespace (104-107)

internal/app/options/connectorflags.go (1)

connectors/s3/connector.go (2)

ConnectorSettings (38-49)

NewConn (73-116)

🔇 Additional comments (21)

proto/adiom/v1/messages.proto (1)

123-128: LGTM - backward-compatible field addition.

The new task_id field at position 4 is correctly placed and maintains proto backward compatibility. Default zero value for uint32 is safe for existing clients.

protocol/iface/transport.go (1)

28-28: LGTM - clear separation of concerns.

The TaskId field for data messages is distinct from BarrierTaskId for barrier signals. This separation allows proper task association through the data flow while maintaining barrier semantics.

go.mod (1)

10-15: LGTM - appropriate dependency additions for S3 support.

The AWS SDK v2 upgrade and S3 service addition are necessary for the new S3 connector. The transitive dependencies (eventstream, checksum, s3shared, etc.) are standard requirements for S3 operations.

internal/app/options/connectorflags.go (2)

300-321: LGTM - S3 connector registration follows established patterns.

The registration logic correctly:

Uses case-insensitive prefix matching for s3://

Validates connection string presence

Parses bucket/prefix before delegating to CreateHelper

548-564: LGTM - robust connection string parsing.

The function correctly handles:

Case-insensitive s3:// prefix validation

Empty bucket detection

Optional prefix extraction via SplitN

connectors/common/base.go (4)

45-47: LGTM - well-designed optional interface.

The single-method interface enables connectors to optionally handle task completion barriers without forcing all connectors to implement this behavior. This follows Go's interface composition idiom.

626-654: LGTM - consistent TaskId propagation across all data paths.

TaskId is correctly set in all three branches:

Transformed multi-response path (line 630)

Transformed single-response path (line 643)

Non-transformed path (line 652)

This ensures task context flows through regardless of transformation configuration.

988-994: LGTM - safe optional interface invocation.

The type assertion pattern correctly checks if the connector implements OnTaskCompletionBarrierHandlerServicable before invoking the hook. Error propagation is handled properly.

1094-1099: LGTM - TaskId correctly propagated to WriteDataRequest.

The TaskId is appropriately cast from uint to uint32 to match the proto field type, maintaining end-to-end task correlation from reader through writer.

java/src/main/java/adiom/v1/Messages.java (1)

12321-12326: Generated protobuf code - changes look correct.

This file appears to be auto-generated from proto/adiom/v1/messages.proto by the protobuf compiler. The additions for the task_id field (field number 4, uint32) follow standard protobuf Java codegen patterns correctly:

Interface getter declaration

Field storage with default value 0

Serialization/deserialization with wire type 32 (varint)

Proper bit field tracking (0x00000008)

Builder methods with onChanged() calls

Generated files are typically excluded from manual code review since the source of truth is the .proto file. Consider adding this path to your review exclusion patterns if not already done.

Also applies to: 12454-12463, 13053-13083

connectors/s3/connector.go (11)

1-35: LGTM!

Clean package setup with appropriate imports and well-defined sentinel errors.

37-70: LGTM!

The struct definitions are well-organized. The use of separate mutexes for batches and error state is appropriate for the different access patterns.

72-116: LGTM!

Good validation of required fields, sensible defaults, and flexible credential configuration supporting both profile-based and static credentials.

118-139: LGTM!

Capabilities are correctly defined for an S3-based JSON connector with source and sink support.

141-247: LGTM!

Well-structured plan generation with proper pagination, filtering, and metadata caching. The graceful degradation when metadata is unavailable is a good approach.

249-283: LGTM!

Defensive error handling with appropriate fallback to zero count when metadata is unavailable.

285-327: LGTM with a note.

The implementation is correct. The entire file is loaded into memory which is fine for typical use cases, but could be a concern for very large JSON files. Consider documenting expected file size limits if this becomes a production concern.

329-337: LGTM!

Correctly returns unimplemented for streaming operations that S3 doesn't support.

339-366: LGTM!

Good defensive checks for error state and required task ID. The batching approach is appropriate for S3's write patterns.

368-371: LGTM!

Correctly rejects update operations which aren't supported for S3's append-only nature.

476-496: LGTM!

The conversion logic handles supported types correctly. Note that BSON to JSON conversion via map[string]any may lose some BSON-specific type information (e.g., ObjectId, Decimal128), but this is likely acceptable for a generic JSON sink.

connectors/s3/connector.go

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

connectors/s3/connector.go (2)
373-395: Empty files still created for tasks with no data.

When batch is nil (no data was written for this task), the code creates an empty storedBatch{} and proceeds to call flushBatch, which writes an empty JSON array to S3. This pollutes the bucket with unnecessary files.

Return early when there's no data to flush:
 func (c *connector) OnTaskCompletionBarrierHandler(taskID uint) error {
 	batch := c.detachBatch(taskID)
 	if batch == nil {
 		slog.Debug("s3 connector received barrier with no data", "taskId", taskID)
-		batch = &storedBatch{}
+		return nil
 	}
+	if len(batch.docs) == 0 {
+		slog.Debug("s3 connector received barrier with empty batch", "taskId", taskID)
+		return nil
+	}
 	if err := c.flushBatch(batch.namespace, taskID, batch.docs); err != nil {
560-587: Race condition in concurrent metadata updates.

updateMetadataAfterFlush performs an unsynchronized read-modify-write. When multiple tasks for the same namespace complete concurrently, metadata updates can be lost because each goroutine reads, modifies, and writes independently.

Example race:

Task A reads metadata {file1: 100}

Task B reads metadata {file1: 100}

Task A writes {file1: 100, task-2.json: 50}

Task B writes {file1: 100, task-3.json: 75} — task-2.json entry is lost

Add per-namespace mutex to serialize metadata updates:
 type connector struct {
 	adiomv1connect.UnimplementedConnectorServiceHandler

 	client       *s3.Client
 	settings     ConnectorSettings
 	batchesMutex sync.Mutex
 	batches      map[taskKey]*storedBatch

+	metadataMutex sync.Mutex // Serialize metadata updates

 	errMutex sync.RWMutex
 	err      error
 }
Then protect the read-modify-write in updateMetadataAfterFlush:
 func (c *connector) updateMetadataAfterFlush(ctx context.Context, namespace string, taskID uint, recordCount uint64) error {
+	c.metadataMutex.Lock()
+	defer c.metadataMutex.Unlock()
+
 	// Read current metadata (or create empty map if doesn't exist)
 	metadata, err := c.readMetadata(ctx, namespace)
For better concurrency with many namespaces, consider per-namespace locks using sync.Map.

🧹 Nitpick comments (1)

connectors/common/base.go (1)

1098-1098: TaskId type narrowing from uint to uint32.

The conversion from uint (which is 64-bit on 64-bit systems) to uint32 could theoretically overflow if task IDs exceed 2³²-1. While task IDs are typically small sequential numbers making overflow unlikely in practice, consider documenting this limitation or adding validation if task ID limits are a concern.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f9d3667 and 8ea476e.

⛔ Files ignored due to path filters (6)

gen/adiom/v1/adiom.pb.go is excluded by !**/*.pb.go, !**/gen/**
gen/adiom/v1/adiomv1connect/adiom.connect.go is excluded by !**/gen/**
gen/adiom/v1/adiomv1connect/vector.connect.go is excluded by !**/gen/**
gen/adiom/v1/messages.pb.go is excluded by !**/*.pb.go, !**/gen/**
gen/adiom/v1/vector.pb.go is excluded by !**/*.pb.go, !**/gen/**
go.sum is excluded by !**/*.sum

📒 Files selected for processing (7)

connectors/common/base.go (6 hunks)
connectors/s3/connector.go (1 hunks)
go.mod (2 hunks)
internal/app/options/connectorflags.go (4 hunks)
java/src/main/java/adiom/v1/Messages.java (13 hunks)
proto/adiom/v1/messages.proto (1 hunks)
protocol/iface/transport.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

proto/adiom/v1/messages.proto
protocol/iface/transport.go

🧰 Additional context used

🧬 Code graph analysis (2)

connectors/s3/connector.go (3)

gen/adiom/v1/adiomv1connect/adiom.connect.go (1)

ConnectorServiceHandler (199-210)

gen/adiom/v1/messages.pb.go (2)

DataType_DATA_TYPE_JSON_ID (30-30)

DataType_DATA_TYPE_MONGO_BSON (29-29)

protocol/iface/connector.go (1)

Namespace (104-107)

internal/app/options/connectorflags.go (1)

connectors/s3/connector.go (2)

ConnectorSettings (38-49)

NewConn (73-116)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (19)

java/src/main/java/adiom/v1/Messages.java (4)

12822-12826: Wire type parsing is correct.

The parsing case 32 corresponds to field number 4 with wire type 0 (varint): (4 << 3) | 0 = 32. The bit flag 0x00000008 correctly tracks the 4th field's presence.

13053-13083: Builder methods follow standard protobuf patterns.

The setTaskId, getTaskId, and clearTaskId builder methods are correctly implemented with proper bit field tracking and onChanged() notifications.

23995-24046: Descriptor string updated for the new field.

The serialized descriptor includes the task_id field definition. This section is typically auto-generated and should match the compiled proto output.

12454-12463: Java implementation correctly matches the proto definition.

The taskId field implementation is properly generated from proto/adiom/v1/messages.proto, which defines uint32 task_id = 4 in the WriteDataRequest message. The wire tag calculation (case 32), field numbering, serialization/deserialization, and builder methods all follow standard protobuf codegen patterns correctly.

connectors/common/base.go (3)

45-47: LGTM! Optional barrier hook interface added.

The new interface enables connectors to implement custom logic when task completion barriers are processed, which is essential for the S3 connector's per-task buffering and flushing strategy.

630-630: LGTM! Task ID propagation implemented correctly.

The TaskId is consistently propagated across all data message creation paths (transformed and non-transformed), enabling per-task tracking and barrier handling downstream.

Also applies to: 643-643, 652-652

988-994: LGTM! Barrier hook integration is well-implemented.

The optional hook is properly guarded with a type assertion and errors are propagated correctly. This allows connectors like S3 to flush buffered data when task completion barriers arrive.

internal/app/options/connectorflags.go (3)

300-321: LGTM! S3 connector registration follows established patterns.

The connector registration correctly parses S3 URIs, extracts bucket and prefix, and delegates to the helper for flag processing—consistent with other connectors in the registry.

548-564: LGTM! Connection string parsing is robust.

The parsing correctly handles the s3://bucket[/prefix] format with proper validation and error messages. Prefix trimming is delegated to the connector implementation, which is appropriate.

596-646: LGTM! S3 flags are comprehensive and well-structured.

The flags cover all necessary AWS S3 configuration options including region, credentials, endpoint customization, and path-style addressing—suitable for both AWS and S3-compatible services.

connectors/s3/connector.go (8)

72-116: LGTM! Connector initialization is well-structured.

The initialization properly validates required settings, configures AWS SDK with flexible credential options (profile, static, default), and supports custom endpoints for testing or S3-compatible services.

118-139: LGTM! Capabilities accurately reflect connector functionality.

The connector correctly advertises JSON source support and BSON/JSON sink support with appropriate capability flags for multi-namespace planning.

141-247: LGTM! Plan generation is thorough and handles edge cases well.

The implementation correctly lists S3 objects with pagination, filters by namespace, loads metadata for count estimation, and gracefully handles missing metadata with appropriate logging.

249-283: LGTM! Metadata retrieval handles missing files gracefully.

The namespace metadata aggregation correctly sums per-file counts and returns appropriate defaults when metadata is unavailable.

285-327: LGTM! Data listing is efficient and memory-safe.

The implementation correctly fetches and validates JSON arrays from S3, with proper byte copying to prevent retaining large backing arrays.

339-366: LGTM! Write buffering is well-designed with proper validation.

The implementation correctly validates task ID, converts data types, and buffers documents in memory for later flushing on barrier events.

397-416: LGTM! Batch management is properly synchronized.

The mutex-protected batch operations correctly handle concurrent writes from multiple tasks.

418-496: LGTM! Helper functions are well-implemented.

The utility functions correctly handle S3 key generation, JSON array construction with buffer optimization, and data type conversion with proper validation. Error state management uses appropriate read/write locking.

go.mod (1)

10-10: Consider updating to latest minor versions for AWS SDK.

The current versions (aws-sdk-go-v2 v1.40.0 and service/s3 v1.92.1) are stable but slightly outdated. Latest available versions are v1.40.1 and v1.93.0 respectively. Update to ensure access to latest bug fixes and improvements.

…nt now and uses unsafe calls

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

connectors/s3/connector.go (5)
68-68: Consider per-namespace mutexes for better concurrency.

While the single metadataMutex correctly prevents race conditions (addressing the past review comment), it serializes metadata updates across all namespaces. For workloads with many concurrent tasks updating different namespaces, this global lock can become a bottleneck.

Consider using a sync.Map to hold per-namespace mutexes:
 type connector struct {
 	adiomv1connect.UnimplementedConnectorServiceHandler
 
 	client       *s3.Client
 	settings     ConnectorSettings
 	batchesMutex sync.Mutex
 	batches      map[taskKey]*storedBatch
 
-	metadataMutex sync.Mutex // Serialize metadata updates
+	namespaceMutexes sync.Map // map[string]*sync.Mutex for per-namespace locking
 
 	errMutex sync.RWMutex
 	err      error
 }
Then in updateMetadataAfterFlush, acquire the namespace-specific mutex:
func (c *connector) updateMetadataAfterFlush(ctx context.Context, namespace string, taskID uint, recordCount uint64) error {
	// Get or create mutex for this namespace
	mu, _ := c.namespaceMutexes.LoadOrStore(namespace, &sync.Mutex{})
	namespaceMu := mu.(*sync.Mutex)
	
	namespaceMu.Lock()
	defer namespaceMu.Unlock()
	
	// ... rest of the function unchanged
}
287-329: Consider memory implications for large files.

The current implementation loads the entire JSON array into memory at once. For very large S3 files (e.g., hundreds of MB or GB), this could cause memory pressure or OOM issues.

If you expect to handle large files, consider:

Implementing pagination/chunking within the file (using the NextCursor field)

Setting size limits on individual S3 files

Streaming the JSON array parsing rather than decoding all at once

Monitoring memory usage in production for files above a certain size threshold

393-398: Redundant check: batch is already confirmed non-empty.

The check if len(batch.docs) > 0 is unnecessary because the function already returns early at lines 382-385 if the batch is empty.

Apply this diff to remove the redundant check:
 	// Update metadata file atomically after successful flush
-	if len(batch.docs) > 0 {
-		if err := c.updateMetadataAfterFlush(context.Background(), batch.namespace, taskID, uint64(len(batch.docs))); err != nil {
-			slog.Error("failed to update metadata", "namespace", batch.namespace, "taskId", taskID, "err", err)
-			// Log error but don't fail the barrier - the data was successfully flushed
-		}
+	if err := c.updateMetadataAfterFlush(context.Background(), batch.namespace, taskID, uint64(len(batch.docs))); err != nil {
+		slog.Error("failed to update metadata", "namespace", batch.namespace, "taskId", taskID, "err", err)
+		// Log error but don't fail the barrier - the data was successfully flushed
 	}
424-438: Accept context parameter for better cancellation control.

The function uses context.Background() for the S3 PutObject call, which means the flush operation cannot be cancelled or timeout-controlled by the caller. If OnTaskCompletionBarrierHandler needs to respect deadlines or cancellation, this will be problematic.

Apply this diff to thread the context through:
-func (c *connector) flushBatch(namespace string, taskID uint, docs [][]byte) error {
+func (c *connector) flushBatch(ctx context.Context, namespace string, taskID uint, docs [][]byte) error {
 	payload := buildJSONArray(docs)
 	key := c.objectKey(namespace, taskID)
-	_, err := c.client.PutObject(context.Background(), &s3.PutObjectInput{
+	_, err := c.client.PutObject(ctx, &s3.PutObjectInput{
And update the call site in OnTaskCompletionBarrierHandler:
-	if err := c.flushBatch(batch.namespace, taskID, batch.docs); err != nil {
+	if err := c.flushBatch(context.Background(), batch.namespace, taskID, batch.docs); err != nil {
(Or pass a proper context if available from the barrier handler)

468-480: Inefficient buffer pre-allocation.

The buf.Grow(len(docs) * 2) estimates only 2 bytes per document, which is far too small for typical JSON documents. This will cause multiple reallocations as the buffer grows. For example, 1000 documents of 100 bytes each would need ~100KB but only 2KB is pre-allocated.

Consider a more realistic estimate:
 func buildJSONArray(docs [][]byte) []byte {
 	var buf bytes.Buffer
-	buf.Grow(len(docs) * 2)
+	// Estimate: sum of document sizes + commas + brackets + 10% overhead
+	totalSize := 2 // '[' and ']'
+	for _, doc := range docs {
+		totalSize += len(doc) + 1 // +1 for comma (or close bracket)
+	}
+	buf.Grow(totalSize)
 	buf.WriteByte('[')
Or use a simple heuristic like len(docs) * 100 if you expect documents to average ~100 bytes.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f5e5c0f and 491d820.

📒 Files selected for processing (1)

connectors/s3/connector.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

connectors/s3/connector.go (3)

gen/adiom/v1/adiomv1connect/adiom.connect.go (1)

ConnectorServiceHandler (199-210)

gen/adiom/v1/messages.pb.go (2)

DataType_DATA_TYPE_JSON_ID (30-30)

DataType_DATA_TYPE_MONGO_BSON (29-29)

protocol/iface/connector.go (1)

Namespace (104-107)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (5)

connectors/s3/connector.go (5)

1-35: LGTM!

Package declaration, imports, and error constants are well-structured and appropriate for an S3 connector implementation.

74-118: LGTM!

Constructor properly validates required settings, configures AWS SDK with flexible credential options, and initializes the connector with appropriate defaults.

143-249: LGTM!

The plan generation logic correctly discovers S3 objects, filters by namespace, loads metadata efficiently with caching, and creates appropriate partitions. Error handling with warning logs is appropriate.

504-596: LGTM with concurrency consideration already noted.

The metadata management implementation correctly handles missing files, generates appropriate keys, and performs atomic read-modify-write operations. The global metadata mutex concern was already raised in the earlier comment on line 68.

348-351: TaskID validation is implementation-specific, not protocol-mandated.

The protobuf WriteDataRequest.TaskId field uses uint32 with no required constraint, meaning 0 is a valid protocol value. While the S3 connector rejects taskID == 0, this validation is not enforced at the proto level. The null connector's WriteData method does not perform this check, indicating inconsistent enforcement across connectors. Either clarify in comments why 0 is invalid for this connector, or consider if the validation is necessary given the protocol allows it.

coderabbitai · 2025-12-04T23:06:20Z

connectors/s3/connector.go

+func convertToJSON(data []byte, dataType adiomv1.DataType) ([]byte, error) {
+	switch dataType {
+	case adiomv1.DataType_DATA_TYPE_JSON_ID:
+		if !json.Valid(data) {
+			return nil, fmt.Errorf("invalid json payload")
+		}
+		return append([]byte(nil), data...), nil
+	case adiomv1.DataType_DATA_TYPE_MONGO_BSON:
+		var doc map[string]any
+		if err := bson.Unmarshal(data, &doc); err != nil {
+			return nil, fmt.Errorf("bson to json: %w", err)
+		}
+		converted, err := json.Marshal(doc)
+		if err != nil {
+			return nil, fmt.Errorf("marshal json: %w", err)
+		}
+		return converted, nil
+	default:
+		return nil, ErrUnsupportedType
+	}
+}


⚠️ Potential issue | 🟠 Major

BSON to JSON conversion loses type fidelity.

For DATA_TYPE_MONGO_BSON, the code unmarshals BSON to map[string]any and then marshals to JSON. This loses BSON-specific type information:

ObjectID → string (hex representation)

DateTime → RFC3339 string

Decimal128 → float64 (potential precision loss)

Binary → base64 string

Other BSON types are similarly transformed

If consumers expect to preserve exact BSON semantics or round-trip data back to MongoDB, this conversion is lossy and could cause issues. However, if the S3 connector's purpose is specifically to produce human-readable JSON for analytics/export, this might be acceptable.

If BSON type fidelity is required, consider using bson.MarshalExtJSON for extended JSON format:

case adiomv1.DataType_DATA_TYPE_MONGO_BSON: - var doc map[string]any - if err := bson.Unmarshal(data, &doc); err != nil { + var doc bson.Raw + if err := bson.Unmarshal(data, &doc); err != nil { return nil, fmt.Errorf("bson to json: %w", err) } - converted, err := json.Marshal(doc) + converted, err := bson.MarshalExtJSON(doc, true, false) if err != nil { return nil, fmt.Errorf("marshal json: %w", err) } return converted, nil

This preserves BSON types in extended JSON format (e.g., {"$oid": "..."} for ObjectID).

If plain JSON is sufficient, document this behavior clearly so users understand the type conversions.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func convertToJSON(data []byte, dataType adiomv1.DataType) ([]byte, error) {

switch dataType {

case adiomv1.DataType_DATA_TYPE_JSON_ID:

if !json.Valid(data) {

return nil, fmt.Errorf("invalid json payload")

}

return append([]byte(nil), data...), nil

case adiomv1.DataType_DATA_TYPE_MONGO_BSON:

var doc map[string]any

if err := bson.Unmarshal(data, &doc); err != nil {

return nil, fmt.Errorf("bson to json: %w", err)

}

converted, err := json.Marshal(doc)

if err != nil {

return nil, fmt.Errorf("marshal json: %w", err)

}

return converted, nil

default:

return nil, ErrUnsupportedType

}

}

func convertToJSON(data []byte, dataType adiomv1.DataType) ([]byte, error) {

switch dataType {

case adiomv1.DataType_DATA_TYPE_JSON_ID:

if !json.Valid(data) {

return nil, fmt.Errorf("invalid json payload")

}

return append([]byte(nil), data...), nil

case adiomv1.DataType_DATA_TYPE_MONGO_BSON:

var doc bson.Raw

if err := bson.Unmarshal(data, &doc); err != nil {

return nil, fmt.Errorf("bson to json: %w", err)

}

converted, err := bson.MarshalExtJSON(doc, true, false)

if err != nil {

return nil, fmt.Errorf("marshal json: %w", err)

}

return converted, nil

default:

return nil, ErrUnsupportedType

}

}

🤖 Prompt for AI Agents

In connectors/s3/connector.go around lines 482 to 502, the current BSON→JSON path unmarshals BSON into map[string]any then json.Marshal which loses BSON type fidelity; replace that path to produce Extended JSON by using the bson library's MarshalExtJSON on the raw BSON (so ObjectID/DateTime/Decimal128/etc. are preserved as extended JSON tokens) and return that payload, and if you intend to keep the current plain-JSON behavior instead, update the function comment and public docs to explicitly state the lossy type conversions so consumers are aware.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

connectors/s3/connector.go (2)
290-345: ListData behavior is reasonable; BSON round‑trip goes through JSON and may not preserve exact BSON types.

The method validates the S3 key from the partition cursor, reads a JSON array, and:

For DATA_TYPE_JSON_ID, returns each element as validated JSON bytes.

For DATA_TYPE_MONGO_BSON, unmarshals each JSON element into map[string]any and then bson.Marshal’s it back to BSON.

This is functionally fine if S3 is primarily a JSON export format. However, the BSON branch does a JSON→generic‑map→BSON conversion, which can change certain BSON types on round‑trip (e.g., extended JSON representations, numeric precision nuances). This was already called out in a previous review; if exact BSON semantics matter, you may want to adopt an Extended JSON strategy instead or clearly document that the S3 connector is JSON-oriented and BSON support is best-effort.
Check the `go.mongodb.org/mongo-driver/bson` documentation for how `bson.Unmarshal` into `map[string]any` followed by `bson.Marshal` treats ObjectID, DateTime, and Decimal128 types when starting from JSON, and whether `MarshalExtJSON` is preferable for preserving BSON semantics.
512-532: convertToJSON validates JSON and handles BSON input; BSON path is lossy and double‑marshals.

convertToJSON:

Validates raw JSON for DATA_TYPE_JSON_ID and returns a copy.

For DATA_TYPE_MONGO_BSON, unmarshals BSON into map[string]any and re‑marshals to JSON, which is straightforward but does a full decode/encode.

Returns ErrUnsupportedType for other data types, aligning with ListData.

As with ListData, the BSON path may lose some type fidelity relative to raw BSON, and it does a full double marshal; if performance or exact BSON preservation becomes important, you might want to revisit this (e.g., Extended JSON, or deferring conversion until needed).
Look up best practices for converting MongoDB BSON to JSON in Go using `go.mongodb.org/mongo-driver/bson` and whether `MarshalExtJSON` is recommended when round-tripping data between BSON and JSON.

🧹 Nitpick comments (6)

internal/app/options/connectorflags.go (1)

596-652: S3Flags align with ConnectorSettings; flag semantics vs description could be clarified.

The flags correctly populate all fields on s3connector.ConnectorSettings (region, prefix, output-format, profile, endpoint, credentials, path-style, pretty-json). Two small nits:

The prefix flag description says “Override or append…”, but the current implementation only overwrites settings.Prefix. If append semantics are desired (e.g. connection-string prefix + --prefix subdir), you’d need a bit of logic in the S3 Create action to combine them.

pretty-json defaults to true via Value: true, which is good; just ensure this matches your intended default for programmatic callers that bypass CLI.

Both are behavioral clarifications rather than blockers; the current implementation is otherwise consistent.

connectors/s3/connector.go (5)

146-288: GeneratePlan / GetNamespaceMetadata and metadata caching are coherent; watch out for very large buckets.

The plan generation loop:

Lists objects under the optional configured prefix.

Filters to .json objects while excluding .metadata.json.

Derives namespaces from the key path (falling back to "default"), applies optional namespace filters, and looks up per-file counts from a cached readMetadata result per namespace.

GetNamespaceMetadata reuses the same metadata format and simply sums counts, which is consistent.

Only potential concern is scalability: for buckets with very large object counts, doing a full ListObjectsV2 across the prefix to build the plan could be expensive. If you expect such buckets, consider an optional limit or a naming convention that lets you narrow the listing further.

391-438: Barrier handler and batch bookkeeping correctly avoid empty flushes; consider using request context for metadata.

OnTaskCompletionBarrierHandler cleanly:

Detaches and discards missing/empty batches (avoiding creation of empty files).

Flushes non-empty batches to S3, sets a sticky error on failure, and returns the error.

Updates namespace metadata after a successful flush, logging but not failing the barrier on metadata errors.

appendBatch/detachBatch under batchesMutex ensure per-task batches are consistent.

One small improvement would be to thread a context into OnTaskCompletionBarrierHandler (or at least use a shared cancellable context instead of context.Background() inside updateMetadataAfterFlush), so metadata writes can respect shutdown/timeouts like the flush itself.

484-510: JSON array builder works; preallocation heuristic is very low and may cause extra allocations.

buildJSONArray correctly builds a JSON array from individual documents and optionally pretty‑prints each element with json.Indent, falling back gracefully and logging when indentation fails. The newline placement also produces valid JSON.

The only nit is buf.Grow(len(docs) * 2), which significantly underestimates the needed capacity for realistic document sizes and will lead to repeated reallocations for large batches. If you care about performance here, consider summing len(doc) across docs (plus a small overhead) and using that to initialize the buffer capacity.

534-626: Namespace metadata helpers are consistent; global mutex serializes updates safely but could be narrowed later.

metadataKey, readMetadata, writeMetadata, and updateMetadataAfterFlush together implement per-namespace metadata as a JSON map of filename -> recordCount:

readMetadata gracefully treats NoSuchKey as “no metadata yet” and decodes the JSON map otherwise.

writeMetadata writes the map back as JSON with appropriate content type.

updateMetadataAfterFlush locks metadataMutex, performs a read-modify-write on the metadata for the file just flushed, and uses objectKey/path.Base to keep keys consistent with GeneratePlan.

Using a single metadataMutex on the connector is safe and avoids lost updates, matching past review guidance. If you later run into contention with many namespaces flushing concurrently, you could evolve this into a per-namespace lock map, but that’s not required for correctness.

37-120: Consider passing a cancellable context to LoadDefaultConfig instead of context.Background().

The AWS SDK v2 best practices recommend passing a caller-owned context (rather than context.Background()) to LoadDefaultConfig so the config loading can be cancelled or respect process-wide timeouts. This enables graceful shutdown and timeout handling during initialization. If constructor timeouts are not a concern, context.Background() is acceptable, but using a cancellable context (or one with a timeout) is preferred for production deployments with strict shutdown requirements.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 491d820 and 6ee0eca.

📒 Files selected for processing (2)

connectors/s3/connector.go (1 hunks)
internal/app/options/connectorflags.go (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

connectors/s3/connector.go (3)

connectors/common/base.go (1)

ConnectorSettings (49-58)

gen/adiom/v1/adiomv1connect/adiom.connect.go (1)

ConnectorServiceHandler (199-210)

gen/adiom/v1/messages.pb.go (2)

DataType_DATA_TYPE_JSON_ID (30-30)

DataType_DATA_TYPE_MONGO_BSON (29-29)

internal/app/options/connectorflags.go (1)

connectors/s3/connector.go (2)

ConnectorSettings (38-50)

NewConn (76-120)

🔇 Additional comments (6)

internal/app/options/connectorflags.go (1)

548-564: Connection string parser is robust and case-insensitive on the scheme.

The parseS3ConnectionString helper correctly validates the s3:// prefix case-insensitively, enforces a non-empty bucket, and cleanly splits an optional key prefix. Behavior for inputs like s3://bucket, s3://bucket/prefix, and s3://bucket/ is well-defined.

connectors/s3/connector.go (5)

122-144: Capabilities advertised in GetInfo match the implemented behavior.

GetInfo correctly reports DbType "s3" and advertises source/sink support for DATA_TYPE_JSON_ID and DATA_TYPE_MONGO_BSON, which aligns with WriteData/convertToJSON and the ListData branches. LSN streaming and updates are marked unsupported as expected.

357-385: WriteData batching and error short‑circuiting look good; taskId validation is strict.

WriteData:

Short‑circuits on currentError() to avoid writing after a fatal failure.

Requires a nonzero TaskId and returns InvalidArgument otherwise.

Converts incoming docs to JSON via convertToJSON and appends them to a per‑task batch.

No‑ops gracefully when data is empty.

This matches the per‑task buffering design and keeps the sink robust in the presence of prior errors. The only minor trade‑off is that convertToJSON failures are surfaced as InvalidArgument (client error), which is appropriate given malformed payloads.

440-468: flushBatch and objectKey are consistent with GeneratePlan and metadataKey.

flushBatch writes a single JSON array per task under the key from objectKey, with appropriate ContentType. objectKey and metadataKey both normalize namespaces by trimming slashes and turning dots into path separators, using "default" when namespace is empty, and prepending the optional connector prefix.

This keeps object layout and metadata filenames consistent across write, planning, and metadata reads.

470-483: Sticky error handling is simple and effective.

The currentError/setError pair with RWMutex ensures that once a fatal error is recorded, subsequent writes are promptly rejected without racing. Only the first error is preserved, which is typically sufficient for debugging.

347-355: Remove this concern—errors.ErrUnsupported is available in Go 1.24.

The project requires Go 1.24 (as specified in go.mod), which is well beyond Go 1.13 when errors.ErrUnsupported was introduced in the standard library. The code is correct and requires no changes.

Likely an incorrect or invalid review comment.

internal/app/options/connectorflags.go

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

internal/app/options/connectorflags.go (1)

300-316: S3 connector registration looks correct, but help-by-name invocation issue remains.

The past review comment about help invocation (e.g., dsync S3 --help) causing an "invalid s3 connection string" error is still applicable. While the urfave/cli library typically handles --help before executing the action, the current implementation sets settings.Uri = args[0] before calling CreateHelper, which could cause issues if the action is invoked.

🧹 Nitpick comments (1)

connectors/s3/connector.go (1)

316-371: ListData correctly implements data retrieval.

The implementation properly reads JSON arrays from S3 and converts them based on the requested data type. For DATA_TYPE_MONGO_BSON, the JSON-to-BSON conversion (lines 352-361) reconstructs BSON documents from the JSON stored in S3.

Note that BSON-specific types (ObjectID, DateTime, Decimal128, etc.) are stored as their JSON representations, not as extended JSON with type markers. This means round-tripping BSON → JSON (in WriteData) → BSON (in ListData) loses type fidelity. If this is intentional for an analytics/export use case, consider documenting this behavior in the package comment or GetInfo response.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6ee0eca and 5eb9723.

📒 Files selected for processing (2)

connectors/s3/connector.go (1 hunks)
internal/app/options/connectorflags.go (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

connectors/s3/connector.go (3)

gen/adiom/v1/adiomv1connect/adiom.connect.go (1)

ConnectorServiceHandler (199-210)

gen/adiom/v1/messages.pb.go (2)

DataType_DATA_TYPE_JSON_ID (30-30)

DataType_DATA_TYPE_MONGO_BSON (29-29)

protocol/iface/connector.go (1)

Namespace (104-107)

internal/app/options/connectorflags.go (1)

connectors/s3/connector.go (2)

ConnectorSettings (38-51)

NewConn (95-146)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (5)

internal/app/options/connectorflags.go (1)

573-629: S3 flags are well-defined.

The flag definitions follow the established pattern with altsrc wrappers for configuration file support and proper destination binding. The required region flag and configurable options for AWS credentials, endpoint, and path-style addressing are appropriate for S3 integration.

Note: The prefix flag description states it will "Override or append to the key prefix derived from the connection string" — verify that this behavior is correctly implemented in the connector (see connector.go review).

connectors/s3/connector.go (4)

1-74: Package structure and types are well-designed.

The connector struct includes proper synchronization primitives (batchesMutex for per-task batch management and metadataMutex for metadata updates), and the ConnectorSettings provides comprehensive configuration options for AWS S3 integration.

383-443: WriteData and barrier handler are correctly implemented.

The sink implementation properly:

Buffers data per-task until barrier is received

Validates taskID is non-zero (defensive check)

Returns early when batch is nil or empty (lines 420-427), preventing empty file creation

Handles metadata update failures gracefully without failing the barrier (lines 436-439), since data was successfully flushed

445-508: Batch management and helper functions are correct.

The thread-safe batch operations, S3 key generation, and error tracking are all properly implemented. The namespace path transformation (line 483, converting dots to slashes) provides reasonable hierarchical organization in S3.

510-652: JSON construction and metadata operations are well-implemented.

The buildJSONArray function includes good error handling with fallback when pretty-printing fails (lines 524-527). Metadata operations are properly synchronized with metadataMutex (line 625), addressing the race condition concern from previous reviews. The NoSuchKey error handling (lines 584-590) correctly treats missing metadata files as expected for newly created namespaces.

connectors/s3/connector.go

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

connectors/s3/connector.go (2)
440-445: Optional: Remove redundant check.

The condition if len(batch.docs) > 0 at line 440 is always true because lines 429-432 already return early when batch.docs is empty. This check can be safely removed.
 	}
 
 	// Update metadata file atomically after successful flush
-	if len(batch.docs) > 0 {
-		if err := c.updateMetadataAfterFlush(context.Background(), batch.namespace, taskID, uint64(len(batch.docs))); err != nil {
-			slog.Error("failed to update metadata", "namespace", batch.namespace, "taskId", taskID, "err", err)
-			// Log error but don't fail the barrier - the data was successfully flushed
-		}
+	if err := c.updateMetadataAfterFlush(context.Background(), batch.namespace, taskID, uint64(len(batch.docs))); err != nil {
+		slog.Error("failed to update metadata", "namespace", batch.namespace, "taskId", taskID, "err", err)
+		// Log error but don't fail the barrier - the data was successfully flushed
 	}
627-657: Recommended: Consider per-namespace mutex for better concurrency.

The global metadataMutex (line 630) serializes metadata updates across all namespaces. With many concurrent namespaces, this could become a bottleneck since updates to different namespaces don't actually conflict.

Consider using per-namespace locking for better concurrency:
 type connector struct {
 	adiomv1connect.UnimplementedConnectorServiceHandler
 
 	client       *s3.Client
 	settings     ConnectorSettings
 	batchesMutex sync.Mutex
 	batches      map[taskKey]*storedBatch
 
-	metadataMutex sync.Mutex // Serialize metadata updates
+	namespaceLocks sync.Map // map[string]*sync.Mutex for per-namespace locking
 
 	errMutex sync.RWMutex
 	err      error
 }
Then in updateMetadataAfterFlush:
 func (c *connector) updateMetadataAfterFlush(ctx context.Context, namespace string, taskID uint, recordCount uint64) error {
-	c.metadataMutex.Lock()
-	defer c.metadataMutex.Unlock()
+	// Get or create mutex for this namespace
+	lockVal, _ := c.namespaceLocks.LoadOrStore(namespace, &sync.Mutex{})
+	mu := lockVal.(*sync.Mutex)
+	mu.Lock()
+	defer mu.Unlock()
This allows concurrent metadata updates for different namespaces while still preventing races within the same namespace.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5eb9723 and fa918fb.

📒 Files selected for processing (1)

connectors/s3/connector.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

connectors/s3/connector.go (3)

gen/adiom/v1/adiomv1connect/adiom.connect.go (1)

ConnectorServiceHandler (199-210)

gen/adiom/v1/messages.pb.go (2)

DataType_DATA_TYPE_JSON_ID (30-30)

DataType_DATA_TYPE_MONGO_BSON (29-29)

protocol/iface/connector.go (1)

Namespace (104-107)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (4)

connectors/s3/connector.go (4)

94-153: LGTM: Constructor properly handles configuration and AWS client setup.

The prefix combination logic (lines 101-108) correctly addresses the previous review concern by merging URI and flag-provided prefixes. AWS SDK v2 configuration with credentials, profile, and endpoint overrides is correctly implemented.

179-285: LGTM: Efficient plan generation with metadata caching.

The namespace metadata caching (line 208) prevents redundant S3 reads during plan generation. Namespace filtering and estimated count extraction from metadata are correctly implemented.

287-319: LGTM: Consistent namespace handling with graceful error fallback.

The empty namespace defaulting to "default" (lines 290-292) is now consistent with the key generation helpers, addressing the previous review concern. Graceful error handling with warnings is appropriate for missing metadata.

321-376: LGTM: ListData correctly handles JSON source with type conversion.

The implementation properly reads JSON arrays from S3 and converts to requested output format (JSON passthrough or JSON-to-BSON conversion). JSON validation ensures data integrity.

coderabbitai · 2025-12-05T02:34:11Z

connectors/s3/connector.go

+func (c *connector) appendBatch(namespace string, taskID uint, docs [][]byte) {
+	c.batchesMutex.Lock()
+	defer c.batchesMutex.Unlock()
+	key := taskKey{taskID}
+	batch, ok := c.batches[key]
+	if !ok {
+		batch = &storedBatch{namespace: namespace}
+		c.batches[key] = batch
+	}
+	batch.docs = append(batch.docs, docs...)
+}


⚠️ Potential issue | 🟠 Major

Major issue: Namespace mismatch can cause data to be written to the wrong namespace.

The appendBatch function keys batches solely by taskID (line 453), but stores a single namespace per batch (line 456). If WriteData is called multiple times with the same taskID but different namespaces, the subsequent calls will append documents to the batch created by the first call, effectively ignoring the namespace parameter and causing all documents to be flushed to the first namespace.

Scenario:

WriteData(namespace="orders", taskID=1, docs=[doc1]) → creates batch with namespace="orders"

WriteData(namespace="customers", taskID=1, docs=[doc2]) → appends to existing batch, ignores "customers" namespace

Barrier for task 1 → flushes both doc1 and doc2 to "orders" namespace

Fix: Either validate namespace consistency or make the batch key include both taskID and namespace.

Solution 1: Validate namespace consistency (simpler):

func (c *connector) appendBatch(namespace string, taskID uint, docs [][]byte) { c.batchesMutex.Lock() defer c.batchesMutex.Unlock() key := taskKey{taskID} batch, ok := c.batches[key] if !ok { batch = &storedBatch{namespace: namespace} c.batches[key] = batch + } else if batch.namespace != namespace { + // This should never happen if task routing is correct + slog.Error("namespace mismatch for taskID", "taskID", taskID, "expected", batch.namespace, "got", namespace) + return } batch.docs = append(batch.docs, docs...) }

Solution 2: Use composite key (more robust if multi-namespace tasks are valid):

Update taskKey to include namespace:

type taskKey struct { - taskID uint + taskID uint + namespace string }

Then update all usages accordingly.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func (c *connector) appendBatch(namespace string, taskID uint, docs [][]byte) {

c.batchesMutex.Lock()

defer c.batchesMutex.Unlock()

key := taskKey{taskID}

batch, ok := c.batches[key]

if !ok {

batch = &storedBatch{namespace: namespace}

c.batches[key] = batch

}

batch.docs = append(batch.docs, docs...)

}

func (c *connector) appendBatch(namespace string, taskID uint, docs [][]byte) {

c.batchesMutex.Lock()

defer c.batchesMutex.Unlock()

key := taskKey{taskID}

batch, ok := c.batches[key]

if !ok {

batch = &storedBatch{namespace: namespace}

c.batches[key] = batch

} else if batch.namespace != namespace {

// This should never happen if task routing is correct

slog.Error("namespace mismatch for taskID", "taskID", taskID, "expected", batch.namespace, "got", namespace)

return

}

batch.docs = append(batch.docs, docs...)

}

🤖 Prompt for AI Agents

connectors/s3/connector.go around lines 450-460: appendBatch currently keys batches only by taskID causing namespace mixing; fix by making the batch key composite — add a namespace string field to taskKey and use taskKey{taskID, namespace} when looking up/creating batches (update the taskKey type declaration and all places that construct or use it), or if you prefer the simpler option, validate that when a batch exists its namespace equals the passed namespace and return/log/error on mismatch; ensure all map lookups/assignments and any function signatures that used taskKey are updated accordingly.

coderabbitai · 2025-12-05T02:34:11Z

connectors/s3/connector.go

+func buildJSONArray(docs [][]byte, prettyJSON bool) []byte {
+	var buf bytes.Buffer
+	buf.Grow(len(docs) * 2)
+	buf.WriteByte('[')
+	for i, doc := range docs {
+		if i > 0 {
+			buf.WriteByte(',')
+		}
+		if prettyJSON {
+			var prettyBuf bytes.Buffer
+			prettyBuf.Grow(len(doc) + len(doc)/10) // rough estimate
+			if err := json.Indent(&prettyBuf, doc, "", "  "); err == nil {
+				buf.WriteByte('\n')
+				doc = prettyBuf.Bytes()
+			} else {
+				// If indenting fails, fall back to original
+				slog.Warn("Failed JSON indentation. Falling back to no-indent")
+			}
+		}
+		buf.Write(doc)
+	}
+	if prettyJSON {
+		buf.WriteByte('\n')
+	}
+	buf.WriteByte(']')
+	return buf.Bytes()
+}


🛠️ Refactor suggestion | 🟠 Major

Performance issue: Buffer pre-allocation is far too small.

Line 517 allocates only len(docs) * 2 bytes for a buffer that needs to hold the entire JSON array. For typical documents (hundreds to thousands of bytes each), this is orders of magnitude too small and will cause multiple reallocations as the buffer grows.

Example: 100 documents of 500 bytes each require ~50KB, but Grow(200) only allocates 200 bytes.

Apply this diff to properly size the buffer:

func buildJSONArray(docs [][]byte, prettyJSON bool) []byte { var buf bytes.Buffer - buf.Grow(len(docs) * 2) + // Estimate total size: sum of doc lengths + array overhead + estimatedSize := 2 // '[' and ']' + for _, doc := range docs { + estimatedSize += len(doc) + 1 // doc + comma + } + if prettyJSON { + estimatedSize += len(docs) * 3 // rough estimate for newlines and indentation + } + buf.Grow(estimatedSize) buf.WriteByte('[')

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func buildJSONArray(docs [][]byte, prettyJSON bool) []byte {

var buf bytes.Buffer

buf.Grow(len(docs) * 2)

buf.WriteByte('[')

for i, doc := range docs {

if i > 0 {

buf.WriteByte(',')

}

if prettyJSON {

var prettyBuf bytes.Buffer

prettyBuf.Grow(len(doc) + len(doc)/10) // rough estimate

if err := json.Indent(&prettyBuf, doc, "", " "); err == nil {

buf.WriteByte('\n')

doc = prettyBuf.Bytes()

} else {

// If indenting fails, fall back to original

slog.Warn("Failed JSON indentation. Falling back to no-indent")

}

}

buf.Write(doc)

}

if prettyJSON {

buf.WriteByte('\n')

}

buf.WriteByte(']')

return buf.Bytes()

}

func buildJSONArray(docs [][]byte, prettyJSON bool) []byte {

var buf bytes.Buffer

// Estimate total size: sum of doc lengths + array overhead

estimatedSize := 2 // '[' and ']'

for _, doc := range docs {

estimatedSize += len(doc) + 1 // doc + comma

}

if prettyJSON {

estimatedSize += len(docs) * 3 // rough estimate for newlines and indentation

}

buf.Grow(estimatedSize)

buf.WriteByte('[')

for i, doc := range docs {

if i > 0 {

buf.WriteByte(',')

}

if prettyJSON {

var prettyBuf bytes.Buffer

prettyBuf.Grow(len(doc) + len(doc)/10) // rough estimate

if err := json.Indent(&prettyBuf, doc, "", " "); err == nil {

buf.WriteByte('\n')

doc = prettyBuf.Bytes()

} else {

// If indenting fails, fall back to original

slog.Warn("Failed JSON indentation. Falling back to no-indent")

}

}

buf.Write(doc)

}

if prettyJSON {

buf.WriteByte('\n')

}

buf.WriteByte(']')

return buf.Bytes()

}

🤖 Prompt for AI Agents

In connectors/s3/connector.go around lines 515 to 541, the buffer is currently pre-allocated with buf.Grow(len(docs) * 2) which is far too small and causes many reallocations; replace this with a proper estimate: compute total := 2 (for '[' and ']') + max(0, len(docs)-1) (for commas) + sum(len(doc) for each doc) and if prettyJSON add a small growth allowance (e.g. sum(len(doc)/10) or a fixed extra per-doc) to account for indentation/newlines, then call buf.Grow(total) before writing into the buffer so it is sized appropriately.

alex-thc enabled auto-merge (squash) November 29, 2025 06:01

coderabbitai bot reviewed Nov 29, 2025

View reviewed changes

connectors/s3/connector.go Show resolved Hide resolved

connectors/s3/connector.go Show resolved Hide resolved

alex-thc added 9 commits December 4, 2025 14:41

.

d794ff5

proto update

29e69b2

add onTaskCompletionBarrierHandlerServicable

cb00ba5

s3 initial impl

b89e729

fix

6db9b05

small changes

a1fc64d

fix s3 connector hook

d56cfea

s3 source

9c536e4

add metadata support

8ea476e

alex-thc force-pushed the s3json branch from f9d3667 to 8ea476e Compare December 4, 2025 22:42

coderabbitai bot reviewed Dec 4, 2025

View reviewed changes

alex-thc added 2 commits December 4, 2025 14:55

update gosec to skip generated files as protoc does something differe…

f5e5c0f

…nt now and uses unsafe calls

minor fixes

491d820

coderabbitai bot reviewed Dec 4, 2025

View reviewed changes

alex-thc added 2 commits December 4, 2025 16:47

add pretty json

650c995

add bson support

6ee0eca

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

internal/app/options/connectorflags.go Show resolved Hide resolved

cosmetic fix

5eb9723

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

connectors/s3/connector.go Show resolved Hide resolved

connectors/s3/connector.go Show resolved Hide resolved

minor fixes

fa918fb

alex-thc requested a review from adiom-mark December 5, 2025 02:29

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

S3 (json) connector #351

Are you sure you want to change the base?

S3 (json) connector #351

Uh oh!

Conversation

alex-thc commented Nov 29, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alex-thc commented Nov 29, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 29, 2025 •

edited

Loading