-
Notifications
You must be signed in to change notification settings - Fork 9
S3 (json) connector #351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
S3 (json) connector #351
Conversation
WalkthroughPropagates TaskId through data and barrier flows, adds OnTaskCompletionBarrierHandlerServicable, introduces an S3 sink connector with per-task buffering and barrier-triggered flush to S3, updates proto/Java messages to include task_id, and bumps Go module dependencies for S3. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Core as Connector Core
participant Buffer as Per-Task Buffer
participant Barrier as Barrier Handler
participant S3 as AWS S3
participant Meta as Namespace Metadata
Core->>Buffer: WriteData(taskId, data)
Note right of Buffer: Accumulate JSON docs per task (in-memory)
Core->>Core: ProcessDataMessages (preserve TaskId)
Core->>Barrier: Emit Barrier(TaskComplete, barrierTaskId)
Barrier-->>Core: If implements OnTaskCompletionBarrierHandler -> call OnTaskCompletionBarrierHandler(barrierTaskId)
Core->>Buffer: Flush buffer for barrierTaskId
Buffer->>S3: PutObject(namespace/derived-key.json) [JSON array]
S3-->>Core: PutObject response
Core->>Meta: Read/Update namespace metadata with new key/count
Meta-->>Core: Metadata update response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
internal/app/options/connectorflags.go (1)
604-608: Clarify prefix flag behavior in usage text.The usage says "Override or append to" but the implementation only overrides (via
Destination). If the CLI flag is provided, it completely replaces the prefix from the connection string.Consider updating the usage to reflect actual behavior:
altsrc.NewStringFlag(&cli.StringFlag{ Name: "prefix", - Usage: "Override or append to the key prefix derived from the connection string", + Usage: "Override the key prefix derived from the connection string", Destination: &settings.Prefix, }),connectors/s3/connector.go (2)
418-432: Consider accepting context parameter.
flushBatchusescontext.Background()which won't respect cancellation signals from the caller. If the application is shutting down, this S3 put operation could block indefinitely.-func (c *connector) flushBatch(namespace string, taskID uint, docs [][]byte) error { +func (c *connector) flushBatch(ctx context.Context, namespace string, taskID uint, docs [][]byte) error { payload := buildJSONArray(docs) key := c.objectKey(namespace, taskID) - _, err := c.client.PutObject(context.Background(), &s3.PutObjectInput{ + _, err := c.client.PutObject(ctx, &s3.PutObjectInput{
462-474: Buffer pre-allocation is underestimated.
buf.Grow(len(docs) * 2)significantly underestimates the required capacity since each document could be much larger than 2 bytes. While this won't cause correctness issues (the buffer grows automatically), it defeats the purpose of pre-allocation.Consider estimating based on actual document sizes:
func buildJSONArray(docs [][]byte) []byte { var buf bytes.Buffer - buf.Grow(len(docs) * 2) + totalSize := 2 + len(docs) // brackets + commas + for _, doc := range docs { + totalSize += len(doc) + } + buf.Grow(totalSize) buf.WriteByte('[')
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (6)
gen/adiom/v1/adiom.pb.gois excluded by!**/*.pb.go,!**/gen/**gen/adiom/v1/adiomv1connect/adiom.connect.gois excluded by!**/gen/**gen/adiom/v1/adiomv1connect/vector.connect.gois excluded by!**/gen/**gen/adiom/v1/messages.pb.gois excluded by!**/*.pb.go,!**/gen/**gen/adiom/v1/vector.pb.gois excluded by!**/*.pb.go,!**/gen/**go.sumis excluded by!**/*.sum
📒 Files selected for processing (7)
connectors/common/base.go(6 hunks)connectors/s3/connector.go(1 hunks)go.mod(2 hunks)internal/app/options/connectorflags.go(4 hunks)java/src/main/java/adiom/v1/Messages.java(13 hunks)proto/adiom/v1/messages.proto(1 hunks)protocol/iface/transport.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
connectors/s3/connector.go (4)
connectors/common/base.go (1)
ConnectorSettings(49-58)gen/adiom/v1/adiomv1connect/adiom.connect.go (1)
ConnectorServiceHandler(199-210)gen/adiom/v1/messages.pb.go (2)
DataType_DATA_TYPE_JSON_ID(30-30)DataType_DATA_TYPE_MONGO_BSON(29-29)protocol/iface/connector.go (1)
Namespace(104-107)
internal/app/options/connectorflags.go (1)
connectors/s3/connector.go (2)
ConnectorSettings(38-49)NewConn(73-116)
🔇 Additional comments (21)
proto/adiom/v1/messages.proto (1)
123-128: LGTM - backward-compatible field addition.The new
task_idfield at position 4 is correctly placed and maintains proto backward compatibility. Default zero value for uint32 is safe for existing clients.protocol/iface/transport.go (1)
28-28: LGTM - clear separation of concerns.The
TaskIdfield for data messages is distinct fromBarrierTaskIdfor barrier signals. This separation allows proper task association through the data flow while maintaining barrier semantics.go.mod (1)
10-15: LGTM - appropriate dependency additions for S3 support.The AWS SDK v2 upgrade and S3 service addition are necessary for the new S3 connector. The transitive dependencies (eventstream, checksum, s3shared, etc.) are standard requirements for S3 operations.
internal/app/options/connectorflags.go (2)
300-321: LGTM - S3 connector registration follows established patterns.The registration logic correctly:
- Uses case-insensitive prefix matching for
s3://- Validates connection string presence
- Parses bucket/prefix before delegating to
CreateHelper
548-564: LGTM - robust connection string parsing.The function correctly handles:
- Case-insensitive
s3://prefix validation- Empty bucket detection
- Optional prefix extraction via
SplitNconnectors/common/base.go (4)
45-47: LGTM - well-designed optional interface.The single-method interface enables connectors to optionally handle task completion barriers without forcing all connectors to implement this behavior. This follows Go's interface composition idiom.
626-654: LGTM - consistent TaskId propagation across all data paths.TaskId is correctly set in all three branches:
- Transformed multi-response path (line 630)
- Transformed single-response path (line 643)
- Non-transformed path (line 652)
This ensures task context flows through regardless of transformation configuration.
988-994: LGTM - safe optional interface invocation.The type assertion pattern correctly checks if the connector implements
OnTaskCompletionBarrierHandlerServicablebefore invoking the hook. Error propagation is handled properly.
1094-1099: LGTM - TaskId correctly propagated to WriteDataRequest.The TaskId is appropriately cast from
uinttouint32to match the proto field type, maintaining end-to-end task correlation from reader through writer.java/src/main/java/adiom/v1/Messages.java (1)
12321-12326: Generated protobuf code - changes look correct.This file appears to be auto-generated from
proto/adiom/v1/messages.protoby the protobuf compiler. The additions for thetask_idfield (field number 4,uint32) follow standard protobuf Java codegen patterns correctly:
- Interface getter declaration
- Field storage with default value
0- Serialization/deserialization with wire type 32 (varint)
- Proper bit field tracking (
0x00000008)- Builder methods with
onChanged()callsGenerated files are typically excluded from manual code review since the source of truth is the
.protofile. Consider adding this path to your review exclusion patterns if not already done.Also applies to: 12454-12463, 13053-13083
connectors/s3/connector.go (11)
1-35: LGTM!Clean package setup with appropriate imports and well-defined sentinel errors.
37-70: LGTM!The struct definitions are well-organized. The use of separate mutexes for batches and error state is appropriate for the different access patterns.
72-116: LGTM!Good validation of required fields, sensible defaults, and flexible credential configuration supporting both profile-based and static credentials.
118-139: LGTM!Capabilities are correctly defined for an S3-based JSON connector with source and sink support.
141-247: LGTM!Well-structured plan generation with proper pagination, filtering, and metadata caching. The graceful degradation when metadata is unavailable is a good approach.
249-283: LGTM!Defensive error handling with appropriate fallback to zero count when metadata is unavailable.
285-327: LGTM with a note.The implementation is correct. The entire file is loaded into memory which is fine for typical use cases, but could be a concern for very large JSON files. Consider documenting expected file size limits if this becomes a production concern.
329-337: LGTM!Correctly returns unimplemented for streaming operations that S3 doesn't support.
339-366: LGTM!Good defensive checks for error state and required task ID. The batching approach is appropriate for S3's write patterns.
368-371: LGTM!Correctly rejects update operations which aren't supported for S3's append-only nature.
476-496: LGTM!The conversion logic handles supported types correctly. Note that BSON to JSON conversion via
map[string]anymay lose some BSON-specific type information (e.g., ObjectId, Decimal128), but this is likely acceptable for a generic JSON sink.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
connectors/s3/connector.go (2)
373-395: Empty files still created for tasks with no data.When
batchis nil (no data was written for this task), the code creates an emptystoredBatch{}and proceeds to callflushBatch, which writes an empty JSON array to S3. This pollutes the bucket with unnecessary files.Return early when there's no data to flush:
func (c *connector) OnTaskCompletionBarrierHandler(taskID uint) error { batch := c.detachBatch(taskID) if batch == nil { slog.Debug("s3 connector received barrier with no data", "taskId", taskID) - batch = &storedBatch{} + return nil } + if len(batch.docs) == 0 { + slog.Debug("s3 connector received barrier with empty batch", "taskId", taskID) + return nil + } if err := c.flushBatch(batch.namespace, taskID, batch.docs); err != nil {
560-587: Race condition in concurrent metadata updates.
updateMetadataAfterFlushperforms an unsynchronized read-modify-write. When multiple tasks for the same namespace complete concurrently, metadata updates can be lost because each goroutine reads, modifies, and writes independently.Example race:
- Task A reads metadata
{file1: 100}- Task B reads metadata
{file1: 100}- Task A writes
{file1: 100, task-2.json: 50}- Task B writes
{file1: 100, task-3.json: 75}—task-2.jsonentry is lostAdd per-namespace mutex to serialize metadata updates:
type connector struct { adiomv1connect.UnimplementedConnectorServiceHandler client *s3.Client settings ConnectorSettings batchesMutex sync.Mutex batches map[taskKey]*storedBatch + metadataMutex sync.Mutex // Serialize metadata updates errMutex sync.RWMutex err error }Then protect the read-modify-write in
updateMetadataAfterFlush:func (c *connector) updateMetadataAfterFlush(ctx context.Context, namespace string, taskID uint, recordCount uint64) error { + c.metadataMutex.Lock() + defer c.metadataMutex.Unlock() + // Read current metadata (or create empty map if doesn't exist) metadata, err := c.readMetadata(ctx, namespace)For better concurrency with many namespaces, consider per-namespace locks using
sync.Map.
🧹 Nitpick comments (1)
connectors/common/base.go (1)
1098-1098: TaskId type narrowing from uint to uint32.The conversion from
uint(which is 64-bit on 64-bit systems) touint32could theoretically overflow if task IDs exceed 2³²-1. While task IDs are typically small sequential numbers making overflow unlikely in practice, consider documenting this limitation or adding validation if task ID limits are a concern.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (6)
gen/adiom/v1/adiom.pb.gois excluded by!**/*.pb.go,!**/gen/**gen/adiom/v1/adiomv1connect/adiom.connect.gois excluded by!**/gen/**gen/adiom/v1/adiomv1connect/vector.connect.gois excluded by!**/gen/**gen/adiom/v1/messages.pb.gois excluded by!**/*.pb.go,!**/gen/**gen/adiom/v1/vector.pb.gois excluded by!**/*.pb.go,!**/gen/**go.sumis excluded by!**/*.sum
📒 Files selected for processing (7)
connectors/common/base.go(6 hunks)connectors/s3/connector.go(1 hunks)go.mod(2 hunks)internal/app/options/connectorflags.go(4 hunks)java/src/main/java/adiom/v1/Messages.java(13 hunks)proto/adiom/v1/messages.proto(1 hunks)protocol/iface/transport.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- proto/adiom/v1/messages.proto
- protocol/iface/transport.go
🧰 Additional context used
🧬 Code graph analysis (2)
connectors/s3/connector.go (3)
gen/adiom/v1/adiomv1connect/adiom.connect.go (1)
ConnectorServiceHandler(199-210)gen/adiom/v1/messages.pb.go (2)
DataType_DATA_TYPE_JSON_ID(30-30)DataType_DATA_TYPE_MONGO_BSON(29-29)protocol/iface/connector.go (1)
Namespace(104-107)
internal/app/options/connectorflags.go (1)
connectors/s3/connector.go (2)
ConnectorSettings(38-49)NewConn(73-116)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (19)
java/src/main/java/adiom/v1/Messages.java (4)
12822-12826: Wire type parsing is correct.The parsing case 32 corresponds to field number 4 with wire type 0 (varint):
(4 << 3) | 0 = 32. The bit flag0x00000008correctly tracks the 4th field's presence.
13053-13083: Builder methods follow standard protobuf patterns.The
setTaskId,getTaskId, andclearTaskIdbuilder methods are correctly implemented with proper bit field tracking andonChanged()notifications.
23995-24046: Descriptor string updated for the new field.The serialized descriptor includes the
task_idfield definition. This section is typically auto-generated and should match the compiled proto output.
12454-12463: Java implementation correctly matches the proto definition.The
taskIdfield implementation is properly generated fromproto/adiom/v1/messages.proto, which definesuint32 task_id = 4in theWriteDataRequestmessage. The wire tag calculation (case 32), field numbering, serialization/deserialization, and builder methods all follow standard protobuf codegen patterns correctly.connectors/common/base.go (3)
45-47: LGTM! Optional barrier hook interface added.The new interface enables connectors to implement custom logic when task completion barriers are processed, which is essential for the S3 connector's per-task buffering and flushing strategy.
630-630: LGTM! Task ID propagation implemented correctly.The TaskId is consistently propagated across all data message creation paths (transformed and non-transformed), enabling per-task tracking and barrier handling downstream.
Also applies to: 643-643, 652-652
988-994: LGTM! Barrier hook integration is well-implemented.The optional hook is properly guarded with a type assertion and errors are propagated correctly. This allows connectors like S3 to flush buffered data when task completion barriers arrive.
internal/app/options/connectorflags.go (3)
300-321: LGTM! S3 connector registration follows established patterns.The connector registration correctly parses S3 URIs, extracts bucket and prefix, and delegates to the helper for flag processing—consistent with other connectors in the registry.
548-564: LGTM! Connection string parsing is robust.The parsing correctly handles the
s3://bucket[/prefix]format with proper validation and error messages. Prefix trimming is delegated to the connector implementation, which is appropriate.
596-646: LGTM! S3 flags are comprehensive and well-structured.The flags cover all necessary AWS S3 configuration options including region, credentials, endpoint customization, and path-style addressing—suitable for both AWS and S3-compatible services.
connectors/s3/connector.go (8)
72-116: LGTM! Connector initialization is well-structured.The initialization properly validates required settings, configures AWS SDK with flexible credential options (profile, static, default), and supports custom endpoints for testing or S3-compatible services.
118-139: LGTM! Capabilities accurately reflect connector functionality.The connector correctly advertises JSON source support and BSON/JSON sink support with appropriate capability flags for multi-namespace planning.
141-247: LGTM! Plan generation is thorough and handles edge cases well.The implementation correctly lists S3 objects with pagination, filters by namespace, loads metadata for count estimation, and gracefully handles missing metadata with appropriate logging.
249-283: LGTM! Metadata retrieval handles missing files gracefully.The namespace metadata aggregation correctly sums per-file counts and returns appropriate defaults when metadata is unavailable.
285-327: LGTM! Data listing is efficient and memory-safe.The implementation correctly fetches and validates JSON arrays from S3, with proper byte copying to prevent retaining large backing arrays.
339-366: LGTM! Write buffering is well-designed with proper validation.The implementation correctly validates task ID, converts data types, and buffers documents in memory for later flushing on barrier events.
397-416: LGTM! Batch management is properly synchronized.The mutex-protected batch operations correctly handle concurrent writes from multiple tasks.
418-496: LGTM! Helper functions are well-implemented.The utility functions correctly handle S3 key generation, JSON array construction with buffer optimization, and data type conversion with proper validation. Error state management uses appropriate read/write locking.
go.mod (1)
10-10: Consider updating to latest minor versions for AWS SDK.The current versions (aws-sdk-go-v2 v1.40.0 and service/s3 v1.92.1) are stable but slightly outdated. Latest available versions are v1.40.1 and v1.93.0 respectively. Update to ensure access to latest bug fixes and improvements.
…nt now and uses unsafe calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
connectors/s3/connector.go (5)
68-68: Consider per-namespace mutexes for better concurrency.While the single
metadataMutexcorrectly prevents race conditions (addressing the past review comment), it serializes metadata updates across all namespaces. For workloads with many concurrent tasks updating different namespaces, this global lock can become a bottleneck.Consider using a
sync.Mapto hold per-namespace mutexes:type connector struct { adiomv1connect.UnimplementedConnectorServiceHandler client *s3.Client settings ConnectorSettings batchesMutex sync.Mutex batches map[taskKey]*storedBatch - metadataMutex sync.Mutex // Serialize metadata updates + namespaceMutexes sync.Map // map[string]*sync.Mutex for per-namespace locking errMutex sync.RWMutex err error }Then in
updateMetadataAfterFlush, acquire the namespace-specific mutex:func (c *connector) updateMetadataAfterFlush(ctx context.Context, namespace string, taskID uint, recordCount uint64) error { // Get or create mutex for this namespace mu, _ := c.namespaceMutexes.LoadOrStore(namespace, &sync.Mutex{}) namespaceMu := mu.(*sync.Mutex) namespaceMu.Lock() defer namespaceMu.Unlock() // ... rest of the function unchanged }
287-329: Consider memory implications for large files.The current implementation loads the entire JSON array into memory at once. For very large S3 files (e.g., hundreds of MB or GB), this could cause memory pressure or OOM issues.
If you expect to handle large files, consider:
- Implementing pagination/chunking within the file (using the
NextCursorfield)- Setting size limits on individual S3 files
- Streaming the JSON array parsing rather than decoding all at once
- Monitoring memory usage in production for files above a certain size threshold
393-398: Redundant check: batch is already confirmed non-empty.The check
if len(batch.docs) > 0is unnecessary because the function already returns early at lines 382-385 if the batch is empty.Apply this diff to remove the redundant check:
// Update metadata file atomically after successful flush - if len(batch.docs) > 0 { - if err := c.updateMetadataAfterFlush(context.Background(), batch.namespace, taskID, uint64(len(batch.docs))); err != nil { - slog.Error("failed to update metadata", "namespace", batch.namespace, "taskId", taskID, "err", err) - // Log error but don't fail the barrier - the data was successfully flushed - } + if err := c.updateMetadataAfterFlush(context.Background(), batch.namespace, taskID, uint64(len(batch.docs))); err != nil { + slog.Error("failed to update metadata", "namespace", batch.namespace, "taskId", taskID, "err", err) + // Log error but don't fail the barrier - the data was successfully flushed }
424-438: Accept context parameter for better cancellation control.The function uses
context.Background()for the S3 PutObject call, which means the flush operation cannot be cancelled or timeout-controlled by the caller. IfOnTaskCompletionBarrierHandlerneeds to respect deadlines or cancellation, this will be problematic.Apply this diff to thread the context through:
-func (c *connector) flushBatch(namespace string, taskID uint, docs [][]byte) error { +func (c *connector) flushBatch(ctx context.Context, namespace string, taskID uint, docs [][]byte) error { payload := buildJSONArray(docs) key := c.objectKey(namespace, taskID) - _, err := c.client.PutObject(context.Background(), &s3.PutObjectInput{ + _, err := c.client.PutObject(ctx, &s3.PutObjectInput{And update the call site in
OnTaskCompletionBarrierHandler:- if err := c.flushBatch(batch.namespace, taskID, batch.docs); err != nil { + if err := c.flushBatch(context.Background(), batch.namespace, taskID, batch.docs); err != nil {(Or pass a proper context if available from the barrier handler)
468-480: Inefficient buffer pre-allocation.The
buf.Grow(len(docs) * 2)estimates only 2 bytes per document, which is far too small for typical JSON documents. This will cause multiple reallocations as the buffer grows. For example, 1000 documents of 100 bytes each would need ~100KB but only 2KB is pre-allocated.Consider a more realistic estimate:
func buildJSONArray(docs [][]byte) []byte { var buf bytes.Buffer - buf.Grow(len(docs) * 2) + // Estimate: sum of document sizes + commas + brackets + 10% overhead + totalSize := 2 // '[' and ']' + for _, doc := range docs { + totalSize += len(doc) + 1 // +1 for comma (or close bracket) + } + buf.Grow(totalSize) buf.WriteByte('[')Or use a simple heuristic like
len(docs) * 100if you expect documents to average ~100 bytes.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
connectors/s3/connector.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
connectors/s3/connector.go (3)
gen/adiom/v1/adiomv1connect/adiom.connect.go (1)
ConnectorServiceHandler(199-210)gen/adiom/v1/messages.pb.go (2)
DataType_DATA_TYPE_JSON_ID(30-30)DataType_DATA_TYPE_MONGO_BSON(29-29)protocol/iface/connector.go (1)
Namespace(104-107)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (5)
connectors/s3/connector.go (5)
1-35: LGTM!Package declaration, imports, and error constants are well-structured and appropriate for an S3 connector implementation.
74-118: LGTM!Constructor properly validates required settings, configures AWS SDK with flexible credential options, and initializes the connector with appropriate defaults.
143-249: LGTM!The plan generation logic correctly discovers S3 objects, filters by namespace, loads metadata efficiently with caching, and creates appropriate partitions. Error handling with warning logs is appropriate.
504-596: LGTM with concurrency consideration already noted.The metadata management implementation correctly handles missing files, generates appropriate keys, and performs atomic read-modify-write operations. The global metadata mutex concern was already raised in the earlier comment on line 68.
348-351: TaskID validation is implementation-specific, not protocol-mandated.The protobuf
WriteDataRequest.TaskIdfield usesuint32with norequiredconstraint, meaning 0 is a valid protocol value. While the S3 connector rejectstaskID == 0, this validation is not enforced at the proto level. The null connector'sWriteDatamethod does not perform this check, indicating inconsistent enforcement across connectors. Either clarify in comments why 0 is invalid for this connector, or consider if the validation is necessary given the protocol allows it.
| func convertToJSON(data []byte, dataType adiomv1.DataType) ([]byte, error) { | ||
| switch dataType { | ||
| case adiomv1.DataType_DATA_TYPE_JSON_ID: | ||
| if !json.Valid(data) { | ||
| return nil, fmt.Errorf("invalid json payload") | ||
| } | ||
| return append([]byte(nil), data...), nil | ||
| case adiomv1.DataType_DATA_TYPE_MONGO_BSON: | ||
| var doc map[string]any | ||
| if err := bson.Unmarshal(data, &doc); err != nil { | ||
| return nil, fmt.Errorf("bson to json: %w", err) | ||
| } | ||
| converted, err := json.Marshal(doc) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("marshal json: %w", err) | ||
| } | ||
| return converted, nil | ||
| default: | ||
| return nil, ErrUnsupportedType | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BSON to JSON conversion loses type fidelity.
For DATA_TYPE_MONGO_BSON, the code unmarshals BSON to map[string]any and then marshals to JSON. This loses BSON-specific type information:
ObjectID→ string (hex representation)DateTime→ RFC3339 stringDecimal128→ float64 (potential precision loss)Binary→ base64 string- Other BSON types are similarly transformed
If consumers expect to preserve exact BSON semantics or round-trip data back to MongoDB, this conversion is lossy and could cause issues. However, if the S3 connector's purpose is specifically to produce human-readable JSON for analytics/export, this might be acceptable.
If BSON type fidelity is required, consider using bson.MarshalExtJSON for extended JSON format:
case adiomv1.DataType_DATA_TYPE_MONGO_BSON:
- var doc map[string]any
- if err := bson.Unmarshal(data, &doc); err != nil {
+ var doc bson.Raw
+ if err := bson.Unmarshal(data, &doc); err != nil {
return nil, fmt.Errorf("bson to json: %w", err)
}
- converted, err := json.Marshal(doc)
+ converted, err := bson.MarshalExtJSON(doc, true, false)
if err != nil {
return nil, fmt.Errorf("marshal json: %w", err)
}
return converted, nilThis preserves BSON types in extended JSON format (e.g., {"$oid": "..."} for ObjectID).
If plain JSON is sufficient, document this behavior clearly so users understand the type conversions.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func convertToJSON(data []byte, dataType adiomv1.DataType) ([]byte, error) { | |
| switch dataType { | |
| case adiomv1.DataType_DATA_TYPE_JSON_ID: | |
| if !json.Valid(data) { | |
| return nil, fmt.Errorf("invalid json payload") | |
| } | |
| return append([]byte(nil), data...), nil | |
| case adiomv1.DataType_DATA_TYPE_MONGO_BSON: | |
| var doc map[string]any | |
| if err := bson.Unmarshal(data, &doc); err != nil { | |
| return nil, fmt.Errorf("bson to json: %w", err) | |
| } | |
| converted, err := json.Marshal(doc) | |
| if err != nil { | |
| return nil, fmt.Errorf("marshal json: %w", err) | |
| } | |
| return converted, nil | |
| default: | |
| return nil, ErrUnsupportedType | |
| } | |
| } | |
| func convertToJSON(data []byte, dataType adiomv1.DataType) ([]byte, error) { | |
| switch dataType { | |
| case adiomv1.DataType_DATA_TYPE_JSON_ID: | |
| if !json.Valid(data) { | |
| return nil, fmt.Errorf("invalid json payload") | |
| } | |
| return append([]byte(nil), data...), nil | |
| case adiomv1.DataType_DATA_TYPE_MONGO_BSON: | |
| var doc bson.Raw | |
| if err := bson.Unmarshal(data, &doc); err != nil { | |
| return nil, fmt.Errorf("bson to json: %w", err) | |
| } | |
| converted, err := bson.MarshalExtJSON(doc, true, false) | |
| if err != nil { | |
| return nil, fmt.Errorf("marshal json: %w", err) | |
| } | |
| return converted, nil | |
| default: | |
| return nil, ErrUnsupportedType | |
| } | |
| } |
🤖 Prompt for AI Agents
In connectors/s3/connector.go around lines 482 to 502, the current BSON→JSON
path unmarshals BSON into map[string]any then json.Marshal which loses BSON type
fidelity; replace that path to produce Extended JSON by using the bson library's
MarshalExtJSON on the raw BSON (so ObjectID/DateTime/Decimal128/etc. are
preserved as extended JSON tokens) and return that payload, and if you intend to
keep the current plain-JSON behavior instead, update the function comment and
public docs to explicitly state the lossy type conversions so consumers are
aware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
connectors/s3/connector.go (2)
290-345: ListData behavior is reasonable; BSON round‑trip goes through JSON and may not preserve exact BSON types.The method validates the S3 key from the partition cursor, reads a JSON array, and:
- For
DATA_TYPE_JSON_ID, returns each element as validated JSON bytes.- For
DATA_TYPE_MONGO_BSON, unmarshals each JSON element intomap[string]anyand thenbson.Marshal’s it back to BSON.This is functionally fine if S3 is primarily a JSON export format. However, the BSON branch does a JSON→generic‑map→BSON conversion, which can change certain BSON types on round‑trip (e.g., extended JSON representations, numeric precision nuances). This was already called out in a previous review; if exact BSON semantics matter, you may want to adopt an Extended JSON strategy instead or clearly document that the S3 connector is JSON-oriented and BSON support is best-effort.
Check the `go.mongodb.org/mongo-driver/bson` documentation for how `bson.Unmarshal` into `map[string]any` followed by `bson.Marshal` treats ObjectID, DateTime, and Decimal128 types when starting from JSON, and whether `MarshalExtJSON` is preferable for preserving BSON semantics.
512-532: convertToJSON validates JSON and handles BSON input; BSON path is lossy and double‑marshals.
convertToJSON:
- Validates raw JSON for
DATA_TYPE_JSON_IDand returns a copy.- For
DATA_TYPE_MONGO_BSON, unmarshals BSON intomap[string]anyand re‑marshals to JSON, which is straightforward but does a full decode/encode.- Returns
ErrUnsupportedTypefor other data types, aligning withListData.As with
ListData, the BSON path may lose some type fidelity relative to raw BSON, and it does a full double marshal; if performance or exact BSON preservation becomes important, you might want to revisit this (e.g., Extended JSON, or deferring conversion until needed).Look up best practices for converting MongoDB BSON to JSON in Go using `go.mongodb.org/mongo-driver/bson` and whether `MarshalExtJSON` is recommended when round-tripping data between BSON and JSON.
🧹 Nitpick comments (6)
internal/app/options/connectorflags.go (1)
596-652: S3Flags align with ConnectorSettings; flag semantics vs description could be clarified.The flags correctly populate all fields on
s3connector.ConnectorSettings(region, prefix, output-format, profile, endpoint, credentials, path-style, pretty-json). Two small nits:
- The
prefixflag description says “Override or append…”, but the current implementation only overwritessettings.Prefix. If append semantics are desired (e.g. connection-string prefix +--prefix subdir), you’d need a bit of logic in the S3Createaction to combine them.pretty-jsondefaults totrueviaValue: true, which is good; just ensure this matches your intended default for programmatic callers that bypass CLI.Both are behavioral clarifications rather than blockers; the current implementation is otherwise consistent.
connectors/s3/connector.go (5)
146-288: GeneratePlan / GetNamespaceMetadata and metadata caching are coherent; watch out for very large buckets.The plan generation loop:
- Lists objects under the optional configured prefix.
- Filters to
.jsonobjects while excluding.metadata.json.- Derives namespaces from the key path (falling back to
"default"), applies optional namespace filters, and looks up per-file counts from a cachedreadMetadataresult per namespace.
GetNamespaceMetadatareuses the same metadata format and simply sums counts, which is consistent.Only potential concern is scalability: for buckets with very large object counts, doing a full
ListObjectsV2across the prefix to build the plan could be expensive. If you expect such buckets, consider an optional limit or a naming convention that lets you narrow the listing further.
391-438: Barrier handler and batch bookkeeping correctly avoid empty flushes; consider using request context for metadata.
OnTaskCompletionBarrierHandlercleanly:
- Detaches and discards missing/empty batches (avoiding creation of empty files).
- Flushes non-empty batches to S3, sets a sticky error on failure, and returns the error.
- Updates namespace metadata after a successful flush, logging but not failing the barrier on metadata errors.
appendBatch/detachBatchunderbatchesMutexensure per-task batches are consistent.One small improvement would be to thread a context into
OnTaskCompletionBarrierHandler(or at least use a shared cancellable context instead ofcontext.Background()insideupdateMetadataAfterFlush), so metadata writes can respect shutdown/timeouts like the flush itself.
484-510: JSON array builder works; preallocation heuristic is very low and may cause extra allocations.
buildJSONArraycorrectly builds a JSON array from individual documents and optionally pretty‑prints each element withjson.Indent, falling back gracefully and logging when indentation fails. The newline placement also produces valid JSON.The only nit is
buf.Grow(len(docs) * 2), which significantly underestimates the needed capacity for realistic document sizes and will lead to repeated reallocations for large batches. If you care about performance here, consider summinglen(doc)acrossdocs(plus a small overhead) and using that to initialize the buffer capacity.
534-626: Namespace metadata helpers are consistent; global mutex serializes updates safely but could be narrowed later.
metadataKey,readMetadata,writeMetadata, andupdateMetadataAfterFlushtogether implement per-namespace metadata as a JSON map offilename -> recordCount:
readMetadatagracefully treatsNoSuchKeyas “no metadata yet” and decodes the JSON map otherwise.writeMetadatawrites the map back as JSON with appropriate content type.updateMetadataAfterFlushlocksmetadataMutex, performs a read-modify-write on the metadata for the file just flushed, and usesobjectKey/path.Baseto keep keys consistent withGeneratePlan.Using a single
metadataMutexon the connector is safe and avoids lost updates, matching past review guidance. If you later run into contention with many namespaces flushing concurrently, you could evolve this into a per-namespace lock map, but that’s not required for correctness.
37-120: Consider passing a cancellable context toLoadDefaultConfiginstead ofcontext.Background().The AWS SDK v2 best practices recommend passing a caller-owned context (rather than
context.Background()) toLoadDefaultConfigso the config loading can be cancelled or respect process-wide timeouts. This enables graceful shutdown and timeout handling during initialization. If constructor timeouts are not a concern,context.Background()is acceptable, but using a cancellable context (or one with a timeout) is preferred for production deployments with strict shutdown requirements.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
connectors/s3/connector.go(1 hunks)internal/app/options/connectorflags.go(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
connectors/s3/connector.go (3)
connectors/common/base.go (1)
ConnectorSettings(49-58)gen/adiom/v1/adiomv1connect/adiom.connect.go (1)
ConnectorServiceHandler(199-210)gen/adiom/v1/messages.pb.go (2)
DataType_DATA_TYPE_JSON_ID(30-30)DataType_DATA_TYPE_MONGO_BSON(29-29)
internal/app/options/connectorflags.go (1)
connectors/s3/connector.go (2)
ConnectorSettings(38-50)NewConn(76-120)
🔇 Additional comments (6)
internal/app/options/connectorflags.go (1)
548-564: Connection string parser is robust and case-insensitive on the scheme.The
parseS3ConnectionStringhelper correctly validates thes3://prefix case-insensitively, enforces a non-empty bucket, and cleanly splits an optional key prefix. Behavior for inputs likes3://bucket,s3://bucket/prefix, ands3://bucket/is well-defined.connectors/s3/connector.go (5)
122-144: Capabilities advertised in GetInfo match the implemented behavior.
GetInfocorrectly reportsDbType"s3"and advertises source/sink support forDATA_TYPE_JSON_IDandDATA_TYPE_MONGO_BSON, which aligns withWriteData/convertToJSONand theListDatabranches. LSN streaming and updates are marked unsupported as expected.
357-385: WriteData batching and error short‑circuiting look good; taskId validation is strict.
WriteData:
- Short‑circuits on
currentError()to avoid writing after a fatal failure.- Requires a nonzero
TaskIdand returnsInvalidArgumentotherwise.- Converts incoming docs to JSON via
convertToJSONand appends them to a per‑task batch.- No‑ops gracefully when
datais empty.This matches the per‑task buffering design and keeps the sink robust in the presence of prior errors. The only minor trade‑off is that
convertToJSONfailures are surfaced asInvalidArgument(client error), which is appropriate given malformed payloads.
440-468: flushBatch and objectKey are consistent with GeneratePlan and metadataKey.
flushBatchwrites a single JSON array per task under the key fromobjectKey, with appropriateContentType.objectKeyandmetadataKeyboth normalize namespaces by trimming slashes and turning dots into path separators, using"default"when namespace is empty, and prepending the optional connector prefix.This keeps object layout and metadata filenames consistent across write, planning, and metadata reads.
470-483: Sticky error handling is simple and effective.The
currentError/setErrorpair with RWMutex ensures that once a fatal error is recorded, subsequent writes are promptly rejected without racing. Only the first error is preserved, which is typically sufficient for debugging.
347-355: Remove this concern—errors.ErrUnsupportedis available in Go 1.24.The project requires Go 1.24 (as specified in
go.mod), which is well beyond Go 1.13 whenerrors.ErrUnsupportedwas introduced in the standard library. The code is correct and requires no changes.Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
internal/app/options/connectorflags.go (1)
300-316: S3 connector registration looks correct, but help-by-name invocation issue remains.The past review comment about help invocation (e.g.,
dsync S3 --help) causing an "invalid s3 connection string" error is still applicable. While the urfave/cli library typically handles--helpbefore executing the action, the current implementation setssettings.Uri = args[0]before callingCreateHelper, which could cause issues if the action is invoked.
🧹 Nitpick comments (1)
connectors/s3/connector.go (1)
316-371: ListData correctly implements data retrieval.The implementation properly reads JSON arrays from S3 and converts them based on the requested data type. For
DATA_TYPE_MONGO_BSON, the JSON-to-BSON conversion (lines 352-361) reconstructs BSON documents from the JSON stored in S3.Note that BSON-specific types (ObjectID, DateTime, Decimal128, etc.) are stored as their JSON representations, not as extended JSON with type markers. This means round-tripping BSON → JSON (in WriteData) → BSON (in ListData) loses type fidelity. If this is intentional for an analytics/export use case, consider documenting this behavior in the package comment or GetInfo response.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
connectors/s3/connector.go(1 hunks)internal/app/options/connectorflags.go(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
connectors/s3/connector.go (3)
gen/adiom/v1/adiomv1connect/adiom.connect.go (1)
ConnectorServiceHandler(199-210)gen/adiom/v1/messages.pb.go (2)
DataType_DATA_TYPE_JSON_ID(30-30)DataType_DATA_TYPE_MONGO_BSON(29-29)protocol/iface/connector.go (1)
Namespace(104-107)
internal/app/options/connectorflags.go (1)
connectors/s3/connector.go (2)
ConnectorSettings(38-51)NewConn(95-146)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (5)
internal/app/options/connectorflags.go (1)
573-629: S3 flags are well-defined.The flag definitions follow the established pattern with altsrc wrappers for configuration file support and proper destination binding. The required
regionflag and configurable options for AWS credentials, endpoint, and path-style addressing are appropriate for S3 integration.Note: The
prefixflag description states it will "Override or append to the key prefix derived from the connection string" — verify that this behavior is correctly implemented in the connector (see connector.go review).connectors/s3/connector.go (4)
1-74: Package structure and types are well-designed.The connector struct includes proper synchronization primitives (
batchesMutexfor per-task batch management andmetadataMutexfor metadata updates), and theConnectorSettingsprovides comprehensive configuration options for AWS S3 integration.
383-443: WriteData and barrier handler are correctly implemented.The sink implementation properly:
- Buffers data per-task until barrier is received
- Validates taskID is non-zero (defensive check)
- Returns early when batch is nil or empty (lines 420-427), preventing empty file creation
- Handles metadata update failures gracefully without failing the barrier (lines 436-439), since data was successfully flushed
445-508: Batch management and helper functions are correct.The thread-safe batch operations, S3 key generation, and error tracking are all properly implemented. The namespace path transformation (line 483, converting dots to slashes) provides reasonable hierarchical organization in S3.
510-652: JSON construction and metadata operations are well-implemented.The
buildJSONArrayfunction includes good error handling with fallback when pretty-printing fails (lines 524-527). Metadata operations are properly synchronized withmetadataMutex(line 625), addressing the race condition concern from previous reviews. TheNoSuchKeyerror handling (lines 584-590) correctly treats missing metadata files as expected for newly created namespaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
connectors/s3/connector.go (2)
440-445: Optional: Remove redundant check.The condition
if len(batch.docs) > 0at line 440 is always true because lines 429-432 already return early whenbatch.docsis empty. This check can be safely removed.} // Update metadata file atomically after successful flush - if len(batch.docs) > 0 { - if err := c.updateMetadataAfterFlush(context.Background(), batch.namespace, taskID, uint64(len(batch.docs))); err != nil { - slog.Error("failed to update metadata", "namespace", batch.namespace, "taskId", taskID, "err", err) - // Log error but don't fail the barrier - the data was successfully flushed - } + if err := c.updateMetadataAfterFlush(context.Background(), batch.namespace, taskID, uint64(len(batch.docs))); err != nil { + slog.Error("failed to update metadata", "namespace", batch.namespace, "taskId", taskID, "err", err) + // Log error but don't fail the barrier - the data was successfully flushed }
627-657: Recommended: Consider per-namespace mutex for better concurrency.The global
metadataMutex(line 630) serializes metadata updates across all namespaces. With many concurrent namespaces, this could become a bottleneck since updates to different namespaces don't actually conflict.Consider using per-namespace locking for better concurrency:
type connector struct { adiomv1connect.UnimplementedConnectorServiceHandler client *s3.Client settings ConnectorSettings batchesMutex sync.Mutex batches map[taskKey]*storedBatch - metadataMutex sync.Mutex // Serialize metadata updates + namespaceLocks sync.Map // map[string]*sync.Mutex for per-namespace locking errMutex sync.RWMutex err error }Then in
updateMetadataAfterFlush:func (c *connector) updateMetadataAfterFlush(ctx context.Context, namespace string, taskID uint, recordCount uint64) error { - c.metadataMutex.Lock() - defer c.metadataMutex.Unlock() + // Get or create mutex for this namespace + lockVal, _ := c.namespaceLocks.LoadOrStore(namespace, &sync.Mutex{}) + mu := lockVal.(*sync.Mutex) + mu.Lock() + defer mu.Unlock()This allows concurrent metadata updates for different namespaces while still preventing races within the same namespace.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
connectors/s3/connector.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
connectors/s3/connector.go (3)
gen/adiom/v1/adiomv1connect/adiom.connect.go (1)
ConnectorServiceHandler(199-210)gen/adiom/v1/messages.pb.go (2)
DataType_DATA_TYPE_JSON_ID(30-30)DataType_DATA_TYPE_MONGO_BSON(29-29)protocol/iface/connector.go (1)
Namespace(104-107)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (4)
connectors/s3/connector.go (4)
94-153: LGTM: Constructor properly handles configuration and AWS client setup.The prefix combination logic (lines 101-108) correctly addresses the previous review concern by merging URI and flag-provided prefixes. AWS SDK v2 configuration with credentials, profile, and endpoint overrides is correctly implemented.
179-285: LGTM: Efficient plan generation with metadata caching.The namespace metadata caching (line 208) prevents redundant S3 reads during plan generation. Namespace filtering and estimated count extraction from metadata are correctly implemented.
287-319: LGTM: Consistent namespace handling with graceful error fallback.The empty namespace defaulting to "default" (lines 290-292) is now consistent with the key generation helpers, addressing the previous review concern. Graceful error handling with warnings is appropriate for missing metadata.
321-376: LGTM: ListData correctly handles JSON source with type conversion.The implementation properly reads JSON arrays from S3 and converts to requested output format (JSON passthrough or JSON-to-BSON conversion). JSON validation ensures data integrity.
| func (c *connector) appendBatch(namespace string, taskID uint, docs [][]byte) { | ||
| c.batchesMutex.Lock() | ||
| defer c.batchesMutex.Unlock() | ||
| key := taskKey{taskID} | ||
| batch, ok := c.batches[key] | ||
| if !ok { | ||
| batch = &storedBatch{namespace: namespace} | ||
| c.batches[key] = batch | ||
| } | ||
| batch.docs = append(batch.docs, docs...) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Major issue: Namespace mismatch can cause data to be written to the wrong namespace.
The appendBatch function keys batches solely by taskID (line 453), but stores a single namespace per batch (line 456). If WriteData is called multiple times with the same taskID but different namespaces, the subsequent calls will append documents to the batch created by the first call, effectively ignoring the namespace parameter and causing all documents to be flushed to the first namespace.
Scenario:
WriteData(namespace="orders", taskID=1, docs=[doc1])→ creates batch with namespace="orders"WriteData(namespace="customers", taskID=1, docs=[doc2])→ appends to existing batch, ignores "customers" namespace- Barrier for task 1 → flushes both doc1 and doc2 to "orders" namespace
Fix: Either validate namespace consistency or make the batch key include both taskID and namespace.
Solution 1: Validate namespace consistency (simpler):
func (c *connector) appendBatch(namespace string, taskID uint, docs [][]byte) {
c.batchesMutex.Lock()
defer c.batchesMutex.Unlock()
key := taskKey{taskID}
batch, ok := c.batches[key]
if !ok {
batch = &storedBatch{namespace: namespace}
c.batches[key] = batch
+ } else if batch.namespace != namespace {
+ // This should never happen if task routing is correct
+ slog.Error("namespace mismatch for taskID", "taskID", taskID, "expected", batch.namespace, "got", namespace)
+ return
}
batch.docs = append(batch.docs, docs...)
}Solution 2: Use composite key (more robust if multi-namespace tasks are valid):
Update taskKey to include namespace:
type taskKey struct {
- taskID uint
+ taskID uint
+ namespace string
}Then update all usages accordingly.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (c *connector) appendBatch(namespace string, taskID uint, docs [][]byte) { | |
| c.batchesMutex.Lock() | |
| defer c.batchesMutex.Unlock() | |
| key := taskKey{taskID} | |
| batch, ok := c.batches[key] | |
| if !ok { | |
| batch = &storedBatch{namespace: namespace} | |
| c.batches[key] = batch | |
| } | |
| batch.docs = append(batch.docs, docs...) | |
| } | |
| func (c *connector) appendBatch(namespace string, taskID uint, docs [][]byte) { | |
| c.batchesMutex.Lock() | |
| defer c.batchesMutex.Unlock() | |
| key := taskKey{taskID} | |
| batch, ok := c.batches[key] | |
| if !ok { | |
| batch = &storedBatch{namespace: namespace} | |
| c.batches[key] = batch | |
| } else if batch.namespace != namespace { | |
| // This should never happen if task routing is correct | |
| slog.Error("namespace mismatch for taskID", "taskID", taskID, "expected", batch.namespace, "got", namespace) | |
| return | |
| } | |
| batch.docs = append(batch.docs, docs...) | |
| } |
🤖 Prompt for AI Agents
connectors/s3/connector.go around lines 450-460: appendBatch currently keys
batches only by taskID causing namespace mixing; fix by making the batch key
composite — add a namespace string field to taskKey and use taskKey{taskID,
namespace} when looking up/creating batches (update the taskKey type declaration
and all places that construct or use it), or if you prefer the simpler option,
validate that when a batch exists its namespace equals the passed namespace and
return/log/error on mismatch; ensure all map lookups/assignments and any
function signatures that used taskKey are updated accordingly.
| func buildJSONArray(docs [][]byte, prettyJSON bool) []byte { | ||
| var buf bytes.Buffer | ||
| buf.Grow(len(docs) * 2) | ||
| buf.WriteByte('[') | ||
| for i, doc := range docs { | ||
| if i > 0 { | ||
| buf.WriteByte(',') | ||
| } | ||
| if prettyJSON { | ||
| var prettyBuf bytes.Buffer | ||
| prettyBuf.Grow(len(doc) + len(doc)/10) // rough estimate | ||
| if err := json.Indent(&prettyBuf, doc, "", " "); err == nil { | ||
| buf.WriteByte('\n') | ||
| doc = prettyBuf.Bytes() | ||
| } else { | ||
| // If indenting fails, fall back to original | ||
| slog.Warn("Failed JSON indentation. Falling back to no-indent") | ||
| } | ||
| } | ||
| buf.Write(doc) | ||
| } | ||
| if prettyJSON { | ||
| buf.WriteByte('\n') | ||
| } | ||
| buf.WriteByte(']') | ||
| return buf.Bytes() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Performance issue: Buffer pre-allocation is far too small.
Line 517 allocates only len(docs) * 2 bytes for a buffer that needs to hold the entire JSON array. For typical documents (hundreds to thousands of bytes each), this is orders of magnitude too small and will cause multiple reallocations as the buffer grows.
Example: 100 documents of 500 bytes each require ~50KB, but Grow(200) only allocates 200 bytes.
Apply this diff to properly size the buffer:
func buildJSONArray(docs [][]byte, prettyJSON bool) []byte {
var buf bytes.Buffer
- buf.Grow(len(docs) * 2)
+ // Estimate total size: sum of doc lengths + array overhead
+ estimatedSize := 2 // '[' and ']'
+ for _, doc := range docs {
+ estimatedSize += len(doc) + 1 // doc + comma
+ }
+ if prettyJSON {
+ estimatedSize += len(docs) * 3 // rough estimate for newlines and indentation
+ }
+ buf.Grow(estimatedSize)
buf.WriteByte('[')📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func buildJSONArray(docs [][]byte, prettyJSON bool) []byte { | |
| var buf bytes.Buffer | |
| buf.Grow(len(docs) * 2) | |
| buf.WriteByte('[') | |
| for i, doc := range docs { | |
| if i > 0 { | |
| buf.WriteByte(',') | |
| } | |
| if prettyJSON { | |
| var prettyBuf bytes.Buffer | |
| prettyBuf.Grow(len(doc) + len(doc)/10) // rough estimate | |
| if err := json.Indent(&prettyBuf, doc, "", " "); err == nil { | |
| buf.WriteByte('\n') | |
| doc = prettyBuf.Bytes() | |
| } else { | |
| // If indenting fails, fall back to original | |
| slog.Warn("Failed JSON indentation. Falling back to no-indent") | |
| } | |
| } | |
| buf.Write(doc) | |
| } | |
| if prettyJSON { | |
| buf.WriteByte('\n') | |
| } | |
| buf.WriteByte(']') | |
| return buf.Bytes() | |
| } | |
| func buildJSONArray(docs [][]byte, prettyJSON bool) []byte { | |
| var buf bytes.Buffer | |
| // Estimate total size: sum of doc lengths + array overhead | |
| estimatedSize := 2 // '[' and ']' | |
| for _, doc := range docs { | |
| estimatedSize += len(doc) + 1 // doc + comma | |
| } | |
| if prettyJSON { | |
| estimatedSize += len(docs) * 3 // rough estimate for newlines and indentation | |
| } | |
| buf.Grow(estimatedSize) | |
| buf.WriteByte('[') | |
| for i, doc := range docs { | |
| if i > 0 { | |
| buf.WriteByte(',') | |
| } | |
| if prettyJSON { | |
| var prettyBuf bytes.Buffer | |
| prettyBuf.Grow(len(doc) + len(doc)/10) // rough estimate | |
| if err := json.Indent(&prettyBuf, doc, "", " "); err == nil { | |
| buf.WriteByte('\n') | |
| doc = prettyBuf.Bytes() | |
| } else { | |
| // If indenting fails, fall back to original | |
| slog.Warn("Failed JSON indentation. Falling back to no-indent") | |
| } | |
| } | |
| buf.Write(doc) | |
| } | |
| if prettyJSON { | |
| buf.WriteByte('\n') | |
| } | |
| buf.WriteByte(']') | |
| return buf.Bytes() | |
| } |
🤖 Prompt for AI Agents
In connectors/s3/connector.go around lines 515 to 541, the buffer is currently
pre-allocated with buf.Grow(len(docs) * 2) which is far too small and causes
many reallocations; replace this with a proper estimate: compute total := 2 (for
'[' and ']') + max(0, len(docs)-1) (for commas) + sum(len(doc) for each doc) and
if prettyJSON add a small growth allowance (e.g. sum(len(doc)/10) or a fixed
extra per-doc) to account for indentation/newlines, then call buf.Grow(total)
before writing into the buffer so it is sized appropriately.
Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.