feat(venues): add venue-aware order identifiers, multi-venue dispatch, and venue management APIs #51

terwey · 2025-11-01T16:49:43Z

Summary

Area	Highlights
Storage	Introduce `(venue_id, wallet, order_id)` as the primary identifier for Hyperliquid submissions/status; regenerate `sqlc` models/queries; migrate and wrap wallet/venue updates in transactions; persist payload metadata for scaled-order audits; expose venue assignment tables.
Engine/Emitter	Replace single `Emitter` with `QueueEmitter` that routes by `OrderIdentifier`; enforce identifier presence and venue registration; replay creates for venues missing submissions; fan-out modify/cancel per venue; add identifier checks and clearer errors.
Fill tracker	Track orders by `OrderIdentifier` (not just CLOID); adjust snapshots, reconciliation, and SSE emission accordingly.
Websocket/Refresher	Tag WS updates with venue/wallet; `StatusRefresher` queries per venue via a registry; record statuses using full identifier.
API/OpenAPI	New endpoints: list/upsert/delete venues; manage bot↔venue assignments; stream/log payloads now include `identifier` (venue_id+wallet); `CancelOrderByOrderId` accepts optional `venue_id`; order records/logs surface venue-scoped state.
CLI/main	Ensure default venue/wallet; build base identifier; register venue emitters with dispatcher; pass venue/wallet through to WS and workers.
Docs/Chore	Update AGENTS.md; revise multi-venue design doc; guard generated code; fix and extend tests.
Fixes	Wrap default wallet/primary assignment in transactions; align Hyperliquid data with venues; restore scaled-order audits and queries.

BREAKING CHANGES

Hyperliquid persistence and APIs are venue-scoped. Any code assuming a single wallet per order must now pass/handle OrderIdentifier { venue_id, wallet, order_id }.
SSE/log/API responses include identifiers.venue_id and identifiers.wallet; hyperliquid.identifier added to order state.
Storage schema changes require migration; queries that joined on order_id alone must join on (venue_id, wallet, order_id).
Cancel operations may need a venue_id filter when multiple submissions exist for the same order.

Migration notes

Run regenerated sqlc + schema migrations before deploy.
Use EnsureDefaultVenueWallet to seed the default Hyperliquid venue/wallet.
When emitting work, always populate OrderWork.Identifier; dispatcher will reject items without it.
If you target multiple venues per bot, upsert assignments via the new /api/venues/* endpoints, then restart to build WS/status clients per venue.

Review

Read specs/multi_venue_emission.adoc

…queries-for-venues `feat(storage): add venue-aware hyperliquid persistence` **Summary** * Added venue registry and bot assignment tables while rebuilding Hyperliquid submission/status and scaled order tables around `(venue_id, wallet, order_id)` identifiers and payload metadata so upgrades apply the new schema automatically. [storage/sqlc/schema.sqlL49-L199](https://github.com/recomma/recomma/blob/b13be88a6a7b37ce1bd1854676de03e3209c13c5/storage/sqlc/schema.sql#L49-L199) * Regenerated sqlc queries and models to accept venue-scoped arguments, emit payload metadata, and return composite scaled-order rows for audits and deal views. [storage/sqlc/queries.sqlL527-L888](https://github.com/recomma/recomma/blob/b13be88a6a7b37ce1bd1854676de03e3209c13c5/storage/sqlc/queries.sql#L527-L888) [storage/sqlcgen/models.goL11-L139](https://github.com/recomma/recomma/blob/b13be88a6a7b37ce1bd1854676de03e3209c13c5/storage/sqlcgen/models.go#L11-L139) * Updated storage logic to populate default venue and wallet identifiers, persist typed payload blobs for Hyperliquid submissions/statuses, and translate the new scaled-order results for streaming and API responses. [storage/storage.goL328-L607](https://github.com/recomma/recomma/blob/b13be88a6a7b37ce1bd1854676de03e3209c13c5/storage/storage.go#L328-L607) [storage/order_scalers.goL330-L648](https://github.com/recomma/recomma/blob/b13be88a6a7b37ce1bd1854676de03e3209c13c5/storage/order_scalers.go#L330-L648)

…upport-for-hyperliquid' into codex/refactor-storage-for-venue-aware-identifiers

…aware-identifiers feat(storage): add venue-aware identifiers

merge main for generated code guard

chore: esnure submodule

guard generated code

…ue' into codex/extend-orderwork-with-identifier-and-refactor-queue-ehfv2o

…ifier-and-refactor-queue-ehfv2o fix: replay missing venue submissions

…011CV2b6CpvF4vQWJtSJiSwW Resolved merge conflicts by combining: - New SDK API style from spec branch (WithPlanTier) - Rate limiter wrapper implementation - Multi-venue support from spec branch - Comprehensive test suites from both branches All conflicts resolved while preserving functionality from both branches.

Cleaned up remaining conflict marker from spec/tc-rate-limit merge.

- Change tc.Bot.Name from string to removed (field not used in tests) - Change tc.Deal.Id from int64 to int to match SDK types - Align with type usage patterns from other tests in codebase

Previously, when the rate limit window reset (every 60 seconds), waiting workflows in the queue were not notified. This caused them to remain stuck in the queue indefinitely, even though capacity was now available. The bug manifested as: - Deal workflows would reserve and get queued - produce:all-bots would consume all 5 slots and release - Window would reset after 60 seconds - produce:all-bots would reserve again immediately - Deal workflows remained stuck in queue forever Fix: Call tryGrantWaiting() after resetting the window in resetWindowIfNeeded() to wake up and grant reservations to queued workflows that now have capacity. This ensures fair FIFO processing and prevents workflow starvation.

fix: hyperliquid wire format expects 8 decimals for price and size

This refactor addresses the critical design flaw where waiting workflows could not use freed capacity from AdjustDown/SignalComplete calls. **Problem:** Previously, tryGrantWaiting() would return immediately if activeReservation was non-nil, defeating the entire "early release" pattern: - produce:all-bots reserves 5 slots - produce:all-bots calls AdjustDown(2), freeing 3 slots - But tryGrantWaiting() couldn't grant to waiting deal workflows - All workflows were serialized, no concurrency **Solution:** - Changed from single `activeReservation *reservation` to multiple `activeReservations map[string]*reservation` - tryGrantWaiting() now grants to waiting workflows whenever there's capacity, regardless of existing active reservations - Added calculateTotalReserved() to sum slots across all reservations - Updated Reserve/Consume/AdjustDown/Extend/SignalComplete/Release to work with the map **Result:** When produce:all-bots adjusts down from 5→2 slots, the freed 3 slots are immediately available for deal workflows. Multiple workflows can now run concurrently, enabling true "early release" behavior per spec. Closes the issue raised in code review regarding workflow serialization.

Line 337 was trying to redeclare totalReserved with := but it was already declared at line 280 in the function scope. Changed to = for reassignment.

The Extend method was incorrectly using the Reserve capacity formula, which prevented reservations from growing beyond the per-window limit. **The Contract:** Reservations can span multiple windows. The rate limiter enforces that *consumption per window* doesn't exceed the limit, not that *total reservations* can't exceed the limit. **Old (incorrect) formula:** if l.consumed + totalReserved - res.slotsReserved + newReservation <= l.limit This prevented extending a reservation beyond the window limit, even when the additional consumption would span multiple windows. **New (correct) formula:** if l.consumed + newReservation - res.slotsConsumed <= l.limit This checks: "Can the additional slots I need (beyond what I've already consumed) fit in the current window's remaining capacity?" **Example (limit=10):** - Reserve 8 slots, consume all 8 in window 1 - Window resets: consumed=0, slotsConsumed=8 (persists with reservation) - Extend by 5 (total 13): Check 0 + 13 - 8 = 5 <= 10 ✓ - The 5 additional slots fit in the new window This allows workflows to have large total reservations that span multiple windows, while still enforcing per-window rate limits. Fixes TestLimiter_ExtendRequiresWindowReset

Updated rate_limit.adoc to accurately describe the implementation: **Core Principle Changes:** - Changed from "single active reservation" to "multiple concurrent reservations" - Clarified goal: guarantee sequential execution per workflow, not global serialization - Added "Sequential Execution Guarantee" mechanism **Key Updates:** 1. Principle (line 92): Now describes coordination vs serialization 2. Key Mechanisms (lines 110-126): - Multiple Concurrent Reservations (replaces Single Active Reservation) - Added Sequential Execution Guarantee - Added Cross-Window Reservations mechanism 3. Core State (lines 139-145): activeReservation → activeReservations map 4. Reserve operation (lines 157-162): Updated capacity check formula and behavior 5. AdjustDown (lines 180-182): Clarifies enables early release for concurrent workflows 6. Extend (lines 188-192): Documents cross-window formula and rationale 7. Release (lines 208-210): Multiple workflows may be granted 8. Window Reset (lines 218-221): slotsConsumed persists, immediate re-evaluation **Open Questions Answered:** - Question 3: Documented capacity formula with rationale - Question 4: Documented Extend cross-window formula with example **Example Updates:** - Line 496: Clarified both workflows have concurrent reservations The spec now accurately reflects that workflows get sequential execution guarantees (preventing thundering herd), while multiple workflows can run concurrently when capacity allows (via early release pattern).

…1CV2b6CpvF4vQWJtSJiSwW feat: implement ThreeCommas API rate limiting with workflow reservation system

Spec/tc rate limit

Threecommas/ratelimit

- internal/api/system_stream.go (lines 36-174) adds per-subscriber locking plus trySend/close helpers so channel closes can’t race with Publish/history sends; cancellation now removes a subscriber under the controller lock and closes it safely afterward. - internal/api/system_stream.go (lines 188-254) reworks Publish/Flush to fan out via the guarded send path, skipping already-closed subscribers and logging only when buffers are full. Co-authored-by: Codex <[email protected]>

This commit documents the investigation of the deadlock bug in the rate limiter where queued reservations never wake when the time window resets if no other limiter operations occur. Key findings: - Queued workflows block on a channel that's only closed by tryGrantWaiting(), which is only called during other limiter operations - If no active reservations exist and no new operations occur, queued workflows hang indefinitely even after the window resets - The existing TestLimiter_WindowReset actually demonstrates the bug by requiring a third workflow to trigger reset detection Changes: - Corrected TestLimiter_WindowReset to test expected behavior (auto-wake) - Added TestLimiter_QueuedReservationDeadlock to explicitly reproduce bug - Created BUG_ANALYSIS.md with detailed root cause analysis and solutions Both corrected tests will FAIL with current implementation, proving the bug exists. They define the expected contract that queued workflows should automatically wake when the window resets. Recommended solution: Add background ticker or scheduled wake-up to detect window resets even when no other operations occur.

Fixes the deadlock bug where queued reservations never wake when the rate limit window resets if no other limiter operations occur. Root Cause: Queued workflows blocked on a channel that was only closed by tryGrantWaiting(), which was only called during other limiter operations (Reserve, Release, Consume, etc.). When no operations occurred, the window could reset but nothing detected it, causing indefinite hangs. Solution: Implemented a background ticker that periodically checks for window resets every window/10 duration (min 100ms). The ticker only runs resetWindowIfNeeded() when there are queued workflows, ensuring minimal overhead. Changes: 1. Updated all godoc comments to explicitly document auto-wake behavior - Reserve(): Now clearly states queued workflows auto-wake on reset - Other operations: Document their role in waking queued workflows 2. Added background ticker infrastructure: - ticker: Runs every window/10 to detect resets - done: Channel for graceful shutdown - windowResetWatcher(): Background goroutine - Stop(): Cleanup method for tests 3. Updated NewLimiter() to start background watcher The fix ensures the limiter matches its specification, which already documented that "waiting workflows are immediately re-evaluated and granted if capacity now available" when windows reset. Tests: - TestLimiter_WindowReset: Now correctly tests auto-wake (previously required 3rd workflow to trigger reset detection) - TestLimiter_QueuedReservationDeadlock: Explicitly tests the bug scenario (burst exhausts quota, single caller queues, no other ops) Both tests will now PASS with the background ticker implementation.

The background ticker goroutine needs to be stopped after each test to prevent synctest from detecting goroutine leaks. This adds defer l.Stop() after every NewLimiter() call in tests. This fixes TestLimiter_FIFOQueue and ensures all other tests properly clean up resources.

Removing investigation notes as they're not needed in the PR. The fix is well-documented in commit messages and godoc comments.

Documents the lessons learned from the rate limiter deadlock bug: 1. Tests must validate contracts (spec + godoc), not implementation 2. Godoc is the bridge between spec and implementation 3. Vague godoc leads to tests that validate bugs instead of exposing them Includes case study from the ratelimit package showing how the bug went undetected because tests followed the implementation instead of the documented contract. Provides concrete examples of vague vs clear godoc with guidelines for writing explicit documentation that makes it impossible to write tests that ignore expected behavior.

…01To3YBZrRDQmvwaAev7cKWZ Rate Limiter Deadlock Bug

…nstraint-011CUyu1iovAvRm9q9iEP9cq Claude/fix venue wallet unique constraint 011 c uyu1iov av rm9q9i ep9cq

terwey · 2025-11-14T13:31:02Z

Known bugs:

#108
recomma/storage/storage.go
Lines 179 to 191 in ed1cb9c
// Check if a venue with this (type, wallet) already exists (possibly with a different ID)
existingVenue, err := s.queries.GetVenueByTypeAndWallet(ctx, sqlcgen.GetVenueByTypeAndWalletParams{
Type: defaultHyperliquidVenueType,
Wallet: wallet,
})
if err == nil {
// A venue with this (type, wallet) already exists.
// If it has a different ID than the default, we should use it instead of trying to create a duplicate.
if existingVenue.ID != string(defaultHyperliquidVenueID) {
// There's already a user-defined venue with this type and wallet.
// No need to create a separate default venue - just return success.
return nil
}
When an existing venue with the same (type, wallet) is found, the new logic returns early without inserting or updating the default row (hyperliquid:default). Downstream code (e.g. EnsureDefaultVenueWallet, order identifier construction, and default venue lookups) expects this well-known ID to exist. If an operator already has a venue for the primary wallet under a different ID, startup will skip the upsert and later lookups by hyperliquid:default will fail with sql.ErrNoRows. Instead of returning immediately, the routine should reconcile the existing venue into the default ID or update it in place so the canonical identifier always exists.

#107
Respect user-specified deal worker and resync settings #107

Edit
New issue

Open
@terwey
Description
terwey
opened 10 minutes ago

recomma/cmd/recomma/main.go
Lines 270 to 276 in ed1cb9c
// Get tier-specific rate limit configuration
rateLimitCfg := planTier.RateLimitConfig()

// Override configuration with tier-specific values (unless explicitly set by user)
cfg.DealWorkers = rateLimitCfg.DealWorkers
cfg.ResyncInterval = rateLimitCfg.ResyncInterval

After determining the plan tier, the code unconditionally assigns cfg.DealWorkers and cfg.ResyncInterval from the tier defaults. This overwrites any values explicitly provided by flags or environment variables, despite the comment stating the override should only happen when the user hasn’t set a value. Operators who tune these settings will now have their configuration silently ignored every boot. Consider only applying the tier defaults when the current value still equals the compiled default.

terwey · 2025-11-14T13:31:15Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-11-14T13:35:03Z

engine/engine.go


 	e.logger.Info("Checking for new deals from bots", slog.Int("bots", len(bots)))

-	// Fetch deals per bot concurrently with a reasonable cap.
+	// Now we know how many bots we have, adjust down the reservation
+	// We need: 1 (ListBots) + len(bots) (GetListOfDeals per bot)
+	if e.limiter != nil {
+		neededSlots := 1 + len(bots)
+		if err := e.limiter.AdjustDown(workflowID, neededSlots); err != nil {
+			e.logger.Warn("rate limit adjust down failed", slog.String("error", err.Error()))
+		}
+	}
+
+	// Fetch deals per bot concurrently with tier-specific concurrency cap
 	g, gctx := errgroup.WithContext(ctx)
-	g.SetLimit(32)
+	g.SetLimit(e.produceConcurrency)

 	for _, bot := range bots {
 		b := bot // capture loop var


Avoid starving bots when reservations are smaller than bot count

The new rate‑limited producer reserves at most limit slots based on Limiter.Stats() and only ever reduces that reservation (AdjustDown). When len(bots) exceeds the per‑minute limit, the subsequent GetListOfDeals calls past the first limit bots immediately fail with ErrConsumeExceedsLimit in the rate‑limited client, hit the logger.Error("list deals for bot") path, and are silently dropped. Because the loop order is deterministic, the same leading limit bots are processed every resync while the rest never have their deals polled, so those bots will stall indefinitely. Either cap the number of bots processed per window or wait for additional capacity instead of issuing calls that you know will exceed the reservation.

Useful? React with 👍 / 👎.

terwey added the codex label Nov 1, 2025 — with ChatGPT Codex Connector

docs: revise multi-venue design doc

f628030

terwey force-pushed the codex/investigate-multi-wallet-support-for-hyperliquid branch from 752edc0 to f628030 Compare November 2, 2025 16:59

terwey added 27 commits November 2, 2025 21:49

feat(storage): add venue-aware schema changes

8cb9475

fix(storage): align hyperliquid data with venues

df38a12

fix(storage): recreate hyperliquid tables for venue keys

38d681a

refactor: replace metadata identifiers with orderid

e9f6c00

fix(storage): align scaled orders with venue keys

4fc33c9

fix(storage): restore scaled order audit retrieval

b13be88

feat(storage): add venue-aware schema changes

6dda401

fix(storage): align hyperliquid data with venues

7824984

feat(storage): add venue-aware identifiers

de87f79

docs(specs): note status refresher venue handling

b7fdd80

Merge remote-tracking branch 'origin/codex/investigate-multi-wallet-s…

a7ecda4

…upport-for-hyperliquid' into codex/refactor-storage-for-venue-aware-identifiers

fix(storage): restore scaled order audit queries

3bf8977

Merge pull request #54 from recomma/codex/refactor-storage-for-venue-…

2ad0d55

…aware-identifiers feat(storage): add venue-aware identifiers

feat: add venue-aware order identifiers

3894475

fix: align default venue wallet handling

9ec10d2

feat(api): add venue management and venue-aware identifiers

4ebcf6c

Merge pull request #59 from recomma/main

5d717e8

merge main for generated code guard

Merge pull request #61 from recomma/main

1a75d64

chore: esnure submodule

fix: make venue_id required in OrderIdentifiers

e7654a9

Merge pull request #62 from recomma/main

d3fe2bb

guard generated code

Merge pull request #63 from recomma/main

3366099

guard generated code

fix(engine): replay missing venue submissions

ac2d733

Merge branch 'codex/extend-orderwork-with-identifier-and-refactor-que…

6cc5a56

…ue' into codex/extend-orderwork-with-identifier-and-refactor-queue-ehfv2o

merge: resolve merge conflict

810110f

Merge pull request #64 from recomma/codex/extend-orderwork-with-ident…

87f00cb

…ifier-and-refactor-queue-ehfv2o fix: replay missing venue submissions

fix(storage): rekey hyperliquid data via sqlc

22f6c2c

claude and others added 25 commits November 12, 2025 00:34

fix: remove leftover merge conflict marker from types.go

03c5983

Cleaned up remaining conflict marker from spec/tc-rate-limit merge.

fix: correct type mismatches in ratelimit client tests

865f868

- Change tc.Bot.Name from string to removed (field not used in tests) - Change tc.Deal.Id from int64 to int to match SDK types - Align with type usage patterns from other tests in codebase

fix: hyperliquid wire format expects 8 decimals for price and size

d420b40

tests: make less flaky for CI

4c2fba2

tests: make less flaky for CI

c7e5b5d

tests: add synctest with fake time to make ratelimit pass faster

a6d4641

Merge pull request #103 from recomma/emitter/roundToHalfEven

51fbe33

fix: hyperliquid wire format expects 8 decimals for price and size

fix: use assignment instead of declaration for totalReserved

211010c

Line 337 was trying to redeclare totalReserved with := but it was already declared at line 280 in the function scope. Changed to = for reassignment.

Merge pull request #102 from recomma/claude/review-rate-limit-spec-01…

159a3c0

…1CV2b6CpvF4vQWJtSJiSwW feat: implement ThreeCommas API rate limiting with workflow reservation system

Merge pull request #104 from recomma/spec/tc-rate-limit

6801754

Spec/tc rate limit

fix: import VaultSecretsBundle

420947d

Merge pull request #100 from recomma/threecommas/ratelimit

12ad6f7

Threecommas/ratelimit

chore: remove bug analysis documentation file

95f2c70

Removing investigation notes as they're not needed in the PR. The fix is well-documented in commit messages and godoc comments.

Merge pull request #106 from recomma/claude/find-limiter-lock-branch-…

ed1cb9c

…01To3YBZrRDQmvwaAev7cKWZ Rate Limiter Deadlock Bug

Merge pull request #99 from recomma/claude/fix-venue-wallet-unique-co…

05dae77

…nstraint-011CUyu1iovAvRm9q9iEP9cq Claude/fix venue wallet unique constraint 011 c uyu1iov av rm9q9i ep9cq

chatgpt-codex-connector bot reviewed Nov 14, 2025

View reviewed changes

terwey merged commit 99dcf4a into main Nov 14, 2025
5 checks passed

terwey deleted the codex/investigate-multi-wallet-support-for-hyperliquid branch November 14, 2025 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(venues): add venue-aware order identifiers, multi-venue dispatch, and venue management APIs #51

feat(venues): add venue-aware order identifiers, multi-venue dispatch, and venue management APIs #51

Uh oh!

terwey commented Nov 1, 2025 •

edited

Loading

Uh oh!

terwey commented Nov 14, 2025

Uh oh!

terwey commented Nov 14, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(venues): add venue-aware order identifiers, multi-venue dispatch, and venue management APIs #51

feat(venues): add venue-aware order identifiers, multi-venue dispatch, and venue management APIs #51

Uh oh!

Conversation

terwey commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review

Uh oh!

terwey commented Nov 14, 2025

Uh oh!

terwey commented Nov 14, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

terwey commented Nov 1, 2025 •

edited

Loading