-
Notifications
You must be signed in to change notification settings - Fork 0
feat(venues): add venue-aware order identifiers, multi-venue dispatch, and venue management APIs #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(venues): add venue-aware order identifiers, multi-venue dispatch, and venue management APIs #51
Conversation
752edc0 to
f628030
Compare
…queries-for-venues `feat(storage): add venue-aware hyperliquid persistence` **Summary** * Added venue registry and bot assignment tables while rebuilding Hyperliquid submission/status and scaled order tables around `(venue_id, wallet, order_id)` identifiers and payload metadata so upgrades apply the new schema automatically. [storage/sqlc/schema.sqlL49-L199](https://github.com/recomma/recomma/blob/b13be88a6a7b37ce1bd1854676de03e3209c13c5/storage/sqlc/schema.sql#L49-L199) * Regenerated sqlc queries and models to accept venue-scoped arguments, emit payload metadata, and return composite scaled-order rows for audits and deal views. [storage/sqlc/queries.sqlL527-L888](https://github.com/recomma/recomma/blob/b13be88a6a7b37ce1bd1854676de03e3209c13c5/storage/sqlc/queries.sql#L527-L888) [storage/sqlcgen/models.goL11-L139](https://github.com/recomma/recomma/blob/b13be88a6a7b37ce1bd1854676de03e3209c13c5/storage/sqlcgen/models.go#L11-L139) * Updated storage logic to populate default venue and wallet identifiers, persist typed payload blobs for Hyperliquid submissions/statuses, and translate the new scaled-order results for streaming and API responses. [storage/storage.goL328-L607](https://github.com/recomma/recomma/blob/b13be88a6a7b37ce1bd1854676de03e3209c13c5/storage/storage.go#L328-L607) [storage/order_scalers.goL330-L648](https://github.com/recomma/recomma/blob/b13be88a6a7b37ce1bd1854676de03e3209c13c5/storage/order_scalers.go#L330-L648)
…upport-for-hyperliquid' into codex/refactor-storage-for-venue-aware-identifiers
…aware-identifiers feat(storage): add venue-aware identifiers
merge main for generated code guard
chore: esnure submodule
guard generated code
guard generated code
…ue' into codex/extend-orderwork-with-identifier-and-refactor-queue-ehfv2o
…ifier-and-refactor-queue-ehfv2o fix: replay missing venue submissions
…011CV2b6CpvF4vQWJtSJiSwW Resolved merge conflicts by combining: - New SDK API style from spec branch (WithPlanTier) - Rate limiter wrapper implementation - Multi-venue support from spec branch - Comprehensive test suites from both branches All conflicts resolved while preserving functionality from both branches.
Cleaned up remaining conflict marker from spec/tc-rate-limit merge.
- Change tc.Bot.Name from string to removed (field not used in tests) - Change tc.Deal.Id from int64 to int to match SDK types - Align with type usage patterns from other tests in codebase
Previously, when the rate limit window reset (every 60 seconds), waiting workflows in the queue were not notified. This caused them to remain stuck in the queue indefinitely, even though capacity was now available. The bug manifested as: - Deal workflows would reserve and get queued - produce:all-bots would consume all 5 slots and release - Window would reset after 60 seconds - produce:all-bots would reserve again immediately - Deal workflows remained stuck in queue forever Fix: Call tryGrantWaiting() after resetting the window in resetWindowIfNeeded() to wake up and grant reservations to queued workflows that now have capacity. This ensures fair FIFO processing and prevents workflow starvation.
fix: hyperliquid wire format expects 8 decimals for price and size
This refactor addresses the critical design flaw where waiting workflows could not use freed capacity from AdjustDown/SignalComplete calls. **Problem:** Previously, tryGrantWaiting() would return immediately if activeReservation was non-nil, defeating the entire "early release" pattern: - produce:all-bots reserves 5 slots - produce:all-bots calls AdjustDown(2), freeing 3 slots - But tryGrantWaiting() couldn't grant to waiting deal workflows - All workflows were serialized, no concurrency **Solution:** - Changed from single `activeReservation *reservation` to multiple `activeReservations map[string]*reservation` - tryGrantWaiting() now grants to waiting workflows whenever there's capacity, regardless of existing active reservations - Added calculateTotalReserved() to sum slots across all reservations - Updated Reserve/Consume/AdjustDown/Extend/SignalComplete/Release to work with the map **Result:** When produce:all-bots adjusts down from 5→2 slots, the freed 3 slots are immediately available for deal workflows. Multiple workflows can now run concurrently, enabling true "early release" behavior per spec. Closes the issue raised in code review regarding workflow serialization.
Line 337 was trying to redeclare totalReserved with := but it was already declared at line 280 in the function scope. Changed to = for reassignment.
The Extend method was incorrectly using the Reserve capacity formula, which prevented reservations from growing beyond the per-window limit. **The Contract:** Reservations can span multiple windows. The rate limiter enforces that *consumption per window* doesn't exceed the limit, not that *total reservations* can't exceed the limit. **Old (incorrect) formula:** if l.consumed + totalReserved - res.slotsReserved + newReservation <= l.limit This prevented extending a reservation beyond the window limit, even when the additional consumption would span multiple windows. **New (correct) formula:** if l.consumed + newReservation - res.slotsConsumed <= l.limit This checks: "Can the additional slots I need (beyond what I've already consumed) fit in the current window's remaining capacity?" **Example (limit=10):** - Reserve 8 slots, consume all 8 in window 1 - Window resets: consumed=0, slotsConsumed=8 (persists with reservation) - Extend by 5 (total 13): Check 0 + 13 - 8 = 5 <= 10 ✓ - The 5 additional slots fit in the new window This allows workflows to have large total reservations that span multiple windows, while still enforcing per-window rate limits. Fixes TestLimiter_ExtendRequiresWindowReset
Updated rate_limit.adoc to accurately describe the implementation: **Core Principle Changes:** - Changed from "single active reservation" to "multiple concurrent reservations" - Clarified goal: guarantee sequential execution per workflow, not global serialization - Added "Sequential Execution Guarantee" mechanism **Key Updates:** 1. Principle (line 92): Now describes coordination vs serialization 2. Key Mechanisms (lines 110-126): - Multiple Concurrent Reservations (replaces Single Active Reservation) - Added Sequential Execution Guarantee - Added Cross-Window Reservations mechanism 3. Core State (lines 139-145): activeReservation → activeReservations map 4. Reserve operation (lines 157-162): Updated capacity check formula and behavior 5. AdjustDown (lines 180-182): Clarifies enables early release for concurrent workflows 6. Extend (lines 188-192): Documents cross-window formula and rationale 7. Release (lines 208-210): Multiple workflows may be granted 8. Window Reset (lines 218-221): slotsConsumed persists, immediate re-evaluation **Open Questions Answered:** - Question 3: Documented capacity formula with rationale - Question 4: Documented Extend cross-window formula with example **Example Updates:** - Line 496: Clarified both workflows have concurrent reservations The spec now accurately reflects that workflows get sequential execution guarantees (preventing thundering herd), while multiple workflows can run concurrently when capacity allows (via early release pattern).
…1CV2b6CpvF4vQWJtSJiSwW feat: implement ThreeCommas API rate limiting with workflow reservation system
Spec/tc rate limit
Threecommas/ratelimit
- internal/api/system_stream.go (lines 36-174) adds per-subscriber locking plus trySend/close helpers so channel closes can’t race with Publish/history sends; cancellation now removes a subscriber under the controller lock and closes it safely afterward. - internal/api/system_stream.go (lines 188-254) reworks Publish/Flush to fan out via the guarded send path, skipping already-closed subscribers and logging only when buffers are full. Co-authored-by: Codex <[email protected]>
This commit documents the investigation of the deadlock bug in the rate limiter where queued reservations never wake when the time window resets if no other limiter operations occur. Key findings: - Queued workflows block on a channel that's only closed by tryGrantWaiting(), which is only called during other limiter operations - If no active reservations exist and no new operations occur, queued workflows hang indefinitely even after the window resets - The existing TestLimiter_WindowReset actually demonstrates the bug by requiring a third workflow to trigger reset detection Changes: - Corrected TestLimiter_WindowReset to test expected behavior (auto-wake) - Added TestLimiter_QueuedReservationDeadlock to explicitly reproduce bug - Created BUG_ANALYSIS.md with detailed root cause analysis and solutions Both corrected tests will FAIL with current implementation, proving the bug exists. They define the expected contract that queued workflows should automatically wake when the window resets. Recommended solution: Add background ticker or scheduled wake-up to detect window resets even when no other operations occur.
Fixes the deadlock bug where queued reservations never wake when the rate limit window resets if no other limiter operations occur. Root Cause: Queued workflows blocked on a channel that was only closed by tryGrantWaiting(), which was only called during other limiter operations (Reserve, Release, Consume, etc.). When no operations occurred, the window could reset but nothing detected it, causing indefinite hangs. Solution: Implemented a background ticker that periodically checks for window resets every window/10 duration (min 100ms). The ticker only runs resetWindowIfNeeded() when there are queued workflows, ensuring minimal overhead. Changes: 1. Updated all godoc comments to explicitly document auto-wake behavior - Reserve(): Now clearly states queued workflows auto-wake on reset - Other operations: Document their role in waking queued workflows 2. Added background ticker infrastructure: - ticker: Runs every window/10 to detect resets - done: Channel for graceful shutdown - windowResetWatcher(): Background goroutine - Stop(): Cleanup method for tests 3. Updated NewLimiter() to start background watcher The fix ensures the limiter matches its specification, which already documented that "waiting workflows are immediately re-evaluated and granted if capacity now available" when windows reset. Tests: - TestLimiter_WindowReset: Now correctly tests auto-wake (previously required 3rd workflow to trigger reset detection) - TestLimiter_QueuedReservationDeadlock: Explicitly tests the bug scenario (burst exhausts quota, single caller queues, no other ops) Both tests will now PASS with the background ticker implementation.
The background ticker goroutine needs to be stopped after each test to prevent synctest from detecting goroutine leaks. This adds defer l.Stop() after every NewLimiter() call in tests. This fixes TestLimiter_FIFOQueue and ensures all other tests properly clean up resources.
Removing investigation notes as they're not needed in the PR. The fix is well-documented in commit messages and godoc comments.
Documents the lessons learned from the rate limiter deadlock bug: 1. Tests must validate contracts (spec + godoc), not implementation 2. Godoc is the bridge between spec and implementation 3. Vague godoc leads to tests that validate bugs instead of exposing them Includes case study from the ratelimit package showing how the bug went undetected because tests followed the implementation instead of the documented contract. Provides concrete examples of vague vs clear godoc with guidelines for writing explicit documentation that makes it impossible to write tests that ignore expected behavior.
…01To3YBZrRDQmvwaAev7cKWZ Rate Limiter Deadlock Bug
…nstraint-011CUyu1iovAvRm9q9iEP9cq Claude/fix venue wallet unique constraint 011 c uyu1iov av rm9q9i ep9cq
|
Known bugs: #108 #107 Edit Open recomma/cmd/recomma/main.go // Override configuration with tier-specific values (unless explicitly set by user) After determining the plan tier, the code unconditionally assigns cfg.DealWorkers and cfg.ResyncInterval from the tier defaults. This overwrites any values explicitly provided by flags or environment variables, despite the comment stating the override should only happen when the user hasn’t set a value. Operators who tune these settings will now have their configuration silently ignored every boot. Consider only applying the tier defaults when the current value still equals the compiled default. |
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| e.logger.Info("Checking for new deals from bots", slog.Int("bots", len(bots))) | ||
|
|
||
| // Fetch deals per bot concurrently with a reasonable cap. | ||
| // Now we know how many bots we have, adjust down the reservation | ||
| // We need: 1 (ListBots) + len(bots) (GetListOfDeals per bot) | ||
| if e.limiter != nil { | ||
| neededSlots := 1 + len(bots) | ||
| if err := e.limiter.AdjustDown(workflowID, neededSlots); err != nil { | ||
| e.logger.Warn("rate limit adjust down failed", slog.String("error", err.Error())) | ||
| } | ||
| } | ||
|
|
||
| // Fetch deals per bot concurrently with tier-specific concurrency cap | ||
| g, gctx := errgroup.WithContext(ctx) | ||
| g.SetLimit(32) | ||
| g.SetLimit(e.produceConcurrency) | ||
|
|
||
| for _, bot := range bots { | ||
| b := bot // capture loop var |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid starving bots when reservations are smaller than bot count
The new rate‑limited producer reserves at most limit slots based on Limiter.Stats() and only ever reduces that reservation (AdjustDown). When len(bots) exceeds the per‑minute limit, the subsequent GetListOfDeals calls past the first limit bots immediately fail with ErrConsumeExceedsLimit in the rate‑limited client, hit the logger.Error("list deals for bot") path, and are silently dropped. Because the loop order is deterministic, the same leading limit bots are processed every resync while the rest never have their deals polled, so those bots will stall indefinitely. Either cap the number of bots processed per window or wait for additional capacity instead of issuing calls that you know will exceed the reservation.
Useful? React with 👍 / 👎.
Summary
(venue_id, wallet, order_id)as the primary identifier for Hyperliquid submissions/status; regeneratesqlcmodels/queries; migrate and wrap wallet/venue updates in transactions; persist payload metadata for scaled-order audits; expose venue assignment tables.EmitterwithQueueEmitterthat routes byOrderIdentifier; enforce identifier presence and venue registration; replay creates for venues missing submissions; fan-out modify/cancel per venue; add identifier checks and clearer errors.OrderIdentifier(not just CLOID); adjust snapshots, reconciliation, and SSE emission accordingly.StatusRefresherqueries per venue via a registry; record statuses using full identifier.identifier(venue_id+wallet);CancelOrderByOrderIdaccepts optionalvenue_id; order records/logs surface venue-scoped state.BREAKING CHANGES
OrderIdentifier { venue_id, wallet, order_id }.identifiers.venue_idandidentifiers.wallet;hyperliquid.identifieradded to order state.order_idalone must join on(venue_id, wallet, order_id).venue_idfilter when multiple submissions exist for the same order.Migration notes
sqlc+ schema migrations before deploy.EnsureDefaultVenueWalletto seed the default Hyperliquid venue/wallet.OrderWork.Identifier; dispatcher will reject items without it./api/venues/*endpoints, then restart to build WS/status clients per venue.Review
Read
specs/multi_venue_emission.adoc