Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@terwey
Copy link
Collaborator

@terwey terwey commented Nov 1, 2025

Summary

Area Highlights
Storage Introduce (venue_id, wallet, order_id) as the primary identifier for Hyperliquid submissions/status; regenerate sqlc models/queries; migrate and wrap wallet/venue updates in transactions; persist payload metadata for scaled-order audits; expose venue assignment tables.
Engine/Emitter Replace single Emitter with QueueEmitter that routes by OrderIdentifier; enforce identifier presence and venue registration; replay creates for venues missing submissions; fan-out modify/cancel per venue; add identifier checks and clearer errors.
Fill tracker Track orders by OrderIdentifier (not just CLOID); adjust snapshots, reconciliation, and SSE emission accordingly.
Websocket/Refresher Tag WS updates with venue/wallet; StatusRefresher queries per venue via a registry; record statuses using full identifier.
API/OpenAPI New endpoints: list/upsert/delete venues; manage bot↔venue assignments; stream/log payloads now include identifier (venue_id+wallet); CancelOrderByOrderId accepts optional venue_id; order records/logs surface venue-scoped state.
CLI/main Ensure default venue/wallet; build base identifier; register venue emitters with dispatcher; pass venue/wallet through to WS and workers.
Docs/Chore Update AGENTS.md; revise multi-venue design doc; guard generated code; fix and extend tests.
Fixes Wrap default wallet/primary assignment in transactions; align Hyperliquid data with venues; restore scaled-order audits and queries.

BREAKING CHANGES

  • Hyperliquid persistence and APIs are venue-scoped. Any code assuming a single wallet per order must now pass/handle OrderIdentifier { venue_id, wallet, order_id }.
  • SSE/log/API responses include identifiers.venue_id and identifiers.wallet; hyperliquid.identifier added to order state.
  • Storage schema changes require migration; queries that joined on order_id alone must join on (venue_id, wallet, order_id).
  • Cancel operations may need a venue_id filter when multiple submissions exist for the same order.

Migration notes

  • Run regenerated sqlc + schema migrations before deploy.
  • Use EnsureDefaultVenueWallet to seed the default Hyperliquid venue/wallet.
  • When emitting work, always populate OrderWork.Identifier; dispatcher will reject items without it.
  • If you target multiple venues per bot, upsert assignments via the new /api/venues/* endpoints, then restart to build WS/status clients per venue.

Review

Read specs/multi_venue_emission.adoc

@terwey terwey force-pushed the codex/investigate-multi-wallet-support-for-hyperliquid branch from 752edc0 to f628030 Compare November 2, 2025 16:59
terwey added 27 commits November 2, 2025 21:49
…queries-for-venues

`feat(storage): add venue-aware hyperliquid persistence`

**Summary**
* Added venue registry and bot assignment tables while rebuilding Hyperliquid submission/status and scaled order tables around `(venue_id, wallet, order_id)` identifiers and payload metadata so upgrades apply the new schema automatically. [storage/sqlc/schema.sqlL49-L199](https://github.com/recomma/recomma/blob/b13be88a6a7b37ce1bd1854676de03e3209c13c5/storage/sqlc/schema.sql#L49-L199)
* Regenerated sqlc queries and models to accept venue-scoped arguments, emit payload metadata, and return composite scaled-order rows for audits and deal views. [storage/sqlc/queries.sqlL527-L888](https://github.com/recomma/recomma/blob/b13be88a6a7b37ce1bd1854676de03e3209c13c5/storage/sqlc/queries.sql#L527-L888) [storage/sqlcgen/models.goL11-L139](https://github.com/recomma/recomma/blob/b13be88a6a7b37ce1bd1854676de03e3209c13c5/storage/sqlcgen/models.go#L11-L139)
* Updated storage logic to populate default venue and wallet identifiers, persist typed payload blobs for Hyperliquid submissions/statuses, and translate the new scaled-order results for streaming and API responses. [storage/storage.goL328-L607](https://github.com/recomma/recomma/blob/b13be88a6a7b37ce1bd1854676de03e3209c13c5/storage/storage.go#L328-L607) [storage/order_scalers.goL330-L648](https://github.com/recomma/recomma/blob/b13be88a6a7b37ce1bd1854676de03e3209c13c5/storage/order_scalers.go#L330-L648)
…upport-for-hyperliquid' into codex/refactor-storage-for-venue-aware-identifiers
…aware-identifiers

feat(storage): add venue-aware identifiers
merge main for generated code guard
chore: esnure submodule
…ue' into codex/extend-orderwork-with-identifier-and-refactor-queue-ehfv2o
…ifier-and-refactor-queue-ehfv2o

fix: replay missing venue submissions
claude and others added 25 commits November 12, 2025 00:34
…011CV2b6CpvF4vQWJtSJiSwW

Resolved merge conflicts by combining:
- New SDK API style from spec branch (WithPlanTier)
- Rate limiter wrapper implementation
- Multi-venue support from spec branch
- Comprehensive test suites from both branches

All conflicts resolved while preserving functionality from both branches.
Cleaned up remaining conflict marker from spec/tc-rate-limit merge.
- Change tc.Bot.Name from string to removed (field not used in tests)
- Change tc.Deal.Id from int64 to int to match SDK types
- Align with type usage patterns from other tests in codebase
Previously, when the rate limit window reset (every 60 seconds), waiting
workflows in the queue were not notified. This caused them to remain stuck
in the queue indefinitely, even though capacity was now available.

The bug manifested as:
- Deal workflows would reserve and get queued
- produce:all-bots would consume all 5 slots and release
- Window would reset after 60 seconds
- produce:all-bots would reserve again immediately
- Deal workflows remained stuck in queue forever

Fix: Call tryGrantWaiting() after resetting the window in resetWindowIfNeeded()
to wake up and grant reservations to queued workflows that now have capacity.

This ensures fair FIFO processing and prevents workflow starvation.
fix: hyperliquid wire format expects 8 decimals for price and size
This refactor addresses the critical design flaw where waiting workflows
could not use freed capacity from AdjustDown/SignalComplete calls.

**Problem:**
Previously, tryGrantWaiting() would return immediately if activeReservation
was non-nil, defeating the entire "early release" pattern:
- produce:all-bots reserves 5 slots
- produce:all-bots calls AdjustDown(2), freeing 3 slots
- But tryGrantWaiting() couldn't grant to waiting deal workflows
- All workflows were serialized, no concurrency

**Solution:**
- Changed from single `activeReservation *reservation` to multiple
  `activeReservations map[string]*reservation`
- tryGrantWaiting() now grants to waiting workflows whenever there's
  capacity, regardless of existing active reservations
- Added calculateTotalReserved() to sum slots across all reservations
- Updated Reserve/Consume/AdjustDown/Extend/SignalComplete/Release
  to work with the map

**Result:**
When produce:all-bots adjusts down from 5→2 slots, the freed 3 slots
are immediately available for deal workflows. Multiple workflows can
now run concurrently, enabling true "early release" behavior per spec.

Closes the issue raised in code review regarding workflow serialization.
Line 337 was trying to redeclare totalReserved with := but it was already
declared at line 280 in the function scope. Changed to = for reassignment.
The Extend method was incorrectly using the Reserve capacity formula,
which prevented reservations from growing beyond the per-window limit.

**The Contract:**
Reservations can span multiple windows. The rate limiter enforces that
*consumption per window* doesn't exceed the limit, not that *total
reservations* can't exceed the limit.

**Old (incorrect) formula:**
  if l.consumed + totalReserved - res.slotsReserved + newReservation <= l.limit

This prevented extending a reservation beyond the window limit, even
when the additional consumption would span multiple windows.

**New (correct) formula:**
  if l.consumed + newReservation - res.slotsConsumed <= l.limit

This checks: "Can the additional slots I need (beyond what I've already
consumed) fit in the current window's remaining capacity?"

**Example (limit=10):**
- Reserve 8 slots, consume all 8 in window 1
- Window resets: consumed=0, slotsConsumed=8 (persists with reservation)
- Extend by 5 (total 13): Check 0 + 13 - 8 = 5 <= 10 ✓
- The 5 additional slots fit in the new window

This allows workflows to have large total reservations that span
multiple windows, while still enforcing per-window rate limits.

Fixes TestLimiter_ExtendRequiresWindowReset
Updated rate_limit.adoc to accurately describe the implementation:

**Core Principle Changes:**
- Changed from "single active reservation" to "multiple concurrent reservations"
- Clarified goal: guarantee sequential execution per workflow, not global serialization
- Added "Sequential Execution Guarantee" mechanism

**Key Updates:**
1. Principle (line 92): Now describes coordination vs serialization
2. Key Mechanisms (lines 110-126):
   - Multiple Concurrent Reservations (replaces Single Active Reservation)
   - Added Sequential Execution Guarantee
   - Added Cross-Window Reservations mechanism
3. Core State (lines 139-145): activeReservation → activeReservations map
4. Reserve operation (lines 157-162): Updated capacity check formula and behavior
5. AdjustDown (lines 180-182): Clarifies enables early release for concurrent workflows
6. Extend (lines 188-192): Documents cross-window formula and rationale
7. Release (lines 208-210): Multiple workflows may be granted
8. Window Reset (lines 218-221): slotsConsumed persists, immediate re-evaluation

**Open Questions Answered:**
- Question 3: Documented capacity formula with rationale
- Question 4: Documented Extend cross-window formula with example

**Example Updates:**
- Line 496: Clarified both workflows have concurrent reservations

The spec now accurately reflects that workflows get sequential execution
guarantees (preventing thundering herd), while multiple workflows can
run concurrently when capacity allows (via early release pattern).
…1CV2b6CpvF4vQWJtSJiSwW

feat: implement ThreeCommas API rate limiting with workflow reservation system
 - internal/api/system_stream.go (lines 36-174) adds per-subscriber locking plus trySend/close helpers so channel closes can’t race with Publish/history sends; cancellation now removes a subscriber under the controller lock and closes it safely afterward.
 - internal/api/system_stream.go (lines 188-254) reworks Publish/Flush to fan out via the guarded send path, skipping already-closed subscribers and logging only when buffers are full.

Co-authored-by: Codex <[email protected]>
This commit documents the investigation of the deadlock bug in the rate
limiter where queued reservations never wake when the time window resets
if no other limiter operations occur.

Key findings:
- Queued workflows block on a channel that's only closed by
  tryGrantWaiting(), which is only called during other limiter operations
- If no active reservations exist and no new operations occur, queued
  workflows hang indefinitely even after the window resets
- The existing TestLimiter_WindowReset actually demonstrates the bug by
  requiring a third workflow to trigger reset detection

Changes:
- Corrected TestLimiter_WindowReset to test expected behavior (auto-wake)
- Added TestLimiter_QueuedReservationDeadlock to explicitly reproduce bug
- Created BUG_ANALYSIS.md with detailed root cause analysis and solutions

Both corrected tests will FAIL with current implementation, proving the
bug exists. They define the expected contract that queued workflows
should automatically wake when the window resets.

Recommended solution: Add background ticker or scheduled wake-up to
detect window resets even when no other operations occur.
Fixes the deadlock bug where queued reservations never wake when the
rate limit window resets if no other limiter operations occur.

Root Cause:
Queued workflows blocked on a channel that was only closed by
tryGrantWaiting(), which was only called during other limiter operations
(Reserve, Release, Consume, etc.). When no operations occurred, the
window could reset but nothing detected it, causing indefinite hangs.

Solution:
Implemented a background ticker that periodically checks for window
resets every window/10 duration (min 100ms). The ticker only runs
resetWindowIfNeeded() when there are queued workflows, ensuring
minimal overhead.

Changes:
1. Updated all godoc comments to explicitly document auto-wake behavior
   - Reserve(): Now clearly states queued workflows auto-wake on reset
   - Other operations: Document their role in waking queued workflows

2. Added background ticker infrastructure:
   - ticker: Runs every window/10 to detect resets
   - done: Channel for graceful shutdown
   - windowResetWatcher(): Background goroutine
   - Stop(): Cleanup method for tests

3. Updated NewLimiter() to start background watcher

The fix ensures the limiter matches its specification, which already
documented that "waiting workflows are immediately re-evaluated and
granted if capacity now available" when windows reset.

Tests:
- TestLimiter_WindowReset: Now correctly tests auto-wake (previously
  required 3rd workflow to trigger reset detection)
- TestLimiter_QueuedReservationDeadlock: Explicitly tests the bug
  scenario (burst exhausts quota, single caller queues, no other ops)

Both tests will now PASS with the background ticker implementation.
The background ticker goroutine needs to be stopped after each test
to prevent synctest from detecting goroutine leaks. This adds
defer l.Stop() after every NewLimiter() call in tests.

This fixes TestLimiter_FIFOQueue and ensures all other tests properly
clean up resources.
Removing investigation notes as they're not needed in the PR.
The fix is well-documented in commit messages and godoc comments.
Documents the lessons learned from the rate limiter deadlock bug:

1. Tests must validate contracts (spec + godoc), not implementation
2. Godoc is the bridge between spec and implementation
3. Vague godoc leads to tests that validate bugs instead of exposing them

Includes case study from the ratelimit package showing how the bug
went undetected because tests followed the implementation instead of
the documented contract.

Provides concrete examples of vague vs clear godoc with guidelines
for writing explicit documentation that makes it impossible to write
tests that ignore expected behavior.
…01To3YBZrRDQmvwaAev7cKWZ

Rate Limiter Deadlock Bug
…nstraint-011CUyu1iovAvRm9q9iEP9cq

Claude/fix venue wallet unique constraint 011 c uyu1iov av rm9q9i ep9cq
@terwey
Copy link
Collaborator Author

terwey commented Nov 14, 2025

Known bugs:

#108
recomma/storage/storage.go
Lines 179 to 191 in ed1cb9c
// Check if a venue with this (type, wallet) already exists (possibly with a different ID)
existingVenue, err := s.queries.GetVenueByTypeAndWallet(ctx, sqlcgen.GetVenueByTypeAndWalletParams{
Type: defaultHyperliquidVenueType,
Wallet: wallet,
})
if err == nil {
// A venue with this (type, wallet) already exists.
// If it has a different ID than the default, we should use it instead of trying to create a duplicate.
if existingVenue.ID != string(defaultHyperliquidVenueID) {
// There's already a user-defined venue with this type and wallet.
// No need to create a separate default venue - just return success.
return nil
}
When an existing venue with the same (type, wallet) is found, the new logic returns early without inserting or updating the default row (hyperliquid:default). Downstream code (e.g. EnsureDefaultVenueWallet, order identifier construction, and default venue lookups) expects this well-known ID to exist. If an operator already has a venue for the primary wallet under a different ID, startup will skip the upsert and later lookups by hyperliquid:default will fail with sql.ErrNoRows. Instead of returning immediately, the routine should reconcile the existing venue into the default ID or update it in place so the canonical identifier always exists.

#107
Respect user-specified deal worker and resync settings #107

Edit
New issue

Open
@terwey
Description
terwey
opened 10 minutes ago

recomma/cmd/recomma/main.go
Lines 270 to 276 in ed1cb9c
// Get tier-specific rate limit configuration
rateLimitCfg := planTier.RateLimitConfig()

// Override configuration with tier-specific values (unless explicitly set by user)
cfg.DealWorkers = rateLimitCfg.DealWorkers
cfg.ResyncInterval = rateLimitCfg.ResyncInterval

After determining the plan tier, the code unconditionally assigns cfg.DealWorkers and cfg.ResyncInterval from the tier defaults. This overwrites any values explicitly provided by flags or environment variables, despite the comment stating the override should only happen when the user hasn’t set a value. Operators who tune these settings will now have their configuration silently ignored every boot. Consider only applying the tier defaults when the current value still equals the compiled default.

@terwey
Copy link
Collaborator Author

terwey commented Nov 14, 2025

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 144 to 161

e.logger.Info("Checking for new deals from bots", slog.Int("bots", len(bots)))

// Fetch deals per bot concurrently with a reasonable cap.
// Now we know how many bots we have, adjust down the reservation
// We need: 1 (ListBots) + len(bots) (GetListOfDeals per bot)
if e.limiter != nil {
neededSlots := 1 + len(bots)
if err := e.limiter.AdjustDown(workflowID, neededSlots); err != nil {
e.logger.Warn("rate limit adjust down failed", slog.String("error", err.Error()))
}
}

// Fetch deals per bot concurrently with tier-specific concurrency cap
g, gctx := errgroup.WithContext(ctx)
g.SetLimit(32)
g.SetLimit(e.produceConcurrency)

for _, bot := range bots {
b := bot // capture loop var

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid starving bots when reservations are smaller than bot count

The new rate‑limited producer reserves at most limit slots based on Limiter.Stats() and only ever reduces that reservation (AdjustDown). When len(bots) exceeds the per‑minute limit, the subsequent GetListOfDeals calls past the first limit bots immediately fail with ErrConsumeExceedsLimit in the rate‑limited client, hit the logger.Error("list deals for bot") path, and are silently dropped. Because the loop order is deterministic, the same leading limit bots are processed every resync while the rest never have their deals polled, so those bots will stall indefinitely. Either cap the number of bots processed per window or wait for additional capacity instead of issuing calls that you know will exceed the reservation.

Useful? React with 👍 / 👎.

@terwey terwey merged commit 99dcf4a into main Nov 14, 2025
5 checks passed
@terwey terwey deleted the codex/investigate-multi-wallet-support-for-hyperliquid branch November 14, 2025 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants