Limit Order v2: Correctness Under Failure

In Part 1, we talked about our take on solving the price monitoring problem - replacing pairwise exchange rate monitoring with a USD-denominated price stream. What happens next is a different ball game: taking ownership of execution correctness in a world where every network call can fail, every process can crash, and the downstream ledger never forgets your mistakes.
The version where this wasn't our problem
In v1, the on-chain program was the state machine. When a limit order was created, filled, or cancelled, each of those actions was a Solana program instruction - with the token deposit wrapped into the order initialisation itself. The order's lifecycle - its current state, its history, and its finality - all lived on-chain.
The keeper - the off-chain service that monitors and fills orders - was stateless by design in v1. Its job was simple: read on-chain orders, check whether any should trigger, and submit fill instructions. If the keeper crashed mid-fill, nothing was lost. No order disappeared. No state was corrupted. The keeper could restart from scratch, re-read the chain, and pick up exactly where it left off. The truth was never in memory. It was on-chain, and the chain doesn't forget.
Correctness was the on-chain program's job, and it did it well. We didn't fully appreciate how much work that single fact was doing for us until we took it away.
The cost of hiding the order book
The reason we took it away was for privacy.
In v1, every open order was fully visible on the public ledger. Trigger price, order size, input token, and output token - all readable by anyone with an RPC endpoint. That's a gift to anyone paying attention. MEV bots can front-run triggers. Competitors can reconstruct your entire order book. Sophisticated traders can see the exact price levels where liquidity is clustered and trade against it.
In v2, every detail of an order is off-chain. Order creation is still a transfer event on Solana - observers can see the deposit amount and the input token. However, the key details such as the output token and the trigger price live in our database, not on the ledger. That's enough to deny reconstruction of the order book. You can see that someone deposited 100 SOL into our system. You cannot see whether they're selling at $200 or $2,000, or whether they're selling it for JUP or USDC. This makes it significantly harder to front-run or grieve limit orders.
The on-chain program wasn't just storing data, it was also enforcing correctness. The moment we stopped relying on the program for order state, we forfeited every guarantee it had been providing:
- Durability - orders survive crashes and restarts
- Atomicity - actions fully succeed or fully revert
- Idempotency - retries don't create duplicates
Those are now our problem, and unlike on-chain programs, where the SVM (Solana Virtual Machine) runtime enforces that every instruction either fully succeeds or fully reverts, off-chain services live in a world of partial failures. RPC calls time out. Processes get OOM-killed. A transaction confirms on-chain but the confirmation never reaches our service. The question was never if these things would happen. It was whether we'd handle them correctly when they did.
What the keeper needed to become
v1's keeper was TypeScript - a natural choice alongside an Anchor program with shared IDLs and a JavaScript-native toolchain. But v2's requirements were different. In v1, correctness lived in the on-chain program. The keeper was stateless. It didn't need to track order lifecycles, manage concurrent state transitions, or recover from partial failures. TypeScript was fine for that.
For v2, the keeper IS the state machine. It owns order lifecycles, manages concurrent execution across multiple services, and must recover correctly from any crash at any point. The first problem was architectural: v1 ran trigger detection and execution in a single process, with sharding bolted on after it hit file descriptor limits. The second problem was the language itself. In an async-heavy TypeScript codebase, errors were too opaque. Figuring out why an execution silently stalled required collecting too many data points across too many layers. The team adopted p-flat to make error handling more explicit, but there was nothing in the language to enforce it - a developer could always skip it, and often did. For a system where a missed error can mean lost funds, that wasn't acceptable.
We evaluated alternatives:
- Rust - attempted a rewrite. Async Rust's complexity got the better of us.
better-result- better ergonomics than p-flat, but a large rewrite for the same enforcement gap.EffectTS- genuinely exciting ideas with dependency injection, and result types, but too much of a paradigm shift and bleeding edge at the time.- Go - explicit error returns, context management, panic recovery, first-class concurrency, and more importantly, team familiarity. This one stuck.
The public API stayed in TypeScript - Solana's client libraries are simply more mature there. We rewrote the engine, not the whole stack.
But Go didn't eliminate complexity - it shifted it. The concurrency we gained surfaced race conditions that TypeScript's single-threaded model had quietly masked. Concurrent processes claiming the same orders, state transitions interleaving in ways we hadn't anticipated. That problem is exactly what the database-backed state machine was built to solve.
What we inherited
Once the order state lives off-chain, it becomes our responsibility to replicate every guarantee the on-chain program previously enforced; only now, the blockchain makes doing so even harder than in a typical distributed systems problem.
Consider the fundamental challenge: you cannot retry a Solana transaction the way you retry an HTTP request. An HTTP call is (usually) idempotent or at least reversible. A Solana transaction that performs a swap is neither. If we submit a swap transaction and never receive confirmation, we're in an ambiguous state. The transaction might have landed on-chain - the swap executed, the user's tokens moved, and we just didn't hear back. If we naively retry by submitting a new transaction, we could double-execute the order: sending the user's remaining tokens through a second swap that they never asked for.
The downstream is an append-only ledger that never forgets, even when our process does.
You can't roll back a confirmed Solana transaction. Every action we take on-chain is permanent, which means every decision to act must be made with certainty about what has already happened. In a traditional backend, the worst case of a double-write is usually some data inconsistency that you can reconcile later. In our case, the worst case is irreversible loss of user funds.
That asymmetry - between the forgiveness of off-chain systems and the permanence of on-chain actions - is the core tension of the entire architecture.
Durability
The first thing we rebuilt was durability. In v1, an order's state was an on-chain account field updated by program instructions. In v2, every order is modelled as an explicit state machine with transitions stored in the database.
Each state in the lifecycle - depositing, open, executing, execution succeeded, withdrawn, expired, etc. - is a durable record in the database, not a variable in memory. State transitions are enforced at the database level: the system literally cannot move an order from "open" to "withdrawn" without passing through the intermediate states. Invalid transitions are rejected by schema-enforced invariants, not by application logic that might have a bug.
Any process can die at any point in time and no order disappears. No order gets stuck in an impossible state. When a service restarts, it reads the database, sees exactly where every order stands, and knows if it is safe to resume or reset. There's no reconstruction from memory, no replaying of logs, no hoping that the last checkpoint was recent enough.
Reliable reconciliation also required isolating balances. Earlier, orders sharing an input token shared a single token account - one bad RPC read on a shared account didn't just corrupt one order's balance, it cascaded across every order in that pool. To answer this, we introduced seeded token accounts, giving each order its own deterministic account created at deposit time. A misread can now only affect the one order it belongs to. This does expose individual order balances on-chain, where a shared account had blended them together. We accepted that tradeoff - the critical information, such as the trigger price remains off-chain.
The tradeoff is real. More database round-trips. More careful query design. State transitions that must be thought through rigorously - every edge case, every failure mode, and every possible interleaving of concurrent operations. The state machine for a single price order has nineteen distinct states, each with a defined set of valid transitions and a record of which service is authorised to perform each one. That's a lot of states for what is conceptually "user deposits tokens, system swaps them, user gets output." But each of those states exists because we found a failure mode that required distinguishing it from the states around it.
The payoff: durability that the on-chain program used to provide is now explicitly managed. And because it's explicit, it's auditable. Every state transition is logged. Every order has a complete history. When something goes wrong - and it will - we can trace exactly what happened and when. From a user's perspective: your order doesn't disappear if our system crashes. It picks up where it left off.
Atomicity
Durability tells us where an order stands. Atomicity tells us what actually happened on-chain, which is arguably a harder problem.
The key insight is that on Solana, a transaction's signature is deterministic - it's derived from the transaction content and the signer's key. Once the transaction is signed by the fee payer, the signature is known before it's ever submitted to the network. We use this property as an anchor for crash recovery.
Before submitting any transaction to the blockchain - whether it's an execution swap, a deposit confirmation, or a withdrawal - the system persists the transaction signature to the database. This happens before the transaction is broadcast. If the process crashes after signing but before submitting, the signature is in the database. If it crashes after submitting but before receiving confirmation, the signature is in the database. If the RPC node accepts the transaction but our connection drops before we get the response, the signature is in the database.
On recovery, the system takes each persisted signature and checks its status on-chain. The outcome falls into one of a few categories:
- Landed successfully - the swap landed; advance the order's state
- Landed but failed - the transaction landed on-chain with an error status; reset for retry
- Never seen - the network has no record of it; may still be in flight
- Blockhash expired - Solana transactions include a recent blockhash that gives them a validity window of maximally 151 blocks which is approximately 1 minute as of writing. If it's expired and the chain has never seen the transaction, we know with certainty it can never land. That transaction is dead, and we can safely act on the order again.
This approach relies on us being the signer, but RFQ breaks this assumption - a market maker is the last signer, fee payer, and the party that broadcasts. We never hold the final signed transaction, so we can't persist its signature ahead of time. Instead, we persist the quote identifier before handing off the fill, and reconcile against the quote's status and the chain on recovery. The ambiguity is larger, and latency is honestly different when the final signature isn't ours to make, but the principle is the same.
The system never blindly retries. It never silently skips. It always reconciles with the chain's truth before deciding what to do next.
This is the mechanism that prevents double-execution. Before creating a new swap transaction for an order, the system first checks whether a previous attempt already exists and resolves it. There is no path through the code where a second transaction is created without first accounting for the first.
Idempotency
The on-chain program never executed the same order twice. It didn't need to be told not to. Off-chain, every retry is live ammunition.
Network drops, page reloads, impatient double-clicks - any of these can cause a client to re-submit a request that's already being processed. Every action in the system - deposits, executions, withdrawals - is associated with a traceable action identifier. If an action gets stuck in a pending state, the identifier ensures it can be resolved before any new action is allowed to proceed. A duplicate request doesn't create a duplicate action - it finds the existing one and continues it.
For deposits specifically, this matters more than it might seem. A deposit involves the user signing and submitting an on-chain transfer, then our API confirming receipt. If the user's client loses connectivity after submitting the transaction but before receiving our confirmation, the client will retry. When that retry arrives, the system checks whether a deposit with that identifier is already in progress. If it is, and the transaction is still unconfirmed, the system attempts to re-confirm it inline - checking the chain for the existing transaction rather than asking the user to sign a new one. The user doesn't wait for a background reconciliation job. They get an immediate answer.
In practice: if your deposit appears stuck, retrying is safe. The system recognises the existing deposit and continues confirming it rather than creating a new one. You won't be charged twice. For other actions like withdrawals and executions, the system aims for the same guarantee, but these depend on RPC responses that can occasionally be unreliable - so the guarantees are best-effort rather than absolute.
Fault isolation
There's one more structural decision that makes all of this manageable: the trigger system and the executor are separate services.
- Trigger system - watches prices, decides which orders should execute, writes that decision to the database. Its job ends at marking an order as ready for execution.
- Executor - picks up orders marked for execution and carries them out: generate the swap, sign the transaction, submit it, confirm it, record the result.
These services communicate through database state and event queues. They share no memory and no direct connections. Either can fail independently. The trigger system can crash and restart without affecting in-flight executions - the executor doesn't care why an order was marked ready for execution, only that it was. The executor can crash and recover without re-triggering orders - the trigger system doesn't know or care about execution progress.
This separation is what makes each service's correctness properties locally verifiable. We can test the trigger system's logic in isolation: given these prices and these orders, does it correctly identify which ones should trigger? We can test the executor's recovery logic in isolation: given a crashed process with these persisted signatures, does it correctly reconcile with the chain? Neither test needs to account for the other service's behaviour, because by design, neither service depends on the other being up.
Correctness as discipline
Even with all of this - the state machine, the signature anchoring, the deduplication, and the fault isolation - we've still had incidents. Double-execution scenarios where the chain confirmed a transaction but our service didn't see the confirmation in time and submitted a retry that also landed. Duplicate withdrawals caused by the same class of ambiguity. Each time, the root cause was some variation of the same theme: the gap between what happened on-chain and what our service believed had happened.
Those specific cases have been found and fixed. But the lesson we took from them is that correctness off-chain isn't a destination you arrive at. It's a discipline you maintain.
Every RPC timeout is a potential crack. Every ambiguous transaction status is a potential crack. Every new state in the state machine, every new edge case in the execution flow, every interaction between services that we didn't fully think through - potential cracks. We run continuous verification that cross-references our database state against on-chain reality. We expect that process to keep finding discrepancies, and we expect to keep fixing them, for as long as the system runs.
We took correctness away from the blockchain and made it our problem. That was the right decision - the privacy and performance gains are real. But we don't pretend it was free. The cost is perpetual vigilance, and we've chosen to pay it honestly rather than pretend the problem is solved.
Next in the series: Part 3 - the conceptual shift that made all of this worth it: how rethinking what an order means, from barter ratios to prices, unlocked capabilities that the old model couldn't even express.