What Building a Trading Bot Taught Me About Building Trading Bots

Across roughly two hundred trades, my bot has rarely been wrong about the market in ways that felt like genuine market misreading. It has been wrong about itself.

Which fields existed on which models. Which rule was actually firing. Which gate was doing the work. Which model the LLM router was actually calling. Whether a sprint had even shipped to production. The market did what the market does. The system kept lying to itself about what it was doing in response.

The single most expensive class of mistake hasn't been the strategy doesn't work. It's been we shipped something believing it would do X, and it silently did Y, and we didn't notice for weeks.

A trading system is a feedback loop. A feedback loop with broken instrumentation is worse than no system at all. It generates confident decisions on hallucinated data.

This is what I want to write down, because I think it's the lesson that will outlive this particular bot. If you're building any kind of automated decision system, trading or otherwise, the dominant bug class isn't going to be the one you expected. It's going to be the system being unable to tell you the truth about itself.

Five costumes, one bug

Five recurring failure patterns came up across seven months of building. They look different on the surface. They're the same underneath.

Phantom field references. Code referencing fields that didn't exist on the data model. Sprint 99.1 referenced five trailing-stop fields the Trade model didn't have; silent getattr defaults masked it. Sprint 99.3 used opened_at when the column was actually created_at; a silent try/except masked it. Sprint 135 referenced trade.risk_amount and trade.quantity; neither existed. Same shape every time. Code written against an imagined schema. Python's permissiveness hid the bug. Behaviour silently degraded.

Removing rules without data. Sprint 120 removed the _MIN_SMA50_DISTANCE rule because it "felt restrictive." The cost: $456 on long positions across 41 trades over 29 days. Long-side win rate collapsed to 14.3%. Nobody had run a query asking what the rule had actually been doing before it was removed. The same instinct shows up in every "let's loosen this gate" proposal that wasn't preceded by a counterfactual.

Design collisions between sprints. Sprint 130 introduced a cluster cap that was mathematically incompatible with the position sizer from Sprint 41. The result: a seven-day window where the system literally couldn't open any position. Each sprint was internally coherent. Together they produced a system that had no legal moves. Nobody had modelled the interaction.

Confidence in opaque infrastructure. The LLM router was silently downgrading opus-4.6 to opus-4.5 for months. The investigation took five phases to surface it. Every reasoning trace, every decision, every backtest interpretation during that window was running on a model the system thought it wasn't using. Same shape. Trusting the layer below without verifying the layer below was doing what it claimed.

Marking work complete before verifying it landed. Sprints declared "shipped" before checking the deploy actually rolled out. Migration registries that didn't include the latest migration. Tasks marked done based on agent reports rather than filesystem checks.

Five different costumes. One bug.

Every one of these is the same failure mode acting on assumptions about state without verifying state. Phantom fields assume the model matches the database. Rule removal assumes the rule wasn't doing useful work. Design collisions assume sprints don't interact. Router confidence assumes the configured model is the running model. Premature task completion assumes "agent said done" equals "is done."

This isn't a list of bugs to fix. It's the dominant bug class in any system complex enough to have parts that talk to each other. The verify-before-change discipline isn't a nice-to-have. It's the answer to most of what was actually going wrong. Every time we skipped it we paid four figures or four weeks. Every time we ran it we caught something cheap.

Why systems lie to themselves

I think about this now as the central problem of building software that touches money.

The system isn't trying to deceive you. It's that every layer makes assumptions about every other layer, and the assumptions are usually right, and when they're wrong they fail silently because making them fail loudly is expensive and slow. Python lets you read attributes that don't exist and get None back. SQL lets you query a column that no longer matches your code's mental model and get empty results. LLM routers let you ask for one model and get another. Each of these conveniences is, in normal use, a kindness. In a system making automated trading decisions, each kindness is an attack surface.

What changes when you finally take this seriously is that "is this code correct?" stops being the right question. The right question is "can I prove this code is doing what I think it's doing, on the actual data, in the actual environment, right now?" Those are not the same question. The first is about the code. The second is about the system. And almost every expensive mistake has been a confusion between the two.

The fix is not more careful coding. The fix is instrumentation. You have to build the system so that it can be honest with itself, and then you have to actually look at what it's telling you. Both halves matter. A system with no observability can't be honest. A system with observability that nobody reads might as well not have it.

What started working

Three things got built in the last two months that I'd call genuine wins. None of them are strategy code.

Verify-before-change tooling. A set of slash commands the system uses on itself before making changes. /hypothesis-test runs counterfactuals before a rule gets removed. /sprint-review checks for design collisions before a sprint ships. /regression-scan checks the deployed system actually matches the intended system. These exist because the system finally noticed it kept making the same class of mistake. They are unglamorous. They are the highest-leverage thing built in months.

The watchdog and risk plumbing. A regime watchdog that intervenes early when conditions shift. A position re-evaluator that holds positions through local noise but exits on global regime change. ATR floors on stops. Maximum hold times. Liquidation safety. The risk management stack is genuinely doing its job. When the system loses, it loses small. That isn't trading edge. It's the precondition for trading edge — the difference between losing money slowly enough to learn and losing money fast enough to blow up.

MFE/MAE instrumentation. Maximum Favourable Excursion and Maximum Adverse Excursion: how far a trade went in your favour before it closed, and how far it went against you. Sprint 138 finally wired these into the trade record. For the first time we can distinguish "stop too tight" from "TP too greedy" in the data, instead of arguing about it on instinct. The plumbing itself is unglamorous. It pays for itself the first time you have to make a stop-versus-target call with evidence rather than vibes.

Collectively these say something I didn't expect to be the headline lesson of seven months: the system has gotten markedly better at being honest with itself. The strategy hasn't found edge yet. But the operational layer is now actually trustworthy. That is the precondition for everything else.

What this means for anyone building

If I were starting again I'd build the instrumentation before the strategy. I didn't, and most people building bots don't, because instrumentation is boring and strategy is exciting. The result is months of confident decisions made on data the system was unable to verify it was actually generating. By the time you notice, the loss isn't the money. It's the time spent reasoning from corrupted evidence about whether the strategy works.

Three things I'd tell anyone building an automated trading system, or any system that takes actions in the world without a human in the loop:

Assume the dominant bug class is the system lying to itself, not the strategy being wrong. Build for that. Most of the failure modes I saw weren't novel. They were the same failure mode in different costumes. Recognising the pattern earlier would have saved weeks.

Verify state before acting on it. This is one rule. Phantom field references, opaque infrastructure, design collisions, premature completion — all of them are violations of one rule. If you do nothing else, do that.

Treat instrumentation as production code. The slash commands and audit tools that finally caught these failures aren't tooling. They are the production system. The strategy code rides on top of them. Without the instrumentation, the strategy code is making decisions you can't verify and shouldn't trust.

The trading edge question is still open in my system. I don't know yet whether this strategy makes money in any sustained way, in any regime. That's the next post. But the operational question is closed. The system can now tell me the truth about what it's doing, so that I can make honest decisions about whether it works. That's what seven months of building actually bought. It is not what I thought I was buying. It turned out to be the more important thing.

This is the third post in a series on retail crypto trading in the AI era. The first two, The House Always Wins and But Retail Can Still Play, laid out the structural argument. This one is the operational reckoning. The next post turns from what the bot taught me about itself to what it taught me about the market.

What Building a Trading Bot Taught Me About Building Trading Bots

Five costumes, one bug

Why systems lie to themselves

What started working

What this means for anyone building

Share this post

The House Always Wins. But Retail Can Still Play.