Now the Real Data Starts

Trader-7 has won its last five trades. Hundred percent win rate. By the all-time numbers it has been profitable for seven months: roughly a hundred trades, 33% win rate, +$910 P&L on a $3,000 starting balance, Sharpe 1.23.
By the honest read, I do not yet know what my system does.
Five trades is not a sample. The all-time 33% is not a measurement either. It is an average across three different versions of the system, separated by a regression I shipped in April that took weeks to surface. The clean window of operating data I can actually trust is six days old. Five of those trades are positive, with two showing the full trail-protection lifecycle for the first time since the strategy baseline reset.
This is the post I have been heading toward across the rest of this series. Posts one and two laid out why retail can't win the funds' game and how to start finding the one you can. Post three was about the operational discipline a system needs before its outputs can be trusted. This one is the next step in that argument: even with a system that tells the truth, you still need a long enough stretch of stability for what it is telling you to mean something.
Seven months of tinkering bought me the system. The stable run starts now.
What 33% all-time is actually averaging
The naive read of Trader-7's seven months is straightforward. A hundred trades. 33% win rate. +$910 P&L. Sharpe 1.23. Those are not headline-grabbing numbers, but they're not bad. By the standards of most retail trading bots they would count as a quiet success.
The problem is the average is hiding three different systems.
For the first 114 trades, ending 19th April, the system ran at 41% win rate. The long side was working. Capital had grown to nearly $4,700.
Then I shipped Sprint 120. It removed a rule called _MIN_SMA50_DISTANCE because the rule felt restrictive. The rule had been quietly filtering out long entries where price was too close to the 50-period moving average: entries near a likely mean-reversion zone where the momentum thesis was weakest. Removing it let those entries through.
The cost: $456 across 41 trades over 29 days. Long-side win rate collapsed to 14.3%. The rule had been doing useful work. I had not run a query asking what it was doing before it was removed.
The deeper failure wasn't the call itself. It was that I wasn't the one looking at the rule when it got pulled.
I'm running three to five repos in parallel these days, and I have been increasing the amount of work I delegate to agents. The agent that proposed removing the rule didn't run a counterfactual either. Nobody asked the obvious question — what is this rule actually doing in the live system? — because the system that should have forced that question wasn't built yet. I had the agents. I didn't have the scaffolding to hold them to evidence.
Deming had a line for this. In God we trust, all others must bring data. He meant it about manufacturing. It applies to agents. The verify-before-change discipline isn't just about being careful with your own commits. It's about building scaffolding that makes agents surface evidence before any change ships, regardless of how confident they sound. Sprint 120 was the moment that missing layer turned into money.
What followed was sprint after sprint of recovery. Sprint 136 fixed the long-side floor that had collapsed. Sprint 137 caught a silent model downgrade where two-thirds of my Opus 4.7 calls were being served by Opus 4.5. Sprint 139 fixed a structural collision between the position sizer and the cluster cap that had produced a seven-day window where the system literally couldn't open any position. Two correctly-designed components were mathematically incompatible at typical volatility, and nothing got through the gate.
If you average wins from a 41% working system, losses from a regression I caused, and a recovery period across four major fixes, you get 33%. That number is not the system. It is an average of the system, the bug, and the climb back.
When the real clock starts
The clean window begins on 30th April, the day Sprint 139 shipped and trades started flowing again. Trade 251 opened six minutes after the redeploy.
That gives me six days of clean operating history. Five closed or partially closed trades. All five positive. The first three closed on the 48-hour maximum-hold ceiling without reaching their profit targets. The most recent two, Trades 254 and 255, hit their first profit targets, banked partial profits, and activated their trailing stops. That trail machinery had been unverified end-to-end since the baseline reset eight days earlier. Now it has fired twice with the right math on both occasions.
Five is not a sample. Five winners can come from a coin-flip strategy on a good week. Anyone who has spent ten minutes on a trading forum has seen a screenshot of three or five green trades captioned with someone calling themselves a quant.
But the clean clock starts somewhere, and it has started here. I want any future claim about whether Trader-7 has edge to rest on a sample drawn from the working system, not on the broken-system average being dragged up by a recent green streak.
What the data tells me, accurately: the structural fix from Sprint 139 is sound. The trail-stop fix shipped on the 4th has now been observed end-to-end on two trades, with the per-unit risk arithmetic working as specified. The risk-management gates are firing consistently. The system runs unattended through migrations, deploys, and trade closes without incident. The operational layer is doing its job.
What the data doesn't tell me:
Win rate. Five is meaningless. Edge versus bull-market beta is unknown. The crypto basket rallied across the period: Bitcoin moved from $73K to $82K, Ethereum up roughly 2%, Solana up 6%. Drift alone could explain positive P&L on a fifty-fifty strategy in a market that's been moving in the same direction as my long-side bias.
That last one is the question I most want answered, and I cannot answer it yet from inside this sample. I need ten clean trades to start having a conversation. Twenty before I'd treat any pattern as real. The sample is now growing usefully. It is not yet large enough to mean what a casual reader of the recent results might assume it means.
The discipline of staying still
Post three was about the operational discipline a system needs to be honest with itself. This post is about the discipline that comes after that.
The system can now tell me the truth about what it's doing. It can't yet tell me what the market is doing in response, because the data the system has produced is contaminated by my own iteration. Every meaningful change in the strategy stack since November has shifted the regime the system was trading in, which means I have no continuous run long enough to read what the strategy actually does in any single regime.
The fix for that isn't another change. It's the absence of changes. A long enough stretch of stability that the data starts to mean something.
This is the harder version of the patience point I made in post two. Refusing the funds' game means choosing to play one where time horizon and lived experience and adaptation are real edges. But all of those edges only show up if you stay still long enough for them to be measurable. The fund analyst who rotates every two years can't accumulate context. The retail trader who tinkers every two weeks can't accumulate data. Both lose the same edge for the same reason.
I have been the second of those for seven months. I had reasons. The system needed the iteration to become trustworthy. Sprint 120 needed Sprint 136 to fix what 120 broke. The S139 collision had to be found. The model downgrade had to be caught. Each sprint earned its place. Together they produced a system that finally tells the truth, and a dataset that mostly doesn't.
Now the harder thing. Don't change the strategy stack. Let it run. Watch what the next ten, then twenty, then fifty trades look like under stable conditions. Let the data be a measurement instead of a moving target.
This is the discipline most retail traders never get to. They quit during the recovery from a regression they caused. They keep tweaking through what would otherwise be a clean window. They never accumulate enough continuous operating data to know what their system actually does, so they end up arguing about strategy on instinct rather than evidence. The trader who can stay still long enough to read the signal has access to information the impatient version of themselves doesn't.
What I'm doing next
Wait for ten clean trades. Then run a hypothesis-test on first-target distance against realised volatility. If trades keep closing on max-hold without reaching their targets, the targets are wrong for the regime, not the system. If the next ten close with a similar profile to the persisted-rejection cohort, that's evidence the no-edge hypothesis is real and bull-market drift has been doing the work. (Across the same period, 113 paper-tracked rejected proposals went 0-for-0.)
The right answer to is the system viable is currently I don't have enough clean data to say. That answer is itself the discipline. It's the answer I would have skipped six months ago, in favour of more confident readings of an average that wasn't a measurement.
The seven months bought me the system that can tell the truth, and the discipline to wait until the truth is readable. Now the real data starts. Game on.
This is the fourth post in a series on retail crypto trading in the AI era. The first three, The House Always Wins, But Retail Can Still Play, and What Building a Trading Bot Taught Me About Building Trading Bots, laid out the structural argument and the operational reckoning. The next post will turn from "the data we don't have yet" to what to do with the data once it lands: how to evolve a trading system from one that survives to one that finds the game retail can actually win.