Skip to main content

The Numbers That Matter

Published: February 7, 20268 min read
#Crypto#trading#agent#win rate#profits

The Numbers That Matter

Let me start with the headline: +10.9% return in 8 days during a market that saw Bitcoin crash over 16%.

If that sounds too good to be true, let me give you the context. This is the story of Trader-7, an AI-powered trading system I've been building, and the week it finally clicked — right when the crypto market decided to have its worst week since the FTX collapse.


The Market From Hell

To understand what happened, you need to understand what we were trading through.

On January 29, 2026, the crypto market entered what can only be described as a bloodbath:

  • Bitcoin dropped from ~$82,667 to ~$68,986 — a 16.5% to 21% crash
  • $5.42 billion in leveraged positions were liquidated
  • $817 million flowed out of Bitcoin ETFs in a single day
  • The Fear & Greed Index hit 12 — "Extreme Fear" (for reference, 50 is neutral)

This wasn't a dip you "buy." This was the kind of market that liquidates accounts, triggers margin calls, and sends retail traders running for the exits. Most AI trading bots — the kind that learn patterns from historical data — would have been caught completely off guard. Bull market patterns don't work in capitulation events.


Where We Started: Sprint 86 and the 30-Day Drought

On January 26, I was staring at logs for a trading system that hadn't executed a single trade in over 30 days.

Let that sink in. A trading system. That wasn't trading.

The irony? It was generating valid signals. The pipeline was working. But somewhere between "here's a good trade" and "execute the order," everything was breaking down.

Three blockers were working in concert:

  1. The LLM validator was rejecting mean-reversion trades because it wanted "reversal confirmation" we weren't providing
  2. 42% of signals were failing our risk-reward ratio checks
  3. Our regime classifier was overriding the strategist's recommendations

Sprint 86 was titled "Unblock Trading" — and that's exactly what we did. One blocker at a time, with verification between each phase. By January 30, we had our first trade in a month.

And then the market decided to test everything we'd built.


The Sprint Journey: 86 through 94

What followed was an intensive 8-day debugging and optimization marathon. Each day brought new problems revealed by live trading conditions:

Sprint 87: Fixed a local regime direction bug that was causing misalignment between our market classification and trade direction.

Sprint 88: The real complexity hit here. Our multi-LLM consensus approach (where Claude and DeepSeek both evaluate trades) had shifted confidence distributions. Signals that used to come in at 85-92% were now arriving at 75-82%. Our leverage calculations, designed for the old distribution, were putting too much capital into each trade — leaving room for only one position at a time when we needed three.

Sprint 89: This one came from watching a SOL SHORT position sit at +$6 profit while the entire thesis underneath it crumbled. The AI kept holding because "it's profitable" — the classic disposition effect. We taught it to recognize the difference between "strongly profitable" (let it run) and "marginally profitable with an invalidated thesis" (get out).

Sprint 90: Paper trading accounting bug. Our stop-losses were recording at market price instead of stop price, making our P&L numbers wrong.

Sprint 91: The fee monitor was triggering emergency stops because our fee-to-profit ratio looked scary early in the month (when you haven't booked many profits yet, any fee looks huge). We redesigned it from "panic-based" to "information-based."

Sprint 92: Sharpe ratio calculation was wrong. Embarrassing, but important.

Sprint 93: The big one. We watched two LONG positions bleed for 9+ hours during a bearish regime because our signal generator kept outputting LONG signals. The regime was screaming "bearish" but DeepSeek wasn't flipping its direction. We built the "Regime Watchdog" — an autonomous protection system that monitors positions independent of signal generation. After N cycles of regime opposition, it tightens stop-losses to breakeven. After M more cycles with deteriorating P&L, it force-closes.

Sprint 94: Added a missing database column for A/B testing our exit strategies.


The Architecture That Survived the Crash

What made Trader-7 different from a typical trading bot? A few key design decisions:

Multi-LLM Consensus

Every trade gets evaluated by both Claude (Anthropic's strategist model) and DeepSeek V3.2 (for signal generation). They have to agree. This creates natural skepticism — it's hard for both models to be wrong in the same direction at the same time.

Stance-Adaptive Thresholds

When the market regime is "defensive" (our classification based on BTC's distance from its 50-day moving average), our confidence threshold for new trades jumps to 82%. In "moderate" conditions, it's 72%. This means we're automatically more selective during uncertainty.

Two-Stage Regime Protection

The Regime Watchdog tracks how long each position has been misaligned with the global market regime. After 2-4 hours (depending on leverage), it moves stop-losses to breakeven. After 4-7 hours of continued deterioration, it force-closes. This caught several positions before they turned into full stop-loss hits.

Thesis-Aware Exits

The position reevaluator doesn't just ask "am I profitable?" It asks "does my entry thesis still hold?" A +0.5% gain with an invalidated thesis gets closed. The profit is noise. The edge is gone.


The Final Scorecard

After 8 days of live paper trading through the worst market conditions in recent memory:

Metric Value
Starting Capital $3,000
Ending Capital ~$3,327
Total P&L +$327.40
Return +10.9%
Win Rate 46.2%
Sharpe Ratio 5.80
Total Trades 26

For context, if you had simply held Bitcoin during this period, you'd be down 16.5%.

Our outperformance: +27.4 percentage points.


What the Numbers Don't Tell You

The win rate is 46.2% — below 50%. That's not a typo.

In trading, you don't need to be right most of the time. You need your wins to be bigger than your losses. Our risk-reward enforcement (3:1 minimum, often higher) means we can be wrong more than half the time and still make money.

The Sharpe ratio of 5.80 is interesting. Anything above 2.0 is considered excellent. Above 3.0 is exceptional. 5.80 suggests our returns are remarkably consistent relative to the volatility we're taking on. That said, 26 trades over 8 days is a small sample. I expect this to normalize over time.


What I Learned

Bug hunting in live trading is brutal. Every bug has real consequences — positions held too long, exits triggered too early, opportunities missed. There's no "we'll fix it in the next release" when money is on the line.

LLMs as trading components are genuinely different. They don't just pattern match — they reason about market conditions, adapt to regime changes, and explain their decisions. The logs are readable because the AI is explaining itself at every step.

Multi-agent systems need coordination protocols. Having Claude and DeepSeek work together required explicit handoffs, confidence calibration, and fallback handling. When one model is unavailable, the system needs to know what to do.

The disposition effect is real, and AIs aren't immune. Without Sprint 89's thesis-aware exits, the system would hold marginally profitable positions with invalidated theses. Profit isn't the same as edge.

Regime awareness is everything in crypto. BTC leads. When BTC's regime turns bearish, every LONG position is at risk. The Regime Watchdog (Sprint 93) was the single most impactful addition.


What's Next

Trader-7 is still paper trading. The next phase is transitioning to live trading with real capital — starting small and scaling based on continued performance.

The architecture is stable. The protection systems are working. The multi-LLM consensus is generating quality signals. Now it's about building confidence through continued operation and gathering more data.

Some questions I'm still exploring:

  • Should the regime watchdog thresholds be dynamic based on volatility?
  • Is there value in partial position closes instead of binary keep/close decisions?
  • How do we tune the "strongly profitable" threshold (currently 1.5%) based on market conditions?

The Bottom Line

Building an AI trading system is not about creating something that's always right. It's about creating something that manages risk intelligently, adapts to changing conditions, and compounds small edges over time.

For 8 days in February 2026, while the crypto market was in free fall, Trader-7 did exactly that.

The journey from Sprint 86 (a system that couldn't trade) to Sprint 94 (a system that made +10.9% during a -16.5% crash) wasn't pretty. It involved late-night debugging sessions, false starts, rollbacks, and more "wait, that's not right" moments than I can count.

But that's what building something real looks like.


Trader-7 is an experimental LLM-powered trading system currently in paper trading mode. Past performance is not indicative of future results. This is not financial advice.

Share this post