How to Debug a Black Box

I have a side project that runs five large language models in parallel and makes them argue with each other about whether to buy a stock.

I’ve been building it for about three months, on evenings and weekends. It runs around the clock on a small mini PC in my living room, trading on a demo account because I am not a complete idiot. Five panel models with deliberately different personalities, a Bull, a Bear, a Value role, a Momentum role, an Analyst, each voting BUY or HOLD on every candidate. A coordinator weighs the votes and decides whether to execute. About $2 a day in API spend.

It is, by any normal definition, a piece of software. But almost none of my usual debugging instincts work on it. The behaviour doesn’t live in the code. It lives in the prompts, and the prompts produce outputs that look plausible whether they’re right or wrong. The system can be quietly broken for weeks while every dashboard glows green.

Here is the moment I realised that.

You Did What?

Sixty-one. Sixty-one consecutive BUY votes from the Bull model. Across forty-odd trading days, every single candidate the panel looked at, the Bull had voted BUY. Not 60 out of 61. Not 59. All sixty-one. The Momentum model was the same shape in reverse, sixty-two HOLDs in a row, no exceptions.

Two of my five panel models were rubber stamps. They’d been rubber stamps for weeks. The only reason I noticed was that I’d added a per-role vote tally to the daily dashboard the night before, mostly out of curiosity.

The system was working. Or rather, it was passing all its tests, executing trades, sending Telegram alerts, and producing what looked like a healthy distribution of decisions. Three of the five models were doing real work. Two of them were noise dressed as a vote. The other three were carrying the entire panel and I’d never noticed.

This is a story about how you find that out. Not from reading the code. The code was fine. From reading what the code said.

Why You Can’t Debug This Like Software

Traditional debugging assumes three things.

The code is the system. Read the code, you understand the behaviour. Inputs are deterministic. Same input, same output. Failures are loud. A crash, a stack trace, a failed test.

An LLM-heavy system breaks all three.

The prompt is the system. The code is just plumbing around prompts. You can change nothing in the Python, rewrite a single sentence in a system prompt, and watch the behaviour flip a hundred and eighty degrees.

Inputs are live market data, news headlines, stochastic model outputs. Nothing is reproducible. A bug that fires once might never fire again. But it will fire on seventy-one of the next hundred runs in a slightly different guise.

Failures are silent. The output looks plausible. The JSON parses. The trade executes. The alert goes out. Everything is green. Meanwhile the system is quietly making the same wrong decision every hour, all day, and you have absolutely no idea.

Unit tests passed through every single failure in this article. All 437 of them. The tests weren’t wrong, exactly. They were just answering a different question than the one I needed to ask.

The One Thing That Makes It Possible

You need to peek inside the black box.

Every LLM call writes its full prompt, full response, model name, and token count to a plain text file. One file per day. Seven days of history on disk. Surfaced through a dashboard tab so I can download any day with two clicks. No structured logging. No OpenTelemetry. No vector database. Just a file that looks like an email archive, one entry after another.

That is the whole infrastructure.

Without it, a black box. With it, I can read exactly what the Bull said about Apple at 9:17 yesterday. What the Bear said back. What the coordinator concluded and why. I can read the next forty candidates after that, and the forty before, and start to notice things no single transcript would ever show me.

You can’t step through an LLM. You can’t set a breakpoint inside a language model. But you can save the receipts. The prompt in, the text out. The corpus of what the system actually said becomes the system of record. The code is no longer the ground truth. The transcripts are.

Keeping seven days of receipts costs about what it costs to keep the dashboard running. This is not a grand observability budget. It’s a phone-bill observability budget. Which is the point.

The Audit

Every morning, the same process. Coffee, dashboard, transcripts.

I skim yesterday’s file. Not all of it, maybe ten per cent at random (I’m flattering myself), plus anything the overnight metrics flagged. I’m looking for shape, not content. Does it feel like yesterday? Are the responses the right length? Is anyone repeating themselves?

I count votes by role. How did each of the five panel models vote across yesterday’s forty to sixty candidates? What does the distribution look like? If the Bull is at 90% BUY, that’s interesting. If the Bear is at 90% HOLD, that’s interesting. If the Analyst has gone completely silent on a whole category of stock, that’s interesting too.

I cross-reference with outcomes. Every stock the system analysed has an end-of-day price. The trades that closed have a P&L. I line up the verdicts against reality. Did the BUY calls go up? Did the HOLDs stagnate? Are there stocks we rejected that took off without us?

I look for absences. Which Telegram alerts was I expecting yesterday that didn’t arrive? Which scheduled jobs silently did nothing? Alerts not firing is more dangerous than alerts firing wrongly. The missing ones don’t show up in any dashboard.

Then I pick one anomaly, open the transcript, and read the actual prompts and responses for three or four examples. This is where eighty per cent of the discoveries happen.

It’s closer to reading a weekly report than debugging. You’re not looking for a broken line of code. You’re looking for a someone who’s been giving you wrong answers and quietly hoping nobody would notice.

Three Things I Found This Way

The Rubber Stamp

The symptom. 61 consecutive BUYs from the Bull. 62 consecutive HOLDs from the Momentum model. Two coin-flips that always came up heads, or always came up tails, depending on which role you asked.

The dig. All five panel roles used the same underlying model. Same API, same weights. The Analyst, Value and Bear were behaving normally, distributions that looked like opinion rather than reflex. The only thing different between the rubber-stampers and the real voters was the system prompt.

I read the Bull’s prompt. There was a line I’d written weeks earlier, when I was worried the Bear was going to suppress everything: “default BUY, one signal is enough.” That sentence had collapsed the role into pure affirmation. I read the Momentum prompt and found an AND-gate. RSI in range AND volume surge AND MACD cross AND price above moving averages AND tech score above 45. Almost every real-world candidate failed at least one of those clauses, which meant mandatory HOLD. A truth table with no path to true.

The fix. Rewrote both prompts. Scoring models in place of hard thresholds. The Bull now had to weigh evidence rather than wave it through. The Momentum role now had a six-dimension score it could reason about, not a five-clause veto.

The lesson. The prompt is the bug more often than the model. Before blaming the LLM, look at what you told it to do. Same model, different instructions, wildly different behaviour. The thing you wrote at 11pm a few weeks ago to fix one problem is the thing that’s quietly broken a different part of the system today.

The Poisoned Mirror

The symptom. Over a week-long window, the panel started HOLDing almost everything. Win rates on the few executed trades were getting worse. The models seemed to be getting more cautious by the day. Nothing in the code had changed.

The dig. A few weeks earlier, I’d built a feature that fed each panel model a summary of the system’s recent track record so they could calibrate. “Your last 30 trades had a 32.7% win rate.” It seemed sensible. Self-aware. The kind of thing that would help the models make better decisions.

I read a few transcripts. The models were literally citing the number in their reasoning. “Given the 32.7% win rate, I am defaulting to HOLD on a weak signal.”

The problem was that 32.7% was the execution win rate. Trades that closed green after stops, defensive sells, and timing noise. The pick accuracy, whether the system’s BUY calls went up by end of day, was around 54%. The models were being told the system was much worse than it actually was, and were reasoning accordingly. I had built a feedback loop that was steadily poisoning its own inputs.

The fix. Changed the injected stat. Led with “54% of your BUY calls went up by end of day.” Clearly labelled execution win rate as affected by stop-outs and defensive timing. Same information, different framing. Win rate recovered within a week.

The lesson. Anything you feed back into the prompt becomes part of the system’s self-image. Be careful which metrics you surface. A number that means one thing to you means something else to a language model reading it cold. The feedback loop you built to make the system smarter is the same feedback loop that can quietly make it stupid, and the difference between the two is a single line of framing.

The Pattern Nobody Could See

The symptom. The multi-day retrospective, the page that tracks what happened to every analysed stock three and five days after the system looked at it, started flagging a cluster. Eight stocks in a three-week window had gone up 15 to 36% after the system rejected them. The common thread was the vote shape. Every single one was a 1/5 panel, the Bull alone in favour, the other four roles voting HOLD. A vote pattern that was trivially easy to dismiss as low conviction.

The dig. Those aren’t random. The Bull alone, against the crowd, was catching low-profile setups the other roles were bailing on because the tech score sat in the 30s and low 40s. I pulled a dozen Analyst transcripts in a row and the pattern was unmistakable. The Analyst was implicitly treating tech score as a hard threshold of around 50. Reject anything below, regardless of what the other dimensions said. Not because I’d written that rule. Because the prompt left enough room for the model to invent it, and once it had, it stuck.

The fix. Rewrote the Analyst prompt to use an explicit six-dimension scoring model: trend, momentum, volume, sentiment, risk-to-reward, structure. Validated by replaying yesterday’s real transcripts through the new prompt before deploying it. BUY rate moved from 33% to 59% on the same candidates, with the previously-missed winners now getting through.

The lesson. Aggregate patterns live in places no single transcript can show you. The eight-winner cluster was invisible in any one day. You had to zoom out three weeks and ask the right question. What trait is shared by the trades we regret not taking? This is the thing LLM observability has to support, and no off-the-shelf product does it yet. You build it yourself or you don’t see it.

What These Have In Common

The prompt is the bug. All three fixes lived in the prompt, not the code. If you’re building LLM systems, your version-controlled unit of behaviour isn’t a function. It’s a markdown string. Write changelogs for prompt rewrites the way you would for schema migrations.

Silent failures dominate. Bugs that crash are easy. Bugs that produce a plausible-looking wrong answer, a 61/61 vote that looks like conviction, a HOLD that quotes a number back at you, an Analyst quietly inventing a threshold you never wrote, are what the audit is actually there to catch. If everything looks “fine”, look harder.

The signal is in the aggregate. You can’t find 61/61 BUYs by reading one transcript. You find them by counting across thousands. This is why the audit is a daily batch job, not a real-time alerting system. Some bugs are only visible when you zoom out a week.

Systems can poison themselves. Once you’re feeding metrics back into prompts, win rates, past decisions, advisories, you’ve closed a loop that can drift. The poisoned mirror wasn’t a mistake in any one place. It was a subtle miscalibration between what I meant and what the model read. Loops amplify mistakes. Assume they will.

Observability compounds. Every audit reveals the next blind spot. Catching the rubber stamps led me to build the per-role BUY-rate table. That table is what made the Analyst-too-strict pattern visible a week later. That led to the replay-against-yesterday’s-transcripts tool, which is now how I validate every future prompt change before I ship it. Each layer makes the next one cheaper.

What This Means If You’re Building With LLMs

Log the prompts. Log the responses. Keep them for a week. It’s the single best thing you can do. A few hours of work catches bugs that would take months otherwise.

Build your audit ritual before you build your tenth feature. You will not find these bugs with tests. Every one of the failures in this article passed its unit tests.

Treat prompts as the system of record. Version-control them with care. The prompt is where the behaviour lives.

Replay, don’t re-roll. When you change a prompt, run yesterday’s real inputs through the new version before you ship. Compare the decisions side by side. It is the closest thing LLM engineering has to a regression test.

Watch for what didn’t happen. Build a short list of things you expect, alerts, rejections, trades, and notice the absences. What’s missing tells you more than what’s present.

The black box isn’t actually black. You just have to decide that the logs are the system, not the code. Everything else follows.

The views expressed in this article are my own and do not represent the views of my employer.