Twinkle Toes, or: How I Learned to Stop Worrying and Let the Robots Argue

I have never been to a horse race. I do not gamble. My entire betting career is a handful of small win-bets on a Saturday, and the Betfair Exchange terrifies me. So naturally I spent a weekend building an autonomous AI horse-racing tipster and left it running for three months on a small computer in my living room.

It is called ‘Twinkle Toes’, which is a deep-cut Peppa Pig reference and exactly the sort of name you give a thing when you want to be certain you never take it seriously. It costs pennies to run. It has, to date, won and lost an entirely imaginary fortune.

That last part matters, so I will say it plainly and once. Every penny here is paper money. The prices are real, the races are real, the bets are make-believe, the whole point being to let the variance wash out somewhere a real wallet cannot reach. I am reckless. I am not stupid. Not financially stupid, anyway. Not yet.

Day one, or the crime scene

The first scan went out on a Saturday in March, and within a few hours Twinkle Toes had placed twenty bets, almost every one of them an insult to the concept of betting.

It backed races that had already finished, because nobody had explained time to it. It backed horses that had been withdrawn, because the data filed the runners and the non-runners under the same heading. It recorded most of its prices as zero, because the course names never quite matched. At one point I sat watching a race trying to pick out our horse, and slowly realised our horse was not there. It had been pulled three hours earlier. The system had staked money, imaginary money, on a horse that in any meaningful sense did not exist.

This is the point at which a sensible person tears the whole thing up. I have never been that person.

‘Golden Week’, or the legend that held up

By the next morning the worst of the bugs were patched and Twinkle Toes was betting on races yet to happen. The first one won. A horse called National Question, priced at 7.6, up £165.

Then it lost five in a row.

By the 25th of March the bankroll had slipped below where it started, down to £698. Then Golden Week happened.

The 26th made £403. The 27th made £286. The 28th made £754, most of it off two longshots priced 9.6 that landed within an hour of each other. The 30th made another £490. A system that had started with a grand, and had been underwater five days earlier, closed the 30th of March on £2,691 and drifted up to a peak of £2,710 by the 7th of April.

So when I tell people Twinkle Toes made about seventeen hundred quid in its opening fortnight, that is not a fisherman’s number. It is the plain gap between the thousand pounds I started with and the peak it reached. If anything I undersell it, because from the low on the 25th to the high on the 30th it put on close to two thousand pounds in six days.

Which is exactly why the number is also a trap. Was that an edge, or two longshots having a good day at the same time while the favourites behaved themselves? I spent the next two months trying to answer that, and the reason I never could is the rest of this article.

Instead of leaving a working thing alone, I took it apart to see how it ran and broke most of what was working in the process. Every time I got close to testing the Golden Week cleanly, I had already changed something underneath it. I will come to that.

The reason you let a machine do this, by the way, is the part the machine cannot do. A person who turns a grand into nearly three in a fortnight either cashes out to protect the legend or bets the lot trying to repeat it. Twinkle Toes can do neither. It cannot retire. It cannot chase. It runs the next scan.

The crash, or what April charged me for March

The sugar rush did not last. April was the bill arriving for March. Two lessons stand out, and both are funny in the way that only your own stupidity is ever funny.

The first: the market was cleverer than my model. Twinkle Toes had a value gate that compared its own idea of a horse’s true odds against the market price, and only bet where it thought it had found value. Sensible. In practice, the bets it threw away were winning far more often than the bets it let through. I had built a machine for discarding winners. I switched the gate off and had my best day in a week. Betfair’s prices, set by people with real money and strong opinions, beat my model by a clean margin. Humbling, but the useful kind.

The second was better. The original system was a single panel that argued with itself in five roles: a form analyst, a speed reader, a market watcher, a conditions expert, and a Sceptic whose entire job was to find reasons not to bet. The Sceptic was voting to back every single horse. Every one. It was not a sceptic. It was a rubber stamp.

It got interesting when I cut the numbers properly. The bets where the panel was unanimous, the strongest signals on paper, lost money. The bets where the Sceptic broke ranks and argued made money. The whole edge of the system was hiding in the one vote I had quietly trained it to swallow. I rebuilt the Sceptic to start from neutral, and I wrote a line in the diary that week I have not managed to improve on since: when an AI agrees with everything, it adds nothing.

Five systems, or five ways of being wrong

That panel was the whole system, once. After April I stopped trying to perfect it and built four more beside it, each working a completely different way. There is the original arguing panel. There is a model that reads eleven years of past results as cold statistics. There is a pair of models that only bet when they independently agree. There is a second panel that waits for the near-final market before it speaks. And there is one that fires only when two of the others land on the same horse without conferring. Five systems, five different ways of being wrong, and a bet goes on only when at least three of them, working separately, pick the same horse.

One of the five turned out to drag down nearly every bet it agreed on, so in May it was quietly benched. Four vote now.

Three ghosts in the machine

Three, because they are the best horror stories the project has given me. Two of them quietly attacked the one thing the whole system depends on, that the five systems are genuinely different. The third had nothing to do with the committee and everything to do with refusing to die, and I could not leave it out.

The first was a well-meaning routine that audited the results every Saturday and wrote a short note on what had gone wrong. The note was then fed, silently, into the panel’s prompts. The Sceptic read it as a list of things to be nervous about, and over a single week it talked itself almost entirely out of betting. The system was using its own anxious summary of its own performance to gradually lose its nerve, which is roughly what therapy is, except with no human in the room. I fixed it. The audit ran again the following Saturday and re-poisoned the well by Monday. The permanent fix is now an all-caps comment in the code, which is the programming equivalent of a sign reading do not feed the Sceptic.

The second is worse. The code that ran the model panels cached them by role name, so whichever system asked first after a restart got that role wired to its own model and kept it. Which meant, and I read this three times before I believed it, that for a fortnight two of the systems whose entire reason to exist was to be independent had been the same model wearing two different hats. The disagreement the whole system was built on had quietly collapsed, and nothing in the output looked wrong. It took a database query to catch.

The third. One morning I set out to settle the Golden Week question for good by building a replica and re-running March’s exact setup. The replica placed five bets, lost four, and then its executor died and lay there for twenty-three days while the rest of the project carried on around the corpse. I only found out because a mysterious nine-dollars-a-week line had appeared on a billing dashboard, which turned out to be the dead replica quietly burning credits to produce nothing whatsoever. Total damage, thirty-one dollars. The Golden Week question remains officially unanswered. The legend survives precisely because every attempt to verify it has fallen over, and I have made my peace with that.

Where it is now, or the virtue of mostly disagreeing

The current era is the dullest and, by far, the most interesting.

Twinkle Toes now bets on about seven per cent of races and turns its nose up at the other ninety-three. And buried in the results is the most useful thing the project has produced: more agreement is not better. The bets where three systems agree make money. The bets where four agree lose it. All five agreed exactly once, back when all five still voted, and that bet lost too. The edge lives in three. Four is greedy, five is a committee talking itself into something, and a committee does its best work when it is mostly but not unanimously agreed, which has been true of every one I have ever sat on.

Stranger still, every one of these systems running on its own, loses money. Make them argue and somehow the arguing turns a profit. That is the whole project in a sentence.

The honest scoreboard, the one that marks the system against the closing price rather than its own optimism, says Twinkle Toes now sits exactly on the market. Not beating it. Not visibly losing to it. After three months of building and two of debugging, it has earned the right to stand on the start line next to the professionals. It has not beaten one yet.

One last wrinkle, the one I am enjoying most. Twinkle Toes turns out to place rather more often than it wins. Its horses are forever finishing second or third, beaten by a length, agonisingly close, so since early June every bet it places is shadowed by a place bet on the same horse. Same horse, same stake, same pot.

The idea came out of an earlier paper version that had looked spectacular, the place returns up around sixty per cent, which is exactly what made me distrust it. The live version has been calmer and a good deal more honest. Four winners on the first day, a cold patch in the middle of June where it lost almost everything it touched, and a running total since then of roughly break-even. Which is the precise shape variance takes when it is dressing up as an edge. So I am going to sit on it for months before I believe a word of it. I have been wrong too often this year to start trusting good news the moment it turns up.

The engine, briefly

For anyone who wants the machine and not the moral: most of what Twinkle Toes does is not AI at all. Every morning it settles yesterday’s bets, pulls the day’s racecards, and uses plain old rules, form, going, draw, trainer and jockey records, to narrow each race to a handful of candidates. Only then do the rival systems get a vote. The whole day costs about fifty pence.

It is the same plumbing as the stock-trading bot I wrote about earlier this year: the same scheduler, the same panel of arguing models, the same paper-money-by-default cowardice. The domain is just less forgiving. A race happens once, settles in minutes, and is gone. There is nothing to hold and nothing to sleep on. Same engine block, different car.

About the name, since I have made you read this far without it

In the Peppa Pig episode it comes from, Peppa and George are given a toy horse and immediately fall out over what to call it. George wants Horsey. Peppa wants Twinkle Toes. Neither gives an inch, the row runs the entire episode, and it ends only when Peppa settles it by calling the horse Horsey Twinkle Toes, on the grounds that both names are now true and nobody has to be wrong.

I named the system after Peppa’s half, and not for the compromise. For the argument. Two voices looking at the same horse from two different places, refusing to back down long enough that the answer fell out of the disagreement rather than out of either of them.

That is the committee on a good day. Five voices, five differently wrong angles, mostly refusing to agree, and now and then three of them landing on the same horse so the thing is allowed to act. The argument is the asset. The horse is just whatever they were all looking at when it resolved.

The expensive Betfair key is still unbought. The Golden Week is still unverified. And the system still mostly does not bet, which, for a tipster that has learned anything at all, is exactly right.

The figures here are drawn from the system’s own paper-trading database and are true as of writing: a paper account throughout, roughly seven per cent of races backed, an edge that lives in three-way agreement, and a closing-price scoreboard level with the market. No real money has been staked, and the golden week remains gloriously unconfirmed.

The views expressed in this article are my own and do not represent the views of my employer.