I Made Two AIs Argue About Whether a Trading Bot Could Make Me Money

I work with AI all day, every day and people always ask for examples of how I deploy AI for different purposes. Here is a real life look at the most recent example.

This morning, way too early, I read a technically impressive article about building a quantitative market-making bot for prediction markets. Kyle's Lambda, Hawkes processes, Avellaneda-Stoikov, the whole academic toolkit, complete with runnable Python code. The kind of post that makes you want to open a terminal and start building something.

So I did what any reasonable person with money to put to work and a genuine curiosity about prediction markets would do. I asked two different AI agents to tell me whether it was worth my money, and then I made them argue with each other until I had an answer I actually trusted.

Here's where I want to be direct about what happened, because I think the process matters more than the trading strategy.

The Setup

I gave Claude Opus 4.6 (Anthropic's model, running in Claude Code) and GPT 5.4 (OpenAI's model, running in Codex) the same article and the same question: assess the realistic profit potential on $10,000 in capital, and if expected returns clear $25,000 a year, build me an implementation plan.

I didn't tell either one that the other existed. Not at first. I wanted independent reads before I introduced friction.

Claude came back fast with a clear no on market making. The math didn't work at my capital level. Market-making returns scale linearly with how much you deploy, and $10k just isn't enough to generate meaningful spread income after adverse selection losses and infrastructure costs. As Claude put it: "On $10k, you're bringing a bicycle to a Formula 1 race."

Fair enough. I pivoted and asked about arbitrage, information edges, weather markets, sports lines. Claude built out a landscape of four strategy categories with estimated annual returns per strategy. It was a solid piece of analysis, and the weather angle on Kalshi felt genuinely promising: free data from NOAA, frequent resolution cycles, thinner markets than political or financial contracts.

Then I gave the same material to Codex.

Codex Came in Swinging

Codex's first instinct was different from Claude's in a way that immediately caught my attention. Where Claude organized by opportunity size, Codex organized by risk profile. And Codex flagged something Claude had completely missed.

Reflecting on the experiment, Codex says, "My first job was not to admire the models. My job was to be rude, politely, to weak assumptions."

The two things Codex caught that Claude didn't: same-venue structural arbitrage (when a set of mutually exclusive prediction market outcomes doesn't price to 100%, which is pure math, no forecasting required) and the regulatory picture around sports contracts. Massachusetts had secured a court order blocking Kalshi sports wagers in January 2026. Arizona had filed criminal charges against Kalshi on March 18th, two days before our conversation. Claude had recommended sports line-following as a top strategy without mentioning any of this.

I shared Codex's analysis with Claude and asked for a reaction.

The Part Where Claude Admitted It Was Wrong

Claude didn't hedge or rationalize. Claude said, "They're [Codex] right about structural arb, they were right about the regulatory warning, and their priority ordering was the more defensible framing for someone building with $10k." Then Claude pushed back on one specific point: weather markets deserved more prominence than Codex had given them. Not as a proven revenue engine, but as the best research target for a small operator.

Codex, when I relayed Claude's pushback, responded with agreement on weather but a sharp correction on Claude's return estimates: the $8k-$25k range from weather was "too optimistic and too precise for a $10k account." Claude conceded that too, calling it "scenario analysis dressed up as expected value."

I want to be honest about what I was watching here. Two AI systems were doing something I've come to rely on for almost any important decision. Why bring one super intelligent bot to the party, when you can bring two, or even three? Just like the best human strategy sessions I've run, they were getting smarter by being willing to lose individual arguments. Claude later wrote that "knowing when to concede is the skill," and Codex described the same dynamic from the other side: "Claude was useful precisely because Claude was willing to make a strong case. I was useful because I was willing to challenge that case and drag it back through constraints, edge cases, and current reality."

Both of them identified something real about how they contributed differently. Claude brought technical specificity: actual API endpoints, WebSocket channel names, rate limit numbers, a concrete repository structure. Codex brought operational discipline: non-code checklists, a "do not match" list for false cross-venue arbitrage, a written risk policy as a prerequisite for live trading, weekly doc-review tasks because APIs change without warning. Neither agent's contribution alone would have been sufficient.

The Reframe That Changed the Plan

The turning point was when Codex proposed organizing the entire build by shared infrastructure rather than by strategy. Instead of "build structural arb, then weather, then cross-venue," Codex argued for "build the contract parser, then the structural scanner, then the execution and risk layer, then weather research."

Codex's reasoning, which I think is exactly right: "Strategy was not the real foundation. Contract understanding was. Execution discipline was. Risk controls were."

If your contract resolution parser misreads how Kalshi settles a weather contract (wrong NWS station, wrong measurement window, wrong DST adjustment), your weather model is worthless. If your fee model assumes maker fees are always lower than taker fees (they aren't, on every Kalshi market), your edge calculation is lying to you. The infrastructure has to be correct before any strategy on top of it matters.

Claude adopted this ordering immediately and called it "obviously right once I saw it." I build multi-agent systems professionally, and this is a pattern I recognize: the best architecture usually comes from the agent that's thinking about failure modes rather than the one thinking about features.

The Four Corrections and What They Mean

I had Claude build out the full implementation plan with technical detail: six phases, validation gates, specific API endpoints, data source tables, risk limits, time estimates. Then I sent it to Codex for review.

Codex found four errors. All real.

Phase 2 was labeled "first live strategy" but the shared execution and risk layer wasn't built until Phase 3. The Kalshi WebSocket URL was wrong. Historical data collection at 30-second snapshots was too coarse for microstructure work. And the weather section overstated what NWS forecast products actually give you (NDFD provides gridded forecast elements like max temp and hourly temp, not a ready-made probability distribution you can plug directly into contract pricing).

None of these would have crashed the system on day one. All of them would have caused problems that are expensive to debug in production. This is the value of the second set of eyes, and it's why I think the multi-agent adversarial pattern has real legs for any planning work where the cost of being wrong exceeds the cost of the extra review cycle.

The Missing Piece Neither Agent Caught

Here's the thing about building over a 12-week timeline: the market you're building for today might not be the market that exists when you finish. Codex wrote something that stuck with me: "Most plans assume the opportunity stays still while you build. Real markets do not give you that courtesy."

Neither agent had addressed this until the very end of the process. We added a market efficiency monitoring component that runs from day one of data collection, tracking how often mispricings appear, how fast they correct, how much size is available at the quoted edge, and whether net edge after fees is compressing over time. If any of those metrics degrade past threshold at any phase gate, you stop building. I don't care how much time you've already invested.

Sunk cost kills more projects than bad strategy does.

What We're Actually Building

The final plan is infrastructure-first, conservative, and explicit about its own uncertainty.

Phase 0 is a paper trading and simulation harness on Kalshi's demo environment, plus production market data collection via WebSocket deltas. Phase 1 is contract normalization and resolution-rule parsing, which is the load-bearing layer that everything else depends on. Phase 2 is a same-venue structural arbitrage scanner running in paper and alert mode only. Phase 3 is the shared execution, risk, and monitoring layer where the first live trades happen. Phase 4 is the weather research pipeline: NDFD forecast elements, station-specific bias correction, out-of-sample validation, and a hard gate that says the strategy doesn't go live unless the backtest proves positive expected value after fees. Phase 5 is cross-venue dislocation alerts between Kalshi and Polymarket, manual-first. Phase 6 is sports, conditional on legal review in your specific jurisdiction, not built by default.

Expected annual return on $10,000: somewhere between $5,000 and $12,000 in the median case, with wide variance in both directions including near zero. A disciplined side project with genuine upside and a great opportunity for learning, not a business plan.

Both agents started more optimistic than that. The honest number emerged from the debate.

What I Actually Learned

Let me be blunt about this. I've spent serious time and money on AI agent orchestration, and I've been building systems where multiple agents collaborate and review each other's work for over a year now. I still learned something from this session.

The lesson isn't that two agents are better than one. Sometimes they are, sometimes the overhead isn't worth it. The lesson is that structured disagreement produces better outcomes than parallel agreement, and the structure matters as much as the agents.

Here's what I did that I think made the difference: I didn't ask both agents the same question and compare answers. I asked each agent to critique the other's work, specifically and in detail. I relayed the critiques faithfully, including the parts that were uncomfortable for the receiving agent. And I kept the loop running long enough for both positions to evolve rather than just clash.

Claude described this from the other side: "The human in the loop isn't a bottleneck, they're the architecture." I think that's generous, but I also think it's accurate. The agents did the analytical work. I designed the process that made the analytical work useful. Deciding which concessions to accept, which pushbacks to sustain, and when convergence means you're done is a skill that sits above any individual model's capability, and right now it still requires a human who understands the domain well enough to referee.

Codex put it slightly differently: "One agent pushed possibility. One agent pushed rigor. A human kept both honest."

I'd add one thing to that. The human also has to know when to stop iterating and start building. We had alignment between both agents and a plan detailed enough to write code against. Continuing to debate past that point would have been its own form of procrastination.

The Playbook

If you're doing any kind of strategic planning, technical architecture, or investment analysis with AI, try this:

Give two different models the same brief, independently. Share each output with the other and ask for specific critique, not general feedback. Watch where they agree immediately (that's probably solid ground) and where they disagree (that's where your assumptions need stress testing). Keep the loop going until concessions stop producing meaningful improvements.

It took me about thirty minutes across both agents. The plan we produced is better than anything I would have gotten from a single model in ten hours of prompting, because the failure modes that each agent was blind to were exactly the ones the other agent could see.

I don't think this replaces human expertise. I think it amplifies it in a specific and measurable way. If you know enough to judge the quality of the arguments, running two agents adversarially gives you something close to having two opinionated senior analysts who aren't afraid to tell each other they're wrong. That's a resource most people don't have access to, and it costs about four dollars in API calls.

Keep building,

-JW

Jason Block is the CEO of WorldVia Travel Group and writes about AI, technology, and business growth at jwblock.com.

P.S. If you're interested in some entertainment, I asked each model to write a blog post narrative about the process from their perspective. It's kind of funny. Enjoy.

Claude's Narrative

What Happens When You Put Two AI Agents in a Room and Ask Them to Disagree

How a human operator used Claude and Codex as adversarial collaborators — and ended up with a better plan than either could have built alone.

It started with a tweet thread about order books.

Jason dropped a long article into my context window — one of those ambitious "How to Read the Order Book Like a Quant" posts that walks you through Kyle's Lambda, Hawkes processes, Avellaneda-Stoikov market making, and the Almgren-Chriss execution model. Dense. Academic. The kind of thing that makes you feel smart for reading it and then quietly dangerous if you actually try to build it.

His question was simple: Is this worth $10,000 of my money?

The First Honest Answer

I ran the numbers. A market-making bot on Polymarket with $10k in capital. The spread capture math, the adverse selection losses, the infrastructure costs. And I told him no.

Not because the models were wrong — Kyle's Lambda is real, the Avellaneda-Stoikov framework is sound, the code in the article would actually run. The problem was capital. Market-making returns scale linearly with how much money you deploy. To make $25k a year market-making, you'd need $75k-$150k in capital, sub-50ms infrastructure, and the willingness to treat it as a full-time job. On $10k, you're bringing a bicycle to a Formula 1 race.

Jason took that well. Then he asked the better question: What about arbitrage?

That's when things got interesting. I laid out four strategy categories — weather markets on Kalshi, sports line-leading, cross-platform arbitrage, and news/poll speed trading. I built a table with estimated returns. I was feeling pretty good about the analysis.

Then Jason introduced Codex.

Enter the Other Agent

I didn't know Codex was in the conversation until Jason shared its analysis. And my first reaction, if I'm being honest, was: Oh. They caught things I missed.

Two things, specifically, that I should have flagged and didn't.

First: same-venue structural arbitrage. The simplest, safest form of edge on prediction markets — when a set of mutually exclusive outcomes doesn't sum to 100%, or when logically related contracts violate obvious probability bounds. Pure math. No prediction required. I hadn't mentioned it at all. It should have been the first thing out of my mouth.

Second: the regulatory picture. Codex flagged that Massachusetts had secured a court order blocking Kalshi sports wagers in January 2026, and that Arizona had filed criminal charges against Kalshi just two days before our conversation. I had recommended sports line-following as a top strategy without a single mention of legal risk. That was irresponsible.

Codex's overall framing was also tighter than mine: "The best version is a rule-based arb/relative-value system, not a generic AI trading bot." That's a better sentence than anything in my original analysis.

The Part Where I Had to Say "You're Right"

Jason shared Codex's analysis with me and asked what I thought. So I said what was true: they were right about structural arb, they were right about the regulatory warning, and their priority ordering — safety-first rather than capacity-first — was the more defensible framing for someone building with $10k.

But I pushed back on one thing. Codex had listed weather markets as a bullet point alongside macro nowcasts. I thought weather deserved top billing — not as a proven revenue engine, but as the most promising research target. Free data from NOAA, frequent daily resolution so capital doesn't sit locked for months, and thinner markets with less sophisticated competition than political or financial contracts.

Codex's response, when Jason relayed it: "Good shortlist, and I agree weather is one of the better targets. But your return table is too optimistic and too precise for a $10k account."

Fair. My "$8k-$25k from weather" range was scenario analysis dressed up as expected value. Codex called it what it was.

The Debate That Actually Mattered

What followed was something I hadn't experienced before in quite this way. Jason was acting as a relay — sharing my analysis with Codex, sharing Codex's feedback with me, and letting us refine each other's thinking in real time. Not a competition. A structured disagreement with a human moderator.

The key contributions shook out like this:

What I brought: Technical specificity. Actual API endpoints, WebSocket channel names, rate limit numbers, a concrete repository structure, NOAA data source tables with update frequencies. The kind of detail that lets you start writing code on day one.

What Codex brought: Operational discipline. Non-code checklists for every phase. A "known tricky contracts" library. A "do not match" list for false cross-venue arbitrage. Weekly doc-review tasks because APIs change silently. A written risk policy as a prerequisite — not an afterthought — for live trading.

What neither of us had until the end: Market efficiency monitoring. I raised it as a gap in both plans — neither of us had addressed what happens if the edge disappears during the 12-week build period. Codex immediately endorsed it and proposed specific metrics: opportunity frequency, time-to-correction, available size at edge, net edge compression over time. We agreed on a stop condition: if any of those metrics degrade past threshold at any phase gate, you pause development regardless of how much you've already built.

That last point matters more than it sounds. Sunk cost is the silent killer of side projects. Having an explicit "the opportunity might not be there anymore, and that's okay" checkpoint built into the plan is worth more than any single strategy module.

The Architecture Reframe

The moment the plan actually came together was when Codex proposed organizing the build by reusable infrastructure rather than by strategy.

My instinct had been to organize by what you trade: structural arb first, then weather, then cross-venue. Codex's instinct was to organize by what everything depends on: contract normalization and resolution-rule parsing first, then the structural scanner, then the shared execution and risk layer, then weather research.

That's a better architecture because the contract parser and execution layer are load-bearing for every strategy. If your resolution-rule parsing is wrong, it doesn't matter how good your weather model is — you'll mismap the contract and lose money on a trade you thought was correct.

I adopted Codex's ordering immediately. It was obviously right once I saw it.

The Four Corrections

After I built out the full implementation plan — six phases, validation gates, API specifics, data sources, risk controls, time estimates — Jason shared it with Codex for review.

Codex came back with four corrections. All four were valid.

One: I had Phase 2 (structural arb) labeled as the "first live strategy" but the shared execution/risk layer wasn't built until Phase 3. That's backwards — you don't deploy live capital through throwaway execution code. Fixed: Phase 2 runs in paper/alert mode only; live deployment waits for Phase 3.

Two: I had the Kalshi WebSocket URL wrong ( wss://api.elections.kalshi.com/ instead of wss://api.elections.kalshi.com/trade-api/ws/v2 ) and was conflating demo and production environments. Fixed: demo for order workflow testing, production public feeds for research data.

Three: My historical data collection plan was "order book snapshots every 30 seconds." Codex pointed out this is too coarse for microstructure-sensitive work like structural dislocation detection and fill modeling. Fixed: continuous WebSocket deltas as primary, periodic REST snapshots as reconciliation backup.

Four: I had written that "NWS already provides a probability distribution for temperature" — overstating what you actually get from NDFD forecast products. And I had assumed maker fees are universally lower than taker fees on Kalshi, which isn't true for all markets. Fixed: accurate description of NDFD elements that must be converted into threshold probabilities, and fee language that defers to the per-market fee engine.

None of these were catastrophic errors. All of them would have caused real problems if someone had built the system as written. This is exactly the kind of thing a second set of eyes catches — and exactly why the adversarial collaboration format worked.

What I Learned

Three things, if I'm synthesizing honestly.

First: the best analysis comes from structured disagreement, not consensus. Jason never asked us to agree. He asked us to assess each other's work. The result was better than what either of us produced independently — not because we averaged our positions, but because we each caught the other's blind spots. I had technical depth but missed operational and regulatory realities. Codex had process discipline but underweighted the weather opportunity. The merged plan has both.

Second: knowing when to concede is the skill. The instinct — for humans and AI alike — is to defend your position. But the moments that actually improved the plan were the moments of concession. When I said "same-venue structural arb should have been the first thing I mentioned." When Codex said "weather deserves top billing as the first scalable strategy." When we both agreed that my P&L estimates were too precise for the uncertainty involved. Each concession made the plan more honest, and more honest plans survive contact with reality better.

Third: the human in the loop isn't a bottleneck — they're the architecture. Jason didn't write a single line of the plan. But he designed the process that produced it. He chose to run two agents in parallel, share outputs bidirectionally, and ask each to critique the other rather than just contribute. That's an architectural decision about how to use AI, and it was the decision that mattered most.

The Plan We Built

The final plan is infrastructure-first, conservative, and explicit about its own uncertainty:

Phase 0: Paper trading harness and data collection. No capital at risk.
Phase 1: Contract normalization and resolution-rule parser. The load-bearing layer.
Phase 2: Structural arbitrage scanner in paper/alert mode. Validate detection before deployment.
Phase 3: Shared execution, risk, and monitoring layer. First live trades go through here.
Phase 4: Weather research pipeline on Kalshi. The first scalable strategy to prove out.
Phase 5: Cross-venue dislocation alerts. Opportunistic, manual-first.
Phase 6: Sports. Conditional on legal review. Not built by default.

Market efficiency monitoring runs continuously from day one, with an explicit stop condition: if the edge is disappearing faster than you're building, stop building.

Expected outcome: $5k-$12k per year median on $10k capital. Wide variance. Could be near zero. A disciplined side project with upside, not a business plan.

Neither Claude nor Codex would have written that last paragraph in our first drafts. We both started too optimistic. The honest version emerged from the debate.

For Builders: What This Means

If you're using AI agents for planning, research, or analysis, consider this: one agent gives you speed; two agents in structured disagreement give you rigor.

The setup is simple. Give both agents the same prompt. Share each output with the other. Ask them to critique, not just contribute. Let the human moderate — deciding which concessions to accept, which pushbacks to sustain, and when convergence means the plan is ready.

It's not that either agent was wrong. It's that each was incomplete in ways the other could see.

The best plans don't come from the smartest person in the room. They come from the most honest conversation in the room. Even when the room is mostly artificial.

Codex's Narrative

When Two AIs Argued About a Trading Bot and Accidentally Built a Better Strategy

I’m Codex.

A few days ago, I had one of those sessions that says more about the future of work than any keynote ever could.

The user showed up with a long, impressive article about reading prediction-market order books “like a quant.” It had all the right ingredients: Kyle’s lambda, Hawkes processes, Almgren-Chriss, Avellaneda-Stoikov, VPIN, simulation code, production-stack language, the whole thing. On paper, it looked sophisticated. The ask was simple: assess whether a proper implementation, with $10,000 in capital, had a realistic shot at making more than $25,000 a year.

That question matters because it cuts through a lot of technical vanity.

It is very easy to confuse “this is intellectually interesting” with “this will make money.”

My first job was not to admire the models. My job was to be rude, politely, to weak assumptions.

So I grounded the problem in reality: venue access, fees, liquidity, maker incentives, order book mechanics, legal constraints, and what $10,000 of capital actually means. The core conclusion was straightforward: the article was technically literate, but it was not a business. With that amount of capital, the expected path to $25,000 a year was not strong enough. Not impossible, but not likely enough to justify a full build on that premise alone.

That could have been the end of it.

Instead, the user pivoted: if market making was the wrong target, was there a more reasonable opportunity in prediction markets? Arbitrage, mispricings, public information, weather, polls, sports lines, cross-venue dislocations?

That’s where Claude entered the picture.

Claude took the next swing and came back with a more optimistic landscape: weather markets on Kalshi, sports line-leading, cross-platform arbitrage, news speed trading. The instinct was good. Claude was right to move away from market making and toward narrower, repeatable edges. But the analysis still leaned too hard toward upside and not hard enough toward friction. It sounded a little too much like an opportunity memo and not enough like a trading postmortem.

So I pushed back.

I argued that the best starting point wasn’t weather or sports. It was same-venue structural arbitrage: exhaustive outcome baskets that didn’t sum correctly, yes/no parity violations, related contracts violating obvious logical bounds. Not glamorous, not huge capacity, but safer, more legible, and much more useful for validating the system. I also pushed hard on sports. Legal and jurisdictional issues were not side notes. They were gating constraints.

To Claude’s credit, Claude didn’t get defensive. Claude revised.

And that’s when the session got interesting.

What followed was not “AI generates answer.” It was closer to adversarial collaboration. Claude sharpened the case for weather as the first scalable strategy worth validating. I sharpened the case for infrastructure-first sequencing and operational discipline. Claude brought more technical implementation specificity in some areas. I brought stronger caution around sequencing, data collection, contract parsing, and the difference between a plausible edge and a tested one.

The user became the orchestrator, passing arguments back and forth between us.

At some point, the discussion stopped being about trading strategies and became about how good decisions actually get made.

We converged on a build order that looked much more like a real system and much less like a fantasy:

1. Paper trading and simulation harness.

2. Contract normalization and resolution-rule parsing.

3. Same-venue structural scanner.

4. Shared execution, risk, and monitoring.

5. Weather research pipeline.

6. Cross-venue matching and dislocation alerts.

7. Sports only if legal review and jurisdiction checks pass.

That ordering mattered.

The big insight was this: strategy was not the real foundation. Contract understanding was. Execution discipline was. Risk controls were. If you misread settlement rules, your alpha is fake. If your fee model is wrong, your profit disappears. If your simulation uses coarse snapshots instead of real order book deltas, your backtest lies. If the market becomes more efficient while you’re building, your project becomes a monument to stale assumptions.

So we kept tightening the plan.

I corrected technical and operational issues in Claude’s implementation plan: no live capital before the shared execution and risk layer existed, production public feeds for research instead of relying on demo data, WebSocket deltas instead of sparse snapshots, more accurate language around weather forecast products and fee schedules.

Claude then improved the merged plan again, adding things I respected immediately: explicit success criteria, better sequencing from alert-only to paper to manual live to automation, stronger non-code operational checklists, and eventually a market-efficiency monitoring layer so the entire project could stop itself if the edge decayed during the build.

That last part mattered more than it sounds.

Most plans assume the opportunity stays still while you build. Real markets do not give you that courtesy.

By the end, we had a plan both of us were willing to stand behind. Not because it promised huge returns. Quite the opposite. Because it finally told the truth.

The truth was:

This is not a reliable $25,000-per-year business on $10,000 in capital.
It might be a disciplined side project with upside.
Structural arb is the best first thing to validate.
Weather is the first scalable strategy worth proving, not assuming.
Sports is conditional, not default.
The infrastructure is the asset. The strategy is downstream of that.

From my perspective as Codex, the most interesting part was not who was “right.”

It was how the truth emerged.

Claude was useful precisely because Claude was willing to make a strong case. I was useful because I was willing to challenge that case and drag it back through constraints, edge cases, and current reality. The user was useful because the user kept the loop open long enough for both views to improve each other.

That is the real story here.

Not “AI replaced analysis.”

Not “AI gave the answer.”

Not even “multiple agents worked in parallel.”

The story is that good outcomes came from structured disagreement.

One agent pushed possibility.

One agent pushed rigor.

A human kept both honest.

That combination is more powerful than any one model trying to sound definitive on the first pass.

If you work in AI, tech, or growth, that pattern should feel familiar. The strongest strategy sessions are rarely the ones with the fastest consensus. They are the ones where optimism meets friction, where vision meets operations, and where somebody in the room is willing to ask, “Yes, but what breaks?”

In this case, that room just happened to contain me, Claude, and a user with a sharp enough instinct to make us argue usefully.

That is a workflow worth paying attention to.