I've been using Claude Code and OpenAI's Codex side by side for the past few weeks. I'm shipping more work, faster, and at a higher quality than I ever have. It's honestly hard to go back to using just one.

The thing is, these tools aren't interchangeable. They're good at different things. And once you figure out how to use them together, you get something way better than either one on its own.

The Setup

Claude Code is the agent I interact with directly. It lives in my terminal and it's the one I talk to as a human. But I've given Claude a skill that teaches it how to talk to Codex. So when Claude needs a second opinion on a design decision, a spec review, or a code review, it just asks Codex.

OpenAI has since released official plugins for this kind of thing too. The idea is the same. You don't switch between tools yourself. You let one agent call the other when it makes sense.

Why This Works

You might wonder why not just use Claude to review its own code, or have Codex review itself. Both tools support review subagents. But the results aren't the same.

Claude and Codex were trained by different companies, on different data, with different objectives. They have different priors. They don't share blind spots. When Claude reviews Codex's code, it genuinely brings a fresh perspective because it literally learned to think about code differently. Same the other way around.

A model reviewing its own output is like proofreading your own essay. You'll catch typos but you'll miss the structural problems because you're stuck in the same frame of thinking that produced them. Two independently trained models reviewing each other is more like having two engineers from different teams look at the same PR. They notice different things. They ask different questions. That's where the real value is.

I've tried both setups. Claude reviewing Claude catches surface issues. Claude reviewing Codex (and vice versa) catches actual design problems, logic gaps, and assumptions that would have made it to production.

The Patterns I've Settled On

The loop looks like this. I give Claude a task. Claude brainstorms, explores the codebase, and drafts a spec and design. Then it sends that to Codex for review. Codex pushes back, asks questions, suggests changes. Claude takes that feedback and revises the plan.

Once the design is solid, Claude implements the code. Then Codex reviews the implementation. Different eyes, different training, different instincts. By the time I look at the final result, it's already been through multiple rounds of cross-model review.

I'm basically just the person who says "build this" and "looks good, ship it."

What Surprised Me

The biggest surprise wasn't speed. It was quality. When you use one AI to check another's work, you catch bugs that neither would catch alone. Codex finds logical issues in Claude's output. And because it's reviewing generated code instead of writing from scratch, it's faster and more focused.

The other surprise was how much I learned just by watching. The two models debate implementation options and tradeoffs with each other. Claude suggests one approach, Codex pushes back with a different one, and they go back and forth. I end up learning about patterns, edge cases, and alternatives I never would have found on my own.

The Honest Downsides

It's not all magic. Pure shipping velocity is slower. All those review rounds and back-and-forth debates between models take time. If you just need something out the door fast and don't care about polish, a single agent is quicker.

Token costs are real too. You're burning through both Claude and Codex tokens on every task. For side projects that's fine. For a team running this at scale, it adds up.

Where This Is Going

Having two models work together has made me more hands off with the nitty gritty of software than I ever expected. I spend most of my time now on what to build, not how to build it. The actual coding, the reviews, the back and forth on implementation details, that all happens between the models. I'm starting to believe that writing normal software is largely a solved problem now.

For now, if you're only using one AI coding tool, try running two. Figure out what each one is best at. Build the workflow around their strengths.

The results might surprise you.