Does giving an LLM smarter code exploration tools make it more efficient? Testing code-aware tools on a private Python repo.

If you give an LLM a code-aware index, can it explore a codebase more efficiently than normal file reading and text search?

By "more efficiently," I mean things a user can actually feel: less tokens spent, fewer round-trips, fewer tool calls, less time waiting, without losing answer quality.

To test this, I forked a project called CodeRLM and turned it into a MCP server. CodeRLM builds an AST index of your codebase and exposes structured code-navigation tools (caller tracing, symbol search, method extraction, dependency graphs) so the model can ask precise questions about the code instead of grepping around and reading files.

I ran a series of benchmarks on a private Python codebase related to prediction-market trading and market making. The repo is not public, but it was large and messy enough that caller lookup and code navigation was non-trivial.

After all the experiments, the answer ended up being narrower than I expected:

Code-aware tools only help when they take some reasoning burden off of the model.

The strongest result came from one focused tool, coderlm_callers, exposed directly over MCP. It returns scoped, structured caller information for any symbol in the codebase so the model doesn't have to grep and read its way to the answer. It cut effective input tokens by 39%.

The setups that performed worse than baseline had the opposite problem: they added overhead without saving enough reasoning work. Replacing Claude Code's built-in tools with a broad CodeRLM suite made everything slower. Wrapping the same caller lookup in a skill added startup cost on every run that erased the gains.

What I was trying to do

The goal was not to prove that "more tools are better." It was to see whether a structured code index could reduce wasted work.

A normal code-search workflow is: grep for a name, read a file, read another file, guess which matches matter, repeat. That works fine for a lot of tasks. But it gets clumsy when you need to find every call site of an exact method, separate real callers from test doubles, filter out irrelevant directories, or figure out which enclosing function each call belongs to. Those are the kinds of tasks where a code-aware tool should have an edge.

Results at a glance

Experiment	Setup	Token change	Outcome	Why
1. Broad replacement	CodeRLM replaces built-in tools	-13% (worse)	More tokens, more API calls, slower, lower pass rate	Claude already knows `Read`/`Grep`/`Glob` well; replacing them with unfamiliar tools added decision overhead and the specialized features went unused
2. Focused caller tool	`coderlm_callers` isolated via MCP with caller-only tasks	+39.3% (better)	Full compliance, full correctness	Inspecting earlier runs showed `coderlm_callers` was doing the most useful work; isolating it with scoped, metadata-rich results replaced multiple grep-read-infer cycles
3. Skill wrapper	Same tool wrapped in a Claude Code skill	-4.3% (worse)	Worked correctly, but gains erased	Fixed per-session overhead (skill load, CLI init, output parsing) cost more tokens than the tool saved, visible even on the control task

Three takeaways:

coderlm_callers reduced reasoning burden directly. Without it, Claude had to grep for a method name, read each matching file, figure out which hits were actual call sites vs. unrelated mentions, and infer the enclosing function for each, often across multiple rounds of tool calls and file reads. With coderlm_callers, one scoped call returned the filtered caller list with enclosing-function metadata already attached. That eliminated the grep-read-infer loop entirely, which is where the 39% token reduction came from.
Adding tools is not free. Tools like coderlm_grep and coderlm_search did not provide any reasoning benefit over the built-in Grep and Glob that Claude already knows how to use well. They just added more options to choose from without making the model's job easier. Broad tool suites and heavy wrappers both made things worse. Anything added on top has to justify its overhead.
MCP beating skills was surprising. Given all the "MCP is dead" discourse on the internet, I expected the skill wrapper to outperform direct MCP. Skills feel more native and skip the MCP discovery step. But the same underlying caller lookup produced the best result as an MCP tool (+39%) and one of the worst as a skill (-4%). The reason is cold-start cost: MCP tools are already available when the session starts, but a skill has to be invoked, its instructions loaded into context, a CLI spun up, and the output parsed back, all before the actual work begins. In a benchmark where every session is fresh, that fixed overhead on every run added up to more than MCP discovery ever cost.

How I tested it

I kept the model fixed and changed the tool setup. Every experiment compared a CodeRLM treatment against a baseline Claude Code session using the normal built-in tools: Read, Grep, Glob.

The main metric was effective input tokens: how much text the model had to consume, accounting for cache effects. Lower means cheaper and faster. I also tracked API calls, tool calls, wall-clock time, and pass rate.

The benchmark itself evolved as the question got sharper.

Experiment 1: replacing the built-in tools

The first benchmark gave Claude a broad CodeRLM tool suite and removed the normal exploration tools. The idea sounded reasonable. If CodeRLM has better code-aware tools, maybe it should replace Read, Grep, and Glob.

It did not work. The setup used about 13% more tokens, 11% more API calls, ran 7% slower, and had a worse pass rate.

The problem was not that CodeRLM was useless. It was that this was the wrong product shape. When I looked at the logs, Claude mostly used CodeRLM as a clumsy substitute for things it already knew how to do, like reading files and grepping text. The specialized tools were barely touched. So this benchmark mostly measured tool sprawl and decision overhead, not the value of the best CodeRLM features.

Experiment 2: adding CodeRLM alongside the built-in tools

The next benchmark kept Claude's normal tools and added CodeRLM beside them. This was immediately better. Both treatments passed all runs, with a +7.3% weighted effective-input-token reduction in favor of CodeRLM.

Promising, but still muddy. Some tasks benefited, some didn't, and some barely used CodeRLM at all. The benchmark mixed caller lookup, dependency lookup, general code reading, and implementation tasks into one suite, so it was hard to tell what was actually driving the improvement.

The second benchmark narrowed the answer but didn't get to the question I cared about most: does the caller-finding tool itself actually help?

Experiment 3: isolating `coderlm_callers`

Inspecting the logs from the earlier runs showed a clear pattern: across all the CodeRLM tools, coderlm_callers was doing the most useful work. The other tools were either ignored or used as clumsy substitutes for things Claude could already do natively.

That led to the most important change in the project. I narrowed the benchmark to a single tool, coderlm_callers, and a set of pure caller-discovery tasks. Baseline: normal code search. Treatment: use coderlm_callers first.

The result was a +39.3% weighted effective-input-token reduction, with full compliance and full correctness. This was the strongest result in the whole project by a wide margin. The scoped, metadata-rich results from the caller tool replaced multiple grep-read-infer cycles. The tool did the filtering so Claude didn't have to.

The skill experiment

After the MCP result looked strong, I wanted to test whether wrapping the same tool in a Claude Code skill would perform even better. A skill might feel more native and avoid MCP discovery overhead.

The first skill run was contaminated by usage-quota failures, so I threw it out and reran. The clean rerun gave a clear answer: -4.3% weighted effective-input-token reduction. The skill worked reliably, the model used it correctly every time, but it was worse than the direct MCP version.

The problem was fixed startup cost. On each fresh session, the skill path had to pay for skill invocation, instructions loaded into context, CLI init, the callers command, and shell output parsing. That overhead erased the gains from skipping MCP tool discovery.

The cleanest proof was the control task. Even when no code-search work was needed, the skill treatment still cost much more input than the baseline. A meaningful part of the cost came from the wrapper itself, not the work.

So the skill experiment answered a different question than I expected. It wasn't "is caller lookup useful?" It was "is this packaging of caller lookup cheaper?" And in this benchmark shape, the answer was no.

Final conclusion

I started by asking whether a code-aware index makes an LLM more efficient. The answer is yes, but only when the tool takes real reasoning burden off the model.

That sounds obvious. The experiments made it concrete.

Every CodeRLM setup that lost had the same problem: it added overhead (more tools to choose from, more instructions to parse, more startup cost) without removing enough reasoning work. The setup that won did the opposite: one focused tool, clear instructions, useful structured output, thin interface.

If I had to reduce the whole project to one sentence:

The best use of CodeRLM was not "more tools." It was one narrow tool, delivered through the thinnest possible interface, that took real reasoning burden off the model.