If you give an LLM a code-aware index, can it explore a codebase more efficiently than normal file reading and text search?
By "more efficiently," I mean things a user can actually feel: less tokens spent, fewer round-trips, fewer tool calls, less time waiting, without losing answer quality.
To test this, I forked a project called CodeRLM and turned it into a MCP server. CodeRLM builds an AST index of your codebase and exposes structured code-navigation tools (caller tracing, symbol search, method extraction, dependency graphs) so the model can ask precise questions about the code instead of grepping around and reading files.
I ran a series of benchmarks on a private Python codebase related to prediction-market trading and market making. The repo is not public, but it was large and messy enough that caller lookup and code navigation was non-trivial.
After all the experiments, the answer ended up being narrower than I expected:
Code-aware tools only help when they take some reasoning burden off of the model.
The strongest result came from one focused tool, coderlm_callers, exposed directly over MCP. It returns scoped, structured caller information for any symbol in the codebase so the model doesn't have to grep and read its way to the answer. It cut effective input tokens by 39%.
The setups that performed worse than baseline had the opposite problem: they added overhead without saving enough reasoning work. Replacing Claude Code's built-in tools with a broad CodeRLM suite made everything slower. Wrapping the same caller lookup in a skill added startup cost on every run that erased the gains.
The goal was not to prove that "more tools are better." It was to see whether a structured code index could reduce wasted work.
A normal code-search workflow is: grep for a name, read a file, read another file, guess which matches matter, repeat. That works fine for a lot of tasks. But it gets clumsy when you need to find every call site of an exact method, separate real callers from test doubles, filter out irrelevant directories, or figure out which enclosing function each call belongs to. Those are the kinds of tasks where a code-aware tool should have an edge.
| Experiment | Setup | Token change | Outcome | Why |
|---|---|---|---|---|
| 1. Broad replacement | CodeRLM replaces built-in tools | -13% (worse) | More tokens, more API calls, slower, lower pass rate | Claude already knows Read/Grep/Glob well; replacing them with unfamiliar tools added decision overhead and the specialized features went unused |
| 2. Focused caller tool | coderlm_callers isolated via MCP with caller-only tasks | +39.3% (better) | Full compliance, full correctness | Inspecting earlier runs showed coderlm_callers was doing the most useful work; isolating it with scoped, metadata-rich results replaced multiple grep-read-infer cycles |
| 3. Skill wrapper | Same tool wrapped in a Claude Code skill | -4.3% (worse) | Worked correctly, but gains erased | Fixed per-session overhead (skill load, CLI init, output parsing) cost more tokens than the tool saved, visible even on the control task |
Three takeaways:
coderlm_callers reduced reasoning burden directly. Without it, Claude had to grep for a method name, read each matching file, figure out which hits were actual call sites vs. unrelated mentions, and infer the enclosing function for each, often across multiple rounds of tool calls and file reads. With coderlm_callers, one scoped call returned the filtered caller list with enclosing-function metadata already attached. That eliminated the grep-read-infer loop entirely, which is where the 39% token reduction came from.coderlm_grep and coderlm_search did not provide any reasoning benefit over the built-in Grep and Glob that Claude already knows how to use well. They just added more options to choose from without making the model's job easier. Broad tool suites and heavy wrappers both made things worse. Anything added on top has to justify its overhead.I kept the model fixed and changed the tool setup. Every experiment compared a CodeRLM treatment against a baseline Claude Code session using the normal built-in tools: Read, Grep, Glob.
The main metric was effective input tokens: how much text the model had to consume, accounting for cache effects. Lower means cheaper and faster. I also tracked API calls, tool calls, wall-clock time, and pass rate.
The benchmark itself evolved as the question got sharper.
The first benchmark gave Claude a broad CodeRLM tool suite and removed the normal exploration tools. The idea sounded reasonable. If CodeRLM has better code-aware tools, maybe it should replace Read, Grep, and Glob.
It did not work. The setup used about 13% more tokens, 11% more API calls, ran 7% slower, and had a worse pass rate.
The problem was not that CodeRLM was useless. It was that this was the wrong product shape. When I looked at the logs, Claude mostly used CodeRLM as a clumsy substitute for things it already knew how to do, like reading files and grepping text. The specialized tools were barely touched. So this benchmark mostly measured tool sprawl and decision overhead, not the value of the best CodeRLM features.
The next benchmark kept Claude's normal tools and added CodeRLM beside them. This was immediately better. Both treatments passed all runs, with a +7.3% weighted effective-input-token reduction in favor of CodeRLM.
Promising, but still muddy. Some tasks benefited, some didn't, and some barely used CodeRLM at all. The benchmark mixed caller lookup, dependency lookup, general code reading, and implementation tasks into one suite, so it was hard to tell what was actually driving the improvement.
The second benchmark narrowed the answer but didn't get to the question I cared about most: does the caller-finding tool itself actually help?
coderlm_callersInspecting the logs from the earlier runs showed a clear pattern: across all the CodeRLM tools, coderlm_callers was doing the most useful work. The other tools were either ignored or used as clumsy substitutes for things Claude could already do natively.
That led to the most important change in the project. I narrowed the benchmark to a single tool, coderlm_callers, and a set of pure caller-discovery tasks. Baseline: normal code search. Treatment: use coderlm_callers first.
The result was a +39.3% weighted effective-input-token reduction, with full compliance and full correctness. This was the strongest result in the whole project by a wide margin. The scoped, metadata-rich results from the caller tool replaced multiple grep-read-infer cycles. The tool did the filtering so Claude didn't have to.
After the MCP result looked strong, I wanted to test whether wrapping the same tool in a Claude Code skill would perform even better. A skill might feel more native and avoid MCP discovery overhead.
The first skill run was contaminated by usage-quota failures, so I threw it out and reran. The clean rerun gave a clear answer: -4.3% weighted effective-input-token reduction. The skill worked reliably, the model used it correctly every time, but it was worse than the direct MCP version.
The problem was fixed startup cost. On each fresh session, the skill path had to pay for skill invocation, instructions loaded into context, CLI init, the callers command, and shell output parsing. That overhead erased the gains from skipping MCP tool discovery.
The cleanest proof was the control task. Even when no code-search work was needed, the skill treatment still cost much more input than the baseline. A meaningful part of the cost came from the wrapper itself, not the work.
So the skill experiment answered a different question than I expected. It wasn't "is caller lookup useful?" It was "is this packaging of caller lookup cheaper?" And in this benchmark shape, the answer was no.
I started by asking whether a code-aware index makes an LLM more efficient. The answer is yes, but only when the tool takes real reasoning burden off the model.
That sounds obvious. The experiments made it concrete.
Every CodeRLM setup that lost had the same problem: it added overhead (more tools to choose from, more instructions to parse, more startup cost) without removing enough reasoning work. The setup that won did the opposite: one focused tool, clear instructions, useful structured output, thin interface.
If I had to reduce the whole project to one sentence:
The best use of CodeRLM was not "more tools." It was one narrow tool, delivered through the thinnest possible interface, that took real reasoning burden off the model.