Skip to main content

Command Palette

Search for a command to run...

How I Built a Codebase Q&A Tool in 2 Days (And What I'd Do Differently)

Updated
10 min readView as Markdown
How I Built a Codebase Q&A Tool in 2 Days (And What I'd Do Differently)

Large codebases are intimidating.

Isn't it? Whether you are a new contributor trying to understand the world of open source and have come across an unfamiliar repository, or a developer returning to your own project after months away and being clueless for the first few hours. You spend more time searching than building.

You jump between files and folders trying to understand the structure. You search for function names, check the imports, find the source, multiple functions with layers of abstraction and more. Good documentation does help to ease the process a bit, but still a lot of work.

I faced the same issue. So I thought, let's fix it.

So I built CodeRepo.

Instead of manually digging through a repository, you simply paste a GitHub URL and start asking questions in plain english.

Now I know some of you may be thinking why not just put the repo url or code files into chatgpt/claude or any other LLM and ask questions. I know and i even did that but faced some issues which CodeRepo fixes. I will explain why CodeRepo is different in this article.

So, moving back, if you want to know how authentication works? Ask it.

Need to find where a particular API is implemented? Ask it.

Trying to understand the flow of a request through the application? Ask it.

CodeRepo analyzes the repository, retrieves the relevant pieces of code, and responds with explanations backed by exact file and line citations so that you can verify every answer yourself. Responses are streamed in real time, making the experience feel more like talking to an engineer who already knows the codebase than waiting for a static search result.

The Architecture

At a higher level overview i have kept the architecture pretty simple.

Note: the app is still in its very early phase. As we grow the architecture will become diverse.

  1. The user pastes a GitHub repository URL

  2. The backend clones the repository from the URL and runs it through an ingestion pipeline.

  3. Every source file is parsed, cleaned, chunked and converted into vector embeddings before being stored in a vector database.

  4. Now, when a user asks a question, a LangGraph-powered agent retreives the most relevant chunks, reasons over them, generates an answer and streams it back to the frontend for the user to view along with precise file and line citations.

This is the entire flow. Now coming to the tech stack:

The Tech Stack

For the frontend, React is used and the backend is built with Django which handles repository ingestion, chat requests and communication with the vector database.

For database, I have used ChromaDB as the vector store and used OpenAI embeddings to store each code chunk semantically, allowing the system to retreive the relevant code based on meaning rather than simple keyword matching.

Finally, there is a LangGraph agent that plans and carries out the retreival and reasoning process.

So, instead of blindly passing retireved chunks to the LLM and generating answer, it follows a structured workflow that retreives context, synthesizes an answer, attaches citations, and streams the response back to the user in real time.

Simple on paper. The interesting stuff is in the details.

The interesting engineering challenges weren't in connecting these components together. The interesting challenges were making the ingestion pipeline reliable, preserving enough context during the chunking process, ensuring the citations stayed accurate and keeping the entire experience responsive even for larger repositories containing thousands of files.

Let's dive into those details.

The Chunking Process (Imp One)

While building the project, one of the important decisions i had to make was improve the chunking process, because the answer quality would depend on it.

Mostly, beginners RAG (Retrieval-Augmented Generation) explanations recommend splitting documents by character count or token limit. Take every N characters, add an overlap of M characters and generate embeddings. Then you are done.

That approach works well for prose because paragraphs dont have tight boundaries. Even if you split a paragraph in half, the surrounding text often provides enough context for the model to capture the meaning.

Let us take an example to understand it more clearly.

Chunk 1: "Most RAG tutorials recommend splitting documents.."

Chunk 2: "...splitting documents by character count or token limit.."

Chunk 3: "...character count or token limit. Take every N characters.."

Even though here each chunk overlaps and cuts sentences in uneven places, the semantic meaning is still preserved well for retrieval to work.

But code is different.

Code has a structure. A function has a structure, a class has a structure that represents a collection. Splitting either of them in the middle creates chunks that no longer give a complete idea of what it is. This will result in embeddings becoming representations of incomplete thoughts. This way the LLM won't understand the context and won't produce proper output.

So, instead of chunking by size, I decided to chunk by syntax. I used tree-sitter to parse every source file into an AST (Abstract Syntax Tree). So, rather than counting characters, it reads the entire tree and extracts classes, functions, methods, and definitions as individual chunks and stores them.

This approach made it easier to store metadata. Every chunk stores its file path along with its starting and ending line numbers. This metadata helps with generating citations, which are shown in the frontend response.

But this approach isn't free. Unlike simple text splitting, Tree-sitter requires a grammar for every programming language you want to support. Currently, we support Python and JavaScript only as most of the repositories use these 2 languages primarily. We will soon add other languages.

This was probably the most important architectural decision for the project as this affects the answer quality the user is going to receive.

Self-Correcting Agent

Retrieval is only half of the problem.

Even if the retriever returns the right pieces of code, there is still no guarantee that the LLM will produce the right output. LLM's are excellent at filling in gaps and sometimes they would assume and infer details that aren't present in the retrieved context. For this project that's a big problem.

Suppose if the assistant introduces a new function name, misattributes a piece of logic or explains behaviour that doesn't actually exist in the repository then the answer becomes misleading. I wanted the system to be conservative, if it couldn't justify a claim from a retrieved code, then it should not make the claim in the first place.

To implement this, I built the agent as a small workflow using LangGraph.

self correcting agent diagram

Each node has a single responsibility. The Router receives the user's question and decides how it should be handled.

The Retriever performs a similarity search against ChromaDB and returns the most relevant code chunks.

The synthesiser takes those chunks and generates a natural language answer. Since every chunk carries metadata such as its file path and line numbers, it helps the synthesiser to attach precise citations alongside each explanation.

Now, when it reaches the critic node, here instead of considering the first response that the LLM produces, a second call is made whose job is to verify the first response. The critic receives three inputs: the user's question, the retrieved chunks and the proposed answer.

The critic responds with a structured JSON.

{ "grounded": true }
{ "grounded": false, "hint": "The answer mentions middleware, but none of the retrieved chunks contain middleware logic." }

If the answer isn't grounded, the workflow doesn't immediately return it to the user. Instead the critic feedback is passed back to the synthesiser, which gets another chance to answer while taking the hint into account.

I stop the process at two retries. If the answer is still not fully grounded after that, the best available response is returned rather than looping indefinitely.

This isn't a perfect hallucination detector. The critic is still another LLM, so it can make mistakes of its own. But in practice it catches many of the obvious failure cases.

To keep the workflow manageable, every node shares a single TypedDict state object. It carries everything the graph needs as it moves from node to node: the user's question, the retrieved chunks, the current answer, citations, retry count, whether the answer is grounded, and any corrective hint from the critic.

Keeping all state explicit instead of hidden inside nodes made the graph much easier to reason about and, more importantly, much easier to debug when something inevitably went wrong.

Why Not Just Use ChatGPT or Claude?

This is probably the first question that comes to mind. If modern LLMs can already read code, why build another tool?

I actually started there. I tried uploading repositories, pasting files into ChatGPT and Claude, and asking questions directly. They worked well for small examples, but I kept running into the same problems as the repositories grew.

The first issue is context. Large repositories simply don't fit into a single prompt. You end up deciding which files to include and which to leave out, hoping you've provided enough information for the model to answer correctly. If the important piece of code lives in a file you didn't include, the answer is likely to be incomplete or wrong.

The second issue is manual exploration. Before asking a question, you often have to spend time figuring out which files are relevant. Ironically, this is exactly the problem you're trying to solve in the first place.

The third issue is trust. When an LLM says, "This function calls another helper," how do you know that's actually true? Without precise references back to the source code, verifying an answer means manually searching through the repository yourself.

CodeRepo approaches the problem differently.

Instead of asking the user to decide what context to provide, it indexes the entire repository once. Every question automatically retrieves only the relevant code before generating an answer. Because every chunk retains its file path and line numbers, the response includes citations that point directly to the implementation, making it easy to verify every claim.

In other words, CodeRepo isn't trying to replace any LLM. Rather it uses an LLM under the hood. What it adds is everything around the model: repository ingestion, syntax-aware chunking, semantic retrieval, grounding, and verifiable citations. Those pieces are what make an LLM genuinely useful for understanding large codebases rather than just reading isolated files.

What's missing?

Right now this project is still an MVP. There are several parts of the architecture i would change before making it production ready.

The biggest limitation right now is that the ingestion pipeline is synchronous. When a user submits a repository, the request blocks while the repository is cloned, parsed, chunked, embedded and indexed. For smaller repositories it works fine but for larger ones it can leave users waiting for an uncomfortable amount of time. A better approach would be to move ingestion into a background worker and allowing the frontend to report progress while processing happens asynchronously.

There is also no repository caching. If two users paste the same Github URL, the system clones and indexes the repository from scratch each time. Repositories should be identified by their commit hash or default branch, allowing previously indexed versions to be reused instead of repeating the entire ingestion pipeline.

Gradually I will add all of these features along with few smaller improvements as well.

What's next

The code is on GitHub: github.com/anik-bin/coderepo

If you're building something similar or have any feedback, I'd like to hear it.