4 min read
AstraGraph

Problem

LLMs are useful for code Q&A but have a fundamental limit: they can’t reason about structure they can’t see. A standard RAG system chunks files into text and retrieves by semantic similarity — it finds relevant snippets but loses all structural information. It can’t answer “what functions does ModelTrainer call?” or “which classes inherit from BaseEncoder?” without the full call graph.

The goal was to build a retrieval system that captures both the semantic meaning of code and its structural relationships — then lets a query agent choose which retrieval mode to use.

Ingestion pipeline

The ingestion is a two-pass tree-sitter AST pipeline that processes Python codebases simultaneously into two stores:

Neo4j property graph — models the full entity hierarchy:

Repository → Package → Module → Class/Function → Attribute/Parameter

Cross-cutting edges: CALLS, INHERITS, IMPORTS. All writes are idempotent via deterministic MD5 UUIDs — re-ingesting the same codebase is safe. Unresolved edges (e.g. calls to external libraries) surface to audit nodes rather than being silently dropped.

Qdrant vector store — stores chunk embeddings (sentence-transformers) for semantic similarity search over docstrings, function bodies, and comments.

Both stores are populated in a single pass over the AST, keeping ingestion fast.

Query agent

A 5-node LangGraph StateGraph agent routes queries across three retrieval modes:

ModeMechanism
GraphCypher queries over Neo4j — structural lookups
VectorDense retrieval over Qdrant — semantic similarity
HybridBoth at 2× top-k, fused with Reciprocal Rank Fusion

The agent decides which mode to use based on the query. Structural questions (“what does X call?”) route to graph; conceptual questions (“how is authentication handled?”) route to vector; ambiguous queries go hybrid.

RRF re-ranks the combined result list without requiring score calibration between the two retrieval systems — an important practical detail since graph traversal relevance and vector similarity scores are not directly comparable.

Architecture decisions

Typed Protocol storage layer: both GraphStore and VectorStore are typed Python protocols, not concrete implementations. The pipeline and agent depend on the protocols — swapping Neo4j for a different graph database, or Qdrant for FAISS, requires no changes to pipeline or agent code.

LangGraph over LangChain agents: LangGraph’s explicit StateGraph gives deterministic routing and easy inspection of agent state at each node. Classic LangChain agents are harder to reason about and debug.

Idempotent writes: deterministic UUIDs from content hashes mean that incremental re-ingestion on code changes only writes new/changed entities. Critical for keeping the graph in sync with an evolving codebase without full re-ingestion.

Evaluation

Benchmarked on the FastAPI repository — a real-world Python project with deep inheritance, complex call graphs, and rich docstrings.

Evaluated on a 22-query dataset covering structural queries (call resolution, inheritance chains) and semantic queries (feature lookup, pattern identification). Hybrid mode consistently outperformed single-mode retrieval on ambiguous queries.

Deployment

  • Backend: FastAPI, containerised with Docker Compose
  • Graph UI: Cytoscape.js for interactive graph visualisation
  • LLM providers: pluggable — Anthropic, Groq, Ollama all supported via a provider interface
  • Deployed to a live VPS at hasssen.xyz
training...