Building AI agents sounds fun until you actually build one. Then a different set of problems shows up — ones nobody writes about.
Here is what I have learned running agent systems in production: self-improvement conflicts with git, most knowledge bases hit a wall sooner than expected, and adding more agents almost never helps.
The self-improvement problem
One of the selling points of agents like Hermes is that they can self-reflect and improve — updating their own rules based on experience. That sounds great until you think about what that means in a real development workflow.
You have git. You have CI pipelines. You have deploys. If an agent updates its rules on a running server and you then deploy from your repo, those changes get overwritten. The agent learns, you deploy, it forgets.
The workaround I landed on: teach the agent to push its own changes to a separate branch — a learning branch — via a PR. I review it, merge it. Keeps the learning in version control and under review, rather than silently overwriting state on a server.
It is not elegant, but it works.
Knowledge base: markdown is fine until it isn’t
I started with markdown files and a set of instructions for how to index and cross-validate them. It works early on. The problem is that as the knowledge base grows, more and more tokens go toward retrieving the right information — and fewer toward actually solving the problem. Context windows fill up with retrieval noise.
That is when I started looking seriously at alternatives.
Andrej Karpathy’s approach: interconnected markdown in Obsidian
Before reaching for a vector database, it is worth looking at what Karpathy demonstrated: a graph of interconnected markdown files in Obsidian, where the graph view makes the relationships between notes visible. He uses this alongside Claude Code as a knowledge layer for his work.
The idea is simple — notes link to other notes, creating a navigable knowledge graph. An agent can traverse it the same way a human would. No embeddings, no vector index, no retrieval pipeline. Just files and links.
I took this further and built what I call a “librarian” agent — a dedicated sub-agent responsible only for managing a Personal Knowledge Management (PKM) vault in Obsidian:
- It indexes notes automatically
- Suggests fixes for orphaned notes (notes with no incoming links)
- Knows how to do research across the knowledge base
- Links related notes and updates adjacent ones when something changes
- Rejects low-quality content that other agents try to push in
After two months: 138 pull requests, 1,766 notes, 791,086 words, 12,688 links. No RAG, no vector database.
The librarian needs a capable model — Sonnet 4.5 or equivalent. A small local model will not hold up for this role.
Vector databases: when you actually need one
If the knowledge base outgrows plain markdown traversal, here is what I have evaluated:
Qdrant — works well at scale, good performance on large datasets. My pick for production use cases with significant data volume.
ChromaDB — good for in-memory work and smaller setups. Easier to get running locally.
pgvector — a PostgreSQL extension that adds vector search. If you already have Postgres, try this before adopting a dedicated vector database. Much less operational overhead.
Pinecone — managed, popular, worth evaluating if you want to avoid running your own infrastructure.
VoyageAI — embedding-focused, more of a retrieval API layer than a full database.
FAISS / Neo4j — FAISS for pure similarity search at scale; Neo4j if your knowledge has real graph structure where relationships matter more than content similarity.
One thing I keep coming back to: file-based search with grep is actually better than vector search for code and documentation. LLMs know how to write good grep queries. They are less good at formulating vector search queries and interpreting ranked results. My rule of thumb: use vector databases for accumulated personal content and media; use file-based search for code and structured documentation.
Local LLMs: viable, with trade-offs
I have spent time running fully local models to avoid subscription costs and keep everything on my own infrastructure.
Qwen 3.6 35B a3b is a current favourite — runs on 8GB VRAM, manageable RAM requirements. Capable enough for agent work, though slower (40-minute task runs are acceptable for my use case).
Google recently released turboquant, a technique that significantly reduces VRAM/RAM usage for KV cache — worth watching if you are running larger models locally.
The honest trade-off: local models are cheaper and private, but you will spend time finding one that meets your quality bar. The model that works for one task may not work for the librarian role described above.
Memory and predictability
Something I keep running into that does not get discussed enough: adding vector-based memory to agents makes them harder to debug.
When something goes wrong with a stateless agent, I can look at the request, trace the reasoning, and understand why it made the decision it made. When the agent has memory — especially in a vector database — the “why” becomes opaque. Some retrieved context influenced the output, but which context, and why did it rank that way?
My more conservative position: if you need agent memory, use structured wiki markdown. One top-level document with a table of contents, linking to topic pages. Re-index daily. Keeps everything human-readable and auditable.
Fewer agents, better prompts
After building my own orchestrator (orqestra), I arrived at a clear conclusion:
Adding more agents does not help. A pipeline where one LLM monitors another LLM which corrects another LLM sounds robust. In practice it compounds errors and makes the system harder to reason about.
What actually works:
- Reduce the number of agents
- Hard-code the pipeline and agent personas in code — do not let the system configure itself dynamically
- Focus on one conversation at a time and build a labelled set of test requests (valid, ambiguous, absurd)
- Optimise at the token level — think in terms of tokens, not sentences
The prompt is the product. Everything else is plumbing.
Summary
| Approach | When it works |
|---|---|
| Interconnected markdown + Obsidian | Starting point for most knowledge bases, scales further than expected |
| “Librarian” agent over markdown | Large PKM with automated maintenance, no RAG needed |
| pgvector | Already have Postgres, want vector search without extra infrastructure |
| Qdrant | Large data volume, production use case |
| ChromaDB | Local/in-memory prototyping |
| File-based search (grep) | Code and structured documentation |
| Local LLMs (Qwen 3.6) | Cost control, privacy, acceptable latency trade-off |
| Hard-coded pipeline + fewer agents | When your multi-agent system is producing unpredictable results |
The pattern I keep seeing: the most progress comes not from adding more agents or more sophisticated retrieval pipelines, but from reducing complexity, keeping humans in the loop for learning and memory updates, and spending real time on the prompt.