Giving Claude Code Long-Term Memory with Qdrant

Claude Code is impressive out of the box — it can read your codebase, run commands, and reason through complex engineering tasks. But every time you start a new conversation, it starts from zero. It doesn’t remember that you debugged a BGP flap last Tuesday, that a specific switch needs a delay after commits, or that your customer uses VLAN 500 at a particular site.

I wanted to fix that. So I built cortex-memory — a semantic memory layer that gives Claude Code persistent, searchable long-term memory using vector embeddings.

The problem

Claude Code has a built-in memory system: markdown files in ~/.claude/ that get loaded into the context window at conversation start. It works, but it has limits:

  • It’s manual. Claude decides what to save based on heuristics, and the index file (MEMORY.md) caps at ~200 lines before truncation.
  • It’s keyword-based. Finding relevant context means Claude has to read through the index and guess which files matter.
  • It doesn’t scale. After a few weeks of daily use, you’ve got dozens of memory files competing for limited context window space.
  • No cross-referencing. It can’t correlate a current alert with a ticket you worked on three months ago.

What I wanted was something closer to how human memory works: store everything, retrieve what’s relevant based on meaning, not keywords.

The stack

  • Qdrant — open-source vector database, self-hosted
  • Ollama — local LLM inference server, running the nomic-embed-text embedding model
  • Python — thin scripts for save/recall/search
  • Bash wrapper — CLI interface that Claude Code calls directly
  • Claude Code Skills — the hook that makes it all seamless

The whole thing runs on my home lab. Qdrant stores the vectors, Ollama generates the embeddings, and a handful of Python scripts glue it together.

How it works

Saving memories

When something worth remembering happens in a conversation, Claude (or I) can save it:

./memory-api.sh save --type fact --summary "Customer ACME uses VLAN 500 at AMS1"
./memory-api.sh save --type learning --summary "leaf03 needs 30s delay after commit due to slow TCAM"
./memory-api.sh save --type decision --summary "Chose OSPF over BGP for the new DC fabric"

Under the hood, memory_save.py:

  1. Cleans the text (strips comments, control characters, limits length to avoid token overflow)
  2. Generates a 768-dimensional embedding via Ollama’s nomic-embed-text model
  3. Creates the Qdrant collection if it doesn’t exist (cosine distance)
  4. Stores the vector alongside a rich metadata payload
  5. Uses a deterministic MD5 hash of summary:timestamp as the point ID to prevent duplicates

Four memory types

Not all memories are equal, so I categorized them:

  • conversation — session summaries (“Debugged BGP flapping, found MTU mismatch”)
  • fact — persistent truths about infrastructure (“Customer X uses VLAN 500”)
  • decision — architecture choices and their reasoning (“Chose OSPF because…”)
  • learning — discoveries and gotchas (“This switch needs a delay after commit”)

These types are stored as metadata and can be used as filters during recall.

The payload

Each memory is more than just text and a vector. The full payload:

{
  "type": "fact",
  "summary": "Customer ACME uses VLAN 500 at AMS1",
  "content": "Full text used for embedding generation",
  "topics": ["vlan", "customer-acme"],
  "devices": ["ams1leaf01", "ams1leaf02"],
  "tickets": ["PROJ-1234"],
  "session_id": "optional grouping ID",
  "timestamp": "2026-04-07T15:30:00Z"
}

The devices and tickets fields enable precise filtering. When I’m troubleshooting a specific switch, I can pull up everything related to that device — not just semantically similar content, but exact matches on the device name.

Recall: semantic search over everything

Recall is where it gets interesting. Instead of keyword matching, the system does semantic similarity search:

./memory-api.sh recall "BGP problems"
./memory-api.sh recall "VLAN assignments" --type fact
./memory-api.sh recall "issues" --device ams1router01

The query gets embedded using the same model, then Qdrant finds the closest vectors in the collection. This means “BGP problems” will match memories about “route flapping”, “peer session down”, or “prefix not propagating” — even if none of those exact words appear in the query.

Results come back with relevance scores:

[0.87] [conversation] Debugged BGP flapping on router01 — found MTU mismatch on ae0
[0.72] [fact] Router01 peers with ISP-A on ae0, ISP-B on ae1
[0.65] [learning] Always check MTU on aggregated interfaces after firmware upgrade

Auto-recall: proactive context at conversation start

The real magic is auto_recall.py. This runs automatically when a new conversation starts, taking the user’s first message and searching for relevant context before Claude even responds.

Here’s what it does:

  1. Takes the user’s initial message (e.g., “router01 is dropping BGP sessions again”)
  2. Cleans and embeds it
  3. Searches two collections in parallel:
    • cortex-memory — past conversations, facts, decisions, learnings (threshold: 0.55)
    • jira-tickets — 500+ indexed project tickets (threshold: 0.50)
  4. Returns up to 3 results from each, formatted for Claude’s context

The output looks like:

=== Relevant Context ===

MEMORIES:
  • [conversation] BGP issue on router01 resolved — MTU mismatch (relevance: 0.87)
  • [fact] Router01 BGP peers: ISP-A on ae0, ISP-B on ae1 (relevance: 0.72)

RELATED TICKETS:
  • PROJ-456: BGP migration plan for router01 [Done] (relevance: 0.81)
  • PROJ-789: MTU standardization across fabric [In Progress] (relevance: 0.68)

This means when I say “router01 is dropping BGP sessions again,” Claude already knows:

  • This happened before and it was an MTU issue
  • There’s an active ticket about MTU standardization
  • The device’s BGP topology

No manual context-gathering needed. It just remembers.

Alert correlation: incident response with context

The most operationally useful piece is alert_correlate.py. When an alert fires, this script searches across three collections simultaneously:

  • Jira tickets (5 results, threshold 0.45) — past incidents, change requests, known issues
  • Cortex memories (3 results, threshold 0.50) — past troubleshooting sessions, facts, learnings
  • Device configurations (3 results, threshold 0.50) — relevant config sections
./alert_correlate.py --device ams1router01 --error "BGP session down"
./alert_correlate.py --query "PRTG alert: esxi01 high CPU"

It auto-detects device names from the query text using regex, then applies device-specific filters to narrow results. The output gives you a correlated view: past tickets about this device, memories from previous troubleshooting, and the relevant config sections — all ranked by semantic relevance.

This turns a cold “BGP session down” alert into an informed starting point with historical context.

The key insight is that infrastructure problems don’t always use the same words. A memory about “route flapping caused by MTU mismatch on aggregated ethernet” should match a query about “BGP session dropping” — because they’re about the same class of problem. Traditional keyword search would miss this entirely. Vector similarity captures the meaning.

The nomic-embed-text model (768 dimensions) handles technical infrastructure language well. I’ve been impressed by its ability to connect related concepts even when the terminology differs.

Text cleaning: the boring but critical part

Embedding models are sensitive to garbage input. The scripts include a cleaning pipeline that:

  • Removes # comment lines and /* */ block comments
  • Strips binary and control characters
  • Collapses excessive whitespace
  • Truncates to 2000–4500 characters depending on the use case

Without this, Ollama would occasionally produce weird embeddings or outright fail on large config dumps. It’s not glamorous code, but it prevents a lot of silent failures.

Integration with Claude Code

The whole system plugs into Claude Code through the Skills framework. Skills are directories under ~/.claude/skills/ with a SKILL.md file that teaches Claude how to use the tools. Claude Code loads these at conversation start and can call the scripts directly.

In my CLAUDE.md (the global instruction file), I added directives for Claude to:

  • Run auto_recall.py at conversation start with the user’s first message
  • Use memory-api.sh save when it learns important facts or completes significant troubleshooting
  • Use alert_correlate.py when responding to monitoring alerts

This makes the memory system mostly invisible. I don’t have to think about saving or retrieving — Claude handles it as part of its normal workflow.

Performance

Some numbers from real usage:

  • Embedding generation: ~50ms per query via Ollama (local GPU)
  • Vector search: <10ms for 1000+ points in Qdrant
  • Auto-recall total latency: ~200ms including both collection searches
  • Storage: negligible — each memory point is a 768-float vector plus a small JSON payload

The bottleneck is the embedding step, and even that’s barely noticeable.

What I’d do differently

  • Automatic summarization. Right now, saving conversation memories is a manual trigger. I’d like Claude to automatically summarize and save at the end of significant sessions.
  • Memory decay. Old memories should gradually lose relevance weight. A fact from last week is probably more useful than one from six months ago.
  • Deduplication. The deterministic ID prevents exact duplicates, but semantically similar memories can pile up. A periodic cleanup job that merges near-duplicates would help.
  • Multi-tenant collections. Currently everything lives in one collection per type. If multiple engineers used this, you’d want per-user or per-team isolation.

The bigger picture

LLMs are incredibly capable but fundamentally stateless. Every conversation is a blank slate. Tools like RAG and vector databases are usually discussed in the context of chatbots or document Q&A — but they’re equally powerful as operational memory for AI engineering assistants.

By giving Claude Code a vector-backed memory, it goes from being a very smart tool that I have to re-brief every session, to something closer to a colleague who was there last time and remembers what happened.

The whole cortex-memory skill is about 400 lines of Python and bash. The infrastructure (Qdrant + Ollama) was already running for other projects. The ROI on those 400 lines has been enormous.

Source

The skill lives in ~/.claude/skills/cortex-memory/ and consists of:

  • memory-api.sh — CLI wrapper
  • memory_save.py — embedding + storage
  • memory_recall.py — semantic search + listing
  • auto_recall.py — proactive context recall
  • alert_correlate.py — incident correlation
  • SKILL.md — Claude Code skill definition

If you’re running Claude Code and have a Qdrant instance handy, this is a weekend project that’ll fundamentally change how useful the assistant is day-to-day.