Designing a Self-Hosted AI Assistant (BottyGPT Architecture)
In my last post, I talked about where your AI knowledge should live — and why I chose to self-host DocsGPT rather than rely on third-party tools.
This is the follow-up to that: a look at how the system actually comes together.
This post isn’t about implementation details just yet. It’s about the shape of the system — how the pieces fit together, where things live, and why I made the decisions I did.
The Goal
At a high level, I wanted something pretty simple:
- One AI assistant
- One source of truth
- Available everywhere I publish
That means whether you're reading docs or browsing my site, you're talking to the same system, with the same context and the same understanding of the content.
No fragmentation, no duplicated setups, no weird inconsistencies.
And... not a ton of time spent designing the avatar...

The Core Pattern
The entire setup follows a simple pattern:
One backend, many frontends
There’s a single DocsGPT-powered backend acting as a RAG system, and it’s shared across:
- my main Ghost site
- my Docusaurus docs site
Both use the same:
apiHostapiKey- widget implementation
So no matter where you interact with it, the assistant behaves the same way — same scope, same answers, same citations.
That consistency was important to me. If the assistant is part of the experience, it shouldn’t feel different depending on where you are.
The System at a Glance
At runtime, everything lives on a single VM, orchestrated with Docker Compose.
Inside that, there are a handful of key components:
1. The Widget (UI Layer)
This is what you actually see on the site.
It’s embedded into:
- the Ghost theme (via
default.hbs) - the docs site (via script or React component)
Its job is simple:
- send queries to the backend
- render responses
- show citations
It doesn’t know anything about the data itself — it just talks to the API.
2. DocsGPT Backend (API)
This is the core of the system.
It handles:
- incoming queries
- retrieval from the vector database
- LLM orchestration
- response formatting
It’s exposed publicly as an API, and everything flows through it.
3. Worker + Background Jobs
Some tasks don’t belong in the request cycle.
Things like:
- embedding documents
- ingestion
- longer-running operations
These run through a worker (Celery) using Redis as a broker.
4. Data Layer
There are three main pieces here:
- Qdrant → vector search (embeddings + retrieval)
- Redis → task queue + short-term caching
- MongoDB → metadata and operational state
Each one does one job, and keeps the system modular.
5. Frontend + Static Assets
The DocsGPT frontend (served via Nginx) handles:
- UI assets
- admin interface
- widget resources
This sits alongside the API, but stays separate in responsibility.
Where It All Lives
Everything runs on a single Google Compute Engine VM in the Montréal region.
That includes:
- backend API
- worker
- vector database
- Redis + Mongo
- frontend + Nginx
It’s all containerized and orchestrated through Docker Compose.
No Kubernetes. No distributed system. Just one well-defined runtime.
Why This Architecture
There are a few principles behind this setup.
1. Simplicity scales surprisingly far
One VM is easy to reason about.
- one place to SSH
- one set of logs
- one deployment target
For the scale I’m operating at, that simplicity is a feature, not a limitation.
2. Consistency across experiences
By centralizing the backend, I avoid:
- duplicated embeddings
- inconsistent answers
- fragmented context
The assistant behaves like a system, not a feature bolted onto individual sites.
3. Data residency matters
Everything runs in Canada.
That’s a deliberate choice — it keeps data within a region I’m comfortable with and makes the trust story a lot cleaner, especially for Canadian clients.
4. Predictable performance and cost
Running a dedicated VM with a local vector database means:
- no surprise per-request costs
- fewer network hops
- more consistent latency
It’s not the most “cloud-native” approach, but it is very predictable.
Tradeoffs (On Purpose)
This setup isn’t trying to do everything.
There are a couple of intentional tradeoffs:
- no automatic horizontal scaling
- single VM = single point of failure
- no multi-region redundancy
And that’s fine — for now.
The goal here wasn’t to build a hyperscale system. It was to build something:
- understandable
- controllable
- and easy to evolve
What’s Next
This post is the high-level view.
In the next one, I’ll break down how this actually runs in practice — the VM setup, Docker Compose, and how all the services are wired together day-to-day.
Because the interesting part isn’t just the architecture, it’s how it behaves when you actually run it!