3 min read

Designing a Self-Hosted AI Assistant (BottyGPT Architecture)

Designing a Self-Hosted AI Assistant (BottyGPT Architecture)
Photo by Igor Omilaev / Unsplash

In my last post, I talked about where your AI knowledge should live — and why I chose to self-host DocsGPT rather than rely on third-party tools.

This is the follow-up to that: a look at how the system actually comes together.

This post isn’t about implementation details just yet. It’s about the shape of the system — how the pieces fit together, where things live, and why I made the decisions I did.


The Goal

At a high level, I wanted something pretty simple:

  • One AI assistant
  • One source of truth
  • Available everywhere I publish

That means whether you're reading docs or browsing my site, you're talking to the same system, with the same context and the same understanding of the content.

No fragmentation, no duplicated setups, no weird inconsistencies.

And... not a ton of time spent designing the avatar...


The Core Pattern

The entire setup follows a simple pattern:

One backend, many frontends

There’s a single DocsGPT-powered backend acting as a RAG system, and it’s shared across:

  • my main Ghost site
  • my Docusaurus docs site

Both use the same:

  • apiHost
  • apiKey
  • widget implementation

So no matter where you interact with it, the assistant behaves the same way — same scope, same answers, same citations.

That consistency was important to me. If the assistant is part of the experience, it shouldn’t feel different depending on where you are.


The System at a Glance

At runtime, everything lives on a single VM, orchestrated with Docker Compose.

Inside that, there are a handful of key components:

1. The Widget (UI Layer)

This is what you actually see on the site.

It’s embedded into:

  • the Ghost theme (via default.hbs)
  • the docs site (via script or React component)

Its job is simple:

  • send queries to the backend
  • render responses
  • show citations

It doesn’t know anything about the data itself — it just talks to the API.


2. DocsGPT Backend (API)

This is the core of the system.

It handles:

  • incoming queries
  • retrieval from the vector database
  • LLM orchestration
  • response formatting

It’s exposed publicly as an API, and everything flows through it.


3. Worker + Background Jobs

Some tasks don’t belong in the request cycle.

Things like:

  • embedding documents
  • ingestion
  • longer-running operations

These run through a worker (Celery) using Redis as a broker.


4. Data Layer

There are three main pieces here:

  • Qdrant → vector search (embeddings + retrieval)
  • Redis → task queue + short-term caching
  • MongoDB → metadata and operational state

Each one does one job, and keeps the system modular.


5. Frontend + Static Assets

The DocsGPT frontend (served via Nginx) handles:

  • UI assets
  • admin interface
  • widget resources

This sits alongside the API, but stays separate in responsibility.


Where It All Lives

Everything runs on a single Google Compute Engine VM in the Montréal region.

That includes:

  • backend API
  • worker
  • vector database
  • Redis + Mongo
  • frontend + Nginx

It’s all containerized and orchestrated through Docker Compose.

No Kubernetes. No distributed system. Just one well-defined runtime.


Why This Architecture

There are a few principles behind this setup.

1. Simplicity scales surprisingly far

One VM is easy to reason about.

  • one place to SSH
  • one set of logs
  • one deployment target

For the scale I’m operating at, that simplicity is a feature, not a limitation.


2. Consistency across experiences

By centralizing the backend, I avoid:

  • duplicated embeddings
  • inconsistent answers
  • fragmented context

The assistant behaves like a system, not a feature bolted onto individual sites.


3. Data residency matters

Everything runs in Canada.

That’s a deliberate choice — it keeps data within a region I’m comfortable with and makes the trust story a lot cleaner, especially for Canadian clients.


4. Predictable performance and cost

Running a dedicated VM with a local vector database means:

  • no surprise per-request costs
  • fewer network hops
  • more consistent latency

It’s not the most “cloud-native” approach, but it is very predictable.


Tradeoffs (On Purpose)

This setup isn’t trying to do everything.

There are a couple of intentional tradeoffs:

  • no automatic horizontal scaling
  • single VM = single point of failure
  • no multi-region redundancy

And that’s fine — for now.

The goal here wasn’t to build a hyperscale system. It was to build something:

  • understandable
  • controllable
  • and easy to evolve

What’s Next

This post is the high-level view.

In the next one, I’ll break down how this actually runs in practice — the VM setup, Docker Compose, and how all the services are wired together day-to-day.

Because the interesting part isn’t just the architecture, it’s how it behaves when you actually run it!