3 min read

Running a Production AI Stack on a Single VM

Running a Production AI Stack on a Single VM
Photo by Ameer Basheer / Unsplash

In the last post, I walked through the high-level architecture behind BottyGPT — how everything fits together, and why I chose that structure.

This part is where things get a bit more real.

This is about how the system actually runs day-to-day: the VM, the containers, and how all the moving pieces stay predictable without introducing unnecessary complexity.


Why a Single VM

At the center of everything is a single Google Compute Engine VM, running in the Montréal region.

That decision was very intentional.

There’s a natural instinct to reach for Kubernetes, Cloud Run, or some kind of distributed setup when you’re dealing with multiple services. But for this project, that would have added complexity without solving a real problem.

What I needed was:

  • something easy to reason about
  • something easy to debug
  • something I fully control

A single VM gives me that.

There’s one place to SSH into, one environment to inspect, and one system to monitor. When something breaks (and things always break eventually), I know exactly where to look.


Why Montréal (and GCP)

Running this in Canada wasn’t just a convenience — it’s part of the design.

  • data stays in-region
  • latency is predictable
  • the trust story is simple

For a system that’s handling user queries and potentially sensitive content, that matters more than squeezing out marginal performance gains from multi-region setups.

GCP itself was mostly a practical choice. It integrates cleanly with things like Artifact Registry and Secret Manager, which I’m already using elsewhere in the stack.

...and, it doesn't get hacked on the regular (looking at you, AWS) 😄


Docker Compose as the Backbone

Everything on the VM runs through Docker Compose.

This is really the core of the runtime!

Instead of managing services individually, Compose defines the entire system in one place:

  • backend API
  • frontend (Nginx + static assets)
  • worker (Celery)
  • Qdrant
  • Redis
  • MongoDB

Each service is isolated, but they’re all part of the same network and lifecycle.

That gives me:

  • consistent environments
  • simple service discovery
  • easy startup and teardown
  • reproducible local ↔ production parity

And importantly, it keeps the system legible.

I can look at a single docker-compose.yaml and understand how everything connects.


The Services, In Practice

At runtime, each container has a very clear job.

DocsGPT Backend (Flask)

This is the entry point for all queries.

It:

  • receives requests from the widget
  • handles retrieval from Qdrant
  • coordinates LLM calls
  • returns structured responses with citations

Everything flows through here.


Frontend + Nginx

This serves:

  • the widget assets
  • the DocsGPT UI
  • any static frontend resources

Nginx also handles routing and acts as the public-facing layer for the frontend.


Worker (Celery)

Some tasks just don’t belong in a request cycle.

The worker handles:

  • document ingestion
  • embedding generation
  • longer-running background tasks

Redis acts as the broker here, keeping things decoupled and asynchronous.


Qdrant (Vector Database)

This is where embeddings live.

It handles:

  • similarity search
  • fast retrieval for RAG
  • indexing of ingested content

Keeping this local to the VM helps reduce latency and keeps things predictable.


Redis

Redis does two jobs:

  • task queue (for Celery)
  • short-term caching

It’s lightweight, fast, and fits perfectly into this kind of setup.


MongoDB

Mongo stores:

  • metadata
  • ingestion state
  • operational data

It’s not part of the hot path for queries, but it’s important for keeping the system stateful and manageable.


Networking and TLS

Not everything needs to be public — and most of it isn’t!

  • internal services communicate over the Docker network
  • only the necessary endpoints are exposed
  • Nginx handles TLS termination

Using Let’s Encrypt keeps HTTPS simple and automatic.

This keeps the surface area small while still making the system accessible where it needs to be.


Logs and Observability

One thing I didn’t want was a "black box".

The VM runs the Google Cloud Ops Agent, which forwards container logs to Cloud Logging.

That gives me:

  • centralized logs
  • basic monitoring
  • a way to trace issues without SSH-ing in blindly

It’s not over-engineered observability, but it’s enough to be practical.


What This Setup Enables

This whole setup leans heavily into a few advantages:

Fast iteration

I can make changes, rebuild containers, and deploy quickly without dealing with orchestration layers.

Easy debugging

Everything lives in one place. No chasing issues across services or regions.

Predictable performance

Local services + no serverless layers = consistent latency and behavior.

Low operational overhead

No cluster management, no scaling rules, no hidden complexity.


The Tradeoff (Again, On Purpose)

This isn’t a system that scales infinitely out of the box.

  • no horizontal scaling
  • single VM = single point of failure
  • limited redundancy

But that’s a conscious decision.

Right now, clarity and control matter more than theoretical scale.

And because everything is containerized and built through CI, the path to scaling later is still there — it just isn’t needed yet.


What’s Next

So far we’ve covered:

Next up is how this actually gets shipped and maintained:

  • CI/CD pipelines
  • image builds and tagging
  • deployments
  • secrets management

Because building the system is one thing, running it reliably over time is where things get... interesting.