Running a Production AI Stack on a Single VM
In the last post, I walked through the high-level architecture behind BottyGPT — how everything fits together, and why I chose that structure.
This part is where things get a bit more real.
This is about how the system actually runs day-to-day: the VM, the containers, and how all the moving pieces stay predictable without introducing unnecessary complexity.
Why a Single VM
At the center of everything is a single Google Compute Engine VM, running in the Montréal region.
That decision was very intentional.
There’s a natural instinct to reach for Kubernetes, Cloud Run, or some kind of distributed setup when you’re dealing with multiple services. But for this project, that would have added complexity without solving a real problem.
What I needed was:
- something easy to reason about
- something easy to debug
- something I fully control
A single VM gives me that.
There’s one place to SSH into, one environment to inspect, and one system to monitor. When something breaks (and things always break eventually), I know exactly where to look.
Why Montréal (and GCP)
Running this in Canada wasn’t just a convenience — it’s part of the design.
- data stays in-region
- latency is predictable
- the trust story is simple
For a system that’s handling user queries and potentially sensitive content, that matters more than squeezing out marginal performance gains from multi-region setups.
GCP itself was mostly a practical choice. It integrates cleanly with things like Artifact Registry and Secret Manager, which I’m already using elsewhere in the stack.
...and, it doesn't get hacked on the regular (looking at you, AWS) 😄
Docker Compose as the Backbone
Everything on the VM runs through Docker Compose.
This is really the core of the runtime!
Instead of managing services individually, Compose defines the entire system in one place:
- backend API
- frontend (Nginx + static assets)
- worker (Celery)
- Qdrant
- Redis
- MongoDB
Each service is isolated, but they’re all part of the same network and lifecycle.
That gives me:
- consistent environments
- simple service discovery
- easy startup and teardown
- reproducible local ↔ production parity
And importantly, it keeps the system legible.
I can look at a single docker-compose.yaml and understand how everything connects.
The Services, In Practice
At runtime, each container has a very clear job.
DocsGPT Backend (Flask)
This is the entry point for all queries.
It:
- receives requests from the widget
- handles retrieval from Qdrant
- coordinates LLM calls
- returns structured responses with citations
Everything flows through here.
Frontend + Nginx
This serves:
- the widget assets
- the DocsGPT UI
- any static frontend resources
Nginx also handles routing and acts as the public-facing layer for the frontend.
Worker (Celery)
Some tasks just don’t belong in a request cycle.
The worker handles:
- document ingestion
- embedding generation
- longer-running background tasks
Redis acts as the broker here, keeping things decoupled and asynchronous.
Qdrant (Vector Database)
This is where embeddings live.
It handles:
- similarity search
- fast retrieval for RAG
- indexing of ingested content
Keeping this local to the VM helps reduce latency and keeps things predictable.
Redis
Redis does two jobs:
- task queue (for Celery)
- short-term caching
It’s lightweight, fast, and fits perfectly into this kind of setup.
MongoDB
Mongo stores:
- metadata
- ingestion state
- operational data
It’s not part of the hot path for queries, but it’s important for keeping the system stateful and manageable.
Networking and TLS
Not everything needs to be public — and most of it isn’t!
- internal services communicate over the Docker network
- only the necessary endpoints are exposed
- Nginx handles TLS termination
Using Let’s Encrypt keeps HTTPS simple and automatic.
This keeps the surface area small while still making the system accessible where it needs to be.
Logs and Observability
One thing I didn’t want was a "black box".
The VM runs the Google Cloud Ops Agent, which forwards container logs to Cloud Logging.
That gives me:
- centralized logs
- basic monitoring
- a way to trace issues without SSH-ing in blindly
It’s not over-engineered observability, but it’s enough to be practical.
What This Setup Enables
This whole setup leans heavily into a few advantages:
Fast iteration
I can make changes, rebuild containers, and deploy quickly without dealing with orchestration layers.
Easy debugging
Everything lives in one place. No chasing issues across services or regions.
Predictable performance
Local services + no serverless layers = consistent latency and behavior.
Low operational overhead
No cluster management, no scaling rules, no hidden complexity.
The Tradeoff (Again, On Purpose)
This isn’t a system that scales infinitely out of the box.
- no horizontal scaling
- single VM = single point of failure
- limited redundancy
But that’s a conscious decision.
Right now, clarity and control matter more than theoretical scale.
And because everything is containerized and built through CI, the path to scaling later is still there — it just isn’t needed yet.
What’s Next
So far we’ve covered:
- the architecture (Part 1)
- the runtime (this post)
Next up is how this actually gets shipped and maintained:
- CI/CD pipelines
- image builds and tagging
- deployments
- secrets management
Because building the system is one thing, running it reliably over time is where things get... interesting.