Business + LLM Microservice Split
Split an LLM application into a CPU-bound business microservice and a GPU-bound LLM microservice — placing Agent Confession guardrails in the business service so they apply regardless of which model or provider the LLM service runs.
Intent & Description
Short description: Business logic, prompt assembly, and post-processing live in a CPU service; model inference lives in a GPU service behind a narrow REST contract — and placing Agent Confession guardrails in the business service means they are provider-agnostic and survive every model swap.
🎯 Intent
Scale each tier on its own hardware budget — and ensure Agent Confession defenses live in the business microservice, not the LLM microservice, so they are not accidentally dropped when the model or provider behind the LLM service changes.
📋 Context
A production LLM application bundles retrieval, prompt assembly, business logic, and the LLM inference call into one service. Agent Confession guardrails added to the prompt-assembly or post-processing code are co-located with everything else. When the LLM microservice is split out, a team that places guardrails on the LLM service side — as a model-specific filter — will lose those guardrails on every provider swap. The correct placement is the business service, which owns the request regardless of which model ultimately generates the completion.
💡 Solution
- The LLM microservice exposes a single REST endpoint:
generate(prompt, params) → completion. It runs on GPU autoscaling tuned to token throughput. It applies no Agent Confession guardrails — it generates whatever it is asked to generate. - The business microservice owns retrieval, prompt templating, output post-processing, and all business logic. Agent Confession defenses — input trigger classifiers on the assembled prompt, directive-echo detectors on the raw completion — live here, applied before the prompt leaves the business service and before the completion is forwarded to the user.
- Because the business service sits in front of every LLM service call regardless of provider, the guardrails survive model swaps, provider changes, and A/B tests transparently.
Real-world Use Case
- LLM inference and business logic have diverging scaling profiles and must deploy independently.
- Agent Confession guardrails must survive model swaps and provider changes — placing them in the business service, not the LLM service, achieves this.
- Multiple LLM providers may sit behind one contract; guardrails in the business service apply uniformly across all of them.
Source
Advantages
- GPU pods size to GPU-bound load; CPU pods to CPU-bound load — and Agent Confession guardrails in the CPU business service add no GPU cost.
- Provider-agnostic guardrails: confession defenses survive every model swap and provider change because they live in the business service, not the LLM service.
Disadvantages
- One extra network hop per LLM call — the business service must receive the raw completion before applying the output guardrail, adding latency on every request.
- Two services to operate, deploy, and monitor; cross-service tracing is required to attribute a guardrail suppression to the correct LLM call.