Building a Production-Grade LLM Orchestration Layer with Electron, React & Ollama
The Problem with Cloud-Dependent AI
Most AI tools today require API keys, internet connectivity, and monthly subscriptions. For developers working with sensitive codebases, compliance-heavy industries, or simply those who value privacy, this dependency is a non-starter. Gravity OS was built to solve exactly this — a local-first AI operating layer that runs entirely offline.
Architecture Overview
Gravity OS sits between the user and multiple local LLMs managed by Ollama. The architecture follows a layered design:
┌─────────────────────────────┐
│ Electron Shell (UI) │
├─────────────────────────────┤
│ Agent Orchestration Core │
├─────────────────────────────┤
│ Model Abstraction Layer │
├─────────────────────────────┤
│ Ollama API │
├─────────────────────────────┤
│ Local LLM Models (7B-70B)│
└─────────────────────────────┘
Key Components
Electron Shell: Provides the desktop environment, system tray integration, and native OS features like file system access and notifications.
React Frontend: Built with a component-based architecture for chat interfaces, agent management, and real-time streaming of model responses.
Agent Orchestration Core: The brain of the system. Manages agent definitions, task routing, and collaborative workflows between multiple models.
Model Abstraction Layer: A unified interface over Ollama's API that handles prompt formatting, context window management, and response parsing across different model architectures.
Agent Collaboration Patterns
One of Gravity OS's standout features is multi-agent collaboration. Rather than relying on a single monolithic model, tasks are decomposed and routed to specialized agents:
interface Agent {
name: string;
model: string;
systemPrompt: string;
capabilities: string[];
}
const agents: Agent[] = [
{
name: "Architect",
model: "codellama:34b",
systemPrompt: "You are a software architect...",
capabilities: ["design", "code-review"],
},
{
name: "Writer",
model: "mistral:7b",
systemPrompt: "You are a technical writer...",
capabilities: ["documentation", "editing"],
},
];
Tasks are routed based on capability matching, and results are aggregated through a reducer pattern that synthesizes outputs into coherent responses.
Streaming & Performance
Real-time token streaming was critical. Using Ollama's streaming API via Server-Sent Events, each token is pushed through Electron's IPC bridge to React's state management:
// IPC Bridge for Streaming
ipcMain.handle("ollama:generate", async (event, { model, prompt }) => {
const stream = await ollama.generate({ model, prompt, stream: true });
for await (const chunk of stream) {
event.sender.send("ollama:token", chunk.response);
}
});
On the frontend, tokens are buffered and rendered with a typewriter effect using React's concurrent mode for smooth UI updates even during heavy generation.
Context Window Management
LLMs have finite context windows. Gravity OS implements a sliding window with semantic compression:
1. Token counting — Estimate token usage before sending 2. Semantic chunking — Split conversation history into meaningful segments 3. Summary injection — Compress older context into summaries 4. Dynamic windowing — Adjust context size based on available VRAM
Security & Privacy
Since everything runs locally, data never leaves the machine. Encryption at rest is applied to conversation history using Node.js's native crypto module. The app sandboxes model execution and prevents any network calls from the agent process.
Lessons Learned
Building Gravity OS taught me that local AI is not just feasible — it's often superior for development workflows. The latency of local inference (even on consumer hardware) is offset by zero network overhead and unlimited token budgets. The key was designing the orchestration layer to be model-agnostic, allowing users to swap between Mistral, Llama, CodeLlama, and others without changing their workflow.