OllamaPythonElectronLocal AIPrivacy

Local-First AI Development: Running LLMs Offline with Ollama, Python & Electron

2024-10-2214 min read

The Case for Local AI

Cloud AI APIs are convenient, but they come with strings attached: per-token pricing, data privacy concerns, internet dependency, and rate limits. For developers building AI-powered tools, going local with Ollama eliminates all of these constraints while delivering performance that's often indistinguishable from cloud-hosted models.

Setting Up Ollama

Ollama makes running local LLMs trivial. Installation is a single command:

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull models
ollama pull mistral:7b
ollama pull codellama:34b
ollama pull llama3:8b
ollama pull nomic-embed-text

Quantization Levels

Not all hardware can run full-precision models. Ollama supports multiple quantization levels:

LevelBitsSize ReductionQuality Impact Q4_K_M4-bit~75%Minimal Q5_K_M5-bit~60%Negligible Q8_08-bit~50%None F1616-bitFullBaseline

For most development work, Q5_K_M offers the best quality-to-size ratio.

Python Integration

Python is the lingua franca of AI/ML. Here's how to integrate Ollama with Python for programmatic access:

import ollama
from typing import Generator

class LocalLLM:
    def __init__(self, model: str = "mistral:7b"):
        self.model = model

    def generate(self, prompt: str, stream: bool = True) -> Generator[str, None, None]:
        response = ollama.generate(
            model=self.model,
            prompt=prompt,
            stream=stream,
            options={
                "temperature": 0.7,
                "top_p": 0.9,
                "num_predict": 2048,
            }
        )

        if stream:
            for chunk in response:
                yield chunk["response"]
        else:
            yield response["response"]

    def chat(self, messages: list[dict]) -> str:
        response = ollama.chat(
            model=self.model,
            messages=messages,
            options={"temperature": 0.7}
        )
        return response["message"]["content"]

Electron Desktop App

Wrapping the Python backend with Electron provides a native desktop experience:

Python Backend (FastAPI)

from fastapi import FastAPI
from pydantic import BaseModel
import ollama
import uvicorn

app = FastAPI()

class GenerateRequest(BaseModel):
    model: str
    prompt: str
    stream: bool = True

@app.post("/generate")
async def generate(req: GenerateRequest):
    def stream_generator():
        for chunk in ollama.generate(
            model=req.model,
            prompt=req.prompt,
            stream=True
        ):
            yield f"data: {chunk['response']}\n\n"

    return StreamingResponse(stream_generator(), media_type="text/event-stream")

Electron Frontend

// preload.ts
contextBridge.exposeInMainWorld("ollama", {
  generate: async (model: string, prompt: string) => {
    const response = await fetch("http://localhost:11434/api/generate", {
      method: "POST",
      body: JSON.stringify({ model, prompt, stream: false }),
    });
    return response.json();
  },
});

Agent Orchestration Patterns

Local models excel at specialized tasks when composed into agent pipelines:

Sequential Pipeline

class AgentPipeline:
    def __init__(self):
        self.agents = []

    def add_agent(self, name: str, model: str, system_prompt: str):
        self.agents.append({
            "name": name,
            "model": model,
            "system_prompt": system_prompt,
        })

    async def run(self, task: str) -> str:
        result = task
        for agent in self.agents:
            messages = [
                {"role": "system", "content": agent["system_prompt"]},
                {"role": "user", "content": result},
            ]
            result = ollama.chat(
                model=agent["model"],
                messages=messages
            )["message"]["content"]
        return result

Parallel Split-Merge

For tasks that benefit from multiple perspectives: 1. Split — Divide the task into sub-problems 2. Parallel — Each agent works independently 3. Merge — A synthesis agent combines results

Performance Optimization

Running LLMs locally requires careful resource management:

**VRAM monitoring** — Track GPU memory usage and swap models when needed

**Model unloading** — Unload unused models to free resources

**Batched inference** — Group requests for GPU efficiency

**Response caching** — Cache identical prompts with semantic similarity matching

Conclusion

Local AI development isn't just a fallback for when the cloud is unavailable — it's a superior development workflow. Zero latency for initial tokens, unlimited requests, full data privacy, and no recurring costs make it the ideal choice for development tools and privacy-sensitive applications.