Local-First AI Development: Running LLMs Offline with Ollama, Python & Electron
The Case for Local AI
Cloud AI APIs are convenient, but they come with strings attached: per-token pricing, data privacy concerns, internet dependency, and rate limits. For developers building AI-powered tools, going local with Ollama eliminates all of these constraints while delivering performance that's often indistinguishable from cloud-hosted models.
Setting Up Ollama
Ollama makes running local LLMs trivial. Installation is a single command:
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Pull models
ollama pull mistral:7b
ollama pull codellama:34b
ollama pull llama3:8b
ollama pull nomic-embed-text
Quantization Levels
Not all hardware can run full-precision models. Ollama supports multiple quantization levels:
For most development work, Q5_K_M offers the best quality-to-size ratio.
Python Integration
Python is the lingua franca of AI/ML. Here's how to integrate Ollama with Python for programmatic access:
import ollama
from typing import Generator
class LocalLLM:
def __init__(self, model: str = "mistral:7b"):
self.model = model
def generate(self, prompt: str, stream: bool = True) -> Generator[str, None, None]:
response = ollama.generate(
model=self.model,
prompt=prompt,
stream=stream,
options={
"temperature": 0.7,
"top_p": 0.9,
"num_predict": 2048,
}
)
if stream:
for chunk in response:
yield chunk["response"]
else:
yield response["response"]
def chat(self, messages: list[dict]) -> str:
response = ollama.chat(
model=self.model,
messages=messages,
options={"temperature": 0.7}
)
return response["message"]["content"]
Electron Desktop App
Wrapping the Python backend with Electron provides a native desktop experience:
Python Backend (FastAPI)
from fastapi import FastAPI
from pydantic import BaseModel
import ollama
import uvicorn
app = FastAPI()
class GenerateRequest(BaseModel):
model: str
prompt: str
stream: bool = True
@app.post("/generate")
async def generate(req: GenerateRequest):
def stream_generator():
for chunk in ollama.generate(
model=req.model,
prompt=req.prompt,
stream=True
):
yield f"data: {chunk['response']}\n\n"
return StreamingResponse(stream_generator(), media_type="text/event-stream")
Electron Frontend
// preload.ts
contextBridge.exposeInMainWorld("ollama", {
generate: async (model: string, prompt: string) => {
const response = await fetch("http://localhost:11434/api/generate", {
method: "POST",
body: JSON.stringify({ model, prompt, stream: false }),
});
return response.json();
},
});
Agent Orchestration Patterns
Local models excel at specialized tasks when composed into agent pipelines:
Sequential Pipeline
class AgentPipeline:
def __init__(self):
self.agents = []
def add_agent(self, name: str, model: str, system_prompt: str):
self.agents.append({
"name": name,
"model": model,
"system_prompt": system_prompt,
})
async def run(self, task: str) -> str:
result = task
for agent in self.agents:
messages = [
{"role": "system", "content": agent["system_prompt"]},
{"role": "user", "content": result},
]
result = ollama.chat(
model=agent["model"],
messages=messages
)["message"]["content"]
return result
Parallel Split-Merge
For tasks that benefit from multiple perspectives: 1. Split — Divide the task into sub-problems 2. Parallel — Each agent works independently 3. Merge — A synthesis agent combines results
Performance Optimization
Running LLMs locally requires careful resource management:
Conclusion
Local AI development isn't just a fallback for when the cloud is unavailable — it's a superior development workflow. Zero latency for initial tokens, unlimited requests, full data privacy, and no recurring costs make it the ideal choice for development tools and privacy-sensitive applications.