What models can I run with OrionPod?

OrionPod supports GGUF-format models. You can browse and download them directly from HuggingFace inside the app. Popular models like Mistral, DeepSeek, Qwen, Kimi, Llama, and Gemma all work. The app automatically filters out models that won't run on your hardware.

Is OrionPod really 100% private?

Yes. OrionPod runs entirely on your machine. There's no telemetry, no analytics, no cloud calls, no API keys. Your prompts and model outputs never leave your device.

What hardware do I need for OrionPod?

macOS 10.15 (Catalina) or later. Apple Silicon (M1/M2/M3/M4) is recommended for Metal GPU acceleration, but Intel Macs work too. For 7B models, 8 GB of RAM is a comfortable minimum; larger models need more.

Yes, OrionPod is free and open source. No subscriptions, no paywalls, no premium tiers.

Download OrionPod

Currently available for macOS

v0.4.0-beta

Yuki Latest

2026-06-05

Re-aligning with focus on UX

macOS

Universal (Apple Silicon + Intel) · ~25 MB

OrionPod_0.4.0_universal.dmg

Changelog

Changes

› Faster multi-turn chat — the model's context (KV cache) is now reused across turns instead of rebuilt on every message, lowering per-message latency and removing the redundant Metal warm-up after the first response
› Observability now reports the real time-to-first-token per request (previously an estimate derived from tokens/sec)
› Context Window card on Observability page — gauge ring showing token usage ratio, used/max/available counts, messages in context, and pruned message count
› Keyboard shortcuts: `⌘N` new chat, `⌘K` quick model switcher, `⌘,` open settings, `Esc` stop generation
› Quick model switcher — command palette (`⌘K`) with search, keyboard navigation, load/unload
› Dynamic window title — shows "OrionPod — ModelName" when a model is loaded
› Window size and position remembered across launches
› Skeleton loader for model list (replaces plain text spinner)
› Smooth animations: message appearance, sidebar active indicator, modal transitions

For Geeks

› Persistent inference context — `InferenceEngine` keeps one `llama_context` per loaded model (created lazily, reused across turns; recreated only when context size or thread count changes). `generate()` is now `&mut self`
› Incremental KV cache — each new prompt is reconciled against the tokens resident in the cache via longest-common-prefix; only the diverging suffix is decoded (`clear_kv_cache_seq`), and generated tokens are appended for reuse on the next turn
› KV cache reconciles with orion-core pruning automatically (truncates at the divergence point, system-prompt prefix retained) and resets on `clear_conversation`, model switch/unload, and context overflow via `InferenceEngine::reset_context()`
› Metal compute pipelines now compile once per loaded model instead of once per turn
› `AgentEvent::GenerationStats` — new event carrying real tokens generated, tokens/sec, time-to-first-token, and generation time; `MetricsCollector` records actual TTFT instead of approximating it as `1000/tps`
› Frontend `AgentEvent` union extended with the `generation_stats` variant
› `tauri-plugin-window-state` added for persistent window geometry
› `useKeyboardShortcuts` hook for global shortcut handling
› `ChatWindow` converted to `forwardRef` to expose `clearChat()` for programmatic reset
› Context budget metrics now recorded to `MetricsCollector` — wired `ContextBudget` agent event data through to session metrics
› `SessionMetrics` TypeScript type extended with `context_used_tokens`, `context_max_tokens`, `context_messages_in_context`, `context_messages_pruned`
› `ChatTemplate` trait in orion-core — pluggable prompt formatting with `format()`, `format_system()`, `format_message()`, `assistant_prefix()`
› `ChatMLTemplate` default implementation
› `detect_template()` — auto-selects template from GGUF metadata string, falls back to ChatML
› Pair-wise context pruning — user+assistant turns pruned as units, never orphans a question or answer
› Template-aware token accounting — budget counts template overhead (`<|im_start|>`, `<|im_end|>`, etc.) per message, not just raw content
› System prompt + tool schema tokens deducted from context budget before conversation pruning
› `CoreError::Context` on overflow — clear error when system prompt or latest message exceeds context budget
› `prepare_context()` replaces separate `prune_messages()` + `format_chatml()` — single function for prune, format, and budget accounting
› `Agent::with_template()` constructor and `set_template()` for runtime template switching
› 17 new tests in `orion-core/tests/context_tests.rs` (pair-wise pruning, overflow errors, template overhead, tool budget, ChatML formatting, detect_template)

v0.3.0-alpha

Daru

2026-03-24

Hello World orion-core, an inbuilt harness

macOS

Universal (Apple Silicon + Intel) · ~24 MB

OrionPod_0.3.0_universal.dmg

Changelog

Changes

› Structured agent event system — chat now receives rich lifecycle events (start, delta, end, error, warning) instead of raw token strings
› Token budget bar in chat toolbar — shows context window usage (e.g. "1,200 / 4,096 tokens") with visual warning when >80% full
› Discord community link added to About modal
› System prompt support — backend accepts custom system prompts via `set_system_prompt` command
› Inference parameter tuning — `set_inference_params` command for runtime temperature, context size, and thread count updates

For Geeks

› `orion-core` integration: `LlamaCppBackend` implements `LlmBackend` trait, wrapping `InferenceEngine` for backend-agnostic agent loop
› `AgentState` replaces `InferenceState` — all inference commands route through `orion_core::Agent` for conversation state, context pruning, and ChatML prompt formatting
› `agent-event` Tauri event channel replaces `token-stream` — emits all 13 `AgentEvent` variants (`agent_start`, `message_delta`, `message_end`, `context_budget`, `error`, etc.)
› `set_system_prompt` and `set_inference_params` IPC commands
› `InferenceParams` extended with `n_threads` field (defaults to `available_parallelism - 2`)
› Removed `inference/streaming.rs` (`TokenEvent`, `ChatMessage`, `format_chatml` superseded by orion-core types)
› Removed dead `format_prompt` and `truncate_to_fit` from `InferenceEngine` (context pipeline now in orion-core)
› Frontend: `AgentEvent` discriminated union type (13 variants), `AgentMessage`, `ToolCall`, `ToolResultData` types added to `lib/types.ts`
› Frontend: `setSystemPrompt()` and `setInferenceParams()` IPC wrappers in `lib/tauri.ts`
› Frontend: `ChatWindow.tsx` migrated from `token-stream` listener to `agent-event` with full event handling
› Zero compiler warnings (Rust), zero TypeScript errors

v0.2.3-beta

Faye

2026-03-21

Suggested Models and Onboarding

macOS

Universal (Apple Silicon + Intel) · ~24 MB

OrionPod_0.2.3_universal.dmg

Changelog

Changes

› "Surprise Me" model discovery — one-click random suggestions from a curated list of small, high-quality models
› Curated model list: TinyLlama 1.1B, Qwen 2.5 (0.5B/1.5B/3B), Phi 3.5 Mini, Gemma 2 2B, StableLM 2 1.6B, SmolLM2 1.7B, Llama 3.2 (1B/3B)
› "Try Another" re-roll button — skips already-seen and already-downloaded models
› Inline download with progress tracking directly from the suggestion card
› HuggingFace search result caching (5-minute TTL) — fewer API calls, faster repeat searches
› Quantization variant badges on search result cards — see available quants at a glance
› Sort search results by downloads, likes, or recent activity
› Download pause/resume — pause active downloads, resume later (supports HTTP Range)
› Cancel download button with proper cleanup
› Disk space check before downloading — warns if insufficient space
› Download complete toast with "Load now?" action button
› Active download indicator in footer bar
› First-run welcome wizard — guided setup: download a starter model, auto-load, start chatting
› Partial download recovery — detects incomplete downloads after app crash and offers resume

For Geeks

› `src/lib/curatedModels.ts` — maintainable curated model list as a typed constant array
› `SurpriseCard` component with full download lifecycle (progress, pause/resume/cancel)
› `SearchCache` with TTL-based expiry in `HuggingFaceClient`
› `DownloadManagerState` with `DownloadHandle` for cancel/pause control
› `DownloadSidecar` metadata JSON written alongside downloaded GGUFs
› `available_disk_space()` using `sysinfo::Disks` for volume-aware space check
› `cancel_download`, `pause_download`, `resume_download`, `list_partial_downloads`, `get_available_disk_space` IPC commands
› Download resume via HTTP `Range` header with `.gguf.part` file detection
› `WelcomeWizard` component with 4-step flow (welcome → download → loading → ready)
› `useDownloads` hook now tracks download completion transitions for toast notifications

v0.2.2-alpha

2026-03-19

Your models, your rules

macOS

Universal (Apple Silicon + Intel) · ~24 MB

OrionPod_0.2.2_universal.dmg

Changelog

Changes

› GGUF metadata extraction — model cards now show parameter count, context length, and architecture
› Runtime controls — functional thread count, temperature, and context length sliders in Settings
› Chat template auto-detection from GGUF metadata with manual override dropdown (ChatML, Llama 3, Mistral, Gemma, Phi-3, DeepSeek, etc.)
› Context overflow handling — oldest messages automatically pruned when conversation exceeds context window
› Model status events — real-time loading/ready/error/unloaded status via Tauri events
› Update notification toast with download button when a new version is available
› Actionable toast notifications (toasts can now have clickable action buttons)

For Geeks

› `GgufModelInfo` struct with full GGUF header metadata (params, layers, heads, embedding dim, architecture, chat template)
› `InferenceEngine::format_prompt()` uses `apply_chat_template()` from llama.cpp with ChatML fallback
› `InferenceEngine::truncate_to_fit()` for pair-wise context pruning
› `model-status` Tauri event channel with `ModelStatusEvent` payload
› `AppConfig` extended with `chat_template` option for manual override
› `generate()` accepts configurable `n_threads` and `context_length` from config
› `ModelMetadata` enriched with `context_length`, `architecture`, `chat_template` fields (backward-compatible via `#[serde(default)]`)
› Auto-update check via `https://orionpod.com/api/latest.json` on app launch
› `check_for_updates` Rust IPC command with semver-aware version comparison
› `useUpdateCheck` hook (5s delayed, silent fail, non-blocking)
› `update-web-release.cjs` script for automated release metadata updates
› Changelog auto-extraction from `CHANGELOG.md` into `releases.js`

v0.2.1-rc1

Kusanagi Deprecated

2026-03-18

First public release. Metal GPU acceleration, HuggingFace model browser, real-time observability.

macOS

Universal (Apple Silicon + Intel) · ~30 MB

OrionPod_0.2.1_universal.dmg

Changelog

Changes

› Chat interface with streaming responses and markdown rendering
› HuggingFace model browser with hardware compatibility filtering
› GGUF model support — download from HuggingFace or upload local files
› Real-time observability dashboard (tokens/s, memory, latency, GPU usage)
› Metal GPU acceleration on Apple Silicon
› Model parameter controls (temperature, context length, top-p, top-k)
› Toast notifications and user-friendly error messages
› Glassmorphism UI with macOS vibrancy

For Geeks

› Tauri v2 + React + TypeScript + Rust + llama.cpp
› orion-core agent harness crate (backend-agnostic)
› Universal macOS binary (Apple Silicon + Intel), ~30 MB
› Starts in under 2 seconds, <50 MB RAM idle
› Zero telemetry, zero analytics, zero cloud dependencies

System Requirements

✓ macOS 10.15 (Catalina) or later
✓ Apple Silicon (M-series) recommended for Metal GPU acceleration
✓ 8 GB RAM minimum for 7B models

Open the DMG → drag OrionPod to Applications → launch. That's it.