LLM Runners

Discover the best tools for running local LLMs on your own hardware

Rank	App Name	OS	Platform	Supported	Explanation
1	Ollama	Windows / macOS / Linux	CUDA / ROCm / Metal / CPU	LLM, RAG, MCP, API	Most popular local LLM runner. Single command: 'ollama run llama3'. Automatic model management, OpenAI-compatible API, Docker support, MCP agent integration.
2	LM Studio	Windows / macOS / Linux	CUDA / Vulkan / Metal / CPU	LLM, RAG, Chat, API	GUI-based local LLM client. Easy GGUF/Q4 model download and testing. Built-in chat interface, server mode for API access, model comparison tools.
3	llama.cpp	Windows / macOS / Linux	CUDA / Metal / Vulkan / CPU / WebGPU	LLM, Embeddings, RAG	C++ inference engine for LLMs. GGUF format standard. High performance on all hardware. Supports 3-bit to 8-bit quantization. Base for most local LLM tools.
4	KoboldCPP	Windows / macOS / Linux	CUDA / Vulkan / Metal / CPU / WebGPU	LLM, Roleplay, RAG, Story mode	Web-based LLM frontend built on llama.cpp. Excellent for character roleplay and story mode. Smart context management, NovelAI models support, API compatible.
5	llamafile	Windows / macOS / Linux	CUDA / Metal / Vulkan / CPU	LLM, Embeddings, Image Gen	Mozilla project: self-extracting executable containing llama.cpp + models. Single file distribution. No dependencies needed. Runs on any device. Code licensed under Apache 2.0.
6	vLLM	Linux / Windows	CUDA / TensorRT / ROCm	LLM Serving, API, Batch	High-throughput LLM serving engine. PagedAttention technology for memory efficiency. Supports Llama, Mistral, Qwen, Gemma. Production-ready API, multi-GPU support.
7	MLX	macOS	Apple Neural Engine / Metal	LLM, Fine-tuning, Inference	Apple-optimized ML framework. Best performance on Mac with Apple Silicon. Fast inference, easy fine-tuning. Supports Llama, Mistral, Gemma models natively.
8	Jan	Windows / macOS / Linux	CUDA / Vulkan / Metal / CPU	LLM, RAG, API, Extensions	Open-source, cross-platform LLM runner. Desktop app with clean UI. Supports GGUF, Safetensors models. Extension system, local API server, offline-first design.
9	Text Generation WebUI	Windows / macOS / Linux	CUDA / ROCm / Vulkan / CPU	LLM, Roleplay, RAG, Training	Feature-rich web UI for local LLMs. One-click install, extension system. Supports llama.cpp, transformers, ExLlamaV2. Training, embedding, image generation plugins.
10	SillyTavern	Windows / macOS / Linux / Docker	Node.js (connects to any backend)	LLM Frontend, Roleplay, RAG	Advanced LLM frontend for character roleplay. Connects to Ollama, KoboldCPP, llama.cpp, API backends. Token visualization, story mode, avatar system, extensions.
11	ExLlamaV2	Linux / Windows	CUDA	LLM Inference, High Performance	CUDA-optimized LLM inference engine. Faster than llama.cpp for FP16/A16. Supports GPTQ, AWQ, FP16 models. Best performance for NVIDIA GPUs 10XX+.
12	llama.cpp Forks — Aquila	Windows / macOS / Linux	CUDA / CPU	LLM, Chinese Language	llama.cpp fork optimized for Chinese language models. Supports Aquila, Baichuan, ChatGLM models. Better Chinese tokenization and context handling.
13	llama.cpp Fork — GPTQ.cpp	Windows / macOS / Linux	CUDA / CPU	LLM, GPTQ Models	llama.cpp fork for GPTQ quantized models. Runs GPTQ models natively without conversion. Lower VRAM usage, good for 4-bit quantized models.
14	llama.cpp Fork —llama.cpp-Metal	macOS	Metal / Apple Silicon	LLM, Apple GPU Acceleration	Metal-optimized branch of llama.cpp. Best performance on Mac with Apple Silicon. Uses Metal Performance Shaders for GPU acceleration.
15	llama.cpp Fork — WebLLM	Any (Browser)	WebGPU / WASM	LLM, Browser-based Inference	Run LLMs directly in browser via WebGPU. No server needed. Supports Phi-2, Llama, Mistral in browser. Privacy-first, fully client-side execution.
16	llama.cpp Fork — Ollama.cpp	Windows / macOS / Linux	CUDA / Vulkan / CPU	LLM, Lightweight Runtime	Lightweight C++ runtime inspired by Ollama. Minimal dependencies, fast startup. Good for embedded systems and edge devices. Simple API for local inference.

Rank

App Name

Platform

Supported

Explanation

Ollama

Windows / macOS / Linux

CUDA / ROCm / Metal / CPU

LLM, RAG, MCP, API

Most popular local LLM runner. Single command: 'ollama run llama3'. Automatic model management, OpenAI-compatible API, Docker support, MCP agent integration.

LM Studio

Windows / macOS / Linux

CUDA / Vulkan / Metal / CPU

LLM, RAG, Chat, API

GUI-based local LLM client. Easy GGUF/Q4 model download and testing. Built-in chat interface, server mode for API access, model comparison tools.

llama.cpp

Windows / macOS / Linux

CUDA / Metal / Vulkan / CPU / WebGPU

LLM, Embeddings, RAG

C++ inference engine for LLMs. GGUF format standard. High performance on all hardware. Supports 3-bit to 8-bit quantization. Base for most local LLM tools.

KoboldCPP

Windows / macOS / Linux

CUDA / Vulkan / Metal / CPU / WebGPU

LLM, Roleplay, RAG, Story mode

Web-based LLM frontend built on llama.cpp. Excellent for character roleplay and story mode. Smart context management, NovelAI models support, API compatible.

llamafile

Windows / macOS / Linux

CUDA / Metal / Vulkan / CPU

LLM, Embeddings, Image Gen

Mozilla project: self-extracting executable containing llama.cpp + models. Single file distribution. No dependencies needed. Runs on any device. Code licensed under Apache 2.0.

vLLM

Linux / Windows

CUDA / TensorRT / ROCm

LLM Serving, API, Batch

High-throughput LLM serving engine. PagedAttention technology for memory efficiency. Supports Llama, Mistral, Qwen, Gemma. Production-ready API, multi-GPU support.

MLX

macOS

Apple Neural Engine / Metal

LLM, Fine-tuning, Inference

Apple-optimized ML framework. Best performance on Mac with Apple Silicon. Fast inference, easy fine-tuning. Supports Llama, Mistral, Gemma models natively.

Jan

Windows / macOS / Linux

CUDA / Vulkan / Metal / CPU

LLM, RAG, API, Extensions

Open-source, cross-platform LLM runner. Desktop app with clean UI. Supports GGUF, Safetensors models. Extension system, local API server, offline-first design.

Text Generation WebUI

Windows / macOS / Linux

CUDA / ROCm / Vulkan / CPU

LLM, Roleplay, RAG, Training

Feature-rich web UI for local LLMs. One-click install, extension system. Supports llama.cpp, transformers, ExLlamaV2. Training, embedding, image generation plugins.

SillyTavern

Windows / macOS / Linux / Docker

Node.js (connects to any backend)

LLM Frontend, Roleplay, RAG

Advanced LLM frontend for character roleplay. Connects to Ollama, KoboldCPP, llama.cpp, API backends. Token visualization, story mode, avatar system, extensions.

ExLlamaV2

Linux / Windows

CUDA

LLM Inference, High Performance

CUDA-optimized LLM inference engine. Faster than llama.cpp for FP16/A16. Supports GPTQ, AWQ, FP16 models. Best performance for NVIDIA GPUs 10XX+.

llama.cpp Forks — Aquila

Windows / macOS / Linux

CUDA / CPU

LLM, Chinese Language

llama.cpp fork optimized for Chinese language models. Supports Aquila, Baichuan, ChatGLM models. Better Chinese tokenization and context handling.

llama.cpp Fork — GPTQ.cpp

Windows / macOS / Linux

CUDA / CPU

LLM, GPTQ Models

llama.cpp fork for GPTQ quantized models. Runs GPTQ models natively without conversion. Lower VRAM usage, good for 4-bit quantized models.

llama.cpp Fork —llama.cpp-Metal

macOS

Metal / Apple Silicon

LLM, Apple GPU Acceleration

Metal-optimized branch of llama.cpp. Best performance on Mac with Apple Silicon. Uses Metal Performance Shaders for GPU acceleration.

llama.cpp Fork — WebLLM

Any (Browser)

WebGPU / WASM

LLM, Browser-based Inference

Run LLMs directly in browser via WebGPU. No server needed. Supports Phi-2, Llama, Mistral in browser. Privacy-first, fully client-side execution.

llama.cpp Fork — Ollama.cpp

Windows / macOS / Linux

CUDA / Vulkan / CPU

LLM, Lightweight Runtime

Lightweight C++ runtime inspired by Ollama. Minimal dependencies, fast startup. Good for embedded systems and edge devices. Simple API for local inference.