Self-hosted LLM using Ollama with NVIDIA GPU support and Open WebUI.

C# 99.1%
Shell 0.3%
Dockerfile 0.3%
Makefile 0.2%

Find a file

Thunder 857cf29151 release: v0.4.0 — Phase 4 Automation & Output Merges all Phase 4 work (M19–M24, M20.1, M21.1): - Full output formatter system (--output human\|json\|xml\|plaintext\|md on all 27 commands) - Batch pipeline sweep (bench pipeline sweep) - Configurable discovery authors (bench discover authors) - Pre-flight doctor (bench doctor) - Agentic score granularity (weighted 0-100 composite score) - Results export (bench export tps/agentic, CSV format) - 317 → 436 tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>		2026-05-11 13:00:06 -05:00
.config	feat(m6-s1): GREEN — DatabaseFactory + M001_InitialSchema; tool manifest; 3/3 tests pass	2026-05-07 13:40:50 -05:00
.github	feat(m1): baseline & quick wins — 36→56 tps via 32k context	2026-05-05 16:20:59 -05:00
docs/work	docs(phase4): closeout ceremonies — phase review, lessons migrated, ideas updated, STATUS complete	2026-05-11 12:59:58 -05:00
src/Thunder.LlmBench	feat(m24-s2): GREEN — ExportRepository, ExportCommand, bench export tps/agentic	2026-05-11 12:45:16 -05:00
.env.example	Initial commit: Self-hosted LLM using Ollama with NVIDIA GPU support and Open WebUI.	2026-03-06 06:09:36 -06:00
.gitignore	feat(m8-s3): deploy config — Makefile viewer-generate/viewer-deploy, Caddyfile.snippet docs route, .gitignore docs/viewer/	2026-05-07 19:26:34 -05:00
bench.db	fix: CtxProber/Command - thread model through for llamacpp compose YAML generation; add --model flag; 1 new test	2026-05-08 15:35:46 -05:00
Caddyfile.snippet	fix(m8-s3): align deploy config with actual thundersizzle.tech infrastructure	2026-05-07 20:34:44 -05:00
check-system.sh	Initial commit: Self-hosted LLM using Ollama with NVIDIA GPU support and Open WebUI.	2026-03-06 06:09:36 -06:00
CLAUDE.md	docs(phase4): closeout ceremonies — phase review, lessons migrated, ideas updated, STATUS complete	2026-05-11 12:59:58 -05:00
docker-compose.yaml	feat(m5-s3): IQ4_NL_XL selected as primary agentic model; M5 complete	2026-05-07 07:59:44 -05:00
HARDWARE-CBA.md	Initial commit: Self-hosted LLM using Ollama with NVIDIA GPU support and Open WebUI.	2026-03-06 06:09:36 -06:00
llamacpp.Dockerfile	chore: add sm_52 to CUDA archs, add llamacpp docker-compose service	2026-05-06 09:13:32 -05:00
Makefile	fix(m8-s3): Makefile — use tabs not spaces (make requires tabs)	2026-05-07 20:40:31 -05:00
README.md	Initial commit: Self-hosted LLM using Ollama with NVIDIA GPU support and Open WebUI.	2026-03-06 06:09:36 -06:00
seed.db	test(m9-s1): RED — process/docker/migration M002/server new fields	2026-05-08 05:39:51 -05:00

README.md

Thunder LLM - Self-Hosted LLM Setup

Self-hosted LLM using Ollama with NVIDIA GPU support and Open WebUI.

Architecture

GPU Machine (this machine):

Ollama: LLM inference engine with GPU acceleration

Web Server (thundersizzle.tech):

Open WebUI: Web interface for interacting with LLMs
Caddy: Reverse proxy for HTTPS access

This separation ensures only the GPU-intensive Ollama runs on the GPU machine, while the lightweight web interface runs with your other web services.

Prerequisites

On GPU Machine (this machine):

Docker and Docker Compose installed
NVIDIA GPU drivers installed

NVIDIA Container Toolkit installed

# For Ubuntu/Debian:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Verify GPU is accessible in Docker:

docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

On Web Server:

Docker and Docker Compose
Caddy configured (already set up in thundersizzle.tech)
Network connectivity to GPU machine on port 11434

Setup

1. GPU Machine Setup

Find your GPU machine's local IP:

ip addr show | grep "inet " | grep -v 127.0.0.1
# Or: hostname -I

Start Ollama:

cd /home/james/Thunder/code/thunder-llm
docker compose up -d

Verify GPU is being used:

docker exec ollama nvidia-smi

Download models:

# Download a smaller model (7B)
docker exec -it ollama ollama pull llama3.2

# Or a larger model (70B) if you have enough VRAM
docker exec -it ollama ollama pull llama3.1:70b

# Popular alternatives:
docker exec -it ollama ollama pull mistral
docker exec -it ollama ollama pull codellama
docker exec -it ollama ollama pull deepseek-coder

2. Web Server Setup

Configure environment on web server:

Edit /home/james/Thunder/code/thundersizzle.tech/.env and add:

# GPU Machine IP (replace with your actual IP)
GPU_MACHINE_IP=192.168.1.100

Add Caddy configuration:

Add the configuration from Caddyfile.snippet to your Caddy config in thundersizzle.tech/caddy/Caddyfile.

Start Open WebUI:

cd /home/james/Thunder/code/thundersizzle.tech
docker compose up -d open-webui
docker compose restart caddy  # Reload Caddy config

Test connectivity:

# From web server, verify you can reach Ollama on GPU machine
curl http://<GPU_MACHINE_IP>:11434/api/tags

Usage

Access

Web UI: https://llm.thundersizzle.tech
Ollama API (from web server): http://<GPU_MACHINE_IP>:11434

First-time Setup

Open https://llm.thundersizzle.tech
Create an admin account (first user becomes admin)
Select a model from the dropdown
Start chatting!

Security Notes

Authentication is enabled by default (WEBUI_AUTH=true)
First registered user becomes the admin
Ollama API is only exposed to your local network (0.0.0.0:11434)
Consider using firewall rules to restrict Ollama access to only your web server IP:
```
# On GPU machine
sudo ufw allow from <WEB_SERVER_IP> to any port 11434
```

Monitoring

GPU Machine:

Check Ollama status:

docker compose ps
docker compose logs ollama

Check GPU usage:

watch -n 1 nvidia-smi

Check disk usage:

docker system df
du -sh /var/lib/docker/volumes/thunder-llm_*

Web Server:

Check Open WebUI status:

cd /home/james/Thunder/code/thundersizzle.tech
docker compose logs open-webui

Maintenance

Update images:

GPU Machine:

cd /home/james/Thunder/code/thunder-llm
docker compose pull
docker compose up -d

Web Server:

cd /home/james/Thunder/code/thundersizzle.tech
docker compose pull open-webui
docker compose up -d open-webui

Backup data:

GPU Machine (models):

docker compose down
sudo tar -czf llm-backup-$(date +%Y%m%d).tar.gz /var/lib/docker/volumes/thunder-llm_*
docker compose up -d

Web Server (user data):

cd /home/james/Thunder/code/thundersizzle.tech
docker compose down open-webui
sudo tar -czf open-webui-backup-$(date +%Y%m%d).tar.gz /var/lib/docker/volumes/thundersizzle.tech_open_webui_data
docker compose up -d open-webui

Remove unused models:

docker exec -it ollama ollama list
docker exec -it ollama ollama rm model-name

Troubleshooting

GPU not detected:

Verify NVIDIA drivers: nvidia-smi
Verify Container Toolkit: docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
Check Docker daemon has GPU support: docker info | grep -i runtime

Can't access web UI:

Verify Open WebUI is running on web server: docker compose ps open-webui
Check Caddy configuration includes llm.thundersizzle.tech
Verify DNS points to your web server
Test Caddy: docker compose exec caddy caddy validate --config /etc/caddy/Caddyfile

Open WebUI can't connect to Ollama:

Verify Ollama is running on GPU machine: docker ps | grep ollama
Test connection from web server: curl http://<GPU_MACHINE_IP>:11434/api/tags
Check GPU_MACHINE_IP in thundersizzle.tech/.env is correct
Verify firewall allows web server to access GPU machine port 11434

Out of memory:

Use smaller models
Limit concurrent requests
Check available VRAM: nvidia-smi

Slow performance:

Ensure GPU is being used (check nvidia-smi during inference)
Check model size vs available VRAM
Monitor CPU/RAM usage with htop