Split config files

2026-02-05 23:27:16 +01:00
parent 3bbd87a3d7
commit 50e725c9e7
7 changed files with 282 additions and 64 deletions
--- a/Services/llamacpp-multi/README.md
+++ b/Services/llamacpp-multi/README.md
@@ -0,0 +1,129 @@
+# LLaMACpp Multi-Instance Setup
+
+Guida per configurare e scalare il numero di istanze di llama-server con load balancing nginx.
+
+## Struttura Attuale
+
+- **4 istanze** di llama-server (porte 9000-9003)
+- **Nginx** come load balancer (porta 8090)
+- **Supervisor** per gestire tutti i processi
+
+## Aggiungere Istanze
+
+Se vuoi aumentare il numero di istanze, segui questi step:
+
+### 1. Modifica il Containerfile
+
+File: `llamacpp-multi.Containerfile`
+
+Cambia:
+```dockerfile
+ENV LLAMA_INSTANCES=4
+```
+
+Con il numero di istanze desiderato (es. 6):
+```dockerfile
+ENV LLAMA_INSTANCES=6
+```
+
+### 2. Aggiorna la Configurazione Nginx
+
+File: `llama-upstream.conf`
+
+Aggiungi i server nei porti nuovi nel blocco `upstream llama_backend`:
+
+```nginx
+upstream llama_backend {
+    least_conn;
+    server 127.0.0.1:9000 max_fails=3 fail_timeout=30s;
+    server 127.0.0.1:9001 max_fails=3 fail_timeout=30s;
+    server 127.0.0.1:9002 max_fails=3 fail_timeout=30s;
+    server 127.0.0.1:9003 max_fails=3 fail_timeout=30s;
+    server 127.0.0.1:9004 max_fails=3 fail_timeout=30s;  # NUOVO
+    server 127.0.0.1:9005 max_fails=3 fail_timeout=30s;  # NUOVO
+}
+```
+
+### 3. Aggiorna il Containerfile con le porte esposte
+
+File: `llamacpp-multi.Containerfile`
+
+Aggiungi le nuove porte:
+```dockerfile
+EXPOSE 8090 9000 9001 9002 9003 9004 9005
+```
+
+### 4. Ricompila il Container
+
+```bash
+cd /home/badstorm/Source/bdi/bdi_podman_serverconf/Services/llamacpp-multi
+podman build -t llamacpp:vulkan-multi-amd64 -f llamacpp-multi.Containerfile .
+```
+
+### 5. Riavvia il Servizio
+
+```bash
+systemctl restart llamacpp-multi
+```
+
+## Considerazioni di Risorse
+
+Ogni istanza consuma:
+- **~8GB VRAM** (dipende dal modello e da `LLAMA_ARG_CTX_SIZE`)
+- **~1-2 CPU core** (dipende dal carico)
+
+**Con GPU AMD Radeon (RENOIR):**
+- 2 istanze: ✅ Stabile
+- 4 istanze: ⚠️ Funziona ma monitorare memoria
+- 6+ istanze: ❌ Probabilmente fuori di VRAM
+
+Monitora con:
+```bash
+podman stats llamacpp-multi
+```
+
+## Variabili di Ambiente Modificabili
+
+Nel file `.container` puoi sovrascrivere:
+
+```ini
+Environment=LLAMA_ARG_PARALLEL=32
+Environment=LLAMA_ARG_THREADS=16
+Environment=LLAMA_ARG_BATCH_SIZE=2048
+Environment=LLAMA_ARG_CTX_SIZE=131072
+Environment=LLAMA_ARG_HF_REPO=unsloth/Qwen3-Coder-Next-GGUF:Q2_K_XL
+Environment=LLAMA_READY_TIMEOUT=600
+```
+
+## Testing
+
+Una volta avviate le istanze, testa:
+
+```bash
+curl http://localhost:8090/v1/models
+```
+
+Dovresti vedere il modello listato se tutte le istanze sono pronte.
+
+Test di carico (concurrent requests):
+```bash
+for i in {1..10}; do
+  curl -X POST http://localhost:8090/api/completion \
+    -H "Content-Type: application/json" \
+    -d '{"prompt": "Once upon a time", "n_predict": 64}' &
+done
+wait
+```
+
+## Troubleshooting
+
+**502 Bad Gateway:**
+```bash
+podman exec llamacpp-multi tail -f /var/log/llama-server-9000.log
+```
+
+**Timeout Ready:**
+Aumenta `LLAMA_READY_TIMEOUT` se il modello impiega più di 10 minuti a caricare.
+
+**Out of Memory:**
+Riduci `LLAMA_ARG_PARALLEL`, `LLAMA_ARG_BATCH_SIZE`, o `LLAMA_ARG_CTX_SIZE`.