Services/llamacpp-multi/README.md

# LLaMACpp Multi-Instance Setup

Guida per configurare e scalare il numero di istanze di llama-server con load balancing nginx.

## Struttura Attuale

- **4 istanze** di llama-server (porte 9000-9003)
- **Nginx** come load balancer (porta 8090)
- **Supervisor** per gestire tutti i processi

## Aggiungere Istanze

Se vuoi aumentare il numero di istanze, segui questi step:

### 1. Modifica il Containerfile

File: `llamacpp-multi.Containerfile`

Cambia:
```dockerfile
ENV LLAMA_INSTANCES=4
```

Con il numero di istanze desiderato (es. 6):
```dockerfile
ENV LLAMA_INSTANCES=6
```

### 2. Aggiorna la Configurazione Nginx

File: `llama-upstream.conf`

Aggiungi i server nei porti nuovi nel blocco `upstream llama_backend`:

```nginx
upstream llama_backend {
    least_conn;
    server 127.0.0.1:9000 max_fails=3 fail_timeout=30s;
    server 127.0.0.1:9001 max_fails=3 fail_timeout=30s;
    server 127.0.0.1:9002 max_fails=3 fail_timeout=30s;
    server 127.0.0.1:9003 max_fails=3 fail_timeout=30s;
    server 127.0.0.1:9004 max_fails=3 fail_timeout=30s;  # NUOVO
    server 127.0.0.1:9005 max_fails=3 fail_timeout=30s;  # NUOVO
}
```

### 3. Aggiorna il Containerfile con le porte esposte

File: `llamacpp-multi.Containerfile`

Aggiungi le nuove porte:
```dockerfile
EXPOSE 8090 9000 9001 9002 9003 9004 9005
```

### 4. Ricompila il Container

```bash
cd /home/badstorm/Source/bdi/bdi_podman_serverconf/Services/llamacpp-multi
podman build -t llamacpp:vulkan-multi-amd64 -f llamacpp-multi.Containerfile .
```

### 5. Riavvia il Servizio

```bash
systemctl restart llamacpp-multi
```

## Considerazioni di Risorse

Ogni istanza consuma:
- **~8GB VRAM** (dipende dal modello e da `LLAMA_ARG_CTX_SIZE`)
- **~1-2 CPU core** (dipende dal carico)

**Con GPU AMD Radeon (RENOIR):**
- 2 istanze: ✅ Stabile
- 4 istanze: ⚠️ Funziona ma monitorare memoria
- 6+ istanze: ❌ Probabilmente fuori di VRAM

Monitora con:
```bash
podman stats llamacpp-multi
```

## Variabili di Ambiente Modificabili

Nel file `.container` puoi sovrascrivere:

```ini
Environment=LLAMA_ARG_PARALLEL=32
Environment=LLAMA_ARG_THREADS=16
Environment=LLAMA_ARG_BATCH_SIZE=2048
Environment=LLAMA_ARG_CTX_SIZE=131072
Environment=LLAMA_ARG_HF_REPO=unsloth/Qwen3-Coder-Next-GGUF:Q2_K_XL
Environment=LLAMA_READY_TIMEOUT=600
```

## Testing

Una volta avviate le istanze, testa:

```bash
curl http://localhost:8090/v1/models
```

Dovresti vedere il modello listato se tutte le istanze sono pronte.

Test di carico (concurrent requests):
```bash
for i in {1..10}; do
  curl -X POST http://localhost:8090/api/completion \
    -H "Content-Type: application/json" \
    -d '{"prompt": "Once upon a time", "n_predict": 64}' &
done
wait
```

## Troubleshooting

**502 Bad Gateway:**
```bash
podman exec llamacpp-multi tail -f /var/log/llama-server-9000.log
```

**Timeout Ready:**
Aumenta `LLAMA_READY_TIMEOUT` se il modello impiega più di 10 minuti a caricare.

**Out of Memory:**
Riduci `LLAMA_ARG_PARALLEL`, `LLAMA_ARG_BATCH_SIZE`, o `LLAMA_ARG_CTX_SIZE`.
Split config files 2026-02-05 23:27:16 +01:00			`# LLaMACpp Multi-Instance Setup`

			`Guida per configurare e scalare il numero di istanze di llama-server con load balancing nginx.`

			`## Struttura Attuale`

			`- 4 istanze di llama-server (porte 9000-9003)`
			`- Nginx come load balancer (porta 8090)`
			`- Supervisor per gestire tutti i processi`

			`## Aggiungere Istanze`

			`Se vuoi aumentare il numero di istanze, segui questi step:`

			`### 1. Modifica il Containerfile`

			File: `llamacpp-multi.Containerfile`

			`Cambia:`
			```dockerfile
			`ENV LLAMA_INSTANCES=4`
			```

			`Con il numero di istanze desiderato (es. 6):`
			```dockerfile
			`ENV LLAMA_INSTANCES=6`
			```

			`### 2. Aggiorna la Configurazione Nginx`

			File: `llama-upstream.conf`

			Aggiungi i server nei porti nuovi nel blocco `upstream llama_backend`:

			```nginx
			`upstream llama_backend {`
			`least_conn;`
			`server 127.0.0.1:9000 max_fails=3 fail_timeout=30s;`
			`server 127.0.0.1:9001 max_fails=3 fail_timeout=30s;`
			`server 127.0.0.1:9002 max_fails=3 fail_timeout=30s;`
			`server 127.0.0.1:9003 max_fails=3 fail_timeout=30s;`
			`server 127.0.0.1:9004 max_fails=3 fail_timeout=30s; # NUOVO`
			`server 127.0.0.1:9005 max_fails=3 fail_timeout=30s; # NUOVO`
			`}`
			```

			`### 3. Aggiorna il Containerfile con le porte esposte`

			File: `llamacpp-multi.Containerfile`

			`Aggiungi le nuove porte:`
			```dockerfile
			`EXPOSE 8090 9000 9001 9002 9003 9004 9005`
			```

			`### 4. Ricompila il Container`

			```bash
			`cd /home/badstorm/Source/bdi/bdi_podman_serverconf/Services/llamacpp-multi`
			`podman build -t llamacpp:vulkan-multi-amd64 -f llamacpp-multi.Containerfile .`
			```

			`### 5. Riavvia il Servizio`

			```bash
			`systemctl restart llamacpp-multi`
			```

			`## Considerazioni di Risorse`

			`Ogni istanza consuma:`
			- ~8GB VRAM (dipende dal modello e da `LLAMA_ARG_CTX_SIZE`)
			`- ~1-2 CPU core (dipende dal carico)`

			`Con GPU AMD Radeon (RENOIR):`
			`- 2 istanze: ✅ Stabile`
			`- 4 istanze: ⚠️ Funziona ma monitorare memoria`
			`- 6+ istanze: ❌ Probabilmente fuori di VRAM`

			`Monitora con:`
			```bash
			`podman stats llamacpp-multi`
			```

			`## Variabili di Ambiente Modificabili`

			Nel file `.container` puoi sovrascrivere:

			```ini
			`Environment=LLAMA_ARG_PARALLEL=32`
			`Environment=LLAMA_ARG_THREADS=16`
			`Environment=LLAMA_ARG_BATCH_SIZE=2048`
			`Environment=LLAMA_ARG_CTX_SIZE=131072`
			`Environment=LLAMA_ARG_HF_REPO=unsloth/Qwen3-Coder-Next-GGUF:Q2_K_XL`
			`Environment=LLAMA_READY_TIMEOUT=600`
			```

			`## Testing`

			`Una volta avviate le istanze, testa:`

			```bash
			`curl http://localhost:8090/v1/models`
			```

			`Dovresti vedere il modello listato se tutte le istanze sono pronte.`

			`Test di carico (concurrent requests):`
			```bash
			`for i in {1..10}; do`
			`curl -X POST http://localhost:8090/api/completion \`
			`-H "Content-Type: application/json" \`
			`-d '{"prompt": "Once upon a time", "n_predict": 64}' &`
			`done`
			`wait`
			```

			`## Troubleshooting`

			`502 Bad Gateway:`
			```bash
			`podman exec llamacpp-multi tail -f /var/log/llama-server-9000.log`
			```

			`Timeout Ready:`
			Aumenta `LLAMA_READY_TIMEOUT` se il modello impiega più di 10 minuti a caricare.

			`Out of Memory:`
			Riduci `LLAMA_ARG_PARALLEL`, `LLAMA_ARG_BATCH_SIZE`, o `LLAMA_ARG_CTX_SIZE`.