Skip to the content.

LLMFlux Supported Models

This document lists all models supported by LLMFlux, along with their hardware requirements and configuration details.

Listing Available Models

To see all available models at any time, run:

llmflux show-models

This will print every model key along with which engines (ollama, vllm, or both) it supports.

Model Naming Convention

LLMFlux identifies models by a model key. This is the name you pass to llmflux commands and the name shown by llmflux show-models. Model keys are case-sensitive and usually correspond to the HuggingFace repository name without the organization prefix.

In models.yaml, each entry typically has:

Some example model keys are:

In addition, if a matching model on HuggingFace was identified and the engine choice is vLLM, then the model also includes an HF Name. This is the actual model that vLLM will attempt to use.

Supported Models

Llama 3.2

Advanced general-purpose model from Meta.

Size Min GPU Memory Recommended Notes
1b (base) 2GB Any CUDA GPU Base model without instruction tuning
1b 8GB Any CUDA GPU Lightweight, good for basic tasks
3b 16GB A40/A100 Best balance of performance/resource usage

Llama 3.2 Vision

Vision-capable variant of Llama 3.2.

Size Min GPU Memory Recommended Notes
11b 24GB A100 Vision capabilities require more memory
90b 40GB A100 80GB Handles complex images and reasoning

Llama 3.3

Latest generation of Llama optimized for reasoning.

Size Min GPU Memory Recommended Notes
70b 80GB A100 80GB State-of-the-art reasoning capabilities

Gemma 3

Google’s efficient and high-quality models.

Size Min GPU Memory Recommended Notes
1b 2GB Any CUDA GPU Extremely efficient, good for basic tasks
4b 8GB Any CUDA GPU Good performance/resource balance
12b 16GB A40/A100 High quality mid-range option
27b 24GB A100 High performance, vision-capable

Qwen 2.5

Production-quality models from Alibaba.

Size Min GPU Memory Recommended Notes
0.5b 2GB Any CUDA GPU Ultra-lightweight model
1.5b 2GB Any CUDA GPU Lightweight model
3b 8GB Any CUDA GPU Good for basic tasks
7b 16GB A40/A100 Default setup, good general model
14b 16GB A40/A100 Mid-range option
32b 24GB A100 High performance
72b 40GB A100 80GB High performance, high resource usage

Phi 3

Microsoft’s efficient models with strong reasoning capabilities.

Size Min GPU Memory Recommended Notes
mini 8GB Any CUDA GPU Extremely efficient 3.8B model
small 12GB Any CUDA GPU 7B parameters, good performance
medium 24GB A100 14B parameters, balanced option
vision 32GB A100 Vision-capable 14B parameter model

Mistral Models

Family of high-quality open source models.

Model Size Min GPU Memory Notes
Mistral 7b 16GB Original Mistral model
Mistral-Small 22b 16GB Optimized for inference speed
Mistral-Small 24b 16GB Latest small model
Mistral-Large 123b 80GB Large capacity model
Mistral-Lite 7b 16GB Small footprint model
Mistral-NeMo 12b 16GB NVIDIA optimized model
Mistral-OpenOrca 7b 16GB Research tuned version

Mixtral

Mixture-of-experts models with strong performance.

Size Min GPU Memory Recommended Notes
8x7b 24GB A100 Original MoE model
8x22b 48GB A100 80GB Higher parameter version

Gemma 2

Google’s second generation efficient models.

Size Min GPU Memory Recommended Notes
2b 2GB Any CUDA GPU Lightweight model
9b 24GB A100 Mid-range option
27b 24GB A100 High performance

Llama 2

Meta’s second generation LLM.

Size Min GPU Memory Recommended Notes
7b 16GB A40/A100 Entry-level Llama 2
13b 16GB A40/A100 Good general use
70b 40GB A100 80GB High performance

Llama 3

Meta’s third generation LLM.

Size Min GPU Memory Recommended Notes
8b 16GB A40/A100 Good general-purpose model
70b 40GB A100 80GB High performance

Llama 3.1

Meta’s Llama 3.1 family with extended context.

Size Min GPU Memory Recommended Notes
8b 16GB A40/A100 Good general-purpose model
70b 40GB A100 80GB High performance
405b 80GB Multi-GPU A100 Largest open model

Llama 4

Meta’s latest generation model.

Size Min GPU Memory Recommended Notes
17b-128e 24GB A100 Maverick mixture-of-experts

Qwen 2.5 Coder

Alibaba’s code-specialized models.

Size Min GPU Memory Recommended Notes
3b 8GB Any CUDA GPU Lightweight coder
7b 16GB A40/A100 Good code generation

Qwen 2.5 Math

Alibaba’s math-specialized models (vLLM only).

Size Min GPU Memory Recommended Notes
1.5b 2GB Any CUDA GPU Lightweight math model
7b 8GB Any CUDA GPU Good math reasoning
72b 40GB A100 80GB High performance math
72b 40GB A100 80GB Reward model

Qwen 2.5 Vision

Alibaba’s vision-language model (vLLM only).

Size Min GPU Memory Recommended Notes
7b 8GB Any CUDA GPU Vision-language model

Qwen 3

Alibaba’s latest generation models.

Size Min GPU Memory Recommended Notes
0.6b 2GB Any CUDA GPU Ultra-lightweight
8b 24GB A100 Good general model
14b 16GB A40/A100 Mid-range option
32b 24GB A100 High performance

QwQ

Alibaba’s reasoning model (vLLM only).

Size Min GPU Memory Recommended Notes
32b 16GB A40/A100 Strong reasoning capabilities

DeepSeek R1

DeepSeek’s reasoning-focused distilled models.

Size Min GPU Memory Recommended Notes
1.5b (Qwen) 2GB Any CUDA GPU Ultra-lightweight reasoning
7b (Qwen) 16GB A40/A100 Good reasoning model
8b (Llama) 16GB A40/A100 Llama-based distillation
14b (Qwen) 16GB A40/A100 Mid-range reasoning
32b (Qwen) 24GB A100 High quality reasoning
70b (Llama) 40GB A100 80GB Highest quality distillation

DeepSeek Vision

DeepSeek’s vision-language models (vLLM only).

Size Min GPU Memory Recommended Notes
vl2-small 8GB Any CUDA GPU Smaller vision model
vl2 24GB A100 Full vision-language model

CodeLlama

Meta’s code-specialized Llama models.

Size Min GPU Memory Recommended Notes
7b 16GB A40/A100 Lightweight code model
13b 16GB A40/A100 Good code generation
34b 24GB A100 Strong code capabilities
70b 40GB A100 80GB Highest quality code model

LLaVA

Vision-language models.

Size Min GPU Memory Recommended Notes
7b (v1.6-mistral) 16GB A40/A100 Mistral-based vision
13b (v1.5) 16GB A40/A100 Original LLaVA
34b (v1.6) 24GB A100 Large vision model

InternVL 2.5

OpenGVLab’s vision-language models (vLLM only).

Size Min GPU Memory Recommended Notes
8b 16GB A40/A100 Lightweight vision model
26b 24GB A100 Mid-range vision model
38b 24GB A100 Large vision model

Phi 3.5

Microsoft’s updated Phi model (vLLM only).

Size Min GPU Memory Recommended Notes
vision 8GB Any CUDA GPU Vision-capable model

Cohere Command R

Cohere’s instruction-following models.

Size Min GPU Memory Recommended Notes
35b (command-r) 24GB A100 Strong instruction following
104b (command-r-plus) 80GB A100 80GB Highest quality
32b (aya-expanse) 24GB A100 Multilingual model

MedGemma

Google’s medical domain models (vLLM only).

Size Min GPU Memory Recommended Notes
4b 8GB Any CUDA GPU Lightweight medical model
27b 24GB A100 Full medical model

Kimi

Moonshot’s large-scale models.

Size Min GPU Memory Recommended Notes
K2 (1T) 24GB A100 Trillion-parameter cloud model
K2.5 80GB A100 80GB Latest generation

Pixtral

Mistral AI’s vision model (vLLM only).

Size Min GPU Memory Recommended Notes
12b 16GB A40/A100 Vision-capable Mistral model

Other Models

Model Min GPU Memory Recommended Notes
GLM-4-9B 16GB A40/A100 ZAI chatbot model
GPT-OSS-120B 80GB A100 80GB OpenAI open-source model
Molmo-7B (vLLM only) 16GB A40/A100 Allen AI vision model
all-MiniLM-L6-v2 2GB Any CUDA GPU Sentence embedding model
bge-base-en-v1.5 (vLLM only) 2GB Any CUDA GPU Embedding model
whisper-large-v3 (vLLM only) 24GB A100 Speech-to-text model

GPU Memory Requirements

Each model specifies a minimum GPU memory requirement based on practical use cases. For optimal performance:

Batch Size and Memory Trade-offs

For each model, you can adjust the batch size based on your memory constraints. Higher batch sizes require more memory but enable much faster throughput, while lower batch sizes use less memory but process requests more slowly.

Memory Requirement Calculation

As a rule of thumb, for every increase of 1 in batch size, you’ll need approximately:

Example: Limited Memory Configuration

When using a smaller GPU or a large model:

# Using code arguments
runner.run(
    input_path="prompts.jsonl",
    output_path="results.json",
    model="Llama-3.2-3B-Instruct",
    batch_size=2  # Reduced from default 4 to use less memory
)

# Or using environment variables
# In .env file:
# MODEL_NAME=Llama-3.2-3B-Instruct
# BATCH_SIZE=2

Example: High Memory Configuration

When using a high-end GPU like A100 (80GB), you can significantly increase batch size for better throughput:

# Using code arguments
# First, configure SLURM to use the right resources
config = Config()
slurm_config = config.get_slurm_config()
slurm_config.account = "myaccount"
slurm_config.partition = "a100"  # A100 partition
slurm_config.mem = "80G"         # Request full node memory
slurm_config.gpus_per_node = 1   # 1 GPU (A100 80GB)

# Then run with high batch size
runner = SlurmRunner(config=slurm_config)
job_id = runner.run(
    input_path="prompts.jsonl",
    output_path="results.json",
    model="Llama-3.2-3B-Instruct",
    batch_size=16     # 4x default for much higher throughput
)

# Or using environment variables
# In .env file:
# MODEL_NAME=Llama-3.2-3B-Instruct
# BATCH_SIZE=16
# SLURM_MEM=80G
# SLURM_PARTITION=a100

Batch Size Recommendations

Model Size GPU Type Recommended Batch Size SLURM Memory Setting
3-7B A100 (40/80GB) 16-24 40G-80G
3-7B A40 (48GB) 12-16 48G
3-7B Mid-range (24GB) 6-8 24G
7-20B A100 (80GB) 8-12 80G
7-20B A100 (40GB) 4-6 40G
20-40B A100 (80GB) 4-6 80G
40-70B A100 (80GB) 2-3 80G

Large Model Configuration on Single GPU

When running large models (70B+) on a single A100 (80GB) GPU:

# Using code arguments
config = Config()
slurm_config = config.get_slurm_config()
slurm_config.account = "myaccount"
slurm_config.partition = "a100"   # A100 partition
slurm_config.mem = "80G"          # Request full node memory
slurm_config.gpus_per_node = 1    # 1 A100 80GB GPU
slurm_config.cpus_per_task = 16   # Increase CPU cores for preprocessing

runner = SlurmRunner(config=slurm_config)
job_id = runner.run(
    input_path="prompts.jsonl",
    output_path="results.json",
    model="Llama-3.3-70B-Instruct",
    batch_size=2               # Even a batch size of 2 is significant for 70B models
)

# Or using environment variables
# In .env file:
# MODEL_NAME=Llama-3.3-70B-Instruct
# BATCH_SIZE=2
# SLURM_MEM=80G
# SLURM_CPUS_PER_TASK=16
# SLURM_PARTITION=a100

Multi-GPU Configuration for Increased Throughput

For large models with higher throughput, you can leverage multiple GPUs:

# Multi-GPU setup for maximum performance
config = Config()
slurm_config = config.get_slurm_config()
slurm_config.account = "myaccount"
slurm_config.partition = "a100"    # A100 partition
slurm_config.mem = "160G"          # Request memory for multiple GPUs
slurm_config.gpus_per_node = 2     # Request 2 A100 GPUs
slurm_config.cpus_per_task = 32    # Increase CPU cores for multi-GPU processing

runner = SlurmRunner(config=slurm_config)
job_id = runner.run(
    input_path="prompts.jsonl",
    output_path="results.jsonl",
    model="Llama-3.3-70B-Instruct",
    batch_size=4                # Higher batch size possible with multiple GPUs
)

# Or using environment variables
# In .env file:
# MODEL_NAME=Llama-3.3-70B-Instruct
# BATCH_SIZE=4
# SLURM_MEM=160G
# SLURM_CPUS_PER_TASK=32
# SLURM_PARTITION=a100
# SLURM_GPUS_PER_NODE=2

Custom Model Configuration

Custom model config files are supported for the vLLM engine only. This is useful when you want to keep the runtime settings in a custom models.yaml and point hf_name at a local fine-tuned model directory on the cluster.

# custom-models.yaml
models:
  my-custom-qwen:
    name: my-custom-qwen
    hf_name: /custom_qwen/output_dir
    resources:
      gpu_layers: 24
      gpu_memory: "16GB"
      batch_size: 4
      max_concurrent: 1
    parameters:
      temperature: 0.7
      top_p: 0.9
      max_tokens: 2048
      stop_sequences: []

Python example

from llmflux.core.config import Config
from llmflux.slurm.runner import SlurmRunner

config = Config()
slurm_config = config.get_slurm_config()

runner = SlurmRunner(config=slurm_config, workspace=".")
job_id = runner.run(
    input_path="data/input/prompts.jsonl",
    output_path="data/output/results.json",
    model="my-custom-qwen",
    custom_config_path="custom-models.yaml",
    batch_size=4,
)

print(job_id)

CLI example

llmflux run \
  --engine vllm \
  --model my-custom-qwen \
  --custom-config-path custom-models.yaml \
  --input data/input/prompts.jsonl \
  --output data/output/results.json

Back to LLMFlux home.