LLMFlux Supported Models

This document lists all models supported by LLMFlux, along with their hardware requirements and configuration details.

Model Naming Convention

Models in LLMFlux follow the naming convention family:size, for example:

llama3.2:3b - Llama 3.2 model with 3 billion parameters
gemma3:27b - Gemma 3 model with 27 billion parameters
phi3:small - Phi-3 Small model

In addition, if a matching model on HuggingFace was identified and the engine choice is vLLM, then the model also includes an HF Name. This is the actual model that vLLM will attempt to use.

Supported Models

Llama 3.2

Advanced general-purpose model from Meta.

Size	Config File	HF Name	Min GPU Memory	Recommended	Notes
3b	llama3.2/3b.yaml	meta-llama/Llama-3.2-1B-Instruct	8GB	Any CUDA GPU	Best balance of performance/resource usage
7b	llama3.2/7b.yaml	meta-llama/Llama-3.2-3B-Instruct	13GB	A40/A100	Good for general use

Llama 3.2 Vision

Vision-capable variant of Llama 3.2.

Size	Config File	HF Name	Min GPU Memory	Recommended	Notes
8b	llama3.2-vision/8b.yaml	meta-llama/Llama-3.2-11B-Vision-Instruct	16GB	A40/A100	Vision capabilities require more memory
70b	llama3.2-vision/70b.yaml	meta-llama/Llama-3.2-90B-Vision-Instruct	40GB	A100 80GB	Handles complex images and reasoning

Llama 3.3

Latest generation of Llama optimized for reasoning.

Size	Config File	HF Name	Min GPU Memory	Recommended	Notes
70b	llama3.3/70b.yaml	meta-llama/Llama-3.3-70B-Instruct	38GB	A100 80GB	State-of-the-art reasoning capabilities

Gemma 3

Google’s efficient and high-quality models.

Size	Config File	HF Name	Min GPU Memory	Recommended	Notes
1b	gemma3/1b.yaml	google/gemma-3-1b-it	6GB	Any CUDA GPU	Extremely efficient, good for basic tasks
4b	gemma3/4b.yaml	google/gemma-3-4b-it	10GB	Any CUDA GPU	Good performance/resource balance
12b	gemma3/12b.yaml	google/gemma-3-12b-it	16GB	A40/A100	High quality mid-range option
27b	gemma3/27b.yaml	google/gemma-3-27b-it	24GB	A100	High performance, vision-capable

Qwen 2.5

Production-quality models from Alibaba.

Size	Config File	HF Name	Min GPU Memory	Recommended	Notes
7b	qwen2.5/7b.yaml	Qwen/Qwen2.5-7B-Instruct	15GB	A40/A100	Default setup, good general model
72b	qwen2.5/72b.yaml	Qwen/Qwen2.5-72B-Instruct	35GB	A100 80GB	High performance, high resource usage

Phi 3

Microsoft’s efficient models with strong reasoning capabilities.

Size	Config File	HF Name	Min GPU Memory	Recommended	Notes
mini	phi3/mini.yaml	microsoft/Phi-3-mini-4k-instruct	6GB	Any CUDA GPU	Extremely efficient 3.8B model
small	phi3/small.yaml	microsoft/Phi-3-small-8k-instruct	12GB	Any CUDA GPU	7B parameters, good performance
medium	phi3/medium.yaml	microsoft/Phi-3-medium-4k-instruct	16GB	A40/A100	14B parameters, balanced option
vision	phi3/vision.yaml	microsoft/Phi-3-vision-128k-instruct	18GB	A40/A100	Vision-capable 14B parameter model

Mistral Models

Family of high-quality open source models.

Model	Size	Config File	HF Name	Min GPU Memory	Notes
Mistral	7b	mistral/7b.yaml	mistralai/Mistral-7B-Instruct-v0.3	13GB	Original Mistral model
Mistral-Small	22b	mistral-small/22b.yaml	mistralai/Mistral-Small-Instruct-2409	12GB	Optimized for inference speed
Mistral-Small	24b	mistral-small/24b.yaml	mistralai/Mistral-Small-24B-Instruct-2501	12GB	Optimized for inference speed
Mistral-Large	24b	mistral-large/24b.yaml	mistralai/Mistral-Large-Instruct-2407	28GB	Large capacity model
Mistral-Lite	7b	mistral-lite/7b.yaml	amazon/MistralLite	8GB	Small footprint model
Mistral-NeMo	12b	mistral-nemo/12b.yaml	mistralai/Mistral-Nemo-Instruct-2407	16GB	NVIDIA optimized model
Mistral-OpenOrca	7b	mistral-openorca/7b.yaml	Open-Orca/Mistral-7B-OpenOrca	14GB	Research tuned version

Mixtral

Mixture-of-experts models with strong performance.

Size	Config File	HF Name	Min GPU Memory	Recommended	Notes
8x7b	mixtral/8x7b.yaml	mistralai/Mixtral-8x7B-Instruct-v0.1	24GB	A100	Original MoE model
8x22b	mixtral/8x22b.yaml	mistralai/Mixtral-8x22B-Instruct-v0.1	40GB	A100 80GB	Higher parameter version

GPU Memory Requirements

Each model specifies a minimum GPU memory requirement based on practical use cases. For optimal performance:

A100 (80GB) can run all models, including the largest 70B+ models
A100 (40GB) can run models up to approximately 40B parameters
A40 (48GB) can run most models except the largest 70B+ models
Mid-range GPUs (24GB) can run models up to approximately 27B parameters
Consumer GPUs (8-16GB) can run smaller models like Phi-3 Mini and Mistral-Lite

Batch Size and Memory Trade-offs

For each model, you can adjust the batch size based on your memory constraints. Higher batch sizes require more memory but enable much faster throughput, while lower batch sizes use less memory but process requests more slowly.

Memory Requirement Calculation

As a rule of thumb, for every increase of 1 in batch size, you’ll need approximately:

Small models (1-7B): +3-4GB memory
Medium models (7-20B): +4-8GB memory
Large models (20B+): +8-16GB memory

Example: Limited Memory Configuration

When using a smaller GPU or a large model:

# Using code arguments
runner.run(
    input_path="prompts.jsonl",
    output_path="results.json",
    model="llama3.2:7b",
    batch_size=2  # Reduced from default 4 to use less memory
)

# Or using environment variables
# In .env file:
# MODEL_NAME=llama3.2:7b
# BATCH_SIZE=2

Example: High Memory Configuration

When using a high-end GPU like A100 (80GB), you can significantly increase batch size for better throughput:

# Using code arguments
# First, configure SLURM to use the right resources
config = Config()
slurm_config = config.get_slurm_config()
slurm_config.account = "myaccount"
slurm_config.partition = "a100"  # A100 partition
slurm_config.mem = "80G"         # Request full node memory
slurm_config.gpus_per_node = 1   # 1 GPU (A100 80GB)

# Then run with high batch size
runner = SlurmRunner(config=slurm_config)
job_id = runner.run(
    input_path="prompts.jsonl",
    output_path="results.json",
    model="llama3.2:7b",
    batch_size=16     # 4x default for much higher throughput
)

# Or using environment variables
# In .env file:
# MODEL_NAME=llama3.2:7b
# BATCH_SIZE=16
# SLURM_MEM=80G
# SLURM_PARTITION=a100

Batch Size Recommendations

Model Size	GPU Type	Recommended Batch Size	SLURM Memory Setting
3-7B	A100 (40/80GB)	16-24	40G-80G
3-7B	A40 (48GB)	12-16	48G
3-7B	Mid-range (24GB)	6-8	24G
7-20B	A100 (80GB)	8-12	80G
7-20B	A100 (40GB)	4-6	40G
20-40B	A100 (80GB)	4-6	80G
40-70B	A100 (80GB)	2-3	80G

Large Model Configuration on Single GPU

When running large models (70B+) on a single A100 (80GB) GPU:

# Using code arguments
config = Config()
slurm_config = config.get_slurm_config()
slurm_config.account = "myaccount"
slurm_config.partition = "a100"   # A100 partition
slurm_config.mem = "80G"          # Request full node memory
slurm_config.gpus_per_node = 1    # 1 A100 80GB GPU
slurm_config.cpus_per_task = 16   # Increase CPU cores for preprocessing

runner = SlurmRunner(config=slurm_config)
job_id = runner.run(
    input_path="prompts.jsonl",
    output_path="results.json",
    model="llama3.3:70b",
    batch_size=2               # Even a batch size of 2 is significant for 70B models
)

# Or using environment variables
# In .env file:
# MODEL_NAME=llama3.3:70b
# BATCH_SIZE=2
# SLURM_MEM=80G
# SLURM_CPUS_PER_TASK=16
# SLURM_PARTITION=a100

Multi-GPU Configuration for Increased Throughput

For large models with higher throughput, you can leverage multiple GPUs:

# Multi-GPU setup for maximum performance
config = Config()
slurm_config = config.get_slurm_config()
slurm_config.account = "myaccount"
slurm_config.partition = "a100"    # A100 partition
slurm_config.mem = "160G"          # Request memory for multiple GPUs
slurm_config.gpus_per_node = 2     # Request 2 A100 GPUs
slurm_config.cpus_per_task = 32    # Increase CPU cores for multi-GPU processing

runner = SlurmRunner(config=slurm_config)
job_id = runner.run(
    input_path="prompts.jsonl",
    output_path="results.jsonl",
    model="llama3.3:70b",
    batch_size=4                # Higher batch size possible with multiple GPUs
)

# Or using environment variables
# In .env file:
# MODEL_NAME=llama3.3:70b
# BATCH_SIZE=4
# SLURM_MEM=160G
# SLURM_CPUS_PER_TASK=32
# SLURM_PARTITION=a100
# SLURM_GPUS_PER_NODE=2

Custom Model Configuration

You can customize model parameters by creating your own YAML configuration files:

# custom/my-model.yaml
name: "custom:my-model"

resources:
  gpu_layers: 32
  gpu_memory: "16GB"
  batch_size: 8
  max_concurrent: 2

parameters:
  temperature: 0.5
  top_p: 0.9
  max_tokens: 4096

Then load it in your code:

from llmflux.core.config import Config

config = Config()
model_config = config.load_model_config(
    model_type="custom",
    model_size="my-model",
    custom_config_path="path/to/custom/my-model.yaml"
) 

Back to LLMFlux home.