Use Serverless LoRA Inference ↗
noOriginal Documentation
Documentation Index#
Fetch the complete documentation index at: https://docs.wandb.ai/llms.txt Use this file to discover all available pages before exploring further.
Bring your own custom LoRA for serving fine-tuned models on W&B Inference.
LoRA (Low-Rank Adaptation) lets you personalize large language models by training and storing only a lightweight ‘add-on’ instead of a full new model. This makes customization faster, cheaper, and easier to deploy.
You can train or upload a LoRA to give a base model new capabilities, such as specializing it for customer support, creative writing, or a particular technical field. This allows you to adapt the model’s behavior without having to retrain or redeploy the entire model.
Why use W&B Inference for LoRAs?#
- Upload once, deploy instantly — no servers to manage.
- Track exactly which version is live with artifact versioning.
- Update models in seconds by swapping small LoRA files instead of the full model weights.
Workflow#
- Upload your LoRA weights as a W&B artifact
- Reference the artifact URI as your model name in the API
- W&B dynamically loads your weights for inference
Here’s an example of calling your custom LoRA model using W&B Inference:
from openai import OpenAI
model_name = f"wandb-artifact:///{WB_TEAM}/{WB_PROJECT}/qwen_lora:latest"
client = OpenAI(
base_url="https://api.inference.wandb.ai/v1",
api_key=API_KEY,
project=f"{WB_TEAM}/{WB_PROJECT}",
)
resp = client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": "Say 'Hello World!'"}],
)
print(resp.choices[0].message.content)Check out this getting started notebook for an interactive demonstration of how to create a LoRA and upload it to W&B as an artifact.
Prerequisites#
You need:
- A W&B API key
- A W&B project
- Python 3.8+ with
openaiandwandbpackages:pip install wandb openai
How to add LoRAs and use them#
You can add LoRAs to your W&B account and start using them with two methods:
Upload your own custom LoRA directory as a W&B artifact. This is perfect if you’ve trained your LoRA elsewhere (local environment, cloud provider, or partner service).
This Python code uploads your locally stored LoRA weights to W&B as a versioned artifact. It creates a lora type artifact with the required metadata (base model and storage region), adds your LoRA files from a local directory, and logs it to your W&B project for use with inference.
import wandb
run = wandb.init(entity=WB_TEAM, project=WB_PROJECT)
artifact = wandb.Artifact(
"qwen_lora",
type="lora",
metadata={"wandb.base_model": "OpenPipe/Qwen3-14B-Instruct"},
storage_region="coreweave-us",
)
artifact.add_dir("<path-to-lora-weights>")
run.log_artifact(artifact)
```
### Key Requirements
To use your own LoRAs with Inference:
* The LoRA must have been trained using one of the models listed in the [Supported Base Models section](#supported-base-models).
* A LoRA saved in PEFT format as a `lora` type artifact in your W\&B account.
* The maximum supported rank is 16.
* The LoRA must be stored in the `storage_region="coreweave-us"` for low latency.
* When uploading, include the name of the base model you trained it on (for example, `meta-llama/Llama-3.1-8B-Instruct`). This ensures W\&B can load it with the correct model.
<span class="tab-end"></span>
<span class="tab-start" data-tab-title="Train a new LoRA with W&B"></span>
Train a new LoRA with [W\&B Training (serverless RL)](/training). Your LoRA automatically becomes a W\&B artifact that you can use directly.
For detailed information on how to train your own LoRA, see [OpenPipe's ART quickstart](https://art.openpipe.ai/getting-started/quick-start).
Once training is complete, your LoRA is automatically available as an artifact.
<span class="tab-end"></span>
<span class="tab-group-end"></span>
Once your LoRA has been added to your project as an artifact, use the artifact's URI in your inference calls, like this:
```python
# After training completes, use your artifact directly
model_name = f"wandb-artifact:///{WB_TEAM}/{WB_PROJECT}/your_trained_lora:latest"Supported Base Models#
Inference is currently configured for the following LLMs (exact strings must be used in wandb.base_model). More models coming soon:
meta-llama/Llama-3.1-70B-Instructmeta-llama/Llama-3.1-8B-InstructOpenPipe/Qwen3-14B-InstructQwen/Qwen3-30B-A3B-Instruct-2507Qwen/Qwen2.5-14B-Instruct
Pricing#
Serverless LoRA Inference is simple and cost-effective: you pay only for storage and the inference you actually run, rather than for always-on servers or dedicated GPU instances.
- Storage - Storing LoRA weights is inexpensive, especially compared to maintaining your own GPU infrastructure.
- Inference usage - Calls that use LoRA artifacts are billed at the same rates as standard model inference. There are no extra fees for serving custom LoRAs.