Adaptive batching¶
GPU inference is significantly faster when processing a batch of inputs at once compared to processing them one at a time. BlazeRPC's adaptive batcher collects individual requests from concurrent clients and groups them into batches before calling your model function.
Enabling batching¶
Batching is enabled by default. Configure it through BlazeApp:
from blazerpc import BlazeApp
app = BlazeApp(
enable_batching=True,
max_batch_size=32,
batch_timeout_ms=10.0,
)
| Parameter | Default | Description |
|---|---|---|
enable_batching |
True |
Set to False to process every request individually. |
max_batch_size |
32 |
Maximum number of requests collected into a single batch. |
batch_timeout_ms |
10.0 |
Maximum time (in milliseconds) to wait for a full batch before dispatching a partial one. |
How it works¶
Batching is fully transparent — you write a normal single-item handler, and the framework handles collecting requests into batches and distributing results.
The batcher runs as a background asyncio.Task with the following loop:
- Wait for the first request to arrive.
- Collect additional requests until either
max_batch_sizeis reached orbatch_timeout_mselapses. - Dispatch the collected batch to the model function (called once per item in the batch).
- Distribute each result back to the corresponding client's future.
This means:
- Under high load, batches fill up quickly and are dispatched at full capacity.
- Under light load, the timeout ensures that a lone request is not stuck waiting for a batch to fill. A 10 ms timeout adds negligible latency.
- Your handler signature stays the same whether batching is on or off.
Example¶
This example serves a scikit-learn Iris classifier with batching enabled. When multiple clients send classification requests within a short time window, BlazeRPC automatically groups them into a single batch:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from blazerpc import BlazeApp, TensorInput, TensorOutput
iris = load_iris()
clf = LogisticRegression(max_iter=200)
clf.fit(iris.data, iris.target)
app = BlazeApp(
enable_batching=True,
max_batch_size=16,
batch_timeout_ms=5.0,
)
@app.model("iris")
def predict_iris(
features: TensorInput[np.float32, "batch", 4],
) -> TensorOutput[np.float32, "batch", 3]:
probs = clf.predict_proba(features).astype(np.float32)
return probs
When three clients call predict_iris within 5 ms of each other, the batcher groups all three requests into a single batch. The model function runs once, and each client receives only its own result.
Tuning¶
max_batch_size controls the upper bound on batch size. Set this based on your GPU memory and model throughput characteristics. Larger batches improve throughput but use more memory.
batch_timeout_ms controls latency under light load. Lower values reduce tail latency for individual requests. Higher values give the batcher more time to collect a full batch, improving throughput.
A good starting point:
- For latency-sensitive applications (real-time APIs):
batch_timeout_ms=5.0,max_batch_size=8. - For throughput-optimized workloads (offline processing):
batch_timeout_ms=50.0,max_batch_size=64.
Partial failure handling¶
If the model function raises an exception, every request in the batch receives that exception. This is the "whole-batch failure" case.
The batcher also supports per-item failure at the infrastructure level. When the internal batch inference function returns an Exception instance at a specific index in the results list, only that item's request is rejected — other items in the batch still receive their results normally.
If the results list has a different length than the input batch, every request receives a RuntimeError explaining the mismatch.
Disabling batching¶
Set enable_batching=False to process every request individually:
This is appropriate when:
- Your model does not benefit from batched inference (e.g., it processes one item at a time internally).
- You want the simplest possible request path for debugging.
Automatic exclusions¶
Even when enable_batching=True, BlazeRPC automatically skips batching for certain models:
- Streaming models: Server-streaming handlers (
streaming=True) are always called individually. The batcher only handles unary RPCs. - Models using dependency injection: Handlers that use
ContextorDependsparameters are excluded from batching. Each request is processed individually so that per-request context and dependencies are correctly resolved. A warning is logged at startup for each excluded model.
If you need both batching and shared resources, access them directly in the handler body (e.g., via a module-level variable) rather than through Depends. See the dependency injection guide for details.