Install on Linux¶
The easiest way to start the correct container optimized for your machine
and with good default settings (like a cache volume and a secure, non-privileged
execution mode) is to use the CLI to choose and start the container using the
inference server start
command.
(Note: you will need to install docker first):
pip install inference-cli
inference server start
Manually Starting the Container¶
If you want more control of the container settings you can also start it manually.
The core CPU Docker image includes support for OpenVINO acceleration on x64 CPUs via onnxruntime. Heavy models like SAM2 and CogVLM may run too slowly (dozens of seconds per image) to be practical (and you should look into getting a CUDA-capable GPU if you want to use them).
The primary use-cases for CPU inference are processing still images (eg for NSFW classification of uploads or document verification) or infrequent sampling of frames on a video (eg for occupancy tracking of a parking lot).
To get started with CPU inference, use the roboflow/roboflow-inference-server-cpu:latest
container.
sudo docker run -d \
--name inference-server \
--read-only \
-p 9001:9001 \
--volume ~/.inference/cache:/tmp:rw \
--security-opt="no-new-privileges" \
--cap-drop="ALL" \
--cap-add="NET_BIND_SERVICE" \
roboflow/roboflow-inference-server-cpu:latest
The GPU container adds support for hardware acceleration on cards that support CUDA
via NVIDIA-Docker. First follow the
NVIDIA Container Toolkit isntallation guide
then add --gpus all
to the docker run
command:
sudo docker run -d \
--name inference-server \
--gpus all \
--read-only \
-p 9001:9001 \
--volume ~/.inference/cache:/tmp:rw \
--security-opt="no-new-privileges" \
--cap-drop="ALL" \
--cap-add="NET_BIND_SERVICE" \
roboflow/roboflow-inference-server-gpu:latest
With the GPU container you can optionally enable TensorRT, NVIDIA's model optimization runtime that will greatly increase your models' speed at the expense of a heavy compilation and optimization step (sometimes 15+ minutes) the first time you load each model.
You can enable TensorRT by adding TensorrtExecutionProvider
to the ONNXRUNTIME_EXECUTION_PROVIDERS
environment variable.
sudo docker run -d \
--name inference-server \
--gpus all \
--read-only \
-p 9001:9001 \
--volume ~/.inference/cache:/tmp:rw \
--security-opt="no-new-privileges" \
--cap-drop="ALL" \
--cap-add="NET_BIND_SERVICE" \
-e ONNXRUNTIME_EXECUTION_PROVIDERS="[TensorrtExecutionProvider,CUDAExecutionProvider,OpenVINOExecutionProvider,CPUExecutionProvider]" \
roboflow/roboflow-inference-server-gpu:latest
Docker Compose¶
If you are using Docker Compose for your application, the equivalent yaml is:
version: "3.9"
services:
inference-server:
container_name: inference-server
image: roboflow/roboflow-inference-server-cpu:latest
read_only: true
ports:
- "9001:9001"
volumes:
- "${HOME}/.inference/cache:/tmp:rw"
security_opt:
- no-new-privileges
cap_drop:
- ALL
cap_add:
- NET_BIND_SERVICE
version: "3.9"
services:
inference-server:
container_name: inference-server
image: roboflow/roboflow-inference-server-gpu:latest
read_only: true
ports:
- "9001:9001"
volumes:
- "${HOME}/.inference/cache:/tmp:rw"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
security_opt:
- no-new-privileges
cap_drop:
- ALL
cap_add:
- NET_BIND_SERVICE
version: "3.9"
services:
inference-server:
container_name: inference-server
image: roboflow/roboflow-inference-server-gpu:latest
read_only: true
ports:
- "9001:9001"
volumes:
- "${HOME}/.inference/cache:/tmp:rw"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
ONNXRUNTIME_EXECUTION_PROVIDERS: "[TensorrtExecutionProvider,CUDAExecutionProvider,OpenVINOExecutionProvider,CPUExecutionProvider]"
security_opt:
- no-new-privileges
cap_drop:
- ALL
cap_add:
- NET_BIND_SERVICE
Using Your New Server¶
Once you have a server running, you can access it via its API or using the Python SDK. You can also use it to build Workflows using the Roboflow Platform UI.
Install the SDK¶
pip install inference-sdk
Run a workflow¶
This code runs an example model comparison Workflow on an Inference Server running on your local machine:
from inference_sdk import InferenceHTTPClient
client = InferenceHTTPClient(
api_url="http://localhost:9001", # use local inference server
# api_key="<YOUR API KEY>" # optional to access your private data and models
)
result = client.run_workflow(
workspace_name="roboflow-docs",
workflow_id="model-comparison",
images={
"image": "https://media.roboflow.com/workflows/examples/bleachers.jpg"
},
parameters={
"model1": "yolov8n-640",
"model2": "yolov11n-640"
}
)
print(result)
From a JavaScript app, hit your new server with an HTTP request.
const response = await fetch('http://localhost:9001/infer/workflows/roboflow-docs/model-comparison', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
// api_key: "<YOUR API KEY>" // optional to access your private data and models
inputs: {
"image": {
"type": "url",
"value": "https://media.roboflow.com/workflows/examples/bleachers.jpg"
},
"model1": "yolov8n-640",
"model2": "yolov11n-640"
}
})
});
const result = await response.json();
console.log(result);
Warning
Be careful not to expose your API Key to external users (in other words: don't use this snippet in a public-facing front-end app).
Using the server's API you can access it from any other client application. From the command line using cURL:
curl -X POST "http://localhost:9001/infer/workflows/roboflow-docs/model-comparison" \
-H "Content-Type: application/json" \
-d '{
"api_key": "<YOUR API KEY -- REMOVE THIS LINE IF NOT FILLING>",
"inputs": {
"image": {
"type": "url",
"value": "https://media.roboflow.com/workflows/examples/bleachers.jpg"
},
"model1": "yolov8n-640",
"model2": "yolov11n-640"
}
}'
Tip
ChatGPT is really good at converting snippets like this into other languages. If you need help, try pasting it in and asking it to translate it to your language of choice.