Install on Windows¶

Windows Installer (x86)¶

You can now run Roboflow Inference Server on your Windows machine using our native desktop app!

Simply download the latest windows installer from the latest release on Github.
➡️ View Latest Release and Download Installers on Github

Windows Installation Steps¶

Download the latest installer and run it to install Roboflow Inference
When the install is finished it will offer to launch the Inference server after the setup completes
To stop the inference server simply close the terminal window it opens
To start it again later, you can find Roboflow Inference in your Start Menu

Using Docker¶

First, you'll need to install Docker Desktop. Then, use the CLI to start the container.

CPUGPU

pip install inference-cli
inference server start

To access the GPU, you'll need to ensure you've installed the up to date NVIDIA drivers and the latest version of WSL2 and that the WSL 2 backend is configured in Docker. Follow the setup instructions from Docker.

Then, use the CLI to start the container:

pip install inference-cli
inference server start

Note

If the pip install command fails, you may need to install Python first. Once you have Python version 3.12, 3.11, 3.10, or 3.9 on your machine, retry the command.

Manually Starting the Container¶

If you want more control of the container settings you can also start it manually.

CPUGPUTensorRT

The core CPU Docker image includes support for OpenVINO acceleration on x64 CPUs via onnxruntime. Heavy models like SAM2 and CogVLM may run too slowly (dozens of seconds per image) to be practical (and you should look into getting a CUDA-capable GPU if you want to use them).

The primary use-cases for CPU inference are processing still images (eg for NSFW classification of uploads or document verification) or infrequent sampling of frames on a video (eg for occupancy tracking of a parking lot).

To get started with CPU inference, use the roboflow/roboflow-inference-server-cpu:latest container.

docker run -d ^
    --name inference-server ^
    --read-only ^
    -p 9001:9001 ^
    --volume "%USERPROFILE%\.inference\cache:/tmp:rw" ^
    --security-opt="no-new-privileges" ^
    --cap-drop="ALL" ^
    --cap-add="NET_BIND_SERVICE" ^
    roboflow/roboflow-inference-server-cpu:latest

The GPU container adds support for hardware acceleration on cards that support CUDA via NVIDIA-Docker. Ensure you have setup Docker to access the GPU then add --gpus all to the docker run command:

docker run -d ^
    --name inference-server ^
    --gpus all ^
    --read-only ^
    -p 9001:9001 ^
    --volume "%USERPROFILE%\.inference\cache:/tmp:rw" ^
    --security-opt="no-new-privileges" ^
    --cap-drop="ALL" ^
    --cap-add="NET_BIND_SERVICE" ^
    roboflow/roboflow-inference-server-gpu:latest

With the GPU container you can optionally enable TensorRT, NVIDIA's model optimization runtime that will greatly increase your models' speed at the expense of a heavy compilation and optimization step (sometimes 15+ minutes) the first time you load each model.

You can enable TensorRT by adding TensorrtExecutionProvider to the ONNXRUNTIME_EXECUTION_PROVIDERS environment variable.

docker run -d ^
    --name inference-server ^
    --gpus all ^
    --read-only ^
    -p 9001:9001 ^
    --volume "%USERPROFILE%\.inference\cache:/tmp:rw" ^
    --security-opt="no-new-privileges" ^
    --cap-drop="ALL" ^
    --cap-add="NET_BIND_SERVICE" ^
    -e ONNXRUNTIME_EXECUTION_PROVIDERS="[TensorrtExecutionProvider,CUDAExecutionProvider,OpenVINOExecutionProvider,CPUExecutionProvider]" ^
    roboflow/roboflow-inference-server-gpu:latest

Docker Compose¶

If you are using Docker Compose for your application, the equivalent yaml is:

CPUGPUTensorRT

version: "3.9"

services:
  inference-server:
    container_name: inference-server
    image: roboflow/roboflow-inference-server-cpu:latest

    read_only: true
    ports:
      - "9001:9001"

    volumes:
      - "${USERPROFILE}/.inference/cache:/tmp:rw"

    security_opt:
      - no-new-privileges
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE

version: "3.9"

services:
  inference-server:
    container_name: inference-server
    image: roboflow/roboflow-inference-server-gpu:latest

    read_only: true
    ports:
      - "9001:9001"

    volumes:
      - "${USERPROFILE}/.inference/cache:/tmp:rw"

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

    security_opt:
      - no-new-privileges
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE

version: "3.9"

services:
  inference-server:
    container_name: inference-server
    image: roboflow/roboflow-inference-server-gpu:latest

    read_only: true
    ports:
      - "9001:9001"

    volumes:
      - "${USERPROFILE}/.inference/cache:/tmp:rw"

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

    environment:
      ONNXRUNTIME_EXECUTION_PROVIDERS: "[TensorrtExecutionProvider,CUDAExecutionProvider,OpenVINOExecutionProvider,CPUExecutionProvider]"

    security_opt:
      - no-new-privileges
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE

Using Your New Server¶

Once you have a server running, you can access it via its API or using the Python SDK. You can also use it to build Workflows using the Roboflow Platform UI.

Python SDKNode.jsHTTP / cURL

Install the SDK¶

pip install inference-sdk

Run a workflow¶

This code runs an example model comparison Workflow on an Inference Server running on your local machine:

from inference_sdk import InferenceHTTPClient

client = InferenceHTTPClient(
    api_url="http://localhost:9001", # use local inference server
    # api_key="<YOUR API KEY>" # optional to access your private data and models
)

result = client.run_workflow(
    workspace_name="roboflow-docs",
    workflow_id="model-comparison",
    images={
        "image": "https://media.roboflow.com/workflows/examples/bleachers.jpg"
    },
    parameters={
        "model1": "yolov8n-640",
        "model2": "yolov11n-640"
    }
)

print(result)

From a JavaScript app, hit your new server with an HTTP request.

const response = await fetch('http://localhost:9001/infer/workflows/roboflow-docs/model-comparison', {
    method: 'POST',
    headers: {
        'Content-Type': 'application/json'
    },
    body: JSON.stringify({
        // api_key: "<YOUR API KEY>" // optional to access your private data and models
        inputs: {
            "image": {
                "type": "url",
                "value": "https://media.roboflow.com/workflows/examples/bleachers.jpg"
            },
            "model1": "yolov8n-640",
            "model2": "yolov11n-640"
        }
    })
});

const result = await response.json();
console.log(result);

Warning

Be careful not to expose your API Key to external users (in other words: don't use this snippet in a public-facing front-end app).

Using the server's API you can access it from any other client application. From the command line using cURL:

curl -X POST "http://localhost:9001/infer/workflows/roboflow-docs/model-comparison" \
-H "Content-Type: application/json" \
-d '{
    "api_key": "<YOUR API KEY -- REMOVE THIS LINE IF NOT FILLING>",
    "inputs": {
        "image": {
            "type": "url",
            "value": "https://media.roboflow.com/workflows/examples/bleachers.jpg"
        },
        "model1": "yolov8n-640",
        "model2": "yolov11n-640"
    }
}'

Tip

ChatGPT is really good at converting snippets like this into other languages. If you need help, try pasting it in and asking it to translate it to your language of choice.