Choosing a Deployment Method¶
There are three primary ways to deploy Inference:
- Serverless Hosted API - for smaller image models.
- Dedicated Deployment - for bigger models and streaming video.
- Self Hosted - on your own edge device or server.
Each has pros and cons and which one you should choose depends on your particular use-case and organizational constraints.
Serverless | Dedicated | Self-Hosted | |
---|---|---|---|
Workflows | ✅ | ✅ | ✅ |
Basic Logic Blocks | ✅ | ✅ | ✅ |
Pre-Trained Models | ✅ | ✅ | ✅ |
Fine-Tuned Models | ✅ | ✅ | ✅ |
Universe Models | ✅ | ✅ | ✅ |
Active Learning | ✅ | ✅ | ✅ |
Model Monitoring | ✅ | ✅ | ✅ |
Foundation Models | ✅ | ✅ | |
Video Stream Management | ✅ | ✅ | |
Dynamic Python Blocks | ✅ | ✅ | |
Device Management | ✅ | ✅ | |
Access Local Devices | ✅ | ||
Can Run Offline | ✅ | ||
Billing | Per-Call | Hourly | See Below |
Cloud Hosting¶
By far the easiest way to get started is with Roboflow's managed services. You can jump straight to building without having to setup any infrastructure. It's often the front-door to using Inference even for those who know they will eventually want to self host.
There are two cloud hosted offerings with different targeted use-cases, capabilities, and pricing models.
Serverless Hosted API¶
The Serverless Hosted API supports running Workflows on pre-trained & fine-tuned models, chaining models, basic logic, visualizations, and external integrations.
It supports cloud-hosted VLMs like ChatGPT and Anthropic Claude, but does not support running heavy models like Florence-2 or SAM 2. It also does not support streaming video.
The Serverless API scales down to zero when you're not using it (and up to infinity under load) with quick (a couple of seconds) cold-start time. You pay per model inference with no minimums. Roboflow's free tier credits may be used.
Dedicated Deployments¶
Dedicated Deployments are single-tenant virtual machines that are allocated for your exclusive use. They can optionally be configured with a GPU and used in development mode (where you may be evicted if capacity is needed for a higher priority task & are limited to 3-hour sessions) or production mode (guaranteed capacity and no session time limit).
On a Dedicated Deployment, you can stream video, run custom Python code, access heavy foundation models like SAM 2, Florence-2, and Paligemma (including your fine-tunes of those models), and install additional dependencies. They are much higher performance machines than the instances backing the Serverless Hosted API.
Scale-up time is on the order of a minute or two.
Dedicated Deployments Availability
Dedicated Deployments are only available to Roboflow Workspaces with an active subscription (and are not available on the free trial). They are billed hourly.
Self Hosting¶
Running at the edge is a core priority and focus area of Inference. For many use-cases latency matters, bandwidth is limited, interfacing with local devices is key, and resiliency to Internet outages is mandatory.
Running locally on a development machine, an AI computer, or an edge device is as simple as starting a Docker container.
Self-Hosted Pricing
Basic usage of self-hosted Inference Servers is completely free.
Workflows that require a Roboflow API Key to access Roboflow Cloud powered features (for example: the private model repository) are metered and consume credits (which cost money after a generous free tier is used up) based on the number of images or the hours of video processed.
Detailed installation instructions and device-specific performance tips are here.
Bring Your Own Cloud¶
Sometimes enterprise compliance policies regarding sensitive data requires running workloads on-premises. This is supported via self-hosting on your own cloud. Billing is the same as for self-hosting on an edge device.
Next Steps¶
Once you've decided on a deployment method and have a server running, interfacing with it is easy.