Parallel Inference¶
Note
This feature is only available for Roboflow Enterprise users. Contact our sales team to learn more about Roboflow Enterprise.
You can run multiple models in parallel with Inference with parallel processing, a version of Roboflow Inference that processes inference requests asynchronously.
Inference Parallel supports all the same features as Roboflow Inference, with the exception that it does not support Core models (i.e. CLIP and SAM).
With Inference Parallel, preprocessing, auto batching, inference, and post processing all run in separate threads to increase server FPS throughput.
Separate requests to the same model will be batched on the fly as allowed by $MAX_BATCH_SIZE
, and then response handling will occurr independently. Images are passed via Python's SharedMemory module to maximize throughput.
These changes result in as much as a 76% speedup on one measured workload.
How To Use Inference with Parallel Processing¶
You can run Inference with Parallel Processing in two ways: via the CLI or via Docker.
First, build the parallel server
./inference/enterprise/parallel/build.sh
Then, run the server:
./inference/enterprise/parallel/run.sh
A message will appear in the terminal indicating that the server is running and ready for use.
We provide a container at Docker Hub that you can pull using docker pull roboflow/roboflow-inference-server-gpu-parallel:latest
. If you are pulling a pinned tag, be sure to change the $TAG
variable in run.sh
.
Benchmarking¶
We evaluated the performance of Inference Parallel on a variety of models from Roboflow Universe. We compared the performance of Inference Parallel to the latest version of Inference Server (0.9.5.rc) on the same hardware.
We ran our tests on a computer with eight cores and one GPU. Instance segmentation metrics are calculated using "mask_decode_mode": "fast"
in the request body. Requests are posted concurrently with a parallelism of 1000.
Here are the results of our tests:
Workspace | Model | Model Type | split | 0.9.5.rc FPS | 0.9.5.parallel FPS |
---|---|---|---|---|---|
senior-design-project-j9gpp | nbafootage/3 | object-detection | train | 30.2 fps | 44.03 fps |
niklas-bommersbach-jyjff | dart-scorer/8 | object-detection | train | 26.6 fps | 47.0 fps |
geonu | water-08xpr/1 | instance-segmentation | valid | 4.7 fps | 6.1 fps |
university-of-bradford | detecting-drusen_1/2 | instance-segmentation | train | 6.2 fps | 7.2 fps |
fy-project-y9ecd | cataract-detection-viwsu/2 | classification | train | 48.5 fps | 65.4 fps |
hesunyu | playing-cards-ir0wr/1 | classification | train | 44.6 fps | 57.7 fps |
Inference with parallel processing enabled achieved higher FPS on every test. On eome models, the FPS increase by using Inference with parallel processing was greater than 10 FPS.