The Inference Pipeline interface is made for streaming and is likely the best route to go for real time use cases. It is an asynchronous interface that can consume many different video sources including local devices (like webcams), RTSP video streams, video files, etc. With this interface, you define the source of a video stream and sinks.
First, install Inference:
Prior to installation, you may want to configure a python virtual environment to isolate dependencies of inference.
To install Inference via pip:
pip install inference
If you have an NVIDIA GPU, you can accelerate your inference with:
pip install inference-gpu
Next, create an Inference Pipeline:
# import the InferencePipeline interface
from inference import InferencePipeline
# import a built in sink called render_boxes (sinks are the logic that happens after inference)
from inference.core.interfaces.stream.sinks import render_boxes
# create an inference pipeline object
pipeline = InferencePipeline.init(
model_id="yolov8x-1280", # set the model id to a yolov8x model with in put size 1280
video_reference="https://storage.googleapis.com/com-roboflow-marketing/inference/people-walking.mp4", # set the video reference (source of video), it can be a link/path to a video file, an RTSP stream url, or an integer representing a device id (usually 0 for built in webcams)
on_prediction=render_boxes, # tell the pipeline object what to do with each set of inference by passing a function
api_key=api_key, # provide your roboflow api key for loading models from the roboflow api
# start the pipeline
# wait for the pipeline to finish
Let's break down the example line by line:
pipeline = InferencePipeline.init(...)
Here, we are calling a class method of InferencePipeline.
We set the model ID to a YOLOv8x model pre-trained on COCO with input resolution
We set the video reference to a URL. Later we will show the various values that can be used as a video reference.
on_prediction argument defines our sink (or a list of sinks).
Here, we start and join the thread that processes the video stream.
Inference Pipelines can consume many different types of video streams.
- Device Id (integer): Providing an integer instructs a pipeline to stream video from a local device, like a webcam. Typically, built in webcams show up as device
- Video File (string): Providing the path to a video file will result in the pipeline reading every frame from the file, running inference with the specified model, then running the
on_predictionmethod with each set of resulting predictions.
- Video URL (string): Providing the path to a video URL is equivalent to providing a video file path and voids needing to first download the video.
- RTSP URL (string): Providing an RTSP URL will result in the pipeline streaming frames from an RTSP stream as fast as possible, then running the
on_predictioncallback on the latest available frame.
Sinks define what an Inference Pipeline should do with each prediction. A sink is a function with signature:
The arguments are:
predictions: A dictionary that is the response object resulting from a call to a model's
video_frame: A VideoFrame object containing metadata and pixel data from the video frame.
**kwargs: Other keyward arguments can be defined for the ability to configure a sink.
To create a custom sink, define a new function with the appropriate signature.
# import the VideoFrame object for type hints
from inference.core.interfaces.camera.entities import VideoFrame
prediction: dict, # predictions are dictionaries
video_frame: VideoFrame, # video frames are python objects with metadata and the video frame itself
# put your custom logic here
Predictions are provided to the sink as a dictionary containing keys:
predictions(list): A list of prediction dictionaries
Each prediction dictionary contains keys:
x: The center x coordinate of the predicted bounding box in pixels
y: The center y coordinate of the predicted bounding box in pixels
width: The width of the predicted bounding box in pixels
height: The height of the predicted bounding box in pixels
confidence: The confidence value of the prediction (between 0 and 1)
class: The predicted class name
class_id: The predicted class ID
The video frame is provided as a video frame object with attributes:
image: A numpy array containing the image pixels
frame_id: An integer of the frame ID, a monotonically increasing integer starting at 0 from the time the pipeline was started
frame_timestamp: A python datetime object of when the frame was captured
Built In Sinks¶
Inference has several sinks built in that are ready to use.
The render boxes sink is made to visualize predictions and overlay them on a stream. It uses Supervision annotators to render the predictions and display the annotated frame.
The UDP sink is made to broadcast predictions with a UDP port. This port can be listened to by client code for further processing.
The Multi-Sink is a way to combine multiple sinks so that multiple actions can happen on a single inference result.
The Video File Sink visualizes predictions, similar to the
render_boxes(...) sink, however, instead of displaying the annotated frames, it saves them to a video file.
Other Pipeline Configuration¶
Inference Pipelines are highly configurable. Configurations include:
max_fps: Used to set the maximum rate of frame processing.
confidence: Confidence threshold used for inference.
iou_threshold: IoU threshold used for inference.
pipeline = InferencePipeline.init(
See the reference docs for the full list of Inference Pipeline parameters.
We tested the performance of Inference on a variety of hardware devices.
Below are the results of our benchmarking tests for Inference.
Tested against the same 1080p 60fps RTSP stream emitted by localhost.
Jetson Orin Nano¶
With old version reaching at max 6-7 fps. This test was executed against 4K@60fps stream, which is not possible to be decoded in native pace due to resource constrains. New implementation proved to run without stability issues for few hours straight.
GPU workstation with Tesla T4 was able to run 4 concurrent HD streams at 15FPS utilising ~80% GPU - reaching
over 60FPS throughput per GPU (against
Inference is deprecating support for
inference.Stream, our video stream inference interface.
inference.Stream is being replaced with
InferencePipeline, which has feature parity and achieves better performance. There are also new, more advanced features available in
New Features in
New implementation allows
InferencePipeline to re-connect to a video source, eliminating the need to create
additional logic to run inference against streams for long hours in fault-tolerant mode.
Granularity of control¶
New implementation let you decide how to handle video sources - and provided automatic selection of mode. Your videos will be processed frame-by-frame with each frame being passed to model, and streams will be processed in a way to provide continuous, up-to-date predictions on the most fresh frames - and the system will automatically adjust to performance of the hardware to ensure best experience.
New implementation allows to create reports about InferencePipeline state in runtime - providing an easy way to build monitoring on top of it.
Let's assume you used
inference.Stream(...) with your custom handlers:
import numpy as np
def on_prediction(predictions: dict, image: np.ndarray) -> None:
Now, the structure of handlers has changed into:
import numpy as np
def on_prediction(predictions video_frame) -> None:
With predictions being still dict (passed as second parameter) in the same, standard Roboflow format,
video_frame is a dataclass with the following property:
image: which is video frame (
frame_id: int value representing the place of the frame in stream order
frame_timestamp: time of frame grabbing - the exact moment when frame appeared in the file/stream on the receiver side (
Additionally, it eliminates the need of grabbing
InferencePipeline exposes interface to manage its state (possibly from different thread) - including