Segment Anything 3 (SAM 3)¶
Segment Anything 3 (SAM 3) is a unified foundation model for promptable segmentation in images and videos. It builds upon SAM 2 by introducing the ability to exhaustively segment all instances of an open-vocabulary concept specified by a short text phrase or exemplars.
SAM 3 can detect, segment, and track objects using: - Text prompts (e.g., "a person", "red car") — segment every instance of a concept - Exemplar box prompts — box one example object and segment every similar instance, optionally combined with text and with negative exemplars to exclude lookalikes - Interactive visual prompts (points, boxes) — segment one specific object, SAM 2 style
How to Use SAM 3 with Inference¶
You can use SAM 3 via the Inference Python SDK or the HTTP API.
Prerequisites¶
To use SAM 3, you will need a Roboflow API key. Sign up for a free Roboflow account to retrieve your key.
Python SDK¶
You can run SAM 3 locally using the inference package.
1. Install the package¶
uv pip install inference-gpu[sam3]
2. Run Inference¶
Here is an example of how to use SAM 3 with a text prompt to segment objects.
import os
os.environ["API_KEY"] = "<YOUR_ROBOFLOW_API_KEY>"
from inference.models.sam3 import SegmentAnything3
from inference.core.entities.requests.sam3 import Sam3Prompt
# Initialize the model
# The model will automatically download weights if not present
model = SegmentAnything3(model_id="sam3/sam3_final")
# Define your image (can be a path, URL, or numpy array)
image_path = "path/to/your/image.jpg"
# Define prompts
# SAM 3 supports text prompts, exemplar box prompts, and combinations of both
prompts = [
# Text prompt: segment every instance of a concept
Sam3Prompt(type="text", text="person"),
# Exemplar prompt: box one example object (absolute pixels, top-left
# anchored XYWH) and segment every similar instance in the image.
# box_labels is required: 1 = positive exemplar, 0 = negative exemplar.
Sam3Prompt(
type="visual",
boxes=[Sam3Prompt.Box(x=1409, y=705, width=112, height=183)],
box_labels=[1],
),
# Combined prompt: text narrowed by exemplars. Here the negative
# exemplar suppresses instances similar to the second box.
Sam3Prompt(
type="visual",
text="car",
boxes=[
Sam3Prompt.BoxXYXY(x0=100, y0=200, x1=300, y1=400),
Sam3Prompt.BoxXYXY(x0=500, y0=200, x1=700, y1=400),
],
box_labels=[1, 0],
),
]
# Run inference
response = model.segment_image(
image=image_path,
prompts=prompts,
output_prob_thresh=0.5,
format="polygon" # or "rle", "json"
)
# Process results
for prompt_result in response.prompt_results:
print(f"Prompt: {prompt_result.echo.text}")
for prediction in prompt_result.predictions:
print(f" Confidence: {prediction.confidence}")
print(f" Mask: {prediction.masks}")
Interactive Segmentation (SAM 2 Style)¶
SAM 3 also supports interactive segmentation using points and boxes, maintaining compatibility with the SAM 2 interface. This is handled by the Sam3ForInteractiveImageSegmentation class.
This mode is ideal for "human-in-the-loop" workflows where you want to refine masks using clicks or bounding boxes.
from inference.models.sam3 import Sam3ForInteractiveImageSegmentation
# Initialize the interactive model
model = Sam3ForInteractiveImageSegmentation(model_id="sam3/sam3_final")
# Embed the image (calculates image features)
embedding, img_shape, image_id = model.embed_image(image="path/to/image.jpg")
# Segment with a point prompt
# points are (x, y), label 1 is positive (include), 0 is negative (exclude)
masks, scores, logits = model.segment_image(
image_id=image_id,
prompts={
"points": [{"x": 500, "y": 400, "positive": True}]
}
)
# The result 'masks' contains the segmentation masks for the prompt
HTTP API¶
You can run SAM 3 via the Inference HTTP API. This is useful if you are running the Inference Server in Docker.
SAM 3 exposes two main modes via API: 1. Promptable Visual Segmentation (PVS): Similar to SAM 2, using points and boxes. 2. Promptable Concept Segmentation (PCS): Using text prompts or mixed text/visual prompts.
1. Start the Server¶
docker run -it --rm -p 9001:9001 --gpus=all roboflow/inference-server:latest
2. Concept Segmentation (Text and Exemplar Prompts)¶
This is the most common usage for SAM 3, allowing you to segment all instances of a concept. Concepts can be described by text:
curl -X POST 'http://localhost:9001/sam3/concept_segment?api_key=<YOUR_API_KEY>' \
-H 'Content-Type: application/json' \
-d '{
"image": {
"type": "url",
"value": "https://media.roboflow.com/inference/sample.jpg"
},
"prompts": [
{ "type": "text", "text": "cat" },
{ "type": "text", "text": "dog" }
]
}'
Concepts can also be described by exemplar boxes — box one example object and SAM 3 segments every similar instance. Boxes use absolute pixel coordinates, either top-left anchored XYWH ({"x", "y", "width", "height"}) or corner form ({"x0", "y0", "x1", "y1"}). box_labels is required alongside boxes: 1 marks a positive exemplar, 0 a negative exemplar to exclude lookalikes. Text and exemplars can be combined in one prompt:
curl -X POST 'http://localhost:9001/sam3/concept_segment?api_key=<YOUR_API_KEY>' \
-H 'Content-Type: application/json' \
-d '{
"image": {
"type": "url",
"value": "https://media.roboflow.com/inference/sample.jpg"
},
"prompts": [
{
"type": "visual",
"text": "dog",
"boxes": [
{ "x": 100, "y": 200, "width": 150, "height": 120 },
{ "x0": 400, "y0": 200, "x1": 550, "y1": 320 }
],
"box_labels": [1, 0]
}
]
}'
3. Visual Segmentation (Points/Boxes)¶
For interactive segmentation similar to SAM 2, you can use the visual segmentation endpoints.
Step 1: Embed the Image (Optional but recommended for speed)
curl -X POST 'http://localhost:9001/sam3/embed_image?api_key=<YOUR_API_KEY>' \
-H 'Content-Type: application/json' \
-d '{
"image": {
"type": "url",
"value": "https://media.roboflow.com/inference/sample.jpg"
}
}'
# Returns an "image_id"
Step 2: Segment with Points and/or a Box
curl -X POST 'http://localhost:9001/sam3/visual_segment?api_key=<YOUR_API_KEY>' \
-H 'Content-Type: application/json' \
-d '{
"image_id": "<IMAGE_ID_FROM_STEP_1>",
"prompts": [
{
"points": [ { "x": 100, "y": 100, "positive": true } ],
"box": { "x": 100, "y": 100, "width": 200, "height": 150 }
}
],
"multimask_output": false
}'
A prompt can contain points, a box, or both. Positive points include the clicked region; negative points ("positive": false) exclude it — add points to iteratively refine the mask. Note that the PVS box is center-anchored XYWH (x, y is the box center), unlike concept segmentation boxes which are top-left anchored.
The response contains the single highest-confidence mask for the prompt: multimask_output controls how many internal mask proposals the model generates (three when true), but the best proposal is always selected for the response. Send one prompt per request — multiple prompts in one request currently return only one prediction.
Workflow Integration¶
SAM 3 is fully integrated into Inference Workflows. Two blocks are available:
- The SAM 3 block runs concept segmentation: use Text Prompts to segment all instances of a class by name.
- The SAM 3 Interactive block runs promptable visual segmentation (PVS): use Point Prompts (positive/negative clicks) and/or Box Prompts from other detection models (like YOLO) to segment specific objects.
Example: Text Prompting in Workflows¶
- Add a SAM 3 block to your workflow.
- Connect an image input.
- In the
class_namesfield, enter the classes you want to segment (e.g.,["person", "vehicle"]). - The block will output instance segmentation predictions compatible with other workflow steps.
Example: Point Prompting in Workflows¶
- Add a SAM 3 Interactive block to your workflow.
- Connect an image input.
- In the
pointsfield, provide labeled points (or connect a workflow input of kindlabeled_points), e.g.[{"x": 320, "y": 240, "positive": true}, {"x": 100, "y": 100, "positive": false}]. Positive points mark the object to segment; negative points exclude regions to refine the mask. - Optionally connect detections from another model to the
boxesfield - each box becomes a separate prompt and its class name is forwarded to the predicted mask.
Video Tracking in Workflows¶
The SAM3 Video Tracker workflow block (roboflow_core/sam3_video@v1) runs
SAM3's streaming concept tracker frame by frame: you provide the concepts to
track as text in class_names, and the model runs fused detection and
tracking on every frame. Objects matching a concept keep a stable
tracker_id across frames, and — unlike detector-seeded tracking — objects
that enter the scene mid-stream are picked up automatically, with no
re-prompting and no upstream detection model required. Each emitted mask
carries the concept it matched as its class name and the model's detection
score as its confidence (filter with threshold, default 0.5).
Key properties:
- Stateful and local-only. The block keeps one tracking session per
video_metadata.video_identifierand requiresWORKFLOWS_STEP_EXECUTION_MODE=local; drive it withInferencePipeline, which delivers frames one at a time with video metadata attached. A GPU is required. - No prompt scheduling. Concept prompts are registered on the session
once; the session is only re-seeded when the stream restarts or
class_nameschanges. For detector-driven (box-prompted) video tracking, use the SAM2 Video Tracker block instead (see the SAM 2 documentation) — it also acceptssam3trackervideoasmodel_idto run SAM3's visually prompted tracker, which shares thesam3videoweights package. - Model.
model_iddefaults tosam3video, the HuggingFace transformers port of SAM3 video, which exposes the frame-by-frame streaming interface (the nativesam3package's video predictor requires the whole video upfront and cannot be used for live streams).
Example¶
from inference import InferencePipeline
from inference.core.interfaces.stream.sinks import render_boxes
WORKFLOW = {
"version": "1.0",
"inputs": [{"type": "InferenceImage", "name": "image"}],
"steps": [
{
"type": "roboflow_core/sam3_video@v1",
"name": "tracker",
"images": "$inputs.image",
"class_names": ["person", "forklift"],
"threshold": 0.5,
},
],
"outputs": [
{
"type": "JsonField",
"name": "predictions",
"selector": "$steps.tracker.predictions",
}
],
}
pipeline = InferencePipeline.init_with_workflow(
video_reference="path/to/video.mp4", # or an RTSP URL / camera id
workflow_specification=WORKFLOW,
on_prediction=render_boxes,
api_key="<YOUR-API-KEY>",
)
pipeline.start()
pipeline.join()
Capabilities & Features¶
- Open Vocabulary Segmentation: Unlike SAM 2 which requires visual prompts, SAM 3 can find objects based on text descriptions.
- High Performance: Achieves state-of-the-art performance on open-vocabulary benchmarks.
- Unified Architecture: Handles both detection and segmentation in a single model.
For more technical details, refer to the official SAM 3 paper.
How to use SAM 3 taking advantage of hot SAM3 instances maintained by Roboflow¶
In below examples we are taking advantage of the serverless infrastructure which handles GPU provisioning automatically, making it ideal for applications that need on-demand segmentation without managing infrastructure.
1. SAM3 Concept Segmentation workflow¶
This example demonstrates using SAM3 with the workflow approach which allows you to combine SAM3's concept segmentation with visualization in a single pipeline. Here, we're segmenting all dogs in an image and automatically visualizing the results with polygon overlays.
If you have created a workflow in Roboflow platform you can use workspace_name and workflow_id instead of specification to run it.
import base64
import cv2 as cv
import numpy as np
from inference_sdk import InferenceHTTPClient
# 2. Connect to your workflow
client = InferenceHTTPClient(
api_url="https://serverless.roboflow.com",
api_key="<YOUR_ROBOFLOW_API_KEY>"
)
# 3. Run your workflow on an image
workflow_spec = {
"version": "1.0",
"inputs": [
{
"type": "InferenceImage",
"name": "image"
}
],
"steps": [
{
"type": "roboflow_core/sam3@v1",
"name": "sam",
"images": "$inputs.image",
"class_names": "dog"
},
{
"type": "roboflow_core/polygon_visualization@v1",
"name": "polygon_visualization",
"image": "$inputs.image",
"predictions": "$steps.sam.predictions"
}
],
"outputs": [
{
"type": "JsonField",
"name": "output",
"coordinates_system": "own",
"selector": "$steps.polygon_visualization.image"
}
]
}
result = client.run_workflow(
specification=workflow_spec,
images={
"image": "https://media.roboflow.com/inference/dog.jpeg" # Path or url to your image file
},
use_cache=True # Speeds up repeated requests
)
# 4. Display the result
nparr = np.frombuffer(base64.b64decode(result[0]["output"]), np.uint8)
img = cv.imdecode(nparr, cv.IMREAD_COLOR)
cv.imshow("result", img)
cv.waitKey(0)
cv.destroyAllWindows()
2. SAM3 via the Inference SDK¶
The inference-sdk client wraps both SAM3 endpoints. Prompt dicts take the same shape as the HTTP payloads, so text, exemplar, and combined prompts all work:
from inference_sdk import InferenceHTTPClient
client = InferenceHTTPClient(
api_url="https://serverless.roboflow.com",
api_key="<YOUR_ROBOFLOW_API_KEY>",
)
# Concept segmentation: text, exemplar boxes, or both per prompt
result = client.sam3_concept_segment(
inference_input="https://media.roboflow.com/inference/people-walking.jpg",
prompts=[
{"type": "text", "text": "person"},
{
"type": "visual",
"boxes": [{"x": 1409, "y": 705, "width": 112, "height": 183}],
"box_labels": [1],
},
],
output_prob_thresh=0.5,
)
# Interactive visual segmentation: points and/or a center-anchored box
result = client.sam3_visual_segment(
inference_input="https://media.roboflow.com/inference/people-walking.jpg",
prompts=[{"points": [{"x": 1465, "y": 796, "positive": True}]}],
multimask_output=False,
)
3. SAM3 raw API¶
For direct API access to SAM3 without workflows, you can use Roboflow's serverless endpoint. This approach gives you raw segmentation results that you can process however you need. The example below shows how to segment a dog and draw the resulting polygon directly on the image using OpenCV.
import requests
import cv2 as cv
import numpy as np
response = requests.post(
"https://serverless.roboflow.com/sam3/concept_segment?api_key=<YOUR_ROBOFLOW_API_KEY>",
headers={
"Content-Type": "application/json"
},
json={
"format": "polygon",
"image": {
"type": "url",
"value": "https://media.roboflow.com/dog.jpeg"
},
"prompts": [
{ "text": "dog" }
]
}
)
img_req = requests.get("https://media.roboflow.com/dog.jpeg")
img_arr = np.asarray(bytearray(img_req.content), dtype=np.uint8)
img = cv.imdecode(img_arr, -1)
polygon_arr = np.array(response.json()["prompt_results"][0]["predictions"][0]["masks"][0])
cv.polylines(img, [polygon_arr], True, (0, 200, 200), 3)
cv.imshow("result", img)
cv.waitKey(0)
cv.destroyAllWindows()