VLM As Detector¶
v2¶
Class: VLMAsDetectorBlockV2 (there are multiple versions of this block)
Source: inference.core.workflows.core_steps.formatters.vlm_as_detector.v2.VLMAsDetectorBlockV2
Warning: This block has multiple versions. Please refer to the specific version for details. You can learn more about how versions work here: Versioning
Parse JSON strings from Visual Language Models (VLMs) and Large Language Models (LLMs) into standardized object detection prediction format by extracting bounding boxes, class names, and confidences, converting normalized coordinates to pixel coordinates, mapping class names to class IDs, and handling multiple model types and task formats to enable VLM-based object detection, LLM detection parsing, and text-to-detection conversion workflows.
How This Block Works¶
This block converts VLM/LLM text outputs containing object detection predictions into standardized object detection format compatible with workflow detection blocks. The block:
- Receives image and VLM output string containing detection results in JSON format
- Parses JSON content from VLM output:
Handles Markdown-wrapped JSON:
- Searches for JSON wrapped in Markdown code blocks (json ...)
- This format is common in LLM/VLM responses
- If multiple markdown JSON blocks are found, only the first block is parsed
- Extracts JSON content from within markdown tags
Handles raw JSON strings: - If no markdown blocks are found, attempts to parse the entire string as JSON - Supports standard JSON format strings 3. Selects appropriate parser based on model type and task type: - Uses registered parsers that handle different model outputs (google-gemini, anthropic-claude, florence-2, openai) - Supports multiple task types: object-detection, open-vocabulary-object-detection, object-detection-and-caption, phrase-grounded-object-detection, region-proposal, ocr-with-text-detection - Each model/task combination uses a specialized parser for that format 4. Parses detection data based on model type:
For OpenAI/Gemini/Claude models: - Extracts detections array from parsed JSON - Converts normalized coordinates (0-1 range) to pixel coordinates using image dimensions - Extracts class names, confidence scores, and bounding box coordinates - Maps class names to class IDs using provided classes list - Creates detection objects with bounding boxes, classes, and confidences
For Florence-2 model:
- Uses supervision's built-in LMM parser for Florence-2 format
- Handles different task types with specialized parsing (object detection, open vocabulary, region proposal, OCR, etc.)
- For region proposal tasks: assigns "roi" as class name
- For open vocabulary detection: uses provided classes list for class ID mapping
- For other tasks: uses MD5-based class ID generation or provided classes
- Sets confidence to 1.0 for Florence-2 detections (model doesn't provide confidence)
5. Converts coordinates and normalizes data:
- Converts normalized coordinates (0-1) to absolute pixel coordinates (x_min, y_min, x_max, y_max)
- Scales coordinates using image width and height
- Normalizes confidence scores to valid range [0.0, 1.0]
- Clamps confidence values outside the range
6. Creates class name to class ID mapping:
- For OpenAI/Gemini/Claude: uses provided classes list to create index mapping (class_name → class_id)
- Classes are mapped in order (first class = ID 0, second = ID 1, etc.)
- Classes not in the provided list get class_id = -1
- For Florence-2: uses different mapping strategies based on task type
7. Constructs object detection predictions:
- Creates supervision Detections objects with bounding boxes (xyxy format)
- Includes class IDs, class names, and confidence scores
- Adds metadata: detection IDs, inference IDs, image dimensions, prediction type
- Attaches parent coordinates for crop-aware detections
- Formats predictions in standard object detection format
8. Handles errors:
- Sets error_status to True if JSON parsing fails
- Sets error_status to True if detection parsing fails
- Returns None for predictions when errors occur
- Always includes inference_id for tracking
9. Returns object detection predictions:
- Outputs predictions in standard object detection format (compatible with detection blocks)
- Outputs error_status indicating parsing success/failure
- Outputs inference_id for tracking and lineage
The block enables using VLMs/LLMs for object detection by converting their text-based JSON outputs into standardized detection predictions that can be used in workflows like any other object detection model output.
Common Use Cases¶
- VLM-Based Object Detection: Use Visual Language Models for object detection by parsing VLM outputs into detection predictions (e.g., detect objects with GPT-4V, use Claude Vision for detection, parse Gemini detection outputs), enabling VLM detection workflows
- Open-Vocabulary Detection: Use VLMs for open-vocabulary object detection with custom classes (e.g., detect custom objects with VLMs, use open-vocabulary detection, detect objects not in training set), enabling open-vocabulary detection workflows
- Multi-Task Detection: Use VLMs for various detection tasks (e.g., object detection with captions, phrase-grounded detection, region proposal, OCR with detection), enabling multi-task detection workflows
- LLM Detection Parsing: Parse LLM text outputs containing detection results into standardized format (e.g., parse GPT detection outputs, convert LLM predictions to detection format, use LLMs for detection), enabling LLM detection workflows
- Text-to-Detection Conversion: Convert text-based detection outputs from models into workflow-compatible detection predictions (e.g., convert text predictions to detection format, parse text-based detections, convert model outputs to detections), enabling text-to-detection workflows
- VLM Integration: Integrate VLM outputs into detection workflows (e.g., use VLMs in detection pipelines, integrate VLM predictions with detection blocks, combine VLM and traditional detection), enabling VLM integration workflows
Connecting to Other Blocks¶
This block receives images and VLM outputs and produces object detection predictions:
- After VLM/LLM blocks to parse detection outputs into standard format (e.g., VLM output to detections, LLM output to detections, parse model outputs), enabling VLM-to-detection workflows
- Before detection-based blocks to use parsed detections (e.g., use parsed detections in workflows, provide detections to downstream blocks, use VLM detections with detection blocks), enabling detection-to-workflow workflows
- Before filtering blocks to filter VLM detections (e.g., filter by class, filter by confidence, apply filters to VLM predictions), enabling detection-to-filter workflows
- Before analytics blocks to analyze VLM detection results (e.g., analyze VLM detections, perform analytics on parsed detections, track VLM detection metrics), enabling detection analytics workflows
- Before visualization blocks to display VLM detection results (e.g., visualize VLM detections, display parsed detection predictions, show VLM detection outputs), enabling detection visualization workflows
- In workflow outputs to provide VLM detections as final output (e.g., VLM detection outputs, parsed detection results, VLM-based detection outputs), enabling detection output workflows
Version Differences¶
This version (v2) includes the following enhancements over v1:
- Improved Type System: The
inference_idoutput now usesINFERENCE_ID_KINDinstead ofSTRING_KIND, providing better type safety and semantic meaning for inference tracking identifiers in the workflow system - OpenAI Model Support: Added support for OpenAI models in addition to Google Gemini, Anthropic Claude, and Florence-2 models, expanding the range of VLM/LLM models that can be used for object detection
- Enhanced Type Safety: Improved type system ensures better integration with workflow execution engine and provides clearer semantic meaning for inference tracking
Requirements¶
This block requires an image input (for metadata and dimensions) and a VLM output string containing JSON detection data. The JSON can be raw JSON or wrapped in Markdown code blocks (json ...). The block supports four model types: "openai", "google-gemini", "anthropic-claude", and "florence-2". It supports multiple task types: "object-detection", "open-vocabulary-object-detection", "object-detection-and-caption", "phrase-grounded-object-detection", "region-proposal", and "ocr-with-text-detection". The classes parameter is required for OpenAI, Gemini, and Claude models (to map class names to IDs) but optional for Florence-2 (some tasks don't require it). Classes are mapped to IDs by index (first class = 0, second = 1, etc.). Classes not in the list get class_id = -1. The block outputs object detection predictions in standard format (compatible with detection blocks), error_status (boolean), and inference_id (INFERENCE_ID_KIND) for tracking.
Type identifier¶
Use the following identifier in step "type" field: roboflow_core/vlm_as_detector@v2to add the block as
as step in your workflow.
Properties¶
| Name | Type | Description | Refs |
|---|---|---|---|
name |
str |
Enter a unique identifier for this step.. | ❌ |
classes |
List[str] |
List of all class names used by the classification model, in order. Required to generate mapping between class names (from VLM output) and class IDs (for detection format). Classes are mapped to IDs by index: first class = ID 0, second = ID 1, etc. Classes from VLM output that are not in this list get class_id = -1. Required for OpenAI, Gemini, and Claude models. Optional for Florence-2 (some tasks don't require it). Should match the classes the VLM was asked to detect.. | ✅ |
model_type |
str |
Type of the VLM/LLM model that generated the prediction. Determines which parser is used to extract detection data from the JSON output. Supported models: 'openai' (GPT-4V), 'google-gemini' (Gemini Vision), 'anthropic-claude' (Claude Vision), 'florence-2' (Microsoft Florence-2). Each model type has different JSON output formats, so the correct model type must be specified for proper parsing.. | ❌ |
task_type |
str |
Task type performed by the VLM/LLM model. Determines which parser and format handler is used. Supported task types: 'object-detection' (standard object detection), 'open-vocabulary-object-detection' (detect objects with custom classes), 'object-detection-and-caption' (detection with captions), 'phrase-grounded-object-detection' (ground phrases to detections), 'region-proposal' (propose regions of interest), 'ocr-with-text-detection' (OCR with text region detection). The task type must match what the VLM/LLM was asked to perform.. | ❌ |
The Refs column marks possibility to parametrise the property with dynamic values available
in workflow runtime. See Bindings for more info.
Available Connections¶
Compatible Blocks
Check what blocks you can connect to VLM As Detector in version v2.
- inputs:
Halo Visualization,GLM-OCR,Image Threshold,Stitch Images,Morphological Transformation,Classification Label Visualization,Crop Visualization,Icon Visualization,Stability AI Outpainting,Blur Visualization,Reference Path Visualization,MoonshotAI Kimi,OpenAI,Google Gemini,Anthropic Claude,Camera Focus,QR Code Generator,Size Measurement,Model Comparison Visualization,Florence-2 Model,Trace Visualization,Ellipse Visualization,Anthropic Claude,Dot Visualization,Perspective Correction,Label Visualization,Image Convert Grayscale,Florence-2 Model,Text Display,Llama 3.2 Vision,Qwen-VL,PLC ModbusTCP,Image Blur,Absolute Static Crop,SIFT,Google Gemini,Dimension Collapse,Qwen 3.5 API,Qwen 3.6 API,Triangle Visualization,Camera Focus,Contrast Equalization,Polygon Visualization,OpenAI,Heatmap Visualization,Clip Comparison,Google Gemma API,Detections List Roll-Up,Contrast Enhancement,Google Gemini,PLC EthernetIP,Halo Visualization,Color Visualization,Morphological Transformation,MoonshotAI Kimi,Llama 3.2 Vision,Buffer,Polygon Visualization,Image Stack,Mask Visualization,Anthropic Claude,Stability AI Inpainting,Keypoint Visualization,Background Subtraction,Image Slicer,Image Contours,Line Counter Visualization,Image Preprocessing,Dynamic Crop,Depth Estimation,Bounding Box Visualization,Motion Detection,Clip Comparison,Corner Visualization,Polygon Zone Visualization,Camera Calibration,Grid Visualization,Stability AI Image Generation,Dynamic Zone,OpenAI,Circle Visualization,Image Slicer,Relative Static Crop,OpenAI-Compatible LLM,OpenRouter,SIFT Comparison,Pixelate Visualization,Background Color Visualization,Google Gemma - outputs:
Halo Visualization,Overlap Analysis,Stitch OCR Detections,SAM 3 Interactive,Template Matching,Twilio SMS/MMS Notification,Classification Label Visualization,Crop Visualization,Icon Visualization,Detections Transformation,Blur Visualization,Reference Path Visualization,ByteTrack Tracker,Single-Label Classification Model,Google Gemini,Detections Classes Replacement,Single-Label Classification Model,Byte Tracker,Webhook Sink,Instance Segmentation Model,Track Class Lock,Instance Segmentation Model,Size Measurement,Model Comparison Visualization,MQTT Writer,Florence-2 Model,Trace Visualization,Path Deviation,Ellipse Visualization,Object Detection Model,Keypoint Detection Model,BoT-SORT Tracker,Dot Visualization,Perspective Correction,Label Visualization,Instance Segmentation Model,Florence-2 Model,Text Display,Per-Class Confidence Filter,Roboflow Dataset Upload,Detections Stabilizer,Keypoint Detection Model,Detections Merge,Gaze Detection,Velocity,Keypoint Detection Model,OC-SORT Tracker,SAM 3,Triangle Visualization,Camera Focus,Time in Zone,Polygon Visualization,SAM2 Video Tracker,SORT Tracker,Line Counter,Heatmap Visualization,Multi-Label Classification Model,Detections Stitch,Detections List Roll-Up,Halo Visualization,Color Visualization,Stitch OCR Detections,Event Writer,Multi-Label Classification Model,Image Stack,Email Notification,Polygon Visualization,Mask Visualization,Detections Filter,Stability AI Inpainting,Distance Measurement,Time in Zone,Microsoft SQL Server Sink,Roboflow Asset Library Attributes,PTZ Tracking (ONVIF),Keypoint Visualization,Multi-Label Classification Model,Roboflow Vision Events,Overlap Filter,Twilio SMS Notification,Mask Area Measurement,Email Notification,Detection Offset,Line Counter Visualization,Detections Consensus,Object Detection Model,OPC UA Writer Sink,SAM 3,Byte Tracker,Path Deviation,Dynamic Crop,Byte Tracker,Bounding Box Visualization,Detections Combine,Motion Detection,Roboflow Dataset Upload,Polygon Zone Visualization,Camera Calibration,Segment Anything 2 Model,Corner Visualization,Circle Visualization,Time in Zone,Single-Label Classification Model,Roboflow Custom Metadata,Instance Segmentation Model,Model Monitoring Inference Aggregator,Slack Notification,Object Detection Model,Detection Event Log,SIFT Comparison,Pixelate Visualization,Background Color Visualization,Line Counter,Dynamic Zone
Input and Output Bindings¶
The available connections depend on its binding kinds. Check what binding kinds
VLM As Detector in version v2 has.
Bindings
-
input
image(image): Input image that was used to generate the VLM prediction. Used to extract image dimensions (width, height) for converting normalized coordinates to pixel coordinates and metadata (parent_id) for the detection predictions. The same image that was provided to the VLM/LLM block should be used here to maintain consistency..vlm_output(language_model_output): String output from a VLM or LLM block containing object detection prediction in JSON format. Can be raw JSON string or JSON wrapped in Markdown code blocks (e.g.,json {...}). Format depends on model_type and task_type - different models and tasks produce different JSON structures. If multiple markdown blocks exist, only the first is parsed..classes(list_of_values): List of all class names used by the classification model, in order. Required to generate mapping between class names (from VLM output) and class IDs (for detection format). Classes are mapped to IDs by index: first class = ID 0, second = ID 1, etc. Classes from VLM output that are not in this list get class_id = -1. Required for OpenAI, Gemini, and Claude models. Optional for Florence-2 (some tasks don't require it). Should match the classes the VLM was asked to detect..
-
output
error_status(boolean): Boolean flag.predictions(object_detection_prediction): Prediction with detected bounding boxes in form of sv.Detections(...) object.inference_id(inference_id): Inference identifier.
Example JSON definition of step VLM As Detector in version v2
{
"name": "<your_step_name_here>",
"type": "roboflow_core/vlm_as_detector@v2",
"image": "$inputs.image",
"vlm_output": [
"$steps.lmm.output"
],
"classes": [
"$steps.lmm.classes",
"$inputs.classes",
[
"dog",
"cat",
"bird"
],
[
"class_a",
"class_b"
]
],
"model_type": [
"openai"
],
"task_type": "<block_does_not_provide_example>"
}
v1¶
Class: VLMAsDetectorBlockV1 (there are multiple versions of this block)
Source: inference.core.workflows.core_steps.formatters.vlm_as_detector.v1.VLMAsDetectorBlockV1
Warning: This block has multiple versions. Please refer to the specific version for details. You can learn more about how versions work here: Versioning
Parse JSON strings from Visual Language Models (VLMs) and Large Language Models (LLMs) into standardized object detection prediction format by extracting bounding boxes, class names, and confidences, converting normalized coordinates to pixel coordinates, mapping class names to class IDs, and handling multiple model types and task formats to enable VLM-based object detection, LLM detection parsing, and text-to-detection conversion workflows.
How This Block Works¶
This block converts VLM/LLM text outputs containing object detection predictions into standardized object detection format compatible with workflow detection blocks. The block:
- Receives image and VLM output string containing detection results in JSON format
- Parses JSON content from VLM output:
Handles Markdown-wrapped JSON:
- Searches for JSON wrapped in Markdown code blocks (json ...)
- This format is common in LLM/VLM responses
- If multiple markdown JSON blocks are found, only the first block is parsed
- Extracts JSON content from within markdown tags
Handles raw JSON strings: - If no markdown blocks are found, attempts to parse the entire string as JSON - Supports standard JSON format strings 3. Selects appropriate parser based on model type and task type: - Uses registered parsers that handle different model outputs (google-gemini, anthropic-claude, florence-2) - Supports multiple task types: object-detection, open-vocabulary-object-detection, object-detection-and-caption, phrase-grounded-object-detection, region-proposal, ocr-with-text-detection - Each model/task combination uses a specialized parser for that format 4. Parses detection data based on model type:
For Gemini/Claude models: - Extracts detections array from parsed JSON - Converts normalized coordinates (0-1 range) to pixel coordinates using image dimensions - Extracts class names, confidence scores, and bounding box coordinates - Maps class names to class IDs using provided classes list - Creates detection objects with bounding boxes, classes, and confidences
For Florence-2 model:
- Uses supervision's built-in LMM parser for Florence-2 format
- Handles different task types with specialized parsing (object detection, open vocabulary, region proposal, OCR, etc.)
- For region proposal tasks: assigns "roi" as class name
- For open vocabulary detection: uses provided classes list for class ID mapping
- For other tasks: uses MD5-based class ID generation or provided classes
- Sets confidence to 1.0 for Florence-2 detections (model doesn't provide confidence)
5. Converts coordinates and normalizes data:
- Converts normalized coordinates (0-1) to absolute pixel coordinates (x_min, y_min, x_max, y_max)
- Scales coordinates using image width and height
- Normalizes confidence scores to valid range [0.0, 1.0]
- Clamps confidence values outside the range
6. Creates class name to class ID mapping:
- For Gemini/Claude: uses provided classes list to create index mapping (class_name → class_id)
- Classes are mapped in order (first class = ID 0, second = ID 1, etc.)
- Classes not in the provided list get class_id = -1
- For Florence-2: uses different mapping strategies based on task type
7. Constructs object detection predictions:
- Creates supervision Detections objects with bounding boxes (xyxy format)
- Includes class IDs, class names, and confidence scores
- Adds metadata: detection IDs, inference IDs, image dimensions, prediction type
- Attaches parent coordinates for crop-aware detections
- Formats predictions in standard object detection format
8. Handles errors:
- Sets error_status to True if JSON parsing fails
- Sets error_status to True if detection parsing fails
- Returns None for predictions when errors occur
- Always includes inference_id for tracking
9. Returns object detection predictions:
- Outputs predictions in standard object detection format (compatible with detection blocks)
- Outputs error_status indicating parsing success/failure
- Outputs inference_id for tracking and lineage
The block enables using VLMs/LLMs for object detection by converting their text-based JSON outputs into standardized detection predictions that can be used in workflows like any other object detection model output.
Common Use Cases¶
- VLM-Based Object Detection: Use Visual Language Models for object detection by parsing VLM outputs into detection predictions (e.g., detect objects with GPT-4V, use Claude Vision for detection, parse Gemini detection outputs), enabling VLM detection workflows
- Open-Vocabulary Detection: Use VLMs for open-vocabulary object detection with custom classes (e.g., detect custom objects with VLMs, use open-vocabulary detection, detect objects not in training set), enabling open-vocabulary detection workflows
- Multi-Task Detection: Use VLMs for various detection tasks (e.g., object detection with captions, phrase-grounded detection, region proposal, OCR with detection), enabling multi-task detection workflows
- LLM Detection Parsing: Parse LLM text outputs containing detection results into standardized format (e.g., parse GPT detection outputs, convert LLM predictions to detection format, use LLMs for detection), enabling LLM detection workflows
- Text-to-Detection Conversion: Convert text-based detection outputs from models into workflow-compatible detection predictions (e.g., convert text predictions to detection format, parse text-based detections, convert model outputs to detections), enabling text-to-detection workflows
- VLM Integration: Integrate VLM outputs into detection workflows (e.g., use VLMs in detection pipelines, integrate VLM predictions with detection blocks, combine VLM and traditional detection), enabling VLM integration workflows
Connecting to Other Blocks¶
This block receives images and VLM outputs and produces object detection predictions:
- After VLM/LLM blocks to parse detection outputs into standard format (e.g., VLM output to detections, LLM output to detections, parse model outputs), enabling VLM-to-detection workflows
- Before detection-based blocks to use parsed detections (e.g., use parsed detections in workflows, provide detections to downstream blocks, use VLM detections with detection blocks), enabling detection-to-workflow workflows
- Before filtering blocks to filter VLM detections (e.g., filter by class, filter by confidence, apply filters to VLM predictions), enabling detection-to-filter workflows
- Before analytics blocks to analyze VLM detection results (e.g., analyze VLM detections, perform analytics on parsed detections, track VLM detection metrics), enabling detection analytics workflows
- Before visualization blocks to display VLM detection results (e.g., visualize VLM detections, display parsed detection predictions, show VLM detection outputs), enabling detection visualization workflows
- In workflow outputs to provide VLM detections as final output (e.g., VLM detection outputs, parsed detection results, VLM-based detection outputs), enabling detection output workflows
Requirements¶
This block requires an image input (for metadata and dimensions) and a VLM output string containing JSON detection data. The JSON can be raw JSON or wrapped in Markdown code blocks (json ...). The block supports three model types: "google-gemini", "anthropic-claude", and "florence-2". It supports multiple task types: "object-detection", "open-vocabulary-object-detection", "object-detection-and-caption", "phrase-grounded-object-detection", "region-proposal", and "ocr-with-text-detection". The classes parameter is required for Gemini and Claude models (to map class names to IDs) but optional for Florence-2 (some tasks don't require it). Classes are mapped to IDs by index (first class = 0, second = 1, etc.). Classes not in the list get class_id = -1. The block outputs object detection predictions in standard format (compatible with detection blocks), error_status (boolean), and inference_id (string) for tracking.
Type identifier¶
Use the following identifier in step "type" field: roboflow_core/vlm_as_detector@v1to add the block as
as step in your workflow.
Properties¶
| Name | Type | Description | Refs |
|---|---|---|---|
name |
str |
Enter a unique identifier for this step.. | ❌ |
classes |
List[str] |
List of all class names used by the detection model, in order. Required for google-gemini and anthropic-claude models to generate mapping between class names (from VLM output) and class IDs (for detection format). Optional for florence-2 model (required only for open-vocabulary-object-detection task). Classes are mapped to IDs by index: first class = ID 0, second = ID 1, etc. Classes from VLM output that are not in this list get class_id = -1. Should match the classes the VLM was asked to detect.. | ✅ |
model_type |
str |
Type of VLM/LLM model that generated the detection prediction. Determines which parser to use for parsing the JSON output. 'google-gemini': Google Gemini model outputs. 'anthropic-claude': Anthropic Claude model outputs. 'florence-2': Microsoft Florence-2 model outputs. Each model type has different JSON output formats and requires appropriate parsing.. | ❌ |
task_type |
str |
Task type that was performed by the VLM model. Determines how the JSON output is parsed and what detection format is expected. Supported tasks: 'object-detection' (unprompted detection), 'open-vocabulary-object-detection' (detection with provided classes), 'object-detection-and-caption' (detection with captions), 'phrase-grounded-object-detection' (prompted detection), 'region-proposal' (regions of interest), 'ocr-with-text-detection' (text detection with OCR). Each task type has specific output format requirements.. | ❌ |
The Refs column marks possibility to parametrise the property with dynamic values available
in workflow runtime. See Bindings for more info.
Available Connections¶
Compatible Blocks
Check what blocks you can connect to VLM As Detector in version v1.
- inputs:
Halo Visualization,GLM-OCR,Image Threshold,Stitch Images,Morphological Transformation,Classification Label Visualization,Crop Visualization,Icon Visualization,Stability AI Outpainting,Blur Visualization,Reference Path Visualization,MoonshotAI Kimi,OpenAI,Google Gemini,Anthropic Claude,Camera Focus,QR Code Generator,Size Measurement,Model Comparison Visualization,Florence-2 Model,Trace Visualization,Ellipse Visualization,Anthropic Claude,Dot Visualization,Perspective Correction,Label Visualization,Image Convert Grayscale,Florence-2 Model,Text Display,Llama 3.2 Vision,Qwen-VL,PLC ModbusTCP,Image Blur,Absolute Static Crop,SIFT,Google Gemini,Dimension Collapse,Qwen 3.5 API,Qwen 3.6 API,Triangle Visualization,Camera Focus,Contrast Equalization,Polygon Visualization,OpenAI,Heatmap Visualization,Clip Comparison,Google Gemma API,Detections List Roll-Up,Contrast Enhancement,Google Gemini,PLC EthernetIP,Halo Visualization,Color Visualization,Morphological Transformation,MoonshotAI Kimi,Llama 3.2 Vision,Buffer,Polygon Visualization,Image Stack,Mask Visualization,Anthropic Claude,Stability AI Inpainting,Keypoint Visualization,Background Subtraction,Image Slicer,Image Contours,Line Counter Visualization,Image Preprocessing,Dynamic Crop,Depth Estimation,Bounding Box Visualization,Motion Detection,Clip Comparison,Corner Visualization,Polygon Zone Visualization,Camera Calibration,Grid Visualization,Stability AI Image Generation,Dynamic Zone,OpenAI,Circle Visualization,Image Slicer,Relative Static Crop,OpenAI-Compatible LLM,OpenRouter,SIFT Comparison,Pixelate Visualization,Background Color Visualization,Google Gemma - outputs:
Overlap Analysis,Template Matching,Morphological Transformation,Classification Label Visualization,Crop Visualization,Detections Transformation,Blur Visualization,Stability AI Outpainting,Reference Path Visualization,OpenAI,YOLO-World Model,Detections Classes Replacement,Anthropic Claude,Track Class Lock,Instance Segmentation Model,Size Measurement,Model Comparison Visualization,Florence-2 Model,Trace Visualization,Label Visualization,Florence-2 Model,Text Display,Qwen-VL,Llama 3.2 Vision,Keypoint Detection Model,Image Blur,Gaze Detection,Velocity,Keypoint Detection Model,LMM,OC-SORT Tracker,Qwen 3.5 API,Qwen 3.6 API,Camera Focus,SORT Tracker,Line Counter,Multi-Label Classification Model,Detections Stitch,Clip Comparison,Google Gemma API,Halo Visualization,Color Visualization,Stitch OCR Detections,MoonshotAI Kimi,Morphological Transformation,Event Writer,Stability AI Inpainting,Cache Set,Time in Zone,Microsoft SQL Server Sink,Roboflow Asset Library Attributes,OpenAI,Roboflow Vision Events,Mask Area Measurement,Detection Offset,CogVLM,Detections Consensus,Object Detection Model,OPC UA Writer Sink,Semantic Segmentation Model,Path Deviation,Dynamic Crop,Byte Tracker,Bounding Box Visualization,Detections Combine,Qwen3.5-VL,SAM 3,Cache Get,OpenAI,Time in Zone,Single-Label Classification Model,Slack Notification,OpenRouter,Detection Event Log,SIFT Comparison,Pixelate Visualization,Google Vision OCR,SAM3 Video Tracker,Dynamic Zone,Google Gemma,Halo Visualization,CLIP Embedding Model,Stitch OCR Detections,GLM-OCR,Image Threshold,SAM 3 Interactive,Twilio SMS/MMS Notification,Icon Visualization,MoonshotAI Kimi,ByteTrack Tracker,Single-Label Classification Model,Google Gemini,Single-Label Classification Model,Byte Tracker,Webhook Sink,Instance Segmentation Model,QR Code Generator,Path Deviation,MQTT Writer,Ellipse Visualization,Object Detection Model,Anthropic Claude,Keypoint Detection Model,BoT-SORT Tracker,Dot Visualization,Perspective Correction,Instance Segmentation Model,Seg Preview,Per-Class Confidence Filter,Roboflow Dataset Upload,Detections Stabilizer,Detections Merge,Google Gemini,Local File Sink,SAM 3,Triangle Visualization,Time in Zone,Contrast Equalization,Polygon Visualization,SAM2 Video Tracker,OpenAI,Heatmap Visualization,Perception Encoder Embedding Model,Detections List Roll-Up,Google Gemini,LMM For Classification,Llama 3.2 Vision,Multi-Label Classification Model,Image Stack,Email Notification,Polygon Visualization,Mask Visualization,Anthropic Claude,Detections Filter,Distance Measurement,PTZ Tracking (ONVIF),Keypoint Visualization,Multi-Label Classification Model,Overlap Filter,Twilio SMS Notification,Email Notification,Line Counter Visualization,Byte Tracker,SAM 3,Image Preprocessing,Depth Estimation,Pixel Color Count,Motion Detection,Current Time,Roboflow Dataset Upload,Polygon Zone Visualization,Camera Calibration,Segment Anything 2 Model,Corner Visualization,Moondream2,Stability AI Image Generation,S3 Sink,Circle Visualization,Roboflow Custom Metadata,Instance Segmentation Model,Model Monitoring Inference Aggregator,OpenAI-Compatible LLM,Object Detection Model,Background Color Visualization,Line Counter
Input and Output Bindings¶
The available connections depend on its binding kinds. Check what binding kinds
VLM As Detector in version v1 has.
Bindings
-
input
image(image): Input image that was used to generate the VLM prediction. Used to extract image dimensions (width, height) for converting normalized coordinates to pixel coordinates and metadata (parent_id) for the detection predictions. The same image that was provided to the VLM/LLM block should be used here to maintain consistency..vlm_output(language_model_output): String output from a VLM or LLM block containing object detection prediction in JSON format. Can be raw JSON string or JSON wrapped in Markdown code blocks (e.g.,json {...}). Format depends on model_type and task_type - different models and tasks produce different JSON structures. If multiple markdown blocks exist, only the first is parsed..classes(list_of_values): List of all class names used by the detection model, in order. Required for google-gemini and anthropic-claude models to generate mapping between class names (from VLM output) and class IDs (for detection format). Optional for florence-2 model (required only for open-vocabulary-object-detection task). Classes are mapped to IDs by index: first class = ID 0, second = ID 1, etc. Classes from VLM output that are not in this list get class_id = -1. Should match the classes the VLM was asked to detect..
-
output
error_status(boolean): Boolean flag.predictions(object_detection_prediction): Prediction with detected bounding boxes in form of sv.Detections(...) object.inference_id(string): String value.
Example JSON definition of step VLM As Detector in version v1
{
"name": "<your_step_name_here>",
"type": "roboflow_core/vlm_as_detector@v1",
"image": "$inputs.image",
"vlm_output": [
"$steps.lmm.output"
],
"classes": [
"$steps.lmm.classes",
"$inputs.classes",
[
"dog",
"cat",
"bird"
],
[
"class_a",
"class_b"
]
],
"model_type": "google-gemini",
"task_type": "<block_does_not_provide_example>"
}