Llama 3.2 Vision¶
Class: LlamaVisionBlockV1
Source: inference.core.workflows.core_steps.models.foundation.llama_vision.v1.LlamaVisionBlockV1
Ask a question to Llama 3.2 Vision model with vision capabilities.
You can specify arbitrary text prompts or predefined ones, the block supports the following types of prompt:
-
Open Prompt (
unconstrained) - Use any prompt to generate a raw response -
Text Recognition (OCR) (
ocr) - Model recognizes text in the image -
Visual Question Answering (
visual-question-answering) - Model answers the question you submit in the prompt -
Captioning (short) (
caption) - Model provides a short description of the image -
Captioning (
detailed-caption) - Model provides a long description of the image -
Single-Label Classification (
classification) - Model classifies the image content as one of the provided classes -
Multi-Label Classification (
multi-label-classification) - Model classifies the image content as one or more of the provided classes -
Structured Output Generation (
structured-answering) - Model returns a JSON response with the specified fields
Issues with structured prompting
Model tends to be quite unpredictable when structured output (in our case JSON document) is expected.
That problems may impact tasks like structured-answering, classification or multi-label-classification.
The cause seems to be quite sensitive "filters" of inappropriate content embedded in model.
🛠️ API providers and model variants¶
Llama Vision 3.2 model is exposed via OpenRouter API and we require passing OpenRouter API Key to run.
There are different versions of the model supported:
-
smaller version (
11B) is faster and cheaper, yet you can expect better quality of results using90Bversion -
Regularversion is paid (and usually faster) API, whereasFreeis free for use for OpenRouter clients (state at 01.01.2025)
As for now, OpenRouter is the only provider for Llama 3.2 Vision model, but we will keep you posted if the state of the matter changes.
API Usage Charges
OpenRouter is external third party providing access to the model and incurring charges on the usage. Please check out pricing before use:
💡 Further reading and Acceptable Use Policy¶
Model license
Check out model license before use.
Click here for the original model card.
Usage of this model is subject to Meta's Acceptable Use Policy.
Type identifier¶
Use the following identifier in step "type" field: roboflow_core/llama_3_2_vision@v1to add the block as
as step in your workflow.
Properties¶
| Name | Type | Description | Refs |
|---|---|---|---|
name |
str |
Enter a unique identifier for this step.. | ❌ |
task_type |
str |
Task type to be performed by model. Value determines required parameters and output response.. | ❌ |
prompt |
str |
Text prompt to the Llama model. | ✅ |
output_structure |
Dict[str, str] |
Dictionary with structure of expected JSON response. | ❌ |
classes |
List[str] |
List of classes to be used. | ✅ |
api_key |
str |
Your Llama Vision API key (dependent on provider, ex: OpenRouter API key). | ✅ |
model_version |
str |
Model to be used. | ✅ |
max_tokens |
int |
Maximum number of tokens the model can generate in it's response.. | ❌ |
temperature |
float |
Temperature to sample from the model - value in range 0.0-2.0, the higher - the more random / "creative" the generations are.. | ✅ |
max_concurrent_requests |
int |
Number of concurrent requests that can be executed by block when batch of input images provided. If not given - block defaults to value configured globally in Workflows Execution Engine. Please restrict if you hit limits.. | ❌ |
The Refs column marks possibility to parametrise the property with dynamic values available
in workflow runtime. See Bindings for more info.
Available Connections¶
Compatible Blocks
Check what blocks you can connect to Llama 3.2 Vision in version v1.
- inputs:
QR Code Generator,Image Convert Grayscale,Google Gemini,Dynamic Crop,Blur Visualization,SIFT,Bounding Box Visualization,Stability AI Outpainting,Identify Changes,Camera Focus,Slack Notification,Keypoint Visualization,Trace Visualization,Polygon Visualization,Ellipse Visualization,Model Comparison Visualization,OpenAI,Anthropic Claude,Dimension Collapse,Local File Sink,Triangle Visualization,Polygon Zone Visualization,Halo Visualization,LMM,Stability AI Image Generation,Florence-2 Model,Circle Visualization,Email Notification,Google Vision OCR,Google Gemini,Motion Detection,Clip Comparison,Cosine Similarity,Camera Focus,Anthropic Claude,Object Detection Model,Instance Segmentation Model,Perspective Correction,CSV Formatter,Reference Path Visualization,Corner Visualization,Color Visualization,Twilio SMS/MMS Notification,VLM as Classifier,Image Slicer,OpenAI,Stitch OCR Detections,Camera Calibration,Image Blur,Buffer,VLM as Detector,Dot Visualization,Roboflow Custom Metadata,Image Threshold,Model Monitoring Inference Aggregator,Morphological Transformation,Label Visualization,Background Color Visualization,Classification Label Visualization,OCR Model,Roboflow Dataset Upload,Mask Visualization,Dynamic Zone,Detections List Roll-Up,Pixelate Visualization,Absolute Static Crop,Keypoint Detection Model,Size Measurement,Webhook Sink,Grid Visualization,Contrast Equalization,Image Preprocessing,Google Gemini,Relative Static Crop,Stability AI Inpainting,Image Contours,Line Counter Visualization,Stitch Images,OpenAI,Crop Visualization,OpenAI,Llama 3.2 Vision,Icon Visualization,Clip Comparison,SIFT Comparison,Gaze Detection,Depth Estimation,Twilio SMS Notification,Single-Label Classification Model,Florence-2 Model,Background Subtraction,LMM For Classification,Multi-Label Classification Model,EasyOCR,CogVLM,Image Slicer,Roboflow Dataset Upload,Email Notification - outputs:
QR Code Generator,Google Gemini,Bounding Box Visualization,Stability AI Outpainting,Trace Visualization,Instance Segmentation Model,Pixel Color Count,Ellipse Visualization,OpenAI,Model Comparison Visualization,Triangle Visualization,SAM 3,Distance Measurement,Stability AI Image Generation,Path Deviation,VLM as Detector,CLIP Embedding Model,Florence-2 Model,Email Notification,Google Gemini,VLM as Classifier,Corner Visualization,Color Visualization,Twilio SMS/MMS Notification,OpenAI,Line Counter,Buffer,SAM 3,Dot Visualization,Roboflow Custom Metadata,Image Threshold,Model Monitoring Inference Aggregator,Time in Zone,Classification Label Visualization,Roboflow Dataset Upload,Mask Visualization,Detections List Roll-Up,Line Counter,Keypoint Detection Model,Size Measurement,Detections Consensus,Webhook Sink,Stability AI Inpainting,Line Counter Visualization,Crop Visualization,OpenAI,Llama 3.2 Vision,Icon Visualization,Clip Comparison,Time in Zone,LMM For Classification,SAM 3,Object Detection Model,CogVLM,Roboflow Dataset Upload,Cache Get,Detections Stitch,Seg Preview,YOLO-World Model,Dynamic Crop,Slack Notification,Keypoint Visualization,Polygon Visualization,Cache Set,Anthropic Claude,Local File Sink,Polygon Zone Visualization,Halo Visualization,LMM,Time in Zone,Circle Visualization,Google Vision OCR,Motion Detection,Clip Comparison,Detections Classes Replacement,Anthropic Claude,Object Detection Model,Instance Segmentation Model,Perspective Correction,Perception Encoder Embedding Model,Reference Path Visualization,Stitch OCR Detections,Image Blur,VLM as Detector,Morphological Transformation,Label Visualization,Background Color Visualization,Keypoint Detection Model,Path Deviation,PTZ Tracking (ONVIF).md),Moondream2,Image Preprocessing,Contrast Equalization,Grid Visualization,Google Gemini,OpenAI,JSON Parser,SIFT Comparison,Depth Estimation,Twilio SMS Notification,VLM as Classifier,Florence-2 Model,Segment Anything 2 Model,Email Notification
Input and Output Bindings¶
The available connections depend on its binding kinds. Check what binding kinds
Llama 3.2 Vision in version v1 has.
Bindings
-
input
images(image): The image to infer on..prompt(string): Text prompt to the Llama model.classes(list_of_values): List of classes to be used.api_key(string): Your Llama Vision API key (dependent on provider, ex: OpenRouter API key).model_version(string): Model to be used.temperature(float): Temperature to sample from the model - value in range 0.0-2.0, the higher - the more random / "creative" the generations are..
-
output
output(Union[string,language_model_output]): String value ifstringor LLM / VLM output iflanguage_model_output.classes(list_of_values): List of values of any type.
Example JSON definition of step Llama 3.2 Vision in version v1
{
"name": "<your_step_name_here>",
"type": "roboflow_core/llama_3_2_vision@v1",
"images": "$inputs.image",
"task_type": "<block_does_not_provide_example>",
"prompt": "my prompt",
"output_structure": {
"my_key": "description"
},
"classes": [
"class-a",
"class-b"
],
"api_key": "xxx-xxx",
"model_version": "11B (Free) - OpenRouter",
"max_tokens": "<block_does_not_provide_example>",
"temperature": "<block_does_not_provide_example>",
"max_concurrent_requests": "<block_does_not_provide_example>"
}