Identify Outliers¶

Class: IdentifyOutliersBlockV1

Source: inference.core.workflows.core_steps.sampling.identify_outliers.v1.IdentifyOutliersBlockV1

Identify outlier embeddings compared to prior data using von Mises-Fisher statistical distribution analysis to detect anomalies, unusual patterns, or deviations from normal behavior by comparing current embedding vectors against a sliding window of historical embeddings for quality control, anomaly detection, and data sampling workflows.

How This Block Works¶

This block detects outliers by statistically comparing embedding vectors against historical data using directional statistics. The block:

Receives an embedding vector representing the current data point's features
Normalizes the embedding to unit length:
Converts the embedding to a unit vector (length = 1) for directional analysis
Enables comparison using angular/directional statistics rather than distance-based metrics
Handles zero vectors gracefully by skipping normalization
Tracks sample count and warmup status:
Increments sample counter for each processed embedding
Determines if still in warmup period (samples < warmup parameter)
During warmup, no outliers are identified to allow baseline establishment
Maintains a sliding window of historical embeddings:
Stores normalized embeddings in a buffer that grows up to window_size
When buffer exceeds window_size, removes oldest embeddings (FIFO)
Creates a rolling history of recent data for statistical comparison
Fits von Mises-Fisher (vMF) distribution parameters during warmup completion:
Mean Direction (mu): Calculates the average direction of all historical embeddings
Concentration Parameter (kappa): Measures how tightly clustered the embeddings are around the mean
Uses statistical estimation to model the distribution of embedding directions
vMF distribution is ideal for directional data on a hypersphere (unit vectors)
Computes alignment score for current embedding:
Calculates dot product between current normalized embedding and mean direction vector
Measures how well the current embedding aligns with the typical direction
Higher values indicate closer alignment to the norm, lower values indicate deviation
Calculates empirical percentile of current embedding:
Computes alignment scores for all historical embeddings against the mean direction
Ranks the current embedding's alignment score among historical scores
Determines percentile position (0.0 = lowest, 1.0 = highest) of current embedding
Determines outlier status based on percentile thresholds:
Flags as outlier if percentile is below threshold_percentile (e.g., bottom 5%)
Flags as outlier if percentile is above (1 - threshold_percentile) (e.g., top 5%)
Detects both extreme low and extreme high deviations from the norm
Returns three outputs:
is_outlier: Boolean flag indicating if the current embedding is an outlier
percentile: Float value (0.0-1.0) representing where the embedding ranks among historical data
warming_up: Boolean flag indicating if still in warmup period (always False after warmup)

The block uses von Mises-Fisher distribution analysis, which is designed for directional data on a hypersphere (unit vectors). This makes it well-suited for high-dimensional embeddings where direction matters more than magnitude. The sliding window approach ensures the statistical model adapts to recent trends while the percentile-based detection identifies embeddings that are unusually different from the historical pattern. Lower percentiles indicate embeddings that are less aligned with typical patterns, while higher percentiles indicate embeddings that are unusually well-aligned or different in a positive direction.

Common Use Cases¶

Anomaly Detection: Detect unusual images, objects, or patterns that deviate from normal data (e.g., identify unusual product variations, detect anomalous behavior, flag unexpected patterns), enabling anomaly detection workflows
Quality Control: Identify defective or unusual items in manufacturing or production (e.g., detect product defects, identify quality issues, flag manufacturing anomalies), enabling quality control workflows
Data Sampling: Identify interesting or unusual data points for manual review or further analysis (e.g., sample unusual images for labeling, identify edge cases for model improvement, select interesting data for analysis), enabling intelligent data sampling workflows
Change Detection: Detect when data patterns change significantly from historical norms (e.g., detect scene changes, identify pattern shifts, flag significant variations), enabling change detection workflows
Model Monitoring: Monitor model performance by detecting when embeddings deviate from training distribution (e.g., detect distribution shift, identify out-of-distribution data, monitor model drift), enabling model monitoring workflows
Content Filtering: Identify unusual or inappropriate content that differs from expected patterns (e.g., detect unusual content, flag inappropriate material, identify content anomalies), enabling content filtering workflows

Connecting to Other Blocks¶

This block receives embeddings and produces is_outlier, percentile, and warming_up outputs:

After embedding model blocks (CLIP, Perception Encoder, etc.) to analyze embedding outliers (e.g., identify outliers from CLIP embeddings, analyze Perception Encoder outliers, detect anomalies from embeddings), enabling embedding-to-outlier workflows
After classification or detection blocks with embeddings to identify unusual predictions (e.g., identify unusual detections, flag anomalous classifications, detect outlier predictions), enabling prediction-to-outlier workflows
Before logic blocks like Continue If to make decisions based on outlier detection (e.g., continue if outlier detected, filter based on outlier status, trigger actions on anomalies), enabling outlier-based decision workflows
Before notification blocks to alert on outlier detection (e.g., alert on anomalies, notify about unusual data, trigger alerts on outliers), enabling outlier-based notification workflows
Before data storage blocks to record outlier information (e.g., log outlier data, store anomaly statistics, record unusual data points), enabling outlier data logging workflows
In quality control pipelines where outlier detection is part of quality assurance (e.g., filter outliers in quality pipelines, identify issues in production workflows, detect problems in processing chains), enabling quality control workflows

Requirements¶

This block requires embeddings as input (typically from embedding model blocks like CLIP or Perception Encoder). The block maintains internal state across workflow executions, accumulating a sliding window of historical embeddings. During the warmup period (first warmup samples), no outliers are identified and the block returns is_outlier=False and percentile=0.5. After warmup, the block uses at least warmup embeddings (up to window_size embeddings) to establish statistical baselines. The threshold_percentile parameter (0.0-1.0) controls sensitivity - lower values (e.g., 0.01) detect only extreme outliers, while higher values (e.g., 0.1) detect more moderate deviations. The block works best with consistent embedding models and may need adjustment of threshold_percentile based on expected variation in your data.

Type identifier¶

Use the following identifier in step "type" field: roboflow_core/identify_outliers@v1to add the block as as step in your workflow.

Properties¶

Name	Type	Description	Refs
`name`	`str`	Unique name of step in workflows.	❌
`threshold_percentile`	`float`	Percentile threshold for outlier detection, range 0.0-1.0. Embeddings below this percentile or above (1 - threshold_percentile) are flagged as outliers. Lower values (e.g., 0.01) detect only extreme outliers - very strict. Higher values (e.g., 0.1) detect more moderate deviations - more sensitive. Default 0.05 means bottom 5% and top 5% are outliers. Adjust based on expected variation in your data..	✅
`warmup`	`int`	Number of initial data points required before outlier detection begins. During warmup, no outliers are identified (is_outlier=False) to allow baseline establishment. Must be at least 2 for statistical analysis. Typical range: 3-100 samples. Higher values provide more stable baselines but delay outlier detection. Lower values enable faster detection but may be less accurate initially..	✅
`window_size`	`int`	Maximum number of historical embeddings to maintain in sliding window. The block keeps the most recent window_size embeddings for statistical comparison. When exceeded, oldest embeddings are removed (FIFO). Larger windows provide more stable statistics but adapt slower to distribution changes. Smaller windows adapt faster but may be less stable. Set to None for unlimited window (uses all historical data). Typical range: 10-100 embeddings..	✅

The Refs column marks possibility to parametrise the property with dynamic values available in workflow runtime. See Bindings for more info.

Runtime compatibility¶

soft — runtime hosted_serverless, dedicated_deployment; execution remote; input video: Block keeps per-video state in process memory (keyed by video_metadata.video_identifier). With remote step execution on stateless or multi-replica HTTP runtimes, successive requests may be served by different worker processes, so the state resets between calls and the output is meaningless for tracking / counting / aggregation. Use local step execution in an InferencePipeline for stable cross-frame results.
soft — input image: Block depends on temporal context from video or repeated-frame workflows. With a still image/photo, there is no meaningful history to track, compare, aggregate, or visualize, so the block provides little or no benefit.

Available Connections¶

Compatible Blocks

Check what blocks you can connect to Identify Outliers in version v1.

Input and Output Bindings¶

The available connections depend on its binding kinds. Check what binding kinds Identify Outliers in version v1 has.

Bindings

input
- embedding (embedding): Embedding vector representing the current data point's features. Typically from embedding models like CLIP or Perception Encoder. The embedding is normalized to unit length for directional statistical analysis using von Mises-Fisher distribution. Must be a numerical vector of any dimension..
- threshold_percentile (float_zero_to_one): Percentile threshold for outlier detection, range 0.0-1.0. Embeddings below this percentile or above (1 - threshold_percentile) are flagged as outliers. Lower values (e.g., 0.01) detect only extreme outliers - very strict. Higher values (e.g., 0.1) detect more moderate deviations - more sensitive. Default 0.05 means bottom 5% and top 5% are outliers. Adjust based on expected variation in your data..
- warmup (integer): Number of initial data points required before outlier detection begins. During warmup, no outliers are identified (is_outlier=False) to allow baseline establishment. Must be at least 2 for statistical analysis. Typical range: 3-100 samples. Higher values provide more stable baselines but delay outlier detection. Lower values enable faster detection but may be less accurate initially..
- window_size (integer): Maximum number of historical embeddings to maintain in sliding window. The block keeps the most recent window_size embeddings for statistical comparison. When exceeded, oldest embeddings are removed (FIFO). Larger windows provide more stable statistics but adapt slower to distribution changes. Smaller windows adapt faster but may be less stable. Set to None for unlimited window (uses all historical data). Typical range: 10-100 embeddings..
output
- is_outlier (boolean): Boolean flag.
- percentile (float_zero_to_one): float value in range [0.0, 1.0].
- warming_up (boolean): Boolean flag.

Example JSON definition of step Identify Outliers in version v1

{
    "name": "<your_step_name_here>",
    "type": "roboflow_core/identify_outliers@v1",
    "embedding": "$steps.clip.embedding",
    "threshold_percentile": 0.05,
    "warmup": 3,
    "window_size": 32
}