Skip to content

Identify Outliers

Class: IdentifyOutliersBlockV1

Source: inference.core.workflows.core_steps.sampling.identify_outliers.v1.IdentifyOutliersBlockV1

Identify outlier embeddings compared to prior data using von Mises-Fisher statistical distribution analysis to detect anomalies, unusual patterns, or deviations from normal behavior by comparing current embedding vectors against a sliding window of historical embeddings for quality control, anomaly detection, and data sampling workflows.

How This Block Works

This block detects outliers by statistically comparing embedding vectors against historical data using directional statistics. The block:

  1. Receives an embedding vector representing the current data point's features
  2. Normalizes the embedding to unit length:
  3. Converts the embedding to a unit vector (length = 1) for directional analysis
  4. Enables comparison using angular/directional statistics rather than distance-based metrics
  5. Handles zero vectors gracefully by skipping normalization
  6. Tracks sample count and warmup status:
  7. Increments sample counter for each processed embedding
  8. Determines if still in warmup period (samples < warmup parameter)
  9. During warmup, no outliers are identified to allow baseline establishment
  10. Maintains a sliding window of historical embeddings:
  11. Stores normalized embeddings in a buffer that grows up to window_size
  12. When buffer exceeds window_size, removes oldest embeddings (FIFO)
  13. Creates a rolling history of recent data for statistical comparison
  14. Fits von Mises-Fisher (vMF) distribution parameters during warmup completion:
  15. Mean Direction (mu): Calculates the average direction of all historical embeddings
  16. Concentration Parameter (kappa): Measures how tightly clustered the embeddings are around the mean
  17. Uses statistical estimation to model the distribution of embedding directions
  18. vMF distribution is ideal for directional data on a hypersphere (unit vectors)
  19. Computes alignment score for current embedding:
  20. Calculates dot product between current normalized embedding and mean direction vector
  21. Measures how well the current embedding aligns with the typical direction
  22. Higher values indicate closer alignment to the norm, lower values indicate deviation
  23. Calculates empirical percentile of current embedding:
  24. Computes alignment scores for all historical embeddings against the mean direction
  25. Ranks the current embedding's alignment score among historical scores
  26. Determines percentile position (0.0 = lowest, 1.0 = highest) of current embedding
  27. Determines outlier status based on percentile thresholds:
  28. Flags as outlier if percentile is below threshold_percentile (e.g., bottom 5%)
  29. Flags as outlier if percentile is above (1 - threshold_percentile) (e.g., top 5%)
  30. Detects both extreme low and extreme high deviations from the norm
  31. Returns three outputs:
  32. is_outlier: Boolean flag indicating if the current embedding is an outlier
  33. percentile: Float value (0.0-1.0) representing where the embedding ranks among historical data
  34. warming_up: Boolean flag indicating if still in warmup period (always False after warmup)

The block uses von Mises-Fisher distribution analysis, which is designed for directional data on a hypersphere (unit vectors). This makes it well-suited for high-dimensional embeddings where direction matters more than magnitude. The sliding window approach ensures the statistical model adapts to recent trends while the percentile-based detection identifies embeddings that are unusually different from the historical pattern. Lower percentiles indicate embeddings that are less aligned with typical patterns, while higher percentiles indicate embeddings that are unusually well-aligned or different in a positive direction.

Common Use Cases

  • Anomaly Detection: Detect unusual images, objects, or patterns that deviate from normal data (e.g., identify unusual product variations, detect anomalous behavior, flag unexpected patterns), enabling anomaly detection workflows
  • Quality Control: Identify defective or unusual items in manufacturing or production (e.g., detect product defects, identify quality issues, flag manufacturing anomalies), enabling quality control workflows
  • Data Sampling: Identify interesting or unusual data points for manual review or further analysis (e.g., sample unusual images for labeling, identify edge cases for model improvement, select interesting data for analysis), enabling intelligent data sampling workflows
  • Change Detection: Detect when data patterns change significantly from historical norms (e.g., detect scene changes, identify pattern shifts, flag significant variations), enabling change detection workflows
  • Model Monitoring: Monitor model performance by detecting when embeddings deviate from training distribution (e.g., detect distribution shift, identify out-of-distribution data, monitor model drift), enabling model monitoring workflows
  • Content Filtering: Identify unusual or inappropriate content that differs from expected patterns (e.g., detect unusual content, flag inappropriate material, identify content anomalies), enabling content filtering workflows

Connecting to Other Blocks

This block receives embeddings and produces is_outlier, percentile, and warming_up outputs:

  • After embedding model blocks (CLIP, Perception Encoder, etc.) to analyze embedding outliers (e.g., identify outliers from CLIP embeddings, analyze Perception Encoder outliers, detect anomalies from embeddings), enabling embedding-to-outlier workflows
  • After classification or detection blocks with embeddings to identify unusual predictions (e.g., identify unusual detections, flag anomalous classifications, detect outlier predictions), enabling prediction-to-outlier workflows
  • Before logic blocks like Continue If to make decisions based on outlier detection (e.g., continue if outlier detected, filter based on outlier status, trigger actions on anomalies), enabling outlier-based decision workflows
  • Before notification blocks to alert on outlier detection (e.g., alert on anomalies, notify about unusual data, trigger alerts on outliers), enabling outlier-based notification workflows
  • Before data storage blocks to record outlier information (e.g., log outlier data, store anomaly statistics, record unusual data points), enabling outlier data logging workflows
  • In quality control pipelines where outlier detection is part of quality assurance (e.g., filter outliers in quality pipelines, identify issues in production workflows, detect problems in processing chains), enabling quality control workflows

Requirements

This block requires embeddings as input (typically from embedding model blocks like CLIP or Perception Encoder). The block maintains internal state across workflow executions, accumulating a sliding window of historical embeddings. During the warmup period (first warmup samples), no outliers are identified and the block returns is_outlier=False and percentile=0.5. After warmup, the block uses at least warmup embeddings (up to window_size embeddings) to establish statistical baselines. The threshold_percentile parameter (0.0-1.0) controls sensitivity - lower values (e.g., 0.01) detect only extreme outliers, while higher values (e.g., 0.1) detect more moderate deviations. The block works best with consistent embedding models and may need adjustment of threshold_percentile based on expected variation in your data.

Type identifier

Use the following identifier in step "type" field: roboflow_core/identify_outliers@v1to add the block as as step in your workflow.

Properties

Name Type Description Refs
name str Unique name of step in workflows.
threshold_percentile float Percentile threshold for outlier detection, range 0.0-1.0. Embeddings below this percentile or above (1 - threshold_percentile) are flagged as outliers. Lower values (e.g., 0.01) detect only extreme outliers - very strict. Higher values (e.g., 0.1) detect more moderate deviations - more sensitive. Default 0.05 means bottom 5% and top 5% are outliers. Adjust based on expected variation in your data..
warmup int Number of initial data points required before outlier detection begins. During warmup, no outliers are identified (is_outlier=False) to allow baseline establishment. Must be at least 2 for statistical analysis. Typical range: 3-100 samples. Higher values provide more stable baselines but delay outlier detection. Lower values enable faster detection but may be less accurate initially..
window_size int Maximum number of historical embeddings to maintain in sliding window. The block keeps the most recent window_size embeddings for statistical comparison. When exceeded, oldest embeddings are removed (FIFO). Larger windows provide more stable statistics but adapt slower to distribution changes. Smaller windows adapt faster but may be less stable. Set to None for unlimited window (uses all historical data). Typical range: 10-100 embeddings..

The Refs column marks possibility to parametrise the property with dynamic values available in workflow runtime. See Bindings for more info.

Available Connections

Compatible Blocks

Check what blocks you can connect to Identify Outliers in version v1.

Input and Output Bindings

The available connections depend on its binding kinds. Check what binding kinds Identify Outliers in version v1 has.

Bindings
  • input

    • embedding (embedding): Embedding vector representing the current data point's features. Typically from embedding models like CLIP or Perception Encoder. The embedding is normalized to unit length for directional statistical analysis using von Mises-Fisher distribution. Must be a numerical vector of any dimension..
    • threshold_percentile (float_zero_to_one): Percentile threshold for outlier detection, range 0.0-1.0. Embeddings below this percentile or above (1 - threshold_percentile) are flagged as outliers. Lower values (e.g., 0.01) detect only extreme outliers - very strict. Higher values (e.g., 0.1) detect more moderate deviations - more sensitive. Default 0.05 means bottom 5% and top 5% are outliers. Adjust based on expected variation in your data..
    • warmup (integer): Number of initial data points required before outlier detection begins. During warmup, no outliers are identified (is_outlier=False) to allow baseline establishment. Must be at least 2 for statistical analysis. Typical range: 3-100 samples. Higher values provide more stable baselines but delay outlier detection. Lower values enable faster detection but may be less accurate initially..
    • window_size (integer): Maximum number of historical embeddings to maintain in sliding window. The block keeps the most recent window_size embeddings for statistical comparison. When exceeded, oldest embeddings are removed (FIFO). Larger windows provide more stable statistics but adapt slower to distribution changes. Smaller windows adapt faster but may be less stable. Set to None for unlimited window (uses all historical data). Typical range: 10-100 embeddings..
  • output

Example JSON definition of step Identify Outliers in version v1
{
    "name": "<your_step_name_here>",
    "type": "roboflow_core/identify_outliers@v1",
    "embedding": "$steps.clip.embedding",
    "threshold_percentile": 0.05,
    "warmup": 3,
    "window_size": 32
}