Config
InferenceSDKDeprecationWarning
¶
Bases: Warning
Class used for warning of deprecated features in the Inference SDK
Source code in inference_sdk/config.py
106 107 108 109 | |
RemoteProcessingTimeCollector
¶
Thread-safe collector for GPU processing times from remote execution responses.
A single instance is shared across all threads handling a single request. Each entry stores a model_id alongside the processing time.
Uses threading.Lock (not asyncio.Lock) because add() is only called from synchronous worker threads (ThreadPoolExecutor). The middleware reads via drain() after await call_next() returns, at which point all worker threads have completed — so there is no contention in the async context.
Source code in inference_sdk/config.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | |
drain()
¶
Atomically return all entries and clear the internal list.
Source code in inference_sdk/config.py
32 33 34 35 36 37 | |
summarize(max_detail_bytes=4096)
¶
Atomically drain entries and return (total_time, entries_json_or_none).
Returns the total processing time and a JSON string of individual entries. If the JSON exceeds max_detail_bytes, the detail string is omitted (None).
Source code in inference_sdk/config.py
43 44 45 46 47 48 49 50 51 52 53 54 | |