scene_service.ingest.perception_vlm¶

VLM-based object detection: take an RGB frame, ask an OpenAI-compatible vision model to enumerate visible objects with approximate image coordinates, parse the JSON response, and (when depth is available) reproject to world coordinates.

This lives inside scene/ deliberately — system/ services should not reverse-depend on a service-layer perception package. The detector calls the same OpenAI-compatible endpoint that pilot already uses (VLM_BASE_URL / VLM_API_KEY / VLM_MODEL), so no new credentials.

v1 keeps the implementation simple: small prompt, JSON-only response, no streaming, no caching. Failures are logged and the perception loop keeps running on the next tick.

Classes

VLMObjectDetector(*, rgb_fetcher, ...[, ...])

Runs the RGB-poll → VLM-call → Detection-list pipeline as one asyncio task.

class scene_service.ingest.perception_vlm.VLMObjectDetector(*, rgb_fetcher: Callable[[], bytes | None], chassis_pose_fn: Callable[[], tuple[float, float, float, float] | None], on_detections: Callable[[list[Detection]], Awaitable[None]], period_s: float = 3.0, camera_frame_id: str = 'head_front_camera_rgb_optical_frame', intrinsics: _CamIntrinsics | None = None)[source]¶

Bases: object

Runs the RGB-poll → VLM-call → Detection-list pipeline as one asyncio task. Calls back into on_detections with a batch of Detection objects at each successful tick.

The detector reads camera/snapshot via the existing PrimitivePoller machinery (passed in as rgb_fetcher) so we don’t duplicate the atlas connect logic.

async start() → None[source]¶

async stop() → None[source]¶