agents.components.mllm¶
Module Contents¶
Classes¶
This component utilizes multi-modal large language models (e.g. Llava) that can be used to process text and image data. |
API¶
- class agents.components.mllm.MLLM(*, inputs: List[Union[agents.ros.Topic, agents.ros.FixedInput]], outputs: List[agents.ros.Topic], model_client: Optional[agents.clients.model_base.ModelClient] = None, config: Optional[agents.config.MLLMConfig] = None, db_client: Optional[agents.clients.db_base.DBClient] = None, trigger: Union[agents.ros.Topic, List[agents.ros.Topic], float, agents.ros.Event] = 1.0, component_name: str, **kwargs)¶
Bases:
agents.components.llm.LLMThis component utilizes multi-modal large language models (e.g. Llava) that can be used to process text and image data.
- Parameters:
inputs (list[Topic | FixedInput]) – The input topics or fixed inputs for the MLLM component. This should be a list of Topic objects or FixedInput instances, limited to String and Image types.
outputs (list[Topic]) – The output topics for the MLLM component. This should be a list of Topic objects. String, Detections2D and PointsOfInterest2D types is handled automatically.
model_client (Optional[ModelClient]) – The model client for the MLLM component. This should be an instance of ModelClient. Optional if
enable_local_modelis set to True in the config.config (MLLMConfig) – Optional configuration for the MLLM component. This should be an instance of MLLMConfig. If not provided, defaults to MLLMConfig().
trigger (Union[Topic, list[Topic], float]) – The trigger value or topic for the MLLM component. This can be a single Topic object, a list of Topic objects, or a float value for a timed component. Defaults to 1.
component_name (str) – The name of the MLLM component. This should be a string and defaults to “mllm_component”.
Example usage:
text0 = Topic(name="text0", msg_type="String") image0 = Topic(name="image0", msg_type="Image") text0 = Topic(name="text1", msg_type="String") config = MLLMConfig() model = TransformersMLLM(name='idefics') model_client = ModelClient(model=model) mllm_component = MLLM(inputs=[text0, image0], outputs=[text1], model_client=model_client, config=config, component_name='mllm_component')
Example usage with local model:
text0 = Topic(name="text0", msg_type="String") image0 = Topic(name="image0", msg_type="Image") text1 = Topic(name="text1", msg_type="String") config = MLLMConfig(enable_local_model=True) mllm_component = MLLM(inputs=[text0, image0], outputs=[text1], config=config, component_name='local_vlm')
- set_task(task: Literal[general, pointing, affordance, trajectory, grounding]) None¶
Set a task for the MLLM component. This is useful when using a multimodal LLM model that has been trained on specific tasks. This method can be invoked as an action, in response to an event, to change the task at runtime. For an example checkout RoboBrain2.0, available on RoboML.
- Parameters:
task – A task that is one of the following “general”, “pointing”, “affordance”, “trajectory”, “grounding”.
- describe(topic_name: str, query: str = 'Describe what you see in the image.', timeout: float = 0.5) str¶
Capture a frame from an image topic and describe it.
Grabs the latest frame from the specified image input topic, runs VLM inference with the given query, and publishes the text result to the component’s String output topics.
- Parameters:
topic_name (str) – Name of the image input topic to capture from.
query (str) – Question or instruction about the image.
timeout (float) – Seconds to wait for a frame. Defaults to 0.5.
- Returns:
True if successful, False otherwise.
- Return type:
bool