# Status & Fallbacks **All robots can fail, but smart robots recover.** EMOS components are **Self-Aware** and **Self-Healing** by design. The Health Status system allows every component to explicitly declare its operational state --- not just "Alive" or "Dead," but *how* it is functioning. When failures are detected, the Fallback system automatically triggers pre-configured recovery strategies, keeping the robot operational without human intervention. --- ## Health Status The **Health Status** is the heartbeat of an EMOS component. Unlike standard ROS2 nodes, EMOS components differentiate between a math error (Algorithm Failure), a hardware crash (Component Failure), or a missing input (System Failure). These reports are broadcast back to the system to trigger: * {material-regular}`notifications;1.2em;sd-text-warning` **Alerts:** Notify the operator of specific issues. * {material-regular}`flash_on;1.2em;sd-text-primary` **Reflexes:** Trigger [Events](events-and-actions.md) to handle the situation. * {material-regular}`healing;1.2em;sd-text-success` **Self-Healing:** Execute automatic [Fallbacks](#fallback-strategies) to recover the node. ### Status Hierarchy EMOS defines distinct failure levels to help you pinpoint the root cause of an issue. - {material-regular}`check_circle;1.5em;sd-text-success` HEALTHY **"Everything is awesome."** The component executed its main loop successfully and produced valid output. - {material-regular}`warning;1.5em;sd-text-warning` ALGORITHM_FAILURE **"I ran, but I couldn't solve it."** The node is healthy, but the logic failed. *Examples:* Path planner couldn't find a path; Object detector found nothing; Optimization solver did not converge. - {material-regular}`error;1.5em;sd-text-danger` COMPONENT_FAILURE **"I am broken."** An internal crash or hardware issue occurred within this specific node. *Examples:* Memory leak; Exception raised in a callback; Division by zero. - {material-regular}`link_off;1.5em;sd-text-primary` SYSTEM_FAILURE **"I am fine, but my inputs are broken."** The failure is caused by an external dependency. *Examples:* Input topic is empty or stale; Network is down; Disk is full. ### Reporting Status Every `BaseComponent` has an internal `self.health_status` object. You interact with this object inside your `_execution_step` or callbacks to declare the current state. #### The Happy Path Always mark the component as healthy at the end of a successful execution. This resets any previous error counters. ```python self.health_status.set_healthy() ``` #### Declaring Failures When things go wrong, be specific. This helps the Fallback system decide whether to *Retry* (Algorithm), *Restart* (Component), or *Wait* (System). **Algorithm Failure:** ```python # Optional: List the specific algorithm that failed self.health_status.set_fail_algorithm(algorithm_names=["A_Star_Planner"]) ``` **Component Failure:** ```python # Report that this component crashed self.health_status.set_fail_component() # Or blame a sub-module self.health_status.set_fail_component(component_names=["Camera_Driver_API"]) ``` **System Failure:** ```python # Report missing data on specific topics self.health_status.set_fail_system(topic_names=["/camera/rgb", "/odom"]) ``` ### Automatic Broadcasting You do not need to manually publish the status message. EMOS automatically broadcasts the status at the start of every execution step. This ensures a consistent "Heartbeat" frequency, even if your algorithm blocks or hangs (up to the threading limits). :::{tip} If you need to trigger an immediate alert from a deeply nested callback or a separate thread, you *can* force a publish: `self.health_status_publisher.publish(self.health_status())` ::: ### Implementation Pattern Here is the robust pattern for writing an execution step using Health Status. This pattern enables the **Self-Healing** capabilities of EMOS. ```python def _execution_step(self): try: # 1. Check Pre-conditions (System Level) if self.input_image is None: self.get_logger().warn("Waiting for video stream...") self.health_status.set_fail_system(topic_names=[self.input_image.name]) return # 2. Run Logic result = self.ai_model.detect(self.input_image) # 3. Check Logic Output (Algorithm Level) if result is None or len(result.detections) == 0: self.health_status.set_fail_algorithm(algorithm_names=["yolo_detector"]) return # 4. Success! self.publish_result(result) self.health_status.set_healthy() except ConnectionError: # 5. Handle Crashes (Component Level) # This will trigger the 'on_component_fail' fallback (e.g., Restart) self.get_logger().error("Camera hardware disconnected!") self.health_status.set_fail_component(component_names=["hardware_interface"]) ``` --- ## Fallback Strategies Fallbacks are the **Self-Healing Mechanism** of an EMOS component. They define the specific set of [Actions](events-and-actions.md#actions) to execute automatically when a failure is detected in the component's Health Status. Instead of crashing or freezing when an error occurs, a Component can be configured to attempt intelligent recovery strategies: * {material-regular}`swap_horiz;1.2em;sd-text-warning` *Algorithm stuck?* $\rightarrow$ **Switch** to a simpler backup. * {material-regular}`restart_alt;1.2em;sd-text-danger` *Driver disconnected?* $\rightarrow$ **Re-initialize** the hardware. * {material-regular}`autorenew;1.2em;sd-text-primary` *Sensor timeout?* $\rightarrow$ **Restart** the node. ```{figure} /_static/images/diagrams/fallbacks_dark.png :class: dark-only :alt: fig-fallbacks :align: center ``` ```{figure} /_static/images/diagrams/fallbacks_light.png :class: light-only :alt: fig-fallbacks :align: center The Self-Healing Loop ``` ### The Recovery Hierarchy When a component reports a failure, EMOS doesn't just panic. It checks for a registered fallback strategy in a specific order of priority. This allows you to define granular responses for different types of errors. - {material-regular}`link_off;1.5em;sd-text-primary` 1. System Failure `on_system_fail` **The Context is Broken.** External failures like missing input topics or disk full. *Example Strategy:* Wait for data, or restart the data pipeline. - {material-regular}`error;1.5em;sd-text-danger` 2. Component Failure `on_component_fail` **The Node is Broken.** Internal crashes or hardware disconnects. *Example Strategy:* Restart the component lifecycle or re-initialize drivers. - {material-regular}`warning;1.5em;sd-text-warning` 3. Algorithm Failure `on_algorithm_fail` **The Logic is Broken.** The code ran but couldn't solve the problem (e.g., path not found). *Example Strategy:* Reconfigure parameters (looser tolerance) or switch algorithms. - {material-regular}`help_center;1.5em;sd-text-secondary` 4. Catch-All `on_fail` **Generic Safety Net.** If no specific handler is found above, this fallback is executed. *Example Strategy:* Log an error or stop the robot. ### Recovery Strategies A Fallback isn't just a single function call. It is a robust policy defined by **Actions** and **Retries**. #### The Persistent Retry (Single Action) *Try, try again.* The system executes the action repeatedly until it returns `True` (success) or `max_retries` is reached. ```python # Try to restart the driver up to 3 times driver.on_component_fail(fallback=restart(component=driver), max_retries=3) ``` #### The Escalation Ladder (List of Actions) *If at first you don't succeed, try something stronger.* You can define a sequence of actions. If the first one fails (after its retries), the system moves to the next one. 1. **Clear Costmaps** (Low cost, fast) 2. **Reconfigure Planner** (Medium cost) 3. **Restart Planner Node** (High cost, slow) ```python # Tiered Recovery for a Navigation Planner planner.on_algorithm_fail( fallback=[ Action(method=planner.clear_costmaps), # Step 1 Action(method=planner.switch_to_fallback), # Step 2 restart(component=planner) # Step 3 ], max_retries=1 # Try each step once before escalating ) ``` #### The "Give Up" State If all strategies fail (all retries of all actions exhausted), the component enters the **Give Up** state and executes the `on_giveup` action. This is the "End of Line", usually used to park the robot safely or alert a human. ### How to Implement Fallbacks #### Method A: In Your Recipe (Recommended) You can configure fallbacks externally without touching the component code. This makes your system modular and reusable. ```python from ros_sugar.actions import restart, log # 1. Define component lidar = BaseComponent(component_name='lidar_driver') # 2. Attach Fallbacks # If it crashes, restart it (Unlimited retries) lidar.on_component_fail(fallback=restart(component=lidar)) # If data is missing (System), just log it and wait lidar.on_system_fail(fallback=log(msg="Waiting for Lidar data...")) # If all else fails, scream lidar.on_giveup(fallback=log(msg="LIDAR IS DEAD. STOPPING ROBOT.")) ``` #### Method B: In Component Class (Advanced) For tightly coupled recovery logic (like re-handshaking a specific serial protocol), you can define custom fallback methods inside your class. :::{tip} Use the `@component_fallback` decorator. It ensures the method is only called when the component is in a valid state to handle it. ::: ```python from ros_sugar.core import BaseComponent, component_fallback from ros_sugar.core import Action class MyDriver(BaseComponent): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) # Register the custom fallback internally self.on_system_fail( fallback=Action(self.try_reconnect), max_retries=3 ) def _execution_step(self): try: self.hw.read() self.health_status.set_healthy() except ConnectionError: # This trigger starts the fallback loop! self.health_status.set_fail_system() @component_fallback def try_reconnect(self) -> bool: """Custom recovery logic""" self.get_logger().info("Attempting handshake...") if self.hw.connect(): return True # Recovery Succeeded! return False # Recovery Failed, will retry... ```