Local Models

EMOS can run all AI components entirely on-device using built-in local models. No Ollama, no RoboML, no cloud API — just set enable_local_model=True in the component config and you’re running inference locally.

This is useful for:

  • Offline robots that operate without network access

  • Edge deployment where latency to a remote server is unacceptable

  • Quick prototyping when you don’t want to set up a model serving platform

Note

Models are auto-downloaded from HuggingFace on first use. Subsequent runs load from cache.

Dependencies

Local models require one additional pip package depending on the component type:

  • LLM / VLM: pip install llama-cpp-python

  • STT / TTS: pip install sherpa-onnx

These are pre-installed in EMOS Docker containers.

Local LLM

The simplest possible EMOS agent — an LLM running entirely on-device with no model client:

from agents.components import LLM
from agents.config import LLMConfig
from agents.ros import Topic, Launcher

config = LLMConfig(
    enable_local_model=True,
    device_local_model="cpu",  # or "cuda" for GPU
    ncpu_local_model=4,
)

query = Topic(name="user_query", msg_type="String")
response = Topic(name="response", msg_type="String")

llm = LLM(
    inputs=[query],
    outputs=[response],
    config=config,
    trigger=query,
    component_name="local_brain",
)

launcher = Launcher()
launcher.add_pkg(components=[llm])
launcher.bringup()

The default model is Qwen3 0.6B (GGUF format). To use a different model, set local_model_path to any HuggingFace GGUF repo ID or a local file path:

config = LLMConfig(
    enable_local_model=True,
    local_model_path="Qwen/Qwen3-1.7B-GGUF",  # larger model
)

Local VLM

A vision-language model that processes both text and images on-device:

from agents.components import VLM
from agents.config import VLMConfig
from agents.ros import Topic, Launcher

config = VLMConfig(enable_local_model=True)

text_in = Topic(name="text_query", msg_type="String")
image_in = Topic(name="image_raw", msg_type="Image")
text_out = Topic(name="response", msg_type="String")

vlm = VLM(
    inputs=[text_in, image_in],
    outputs=[text_out],
    config=config,
    trigger=text_in,
    component_name="local_vision",
)

launcher = Launcher()
launcher.add_pkg(components=[vlm])
launcher.bringup()

The default model is Moondream2 (GGUF format).

Warning

Streaming output (stream=True) is not supported with local VLM models. The component will return the complete response once inference finishes.

Local Speech-to-Text

Convert spoken audio to text using an on-device Whisper model:

from agents.components import SpeechToText
from agents.config import SpeechToTextConfig
from agents.ros import Topic, Launcher

config = SpeechToTextConfig(
    enable_local_model=True,
    enable_vad=True,  # voice activity detection
)

audio_in = Topic(name="audio0", msg_type="Audio")
text_out = Topic(name="transcription", msg_type="String")

stt = SpeechToText(
    inputs=[audio_in],
    outputs=[text_out],
    config=config,
    trigger=audio_in,
    component_name="local_stt",
)

launcher = Launcher()
launcher.add_pkg(components=[stt])
launcher.bringup()

The default model is Whisper tiny.en (via sherpa-onnx). For other languages or larger models, see the sherpa-onnx pretrained models and set local_model_path accordingly.

Warning

Streaming output (stream=True) is not supported with local STT models. Use a WebSocket client (e.g. RoboMLWSClient) if you need streaming transcription.

Local Text-to-Speech

Synthesize speech on-device and play it through the robot’s speakers:

from agents.components import TextToSpeech
from agents.config import TextToSpeechConfig
from agents.ros import Topic, Launcher

config = TextToSpeechConfig(
    enable_local_model=True,
    play_on_device=True,  # play audio on the robot's speakers
)

text_in = Topic(name="text_input", msg_type="String")

tts = TextToSpeech(
    inputs=[text_in],
    outputs=[],
    config=config,
    trigger=text_in,
    component_name="local_tts",
)

launcher = Launcher()
launcher.add_pkg(components=[tts])
launcher.bringup()

The default model is Kokoro EN (via sherpa-onnx).

Warning

Streaming output (stream=True) is not supported with local TTS models.

Complete Example: Local Conversational Agent

Here is a full conversational agent — speech-to-text, vision-language model, and text-to-speech — running entirely on local models. No Ollama, no RoboML, no cloud API.

Fully Local Conversational Agent
 1from agents.components import VLM, SpeechToText, TextToSpeech
 2from agents.config import SpeechToTextConfig, VLMConfig, TextToSpeechConfig
 3from agents.ros import Topic, Launcher
 4
 5# --- Speech-to-Text (Whisper tiny.en via sherpa-onnx) ---
 6audio_in = Topic(name="audio0", msg_type="Audio")
 7text_query = Topic(name="text0", msg_type="String")
 8
 9stt_config = SpeechToTextConfig(
10    enable_local_model=True,
11    enable_vad=True,
12)
13
14speech_to_text = SpeechToText(
15    inputs=[audio_in],
16    outputs=[text_query],
17    config=stt_config,
18    trigger=audio_in,
19    component_name="speech_to_text",
20)
21
22# --- VLM (Moondream2 via llama-cpp-python) ---
23image_in = Topic(name="image_raw", msg_type="Image")
24text_answer = Topic(name="text1", msg_type="String")
25
26vlm_config = VLMConfig(enable_local_model=True)
27
28vlm = VLM(
29    inputs=[text_query, image_in],
30    outputs=[text_answer],
31    config=vlm_config,
32    trigger=text_query,
33    component_name="vision_brain",
34)
35
36# --- Text-to-Speech (Kokoro via sherpa-onnx) ---
37tts_config = TextToSpeechConfig(
38    enable_local_model=True,
39    play_on_device=True,
40)
41
42text_to_speech = TextToSpeech(
43    inputs=[text_answer],
44    outputs=[],
45    config=tts_config,
46    trigger=text_answer,
47    component_name="text_to_speech",
48)
49
50# --- Launch ---
51launcher = Launcher()
52launcher.add_pkg(components=[speech_to_text, vlm, text_to_speech])
53launcher.bringup()

This recipe creates the same pipeline as the Conversational Agent tutorial, but runs fully offline. The trade-off is that on-device models are smaller and less capable than hosted alternatives — they work well for simple interactions but may struggle with complex reasoning.

See also