rpine lab Tech Blog

Tech blog covering a wide range of topics, including my hobby of programming (web and backend), setting up a home lab, and electronics projects using microcontrollers.

🤖Running M5StackChan AI.AGENT with a Local LLM

Running M5StackChan AI.AGENT with a Local LLM
Table of contents
🤖
This article was AI-translated. Some nuances may differ from the original Japanese version.

Introduction

I successfully ran the AI.AGENT feature of M5StackChan, the M5Stack-produced version of the open-source StackChan robot, using a local LLM.

The AI.AGENT feature in M5StackChan's stock firmware relies on an external Xiaozhi AI server for AI processing. To address privacy concerns and enable customization, I self-hosted the Xiaozhi AI server to enable local LLM-powered voice conversations on my own PC.

What Was Verified

  • Voice conversation using a local LLM
  • Built-in MCP tool invocation (head movement, LED lighting, etc.)
  • Image recognition via the built-in camera using a local LLM
System Overview
System Overview

Equipment

  • StackChan (released by M5Stack)
    • Modified stock firmware v1.4.1 (modified version flashed during this procedure)
    • Initial setup complete (see previous article)
  • Host PC Spec (running local LLM models)
    • OS: Windows 11
    • CPU: AMD Ryzen 9 9900X3D 4.4 GHz 12C/24P
    • RAM: 64 GB
    • GPU: NVIDIA GeForce RTX 5070 12GB
  • Software
    • VSCode (for firmware build)
      • ESP-IDF extension
    • Rancher Desktop (Docker runtime; Docker Desktop also works)
    • Ollama (for running local LLM)
  • Local LLM Model

Preparing the Xiaozhi AI Server

There are several open-source Xiaozhi AI server implementations, and I'm using xinnan-tech/xiaozhi-esp32-server, which appears to be the most feature-rich.

This server was developed by a research team at South China University of Technology. It supports MQTT+UDP and WebSocket communication compatible with the Xiaozhi AI protocol, plus an MCP endpoint and ASR (speech recognition), implemented in Python — feature-wise, it is quite close to the official Xiaozhi AI server. It also has a web management interface, though I didn't use that this time.

Since code modifications are needed, I cloned the source, made adjustments for Japanese support, then built and ran a custom Docker image.

Cloning the Source Code

Clone xinnan-tech/xiaozhi-esp32-server:

git clone https://github.com/xinnan-tech/xiaozhi-esp32-server.git

We'll be working with the Python Xiaozhi AI server implementation under main\xiaozhi-server.

📝
A fork with all code modifications and Dockerfile changes from this article pre-applied is available. To skip those steps, clone the fork below instead of the upstream. See PR #1 for details.
git clone https://github.com/pinelibg/xiaozhi-esp32-server.git

Downloading the ASR Model

This project uses SenseVoiceSmall by default for ASR (audio-to-text).

Since the model needs to be downloaded for local execution, download the model file (model.pt, ~900 MB) from the link in the Deployment.md’s model-files section of the repository and place it at main\xiaozhi-server\models\SenseVoiceSmall\model.pt.

Code Modifications

Next, modify the code for non-Chinese language support.

⏭️
If you cloned the fork above, all changes in this section are already applied. Skip ahead to "Creating the Config File".
  1. Translate the base prompt (main\xiaozhi-server\agent-base-prompt.txt)

    Localize the base prompt (agent-base-prompt.txt) from Chinese. I used Claude Code to handle this. Also adjust the content as needed.

  2. Translate the Few-Shot prompts

    Convert the example dialogues (few-shot prompts) passed to the LLM.

    @@ -557,48 +557,48 @@
    -        self.dialogue.put(Message(role="user", content="给我讲个故事吧", is_temporary=True))
    +        self.dialogue.put(Message(role="user", content="お話を聞かせて", is_temporary=True))
             self.dialogue.put(Message(
                 role="assistant",
                 tool_calls=[{
                     "id": da_tc_id,
    -                "function": {"arguments": '{"response": "好呀,你想听什么类型的呀?童话、冒险还是搞笑的?选一个我给你开讲~"}', "name": "direct_answer"},
    +                "function": {"arguments": '{"response": "いいよ。どんなお話がいい?童話、冒険、面白い話から選んでくれたら始めるね。"}', "name": "direct_answer"},
    main/xiaozhi-server/core/connection.py
  3. Fix the image recognition prompt

    Replace the Chinese-language instruction in the prompt passed to the LLM during image recognition:

    @@ -40,7 +40,7 @@ class VLLMProvider(VLLMProviderBase):
             self.client = openai.OpenAI(api_key=self.api_key, base_url=self.base_url)
     
         def response(self, question, base64_image):
    -        question = question + "(请使用中文回复)"
    +        question = question + "(日本語で端的に回答してください)"
             try:
                 messages = [
                     {
    main/xiaozhi-server/core/providers/vllm/openai.py
  4. Restrict ASR language (if needed)

    By default, the ASR model (SenseVoiceSmall) auto-detects the language among those it supports (Chinese, English, Japanese, Korean). Since speech was sometimes recognized as the wrong language, I restricted it to Japanese only:

    @@ -80,7 +80,7 @@ class ASRProvider(ASRProviderBase):
                         self.model.generate,
                         input=artifacts.pcm_bytes,
                         cache={},
    -                    language="auto",
    +                    language="ja",
                         use_itn=True,
                         batch_size_s=60,
                     )
    main/xiaozhi-server/core/providers/asr/fun_local.py
    📝 Note: The language settings in steps 3 and 4 above have been adapted for Japanese. Adjust the prompt instruction and ASR language code to match your preferred language.
  5. Remove weekday localization (bonus)

    Change the weekday in the context's current timestamp from Chinese format to English:

    @@ -36,7 +36,8 @@ def get_current_weekday() -> str:
         now = datetime.now()
    -    return WEEKDAY_MAP[now.strftime("%A")]
    +    # return WEEKDAY_MAP[now.strftime("%A")]
    +    return now.strftime("%A")
    main/xiaozhi-server/core/utils/current_time.py

Creating the Config File

Create a data folder under main\xiaozhi-server and add .config.yaml there (main/xiaozhi-server/data/.config.yaml). All default values are in main\xiaozhi-server\config.yaml; below is the minimum required config for running with Ollama's gemma4:e4b model.

Config file (collapsed due to length) main/xiaozhi-server/data/.config.yaml
---
server:
  ip: 0.0.0.0
  port: 8000
  http_port: 8003
  websocket: ws://<PC_IP_ADDRESS>:8000/xiaozhi/v1/
  vision_explain: http://<PC_IP_ADDRESS>:8003/mcp/vision/explain
  timezone_offset: +9

delete_audio: true
close_connection_no_voice_time: 120
tts_timeout: 10
tool_call_timeout: 30
enable_wakeup_words_response_cache: true
enable_greeting: true
enable_stop_tts_notify: true
stop_tts_notify_voice: "config/assets/tts_notify.mp3"
enable_websocket_ping: false

exit_commands:
  - "終了"
  - "終わり"
  - "おしまい"

prompt: |
  あなたはとてもカワイイAIアシスタントのスタックチャンです。
  スタックチャンは、ESP32マイコンを搭載したStackChanデバイスのためのアシスタントで、ユーザーの質問に答えたり、デバイスを制御したりする役割を担っています。

prompt_template: agent-base-prompt.txt

system_error_response: "申し訳ありませんが、システムエラーが発生しました。後でもう一度お試しください。"

end_prompt:
  enable: true
  prompt: |
    Wrap up the conversation in 1-2 sentences, like "Time sure flies."
    Please use an emotional, lingering tone.

selected_module:
  VAD: SileroVAD
  ASR: FunASR
  LLM: OllamaLLM
  VLLM: OllamaLLM
  TTS: EdgeTTS
  Memory: mem_local_short
  Intent: function_call

Intent:
  function_call:
    type: function_call

Memory:
  nomem:
    type: nomem
  mem_local_short:
    type: mem_local_short
    llm: OllamaLLM

ASR:
  FunASR:
    type: fun_local
    model_dir: models/SenseVoiceSmall
    output_dir: tmp/

VAD:
  SileroVAD:
    type: silero
    threshold: 0.5
    threshold_low: 0.3
    model_dir: models/snakers4_silero-vad
    min_silence_duration_ms: 200

LLM:
  OllamaLLM:
    type: ollama
    model_name: gemma4:e4b
    base_url: http://host.docker.internal:11434

VLLM:
  OllamaLLM:
    type: openai
    model_name: gemma4:e4b
    url: http://host.docker.internal:11434/v1
    api_key: ollama

TTS:
  EdgeTTS:
    type: edge
    voice: ja-JP-NanamiNeural
    output_dir: tmp/
    language: "Japanese"

Key config points:

  • Server address: Replace <PC_IP_ADDRESS> with the actual IP address of the PC that StackChan will connect to (e.g., 192.168.11.128). The OTA server provided by xiaozhi-esp32-server needs the PC's IP to return the correct WebSocket endpoint to StackChan.
  • enable_stop_tts_notify: Plays a sound when the chat ends. It is disabled by default, so I enabled it.
  • Prompts and system messages (prompt, system_error_response, end_prompt): Localized for Japanese use.
  • Model config
    selected_module:
      LLM: OllamaLLM
      VLLM: OllamaLLM

    Select the modules to use for LLM (text chat) and VLLM (image recognition). The key names used here correspond to the LLM and VLLM configuration sections defined below.

  • LLM config
    LLM:
      OllamaLLM:
        type: ollama
        model_name: gemma4:e4b
        base_url: http://host.docker.internal:11434

    Uses the special hostname host.docker.internal to access Ollama running on the host machine from within the Docker container. Download the model in advance with ollama pull gemma4:e4b. The key name OllamaLLM is referenced in selected_module above.

  • VLLM config (Vision Language Model for image recognition)
    VLLM:
      OllamaLLM:
        type: openai
        model_name: gemma4:e4b
        url: http://host.docker.internal:11434/v1
        api_key: ollama

    Since Ollama's Vision API isn't directly supported, it uses Ollama's OpenAI-compatible endpoint. Gemma 4 supports multimodal input, so the same model is used for both text and image recognition.

  • TTS: Set to the Japanese voice model ja-JP-NanamiNeural.

Modifying Dockerfile and docker-compose.yml

⏭️
If you cloned the fork above, all changes in this section are already applied. Skip ahead to "Running".

While an official Docker image is available, we'll build a custom image this time. The repository includes both Dockerfile-server-base and Dockerfile-server, but the latter simply copies the Python source code on top of the former's image. Since we'll be mounting the source code directly via Docker Compose, Dockerfile-server is not used.

Modify Dockerfile-server-base with the following changes:

  • Remove Chinese locale settings
  • Remove Chinese pip mirror config
  • Optimize build with Bind Mount and Cache Mount
  • Add Python env vars (PYTHONUNBUFFERED=1 and PYTHONUTF8=1)
# Dockerfile-server-base
FROM python:3.10-slim

RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
    --mount=type=cache,target=/var/lib/apt,sharing=locked \
  rm -f /etc/apt/apt.conf.d/docker-clean && \
  apt-get update && \
  apt-get install -y --no-install-recommends libopus0 ffmpeg

ENV \
  PYTHONUTF8=1 \
  PYTHONIOENCODING=utf-8 \
  PYTHONUNBUFFERED=1

WORKDIR /opt/xiaozhi-esp32-server

RUN --mount=type=bind,src=main/xiaozhi-server/requirements.txt,dst=requirements.txt \
    --mount=type=cache,target=/root/.cache/pip \
  pip install --upgrade pip setuptools wheel && \
  pip install -r requirements.txt --default-timeout=120 --retries 5
Dockerfile-server-base

Modify docker-compose.yml as well.

Changes:

  • Use the custom-built image
  • Change the timezone to JST (Asia/Tokyo)
  • Mount the Python source code directly (By default, it is configured to mount only the data folder and the ASR model to the prebuilt source-code-included image)
@@ -3,13 +3,17 @@
   xiaozhi-esp32-server:
-    image: ghcr.nju.edu.cn/xinnan-tech/xiaozhi-esp32-server:server_latest
+    build:
+      context: ../..
+      dockerfile: Dockerfile-server-base
+    command: ["python", "app.py"]
     environment:
-      - TZ=Asia/Shanghai
+      - TZ=Asia/Tokyo
     volumes:
-      - ./data:/opt/xiaozhi-esp32-server/data
-      - ./models/SenseVoiceSmall/model.pt:/opt/xiaozhi-esp32-server/models/SenseVoiceSmall/model.pt
+      - .:/opt/xiaozhi-esp32-server
main/xiaozhi-server/docker-compose.yml

Firmware Modification

To connect to the server running on your PC, the endpoint that StackChan connects to must be changed.

Currently, the connection endpoint cannot be changed from the official app or similar tools, so we must directly modify the firmware.

⚠️
Firmware modification may cause StackChan to malfunction or fail to boot (bricking).

As a last resort, use the M5Stack official flashing tool M5Burner to perform a flash erase (including all NVS settings such as Wi-Fi) and reflash the stock firmware.

Development Environment Setup

Set up ESP-IDF, the official ESP32 development environment, and build and flash the stock firmware from the firmware/ directory in the official repository.

See a separate article on Zenn for detailed build and flash instructions.

Reference Sites:

Firmware Modification (Changing the OTA Endpoint)

The firmware retrieves various settings from an OTA server at startup, and the Xiaozhi AI WebSocket endpoint is included in those settings.

The flow in the stock firmware after launching AI.AGENT:

  • Connect to OTA URL (https://api.tenclass.net/xiaozhi/ota/)
  • Retrieve settings including the Xiaozhi AI WebSocket endpoint from the OTA server and write them to the flash's settings area (NVS)
  • (Update firmware if a newer version is available)
  • Connect to the WebSocket endpoint URL retrieved from NVS

xiaozhi-esp32-server provides not only a WebSocket endpoint but also an HTTP OTA endpoint, so we use that to distribute the Xiaozhi AI WebSocket endpoint URL.

The OTA URL to specify is "http://<PC_IP_ADDRESS>:8003/xiaozhi/ota/". This is the IP address of the PC running xiaozhi-esp32-server.

Modify the Ota::GetCheckVersionUrl function in xiaozhi-esp32\main\ota.cc to return a fixed endpoint:

std::string Ota::GetCheckVersionUrl() {
    // Settings settings("wifi", false);
    // std::string url = settings.GetString("ota_url");
    // if (url.empty()) {
    //     url = CONFIG_OTA_URL;
    // }
    // return url;
    return "http://<PC_IP_ADDRESS>:8003/xiaozhi/ota/";
}
xiaozhi-esp32\main\ota.cc

After completing the firmware modification, build and flash it to StackChan.

📝 Note: Changing CONFIG_OTA_URL via menuconfig (modifying sdkconfig) is possible, but in my environment the URL already written to NVS (Settings) took priority, causing StackChan to still connect to the official server — so this approach didn't work.

Running

Prepare the Ollama Model

Start Ollama on Windows if not running, and download the model as needed:

ollama pull gemma4:e4b

Start the Xiaozhi AI Server

cd main/xiaozhi-server
# For Rancher Desktop (Windows). On Linux, use `docker compose`
docker compose up -d
docker compose logs -f  # Monitor logs

Launch StackChan AI.AGENT

Launch the AI.AGENT app on StackChan with the modified firmware. If the connection is successful, various information like MAC address and firmware version should flow through the Docker logs.

Verification

General Questions

🧑
Introduce yourself.
🤖
I'm StackChan,

your cute AI assistant equipped with ESP32

~~~

As a rough benchmark, responses come back within a few seconds to under 10 seconds.

自己紹介(ローカルLLMで)
自己紹介(ローカルLLMで)
今日の日付を聞いたとき(コンテキストに現在時刻が含まれている)
今日の日付を聞いたとき(コンテキストに現在時刻が含まれている)
YouTubeで動画を見る

Built-in Tool Invocation

StackChan also supports various built-in tool invocations (head movement, LED color changes):

🧑
Look up.
🤖
I looked up!
YouTubeで動画を見る

Image Recognition (Built-in Camera)

With the VLLM configuration set up, image recognition via the built-in camera works properly as well.

眼の前にムヒだけ写るようにしてカメラ機能呼び出し
眼の前にムヒだけ写るようにしてカメラ機能呼び出し
🧑
Take a photo of what's in front of you and describe the product in the picture.
🤖
This is

Muhi Alpha EX, a skin care product

~~~ (followed by a plausible description)

YouTubeで動画を見る

Notes on the Model

Gemma 4 E4B was chosen as the local LLM this time. This model runs at a reasonable speed on my home PC, but due to its relatively small size and limited context window, there were occasional unexpected answers or tool call failures.

This is expected, since the official Xiaozhi AI server runs full-size models (e.g., Qwen3 235B).

Using the even smaller Gemma 4 E2B caused tool invocation to stop functioning reliably — it seemed unable to follow even the base prompt consistently, likely due to insufficient context capacity.

Possible directions going forward: use a lighter, faster model in conversation-only (no tools), or simply connect to a cloud AI model (Claude, GPT, Gemini) via API.

Notes on Other Tools

External MCP tools like news and weather also work if API keys are configured, but I haven't tested them this time. Since the default integrations use Chinese news and weather services, I'd likely implement my own endpoints if I were to use them in the future.

Summary

I ran StackChan on a local LLM while keeping the stock firmware's architecture intact. Built-in tool usage and image recognition also worked.

xiaozhi-esp32-server is indeed feature-rich, but it contains quite a few China-specific localizations. Based on the published Xiaozhi AI communication protocol and the source code used this time, I plan to build my own server in the future.

References

If you found this article helpful, please consider supporting me!