Create Wyoming server for Home assistant Part2 - stt - wyoming-funasr arm64

Published Feb 5, 2026

Wyoming protocol server for the funasr speech to text system.stt - wyoming-funasr arm64

FunASR: A Fundamental End-to-End Speech Recognition Toolkit.

To make an STT server work with Home Assistant, the industry standard is using the Wyoming Protocol.

Step 1.Development Environment Setup
- Create Python virtual environment
Step 2. Install
Step3.Download and test a model (example: paraformer-zh)
- SenseVoice - Speech Recognition (Non-streaming)
Step 4.FunASR + Wyoming STT full server
- Install wyoming
Strategies to reduce latency

Step 1.Development Environment Setup

Create Python virtual environment

mkdir -p /funasr-wyoming

cd /funasr-wyoming

python3 -m venv venv

source venv/bin/activate

python --version

Python 3.11.2

pip3 install wyoming==1.8.0

pip3 install funasr==1.3.0

pip3 install torch

pip3 install torchaudio

apt list --installed

(venv) root@raspberrypi:/funasr-wyoming# pip3 show funasr
Name: funasr
Version: 1.3.0
Summary: FunASR: A Fundamental End-to-End Speech Recognition Toolkit
Home-page: https://github.com/alibaba-damo-academy/FunASR.git
Author: Speech Lab of Alibaba Group
Author-email: [email protected]
License: The MIT License
Location: /funasr-wyoming/venv/lib/python3.11/site-packages
Requires: editdistance, hydra-core, jaconv, jamo, jieba, kaldiio, librosa, modelscope, oss2, pytorch_wpe, PyYAML, requests, scipy, sentencepiece, soundfile, tensorboardX, torch_complex, tqdm, umap_learn

Requirements

python>=3.8
torch>=1.13
torchaudio

Step 2. Install

(venv) root@raspberrypi:/funasr-wyoming# pip3 --version

pip 23.0.1 from /funasr-wyoming/venv/lib/python3.11/site-packages/pip (python 3.11)

Install torch via PyPI

pip3 install torch==2.1.0 (CPU-only)

output

Installing collected packages: mpmath, sympy, networkx, MarkupSafe, fsspec, jinja2, torch
Successfully installed MarkupSafe-3.0.3 fsspec-2026.1.0 jinja2-3.1.6 mpmath-1.3.0 networkx-3.6.1 sympy-1.14.0 torch-2.1.0

if ffmpeg is not installed. torchaudio is used to load audio

pip3 install torchaudio==2.1.0 (CPU-only)

output

Successfully installed torchaudio-2.1.0

You will need the wyoming and funasr libraries.

Install FunASR 1.3.0 via PyPI

pip3 install -U funasr==1.3.0

This will pull:

Downloading https://www.piwheels.org/simple/threadpoolctl/threadpoolctl-3.6.0-py3-none-any.whl (18 kB)

Installing collected packages: jieba, jamo, jaconv, crcmod, antlr4-python3-runtime, urllib3, typing_extensions, tqdm, threadpoolctl, six, sentencepiece, PyYAML, pycryptodome, pycparser, protobuf, platformdirs, packaging, numpy, msgpack, llvmlite, joblib, jmespath, idna, filelock, editdistance, decorator, charset_normalizer, certifi, audioread, torch_complex, tensorboardX, soxr, scipy, requests, pytorch_wpe, omegaconf, numba, lazy_loader, kaldiio, cffi, soundfile, scikit-learn, pooch, modelscope, hydra-core, cryptography, pynndescent, librosa, aliyun-python-sdk-core, umap_learn, aliyun-python-sdk-kms, oss2, funasr

output

Successfully installed PyYAML-6.0.3 aliyun-python-sdk-core-2.16.0 aliyun-python-sdk-kms-2.16.5 antlr4-python3-runtime-4.9.3 audioread-3.1.0 certifi-2026.1.4 cffi-2.0.0 charset_normalizer-3.4.4 crcmod-1.7 cryptography-46.0.3 decorator-5.2.1 editdistance-0.8.1 filelock-3.20.3 funasr-1.3.0 hydra-core-1.3.2 idna-3.11 jaconv-0.4.1 jamo-0.4.1 jieba-0.42.1 jmespath-0.10.0 joblib-1.5.3 kaldiio-2.18.1 lazy_loader-0.4 librosa-0.11.0 llvmlite-0.46.0 modelscope-1.34.0 msgpack-1.1.2 numba-0.63.1 numpy-2.3.5 omegaconf-2.3.0 oss2-2.19.1 packaging-26.0 platformdirs-4.5.1 pooch-1.8.2 protobuf-6.33.4 pycparser-3.0 pycryptodome-3.23.0 pynndescent-0.6.0 pytorch_wpe-0.0.1 requests-2.32.5 scikit-learn-1.8.0 scipy-1.17.0 sentencepiece-0.2.1 six-1.17.0 soundfile-0.13.1 soxr-1.0.0 tensorboardX-2.6.4 threadpoolctl-3.6.0 torch_complex-0.4.4 tqdm-4.67.1 typing_extensions-4.15.0 umap_learn-0.5.11 urllib3-2.6.3

detail:https://pypi.org/project/funasr

sudo apt install ffmpeg

output

ffmpeg is already the newest version (8:5.1.8-0+deb12u1+rpt1).

if ffmpeg is not installed. torchaudio is used to load audio

Verify installation

python - << 'EOF'
from funasr import AutoModel
print("FunASR imported OK")
EOF

output

FunASR imported OK

Step3.Download and test a model (example: paraformer-zh)

test.py

from funasr import AutoModel

model = AutoModel(
    model="paraformer-zh",
    model_revision="v2.0.4",
    vad_model="fsmn-vad",
    vad_model_revision="v2.0.4",
    punc_model="ct-punc",
    punc_model_revision="v2.0.4",
)

res = model.generate(input="test.wav")
print(res)

res = model.generate(input="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/vad_example.wav")
print(res)

SenseVoice - Speech Recognition (Non-streaming)

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model_dir = "iic/SenseVoiceSmall"

model = AutoModel(
    model=model_dir,
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    device="cuda:0",
)

# en
res = model.generate(
    input=f"{model.model_path}/example/en.mp3",
    cache={},
    language="auto",  # "zn", "en", "yue", "ja", "ko", "nospeech"
    use_itn=True,
    batch_size_s=60,
    merge_vad=True,  #
    merge_length_s=15,
)
text = rich_transcription_postprocess(res[0]["text"])
print(text)

see:https://github.com/modelscope/FunASR?tab=readme-ov-file#sensevoice

python3 test.py

Models are cached in:

/root/.cache/modelscope/hub/models/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch

output

Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch


2026-01-25 08:58:28,492 - modelscope - INFO - Use user-specified model revision: v2.0.4
2026-01-25 08:58:28,595 - modelscope - INFO - Got 11 files, start to download ...
Downloading [fig/res.png]: 100%|███████████████████████████████████████████████████| 192k/192k [00:00<00:00, 386kB/s]
Downloading [am.mvn]: 100%|█████████████████████████████████████████████████████| 10.9k/10.9k [00:00<00:00, 21.7kB/s]
Downloading [example/hotword.txt]: 100%|███████████████████████████████████████████| 7.00/7.00 [00:00<00:00, 11.9B/s]
Downloading [config.yaml]: 100%|████████████████████████████████████████████████| 3.34k/3.34k [00:00<00:00, 5.66kB/s]
Downloading [configuration.json]: 100%|███████████████████████████████████████████████| 478/478 [00:00<00:00, 766B/s]
Downloading [README.md]: 100%|██████████████████████████████████████████████████| 11.3k/11.3k [00:00<00:00, 18.2kB/s]
Downloading [example/asr_example.wav]: 100%|███████████████████████████████████████| 141k/141k [00:00<00:00, 208kB/s]
Downloading [fig/seaco.png]: 100%|█████████████████████████████████████████████████| 167k/167k [00:00<00:00, 296kB/s]
Downloading [tokens.json]: 100%|█████████████████████████████████████████████████| 91.5k/91.5k [00:00<00:00, 165kB/s]
Downloading [seg_dict]: 100%|███████████████████████████████████████████████████| 7.90M/7.90M [00:03<00:00, 2.76MB/s]
Downloading [model.pt]: 100%|█████████████████████████████████████████████████████| 944M/944M [01:31<00:00, 10.8MB/s]
Processing 11 items: 100%|████████████████████████████████████████████████████████| 11.0/11.0 [01:31<00:00, 8.34s/it]
2026-01-25 09:00:00,347 - modelscope - INFO - Download model 'iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch' successfully.█████████████████████████████████████████████| 91.5k/91.5k [00:00<00:00, 165kB/s]
WARNING:root:trust_remote_code: False                                            | 5.00M/944M [00:01<02:50, 5.77MB/s]
Downloading [model.pt]:   2%|█▎                                                  | 23.0M/944M [00:03<02:15, 7.12MB/s]
Downloading [model.pt]: 100%|████████████████████████████████████████████████████▉| 942M/944M [01:31<00:00, 6.28MB/s]
Downloading [seg_dict]: 100%|███████████████████████████████████████████████████| 7.90M/7.90M [00:03<00:00, 4.16MB/s

python3 -c "from funasr import AutoModel; AutoModel(model='paraformer-zh', device='cpu')"

Step 4.FunASR + Wyoming STT full server

Install wyoming

pip3 install wyoming==1.8.0

output

Looking in indexes: https://pypi.org/simple, https://www.piwheels.org/simple
Collecting wyoming==1.5.0
  Downloading wyoming-1.5.0-py3-none-any.whl (23 kB)
Installing collected packages: wyoming
Successfully installed wyoming-1.5.0

A Wyoming server consists of an AsyncServer and an AsyncEventHandler. The handler processes events like Describe

python3 server.py

You will need the wyoming and funasr libraries.

Describe event

Listen for a Describe event (to tell HA it's an STT service)

AudioStart event

Ha client->

{
  "type": "audio.start",
  "rate": 16000,
  "width": 2,
  "channels": 1
}

AudioStart.is_type(event.type)

AudioChunk event

The AudioChunk event is where you collect the raw PCM data.Receive AudioChunk events and buffer them.

AudioStop event

Trigger the SenseVoice model when AudioStop event is received.AudioStop event is where you trigger the inference.

funasr-wyoming# python3 server.py
funasr version: 1.3.0.
Check update of funasr, and it would cost few times. You may disable it by set `disable_update=True` in AutoModel
New version is available: 1.3.1.
Please use the command "pip install -U funasr" to upgrade.
2026-02-19 14:15:51.321 - INFO - root - download models from model hub: ms
Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/iic/SenseVoiceSmall
2026-02-19 14:15:52.591 - WARNING - root - trust_remote_code: False
2026-02-19 14:15:54.679 - INFO - root - Loading pretrained params from /root/.cache/modelscope/hub/models/iic/SenseVoiceSmall/model.pt
2026-02-19 14:15:54.688 - INFO - root - ckpt: /root/.cache/modelscope/hub/models/iic/SenseVoiceSmall/model.pt
2026-02-19 14:16:10.752 - INFO - root - scope_map: ['module.', 'None']
2026-02-19 14:16:10.871 - INFO - root - excludes: None
2026-02-19 14:16:11.383 - INFO - root - Loading ckpt: /root/.cache/modelscope/hub/models/iic/SenseVoiceSmall/model.pt, status: <All keys matched successfully>
2026-02-19 14:16:11.461 - INFO - root - Building VAD model.
2026-02-19 14:16:11.461 - INFO - root - download models from model hub: ms
Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch
2026-02-19 14:16:13.308 - WARNING - root - trust_remote_code: False
2026-02-19 14:16:13.652 - INFO - root - Loading pretrained params from /root/.cache/modelscope/hub/models/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/model.pt
2026-02-19 14:16:13.653 - INFO - root - ckpt: /root/.cache/modelscope/hub/models/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/model.pt
2026-02-19 14:16:13.959 - INFO - root - scope_map: ['module.', 'None']
2026-02-19 14:16:13.959 - INFO - root - excludes: None
2026-02-19 14:16:13.962 - INFO - root - Loading ckpt: /root/.cache/modelscope/hub/models/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/model.pt, status: <All keys matched successfully>
2026-02-19 14:16:14.020 - INFO - wyoming-funasr-stt-server - Wyoming STT Server started on port 10850
2026-02-19 14:16:19.596 - INFO - wyoming-funasr-stt-server - Transcribe received
2026-02-19 14:16:19.597 - INFO - wyoming-funasr-stt-server - AudioStart received
2026-02-19 14:16:22.530 - INFO - wyoming-funasr-stt-server - AudioStop received. Processing...  95040 bytes of audio...
2026-02-19 14:16:22.557 - 音频长度: 47520 samples, 2.97 秒
2026-02-19 14:16:22.559 - INFO - wyoming-funasr-stt-server - Audio length: 47520 samples, 2.97 s
rtf_avg: 0.578: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.71s/it]
rtf_avg: 1.793: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.72s/it]
rtf_avg: 0.581, time_speech:  2.970, time_escape: 1.727: 100%|████████████████████████████████████████████| 1/1 [00:01<00:00,  1.73s/it]
2026-02-19 14:16:26.056 - 过滤掉 SenseVoice 可能输出的情感/事件标签
2026-02-19 14:16:26.056 - INFO - wyoming-funasr-stt-server - 识别结果原文：可能输出的情感/事件标签- Result: <|zh|><|NEUTRAL|><|Speech|><|withitn|>换货。
2026-02-19 14:16:26.057 - INFO - wyoming-funasr-stt-server - Result: 换货。
2026-02-19 14:16:26.057 - 识别结果: 换货。
2026-02-19 14:16:35.711 - INFO - wyoming-funasr-stt-server - Transcribe received
2026-02-19 14:16:35.711 - INFO - wyoming-funasr-stt-server - AudioStart received
2026-02-19 14:16:38.141 - INFO - wyoming-funasr-stt-server - AudioStop received. Processing...  78400 bytes of audio...
2026-02-19 14:16:38.143 - 音频长度: 39200 samples, 2.45 秒
2026-02-19 14:16:38.143 - INFO - wyoming-funasr-stt-server - Audio length: 39200 samples, 2.45 s
rtf_avg: 0.028: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14.81it/s]
rtf_avg: 1.693: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.53s/it]
rtf_avg: 0.624, time_speech:  2.450, time_escape: 1.530: 100%|████████████████████████████████████████████| 1/1 [00:01<00:00,  1.53s/it]
2026-02-19 14:16:39.742 - 过滤掉 SenseVoice 可能输出的情感/事件标签
2026-02-19 14:16:39.742 - INFO - wyoming-funasr-stt-server - 识别结果原文：可能输出的情感/事件标签- Result: <|zh|><|NEUTRAL|><|Speech|><|withitn|>关火。
2026-02-19 14:16:39.742 - INFO - wyoming-funasr-stt-server - Result: 关火。
2026-02-19 14:16:39.742 - 识别结果: 关火。

pip3 list

Package                Version
---------------------- --------
aliyun-python-sdk-core 2.16.0
aliyun-python-sdk-kms  2.16.5
antlr4-python3-runtime 4.9.3
audioread              3.1.0
certifi                2026.1.4
cffi                   2.0.0
charset-normalizer     3.4.4
crcmod                 1.7
cryptography           46.0.3
decorator              5.2.1
editdistance           0.8.1
filelock               3.20.3
fsspec                 2026.1.0
funasr                 1.3.0
hydra-core             1.3.2
idna                   3.11
ifaddr                 0.2.0
jaconv                 0.4.1
jamo                   0.4.1
jieba                  0.42.1
Jinja2                 3.1.6
jmespath               0.10.0
joblib                 1.5.3
kaldiio                2.18.1
lazy_loader            0.4
librosa                0.11.0
llvmlite               0.46.0
MarkupSafe             3.0.3
modelscope             1.34.0
mpmath                 1.3.0
msgpack                1.1.2
networkx               3.6.1
numba                  0.63.1
numpy                  1.26.4
omegaconf              2.3.0
oss2                   2.19.1
packaging              26.0
pip                    23.0.1
platformdirs           4.5.1
pooch                  1.8.2
protobuf               6.33.4
pycparser              3.0
pycryptodome           3.23.0
pynndescent            0.6.0
pytorch-wpe            0.0.1
PyYAML                 6.0.3
requests               2.32.5
scikit-learn           1.8.0
scipy                  1.17.0
sentencepiece          0.2.1
setuptools             66.1.1
six                    1.17.0
soundfile              0.13.1
soxr                   1.0.0
sympy                  1.14.0
tensorboardX           2.6.4
threadpoolctl          3.6.0
torch                  2.1.0
torch_complex          0.4.4
torchaudio             2.1.0
tqdm                   4.67.1
typing_extensions      4.15.0
umap-learn             0.5.11
urllib3                2.6.3
wyoming                1.8.0
zeroconf               0.148.0

Strategies to reduce latency

1.Use smaller models

FunASR has paraformer-zh-small or paraformer-zh-medium

2.VAD pre-filtering

Skip silence chunks → speech → Skip silence chunks

Create Wyoming server for Home assistant Part2 - stt - wyoming-funasr arm64

Table of Contents

Step 1.Development Environment Setup

Create Python virtual environment

Step 2. Install

Install torch via PyPI

Install FunASR 1.3.0 via PyPI

Verify installation

Step3.Download and test a model (example: paraformer-zh)

SenseVoice - Speech Recognition (Non-streaming)

Step 4.FunASR + Wyoming STT full server

Install wyoming

Strategies to reduce latency

Comments