Create Wyoming server for Home assistant Part2 - stt - wyoming-funasr arm64
Wyoming protocol server for the funasr speech to text system.stt - wyoming-funasr arm64
FunASR: A Fundamental End-to-End Speech Recognition Toolkit.
Table of Contents
Step 1. Create Python virtual environment
mkdir -p /funasr-wyoming
cd /funasr-wyoming
python3 -m venv venv
source venv/bin/activate
python --version
Python 3.11.2
apt list --installed
(venv) root@raspberrypi:/funasr-wyoming# pip3 show funasr
Name: funasr
Version: 1.3.0
Summary: FunASR: A Fundamental End-to-End Speech Recognition Toolkit
Home-page: https://github.com/alibaba-damo-academy/FunASR.git
Author: Speech Lab of Alibaba Group
Author-email: [email protected]
License: The MIT License
Location: /funasr-wyoming/venv/lib/python3.11/site-packages
Requires: editdistance, hydra-core, jaconv, jamo, jieba, kaldiio, librosa, modelscope, oss2, pytorch_wpe, PyYAML, requests, scipy, sentencepiece, soundfile, tensorboardX, torch_complex, tqdm, umap_learn
Requirements
python>=3.8
torch>=1.13
torchaudio
Step 2. Install
(venv) root@raspberrypi:/funasr-wyoming# pip3 --version
pip 23.0.1 from /funasr-wyoming/venv/lib/python3.11/site-packages/pip (python 3.11)
Install torch via PyPI
pip3 install torch==2.1.0 (CPU-only)
output
Installing collected packages: mpmath, sympy, networkx, MarkupSafe, fsspec, jinja2, torch
Successfully installed MarkupSafe-3.0.3 fsspec-2026.1.0 jinja2-3.1.6 mpmath-1.3.0 networkx-3.6.1 sympy-1.14.0 torch-2.1.0
if ffmpeg is not installed. torchaudio is used to load audio
pip3 install torchaudio==2.1.0 (CPU-only)
output
Successfully installed torchaudio-2.1.0
You will need the wyoming and funasr libraries.
Install FunASR 1.3.0 via PyPI
pip3 install -U funasr==1.3.0
This will pull:
Downloading https://www.piwheels.org/simple/threadpoolctl/threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: jieba, jamo, jaconv, crcmod, antlr4-python3-runtime, urllib3, typing_extensions, tqdm, threadpoolctl, six, sentencepiece, PyYAML, pycryptodome, pycparser, protobuf, platformdirs, packaging, numpy, msgpack, llvmlite, joblib, jmespath, idna, filelock, editdistance, decorator, charset_normalizer, certifi, audioread, torch_complex, tensorboardX, soxr, scipy, requests, pytorch_wpe, omegaconf, numba, lazy_loader, kaldiio, cffi, soundfile, scikit-learn, pooch, modelscope, hydra-core, cryptography, pynndescent, librosa, aliyun-python-sdk-core, umap_learn, aliyun-python-sdk-kms, oss2, funasr
output
Successfully installed PyYAML-6.0.3 aliyun-python-sdk-core-2.16.0 aliyun-python-sdk-kms-2.16.5 antlr4-python3-runtime-4.9.3 audioread-3.1.0 certifi-2026.1.4 cffi-2.0.0 charset_normalizer-3.4.4 crcmod-1.7 cryptography-46.0.3 decorator-5.2.1 editdistance-0.8.1 filelock-3.20.3 funasr-1.3.0 hydra-core-1.3.2 idna-3.11 jaconv-0.4.1 jamo-0.4.1 jieba-0.42.1 jmespath-0.10.0 joblib-1.5.3 kaldiio-2.18.1 lazy_loader-0.4 librosa-0.11.0 llvmlite-0.46.0 modelscope-1.34.0 msgpack-1.1.2 numba-0.63.1 numpy-2.3.5 omegaconf-2.3.0 oss2-2.19.1 packaging-26.0 platformdirs-4.5.1 pooch-1.8.2 protobuf-6.33.4 pycparser-3.0 pycryptodome-3.23.0 pynndescent-0.6.0 pytorch_wpe-0.0.1 requests-2.32.5 scikit-learn-1.8.0 scipy-1.17.0 sentencepiece-0.2.1 six-1.17.0 soundfile-0.13.1 soxr-1.0.0 tensorboardX-2.6.4 threadpoolctl-3.6.0 torch_complex-0.4.4 tqdm-4.67.1 typing_extensions-4.15.0 umap_learn-0.5.11 urllib3-2.6.3
detail:https://pypi.org/project/funasr
sudo apt install ffmpeg
output
ffmpeg is already the newest version (8:5.1.8-0+deb12u1+rpt1).
if ffmpeg is not installed. torchaudio is used to load audio
Verify installation
python - << 'EOF'
from funasr import AutoModel
print("FunASR imported OK")
EOF
output
FunASR imported OK
Step3.Download and test a model (example: paraformer-zh)
test.py
from funasr import AutoModel
model = AutoModel(
model="paraformer-zh",
model_revision="v2.0.4",
vad_model="fsmn-vad",
vad_model_revision="v2.0.4",
punc_model="ct-punc",
punc_model_revision="v2.0.4",
)
res = model.generate(input="test.wav")
print(res)
res = model.generate(input="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/vad_example.wav")
print(res)
python3 test.py
Models are cached in:
/root/.cache/modelscope/hub/models/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
output
Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
2026-01-25 08:58:28,492 - modelscope - INFO - Use user-specified model revision: v2.0.4
2026-01-25 08:58:28,595 - modelscope - INFO - Got 11 files, start to download ...
Downloading [fig/res.png]: 100%|███████████████████████████████████████████████████| 192k/192k [00:00<00:00, 386kB/s]
Downloading [am.mvn]: 100%|█████████████████████████████████████████████████████| 10.9k/10.9k [00:00<00:00, 21.7kB/s]
Downloading [example/hotword.txt]: 100%|███████████████████████████████████████████| 7.00/7.00 [00:00<00:00, 11.9B/s]
Downloading [config.yaml]: 100%|████████████████████████████████████████████████| 3.34k/3.34k [00:00<00:00, 5.66kB/s]
Downloading [configuration.json]: 100%|███████████████████████████████████████████████| 478/478 [00:00<00:00, 766B/s]
Downloading [README.md]: 100%|██████████████████████████████████████████████████| 11.3k/11.3k [00:00<00:00, 18.2kB/s]
Downloading [example/asr_example.wav]: 100%|███████████████████████████████████████| 141k/141k [00:00<00:00, 208kB/s]
Downloading [fig/seaco.png]: 100%|█████████████████████████████████████████████████| 167k/167k [00:00<00:00, 296kB/s]
Downloading [tokens.json]: 100%|█████████████████████████████████████████████████| 91.5k/91.5k [00:00<00:00, 165kB/s]
Downloading [seg_dict]: 100%|███████████████████████████████████████████████████| 7.90M/7.90M [00:03<00:00, 2.76MB/s]
Downloading [model.pt]: 100%|█████████████████████████████████████████████████████| 944M/944M [01:31<00:00, 10.8MB/s]
Processing 11 items: 100%|████████████████████████████████████████████████████████| 11.0/11.0 [01:31<00:00, 8.34s/it]
2026-01-25 09:00:00,347 - modelscope - INFO - Download model 'iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch' successfully.█████████████████████████████████████████████| 91.5k/91.5k [00:00<00:00, 165kB/s]
WARNING:root:trust_remote_code: False | 5.00M/944M [00:01<02:50, 5.77MB/s]
Downloading [model.pt]: 2%|█▎ | 23.0M/944M [00:03<02:15, 7.12MB/s]
Downloading [model.pt]: 100%|████████████████████████████████████████████████████▉| 942M/944M [01:31<00:00, 6.28MB/s]
Downloading [seg_dict]: 100%|███████████████████████████████████████████████████| 7.90M/7.90M [00:03<00:00, 4.16MB/s
python3 -c "from funasr import AutoModel; AutoModel(model='paraformer-zh', device='cpu')"
Step 4.FunASR + Wyoming STT full server
Install wyoming
pip3 install wyoming==1.8.0
output
Looking in indexes: https://pypi.org/simple, https://www.piwheels.org/simple
Collecting wyoming==1.5.0
Downloading wyoming-1.5.0-py3-none-any.whl (23 kB)
Installing collected packages: wyoming
Successfully installed wyoming-1.5.0
A Wyoming server consists of an AsyncServer and an AsyncEventHandler. The handler processes events like Describe
python3 server.py
You will need the wyoming and funasr libraries.
Describe event
Listen for a Describe event (to tell HA it's an STT service)
AudioStart event
Ha client->
{
"type": "audio.start",
"rate": 16000,
"width": 2,
"channels": 1
}
AudioStart.is_type(event.type)
AudioChunk event
The AudioChunk event is where you collect the raw PCM data.Receive AudioChunk events and buffer them.
AudioStop event
Trigger the SenseVoice model when AudioStop event is received.AudioStop event is where you trigger the inference.
pip3 list
Package Version
---------------------- --------
aliyun-python-sdk-core 2.16.0
aliyun-python-sdk-kms 2.16.5
antlr4-python3-runtime 4.9.3
audioread 3.1.0
certifi 2026.1.4
cffi 2.0.0
charset-normalizer 3.4.4
crcmod 1.7
cryptography 46.0.3
decorator 5.2.1
editdistance 0.8.1
filelock 3.20.3
fsspec 2026.1.0
funasr 1.3.0
hydra-core 1.3.2
idna 3.11
ifaddr 0.2.0
jaconv 0.4.1
jamo 0.4.1
jieba 0.42.1
Jinja2 3.1.6
jmespath 0.10.0
joblib 1.5.3
kaldiio 2.18.1
lazy_loader 0.4
librosa 0.11.0
llvmlite 0.46.0
MarkupSafe 3.0.3
modelscope 1.34.0
mpmath 1.3.0
msgpack 1.1.2
networkx 3.6.1
numba 0.63.1
numpy 1.26.4
omegaconf 2.3.0
oss2 2.19.1
packaging 26.0
pip 23.0.1
platformdirs 4.5.1
pooch 1.8.2
protobuf 6.33.4
pycparser 3.0
pycryptodome 3.23.0
pynndescent 0.6.0
pytorch-wpe 0.0.1
PyYAML 6.0.3
requests 2.32.5
scikit-learn 1.8.0
scipy 1.17.0
sentencepiece 0.2.1
setuptools 66.1.1
six 1.17.0
soundfile 0.13.1
soxr 1.0.0
sympy 1.14.0
tensorboardX 2.6.4
threadpoolctl 3.6.0
torch 2.1.0
torch_complex 0.4.4
torchaudio 2.1.0
tqdm 4.67.1
typing_extensions 4.15.0
umap-learn 0.5.11
urllib3 2.6.3
wyoming 1.8.0
zeroconf 0.148.0
Strategies to reduce latency
1.Use smaller models
FunASR has paraformer-zh-small or paraformer-zh-medium
2.VAD pre-filtering
Skip silence chunks → speech → Skip silence chunks
Comments
Comments are closed