Note
Hello, welcome to the SunFounder Raspberry Pi & Arduino & ESP32 Enthusiasts Community on Facebook! Dive deeper into Raspberry Pi, Arduino, and ESP32 with fellow enthusiasts.
Why Join?
Expert Support: Solve post-sale issues and technical challenges with help from our community and team.
Learn & Share: Exchange tips and tutorials to enhance your skills.
Exclusive Previews: Get early access to new product announcements and sneak peeks.
Special Discounts: Enjoy exclusive discounts on our newest products.
Festive Promotions and Giveaways: Take part in giveaways and holiday promotions.
👉 Ready to explore and create with us? Click [here] and join today!
6. Local Voice Chatbot
In this lesson, you will combine everything you’ve learned — speech recognition (STT), text-to-speech (TTS), and a local LLM (Ollama) — to build a fully offline voice chatbot that runs on your Fusion HAT.
The workflow is simple:
Listen — The microphone captures your speech and transcribes it with Vosk.
Think — The text is sent to a local LLM running on Ollama (e.g.,
llama3.2:3b).Speak — The chatbot answers aloud using Piper TTS.
This creates a hands-free conversational robot that can understand and respond in real time.
Before You Start
Make sure you have prepared the following:
Tested Piper TTS (1. Testing Piper) and chosen a working voice model.
Tested Vosk STT (2. Test Vosk) and chosen the right language pack (e.g.,
en-us).Installed Ollama (1. Install Ollama (LLM) and Download Model) on your Pi or another computer, and downloaded a model such as
llama3.2:3b(or a smaller one likemoondream:1.8bif memory is limited).
Run the Code
Open the example script:
cd ~/fusion-hat/examples/ sudo nano local_voice_chatbot.py
Update the parameters as needed:
stt = Vosk(language="en-us"): Change this to match your accent/language package (e.g.,en-us,zh-cn,es).tts.set_model("en_US-amy-low"): Replace with the Piper voice model you verified in 1. Testing Piper.llm = Ollama(ip="localhost", model="llama3.2:3b"): Update bothipandmodelto your own setup.ip: If Ollama runs on the same Pi, uselocalhost. If Ollama runs on another computer in your LAN, enable Expose to network in Ollama and setipto that computer’s LAN IP.model: Must exactly match the model name you downloaded/activated in Ollama.
Run the script:
cd ~/fusion-hat/examples/ sudo python3 local_voice_chatbot.py
After running, you should see:
The bot greets you with a spoken welcome message.
It waits for speech input.
Vosk transcribes your speech into text.
The text is sent to Ollama, which streams back a reply.
The reply is cleaned (removing hidden reasoning) and spoken aloud by Piper.
Stop the program anytime with
Ctrl+C.
Code
import re
import time
from fusion_hat.llm import Ollama
from fusion_hat.stt import Vosk
from fusion_hat.tts import Piper
# Initialize speech recognition
stt = Vosk(language="en-us")
# Initialize TTS
tts = Piper()
tts.set_model("en_US-amy-low")
# Instructions for the LLM
INSTRUCTIONS = (
"You are a helpful assistant. Answer directly in plain English. "
"Do NOT include any hidden thinking, analysis, or tags like <think>."
)
WELCOME = "Hello! I'm your voice chatbot. Speak when you're ready."
# Initialize Ollama connection
llm = Ollama(ip="localhost", model="llama3.2:3b")
llm.set_max_messages(20)
llm.set_instructions(INSTRUCTIONS)
# Utility: clean hidden reasoning
def strip_thinking(text: str) -> str:
if not text:
return ""
text = re.sub(r"<\s*think[^>]*>.*?<\s*/\s*think\s*>", "", text, flags=re.DOTALL|re.IGNORECASE)
text = re.sub(r"<\s*thinking[^>]*>.*?<\s*/\s*thinking\s*>", "", text, flags=re.DOTALL|re.IGNORECASE)
text = re.sub(r"```(?:\s*thinking)?\s*.*?```", "", text, flags=re.DOTALL|re.IGNORECASE)
text = re.sub(r"\[/?thinking\]", "", text, flags=re.IGNORECASE)
return re.sub(r"\s+\n", "\n", text).strip()
def main():
print(WELCOME)
tts.say(WELCOME)
try:
while True:
print("\n🎤 Listening... (Press Ctrl+C to stop)")
# Collect final transcript from Vosk
text = ""
for result in stt.listen(stream=True):
if result["done"]:
text = result["final"].strip()
print(f"[YOU] {text}")
else:
print(f"[YOU] {result['partial']}", end="\r", flush=True)
if not text:
print("[INFO] Nothing recognized. Try again.")
time.sleep(0.1)
continue
# Query Ollama with streaming
reply_accum = ""
response = llm.prompt(text, stream=True)
for next_word in response:
if next_word:
print(next_word, end="", flush=True)
reply_accum += next_word
print("")
# Clean and speak
clean = strip_thinking(reply_accum)
if clean:
tts.say(clean)
else:
tts.say("Sorry, I didn't catch that.")
time.sleep(0.05)
except KeyboardInterrupt:
print("\n[INFO] Stopping...")
finally:
tts.say("Goodbye!")
print("Bye.")
if __name__ == "__main__":
main()
Code Analysis
Imports and global setup
import re
import time
from fusion_hat.llm import Ollama
from fusion_hat.stt import Vosk
from fusion_hat.tts import Piper
Brings in the three subsystems you built earlier: Vosk for speech-to-text (STT), Ollama for the LLM, and Piper for text-to-speech (TTS).
Initialize STT (Vosk)
stt = Vosk(language="en-us")
Loads the Vosk model for US English.
Change the language code (e.g., zh-cn, es) to match your voice pack for better accuracy.
Initialize TTS (Piper)
tts = Piper()
tts.set_model("en_US-amy-low")
Creates a Piper engine and selects a specific voice. Pick a model you’ve tested in 1. Testing Piper. Lower-quality voices are faster and use less CPU.
LLM instructions and welcome line
INSTRUCTIONS = (
"You are a helpful assistant. Answer directly in plain English. "
"Do NOT include any hidden thinking, analysis, or tags like <think>."
)
WELCOME = "Hello! I'm your voice chatbot. Speak when you're ready."
Two key UX choices:
Keep answers short and direct (helps with TTS clarity).
Explicitly forbid hidden “chain-of-thought” tags to reduce noisy outputs.
Connect to Ollama and set conversation scope
llm = Ollama(ip="localhost", model="llama3.2:3b")
llm.set_max_messages(20)
llm.set_instructions(INSTRUCTIONS)
ip="localhost"assumes the Ollama server runs on the same Pi. If it runs on another LAN machine, put that computer’s LAN IP and enable Expose to network in Ollama.set_max_messages(20)keeps a short conversational history. Lower this if memory/latency is tight.
Strip hidden reasoning / tags before speaking
def strip_thinking(text: str) -> str:
if not text:
return ""
text = re.sub(r"<\s*think[^>]*>.*?<\s*/\s*think\s*>", "", text, flags=re.DOTALL|re.IGNORECASE)
text = re.sub(r"<\s*thinking[^>]*>.*?<\s*/\s*thinking\s*>", "", text, flags=re.DOTALL|re.IGNORECASE)
text = re.sub(r"```(?:\s*thinking)?\s*.*?```", "", text, flags=re.DOTALL|re.IGNORECASE)
text = re.sub(r"\[/?thinking\]", "", text, flags=re.IGNORECASE)
return re.sub(r"\s+\n", "\n", text).strip()
Some models may emit internal-style tags (e.g., <think>…).
This function removes those so your TTS only speaks the final answer.
Tip: If you see other artifacts on screen (because you stream raw tokens), this function already ensures spoken output stays clean.
Main loop: greet once, then listen → think → speak
print(WELCOME)
tts.say(WELCOME)
Greets the user via terminal and speaker. Happens once at startup.
Listen (streaming STT with live partials)
print("\n🎤 Listening... (Press Ctrl+C to stop)")
text = ""
for result in stt.listen(stream=True):
if result["done"]:
text = result["final"].strip()
print(f"[YOU] {text}")
else:
print(f"[YOU] {result['partial']}", end="\r", flush=True)
stream=Trueyields partial transcripts for immediate feedback and a final transcript when the utterance ends.The final recognized text is stored in
textand printed once.
Guard: If nothing was recognized, you skip the LLM call:
if not text:
print("[INFO] Nothing recognized. Try again.")
time.sleep(0.1)
continue
This avoids sending empty prompts to the model (saves time and tokens).
Think (LLM) with streamed printing
reply_accum = ""
response = llm.prompt(text, stream=True)
for next_word in response:
if next_word:
print(next_word, end="", flush=True)
reply_accum += next_word
print("")
Sends the final transcript to the local LLM and prints tokens as they arrive for low latency.
Meanwhile, you accumulate the full reply in
reply_accumfor post-processing.
Note: If you’d rather not show raw tokens, set stream=False and just print the final string.
Speak (clean first, then TTS once)
clean = strip_thinking(reply_accum)
if clean:
tts.say(clean)
else:
tts.say("Sorry, I didn't catch that.")
Cleans the final text to remove hidden tags, then speaks exactly once.
Keeping TTS to a single pass avoids repeated prompts like “[LLM] / [SAY]”.
Exit and teardown
except KeyboardInterrupt:
print("\n[INFO] Stopping...")
finally:
tts.say("Goodbye!")
print("Bye.")
Use Ctrl+C to stop. The bot says a short goodbye to signal a clean exit.
Troubleshooting & FAQ
Model is too large (memory error)
Use a smaller model like
moondream:1.8bor run Ollama on a more powerful computer.No response from Ollama
Make sure Ollama is running (
ollama serveor desktop app open). If remote, enable Expose to network and check IP address.Vosk not recognizing speech
Verify your microphone works. Try another language pack (
zh-cn,esetc.) if needed.Piper silent or errors
Confirm the chosen voice model is downloaded and tested in 1. Testing Piper.
Answers too long or off-topic
Edit
INSTRUCTIONSto add: “Keep answers short and to the point.”