.. include:: /index.rst :start-after: start_hello_message :end-before: end_hello_message .. _mp_hand_count_tts: 12. Adding TTS Voice Broadcast to MediaPipe Projects ======================================================= ----------------------------------------------------------------- 1. Overview ----------------------------------------------------------------- In :ref:`mp_hand_count` (Section 5), we built a hand gesture counting program that displays the number of raised fingers on screen. In this section, we will go one step further: **add Text-to-Speech (TTS) voice broadcast** so the Raspberry Pi can *speak* the detected finger count out loud — making the project more interactive and accessible. .. image:: img/mp_hand_count.png :align: center This lesson is not just about finger counting — it teaches a **general pattern** for adding TTS to *any* MediaPipe or OpenCV project. By the end of this lesson, you will know how to: - Initialize and configure the Fusion HAT+ TTS engine - Trigger TTS on a key press with debounce protection - Add visual feedback while the system is speaking - Apply this pattern to your own computer vision projects ----------------------------------------------------------------- 2. How It Works ----------------------------------------------------------------- The program builds on the hand-counting pipeline and adds a TTS layer that is activated by a key press: 1. Initialize **MediaPipe Hands** for real-time hand detection. 2. Initialize the **Fusion HAT+ TTS engine** (Espeak). 3. Capture video frames and detect fingers (same as before). 4. Wait for the user to press the ``t`` key. 5. On key press, convert the current finger count into a spoken message. 6. Use **debounce logic** to prevent rapid repeated triggers. 7. Show a **visual flash** on screen while TTS is speaking. 8. The speech plays through the Fusion HAT+ speaker. The key design idea is: *TTS is added as a non-blocking layer —* detection runs continuously, and speech is only triggered when the user requests it. This pattern keeps the video pipeline smooth while adding voice output on demand. ----------------------------------------------------------------- 3. The Fusion HAT+ TTS Module ----------------------------------------------------------------- The ``fusion_hat`` library provides a simple, unified interface for several TTS engines. In this project, we use **Espeak** — a lightweight offline engine that works well on Raspberry Pi. **Basic usage:** .. code-block:: python from fusion_hat.tts import Espeak # Create TTS instance tts = Espeak() # Configure voice tts.set_amp(200) # volume: 0-200 (default 100) tts.set_speed(150) # speed: 80-260 (default 150) tts.set_pitch(80) # pitch: 0-99 (default 80) # Speak tts.say("Hello!") Three parameters let you customize the voice: - **amp** (amplitude) — controls volume. Higher = louder. - **speed** — speaking rate in words per minute. 150 is normal. - **pitch** — voice pitch. 80 is the default; lower values sound deeper. .. note:: Fusion HAT+ also supports **Piper** (neural, offline) and **OpenAI TTS** (online, natural voices). See :ref:`tts_piper_openai` for more advanced options. ----------------------------------------------------------------- 4. Key Design: Adding TTS to a Video Loop ----------------------------------------------------------------- When adding TTS to a real-time video pipeline, there are a few important design considerations. Let's walk through each one. -------------------------------------------------- 4.1 Trigger by Key Press -------------------------------------------------- Rather than speaking on every frame (which would be chaotic), we use a keyboard key as the trigger: .. code-block:: python key = cv2.waitKey(1) & 0xff if key == ord('t'): tts.say(message) The ``t`` key is chosen because it's easy to remember (*t* for *talk*). You can use any key — ``space`` for hands-free floor control, or a GPIO button for physical input. -------------------------------------------------- 4.2 Debounce Protection -------------------------------------------------- Without protection, holding down the ``t`` key would trigger TTs dozens of times per second, overlapping speech and making it unintelligible. **Solution: time-based debounce.** .. code-block:: python DEBOUNCE_INTERVAL = 1.5 # seconds last_tts_time = 0 # In the loop: if key == ord('t'): now = time.time() if now - last_tts_time > DEBOUNCE_INTERVAL: last_tts_time = now tts.say(message) After each TTS trigger, further triggers are ignored for 1.5 seconds. This gives the speech enough time to finish before the next one starts. -------------------------------------------------- 4.3 Building the Message -------------------------------------------------- The finger count (an integer) must be converted into a natural-sounding sentence: .. code-block:: python if total_fingers == 0: message = "no fingers detected" elif total_fingers == 1: message = "one finger detected" else: message = f"{total_fingers} fingers detected" Using ``"one"`` instead of ``"1"`` ensures Espeak pronounces it naturally. For numbers greater than one, the digit form works fine with Espeak. -------------------------------------------------- 4.4 Visual Feedback (Green Border Flash) -------------------------------------------------- While the system is speaking, we add a visual indicator so the user knows speech is in progress: .. code-block:: python tts_flash_until = now + 1.0 # flash for 1 second # Later in the loop: if tts_triggered and time.time() < tts_flash_until: cv2.rectangle(frame, (0, 0), (w-1, h-1), (0, 255, 0), 8) cv2.putText(frame, "Speaking...", (10, 75), cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 255, 255), 2) A **green border** appears around the frame and a **"Speaking..."** label is shown. Both disappear automatically after 1 second. This feedback loop is important because: - TTS takes a moment to complete — the user needs to know the system heard their command. - The border disappears when done, so it does not interfere with normal use. ----------------------------------------------------------------- 5. Run the Code ----------------------------------------------------------------- .. important:: Before you start, make sure: * The Fusion HAT+ is assembled and the speaker is connected * You can access the Raspberry Pi desktop * The code package is installed * MediaPipe and OpenCV are installed For detailed instructions, see :ref:`mediapipe_install` and :ref:`opencv_install`. #. Open the terminal and enter the following command: .. code-block:: bash sudo python3 ~/ai-lab-kit/mediapipe/mp_hand_count_tts.py #. After running the program: - A window titled "MediaPipe Hand Count + TTS" opens, showing the live camera feed. - Hold your hand up to the camera — the finger count appears in the top-left corner. - *Press the* ``t`` *key* — the system speaks the current finger count through the Fusion HAT+ speaker. - A green border flashes on screen while speaking. .. hint:: Try showing different numbers of fingers and pressing ``t`` each time. You should hear: "one finger detected", "three fingers detected", etc. Press ``q`` to exit the program. -------------------------------------------------- 6. Complete Code -------------------------------------------------- .. code-block:: python """ MediaPipe Hand Detection + TTS Demo ==================================== Detects fingers via webcam in real time. Press the 't' key to speak the current finger count using TTS. Usage: python mp_hand_count_tts.py Controls: 't' - speak the detected finger count via TTS 'q' - quit """ from picamera2 import Picamera2 import cv2 import mediapipe.python.solutions.hands as mp_hands import mediapipe.python.solutions.drawing_utils as drawing import mediapipe.python.solutions.drawing_styles as drawing_styles from fusion_hat.tts import Espeak import time # ======================== Init TTS ======================== tts = Espeak() tts.set_amp(200) # volume 0-200, default 100 tts.set_speed(150) # speed 80-260, default 150 tts.set_pitch(80) # pitch 0-99, default 80 # ======================== Init MediaPipe Hands ======================== hands = mp_hands.Hands( static_image_mode=False, max_num_hands=2, min_detection_confidence=0.5 ) # ======================== Init Camera ======================== picam2 = Picamera2() config = picam2.create_preview_configuration( main={"size": (640, 480), "format": "XRGB8888"}, ) picam2.configure(config) picam2.start() # ======================== Constants ======================== # Finger tip and dip landmark indices FINGER_TIPS = [4, 8, 12, 16, 20] # thumb, index, middle, ring, pinky tips FINGER_DIPS = [2, 6, 10, 14, 18] # corresponding middle joints # Minimum interval (seconds) between TTS triggers to avoid spamming DEBOUNCE_INTERVAL = 1.5 print("=" * 55) print(" MediaPipe Hand Count + TTS") print(" Press 't' to speak count | 'q' to quit") print("=" * 55) # ======================== Main Loop ======================== last_tts_time = 0 # timestamp of last TTS trigger tts_triggered = False # whether TTS was just fired (for visual flash) tts_flash_until = 0 # how long the flash should last while True: # ---- 1. Capture frame ---- frame_bgra = picam2.capture_array() frame_bgr = cv2.cvtColor(frame_bgra, cv2.COLOR_BGRA2BGR) # ---- 2. Convert to RGB for MediaPipe ---- frame_rgb = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2RGB) hands_detected = hands.process(frame_rgb) # ---- 3. Convert back to BGR for OpenCV display ---- frame = cv2.cvtColor(frame_rgb, cv2.COLOR_RGB2BGR) # ---- 4. Count fingers (right hand only) ---- total_fingers = 0 if hands_detected.multi_hand_landmarks: for hand_landmarks in hands_detected.multi_hand_landmarks: # Draw hand skeleton drawing.draw_landmarks( frame, hand_landmarks, mp_hands.HAND_CONNECTIONS, drawing_styles.get_default_hand_landmarks_style(), drawing_styles.get_default_hand_connections_style(), ) landmarks = hand_landmarks.landmark finger_count = 0 # Thumb: extended when x_tip > x_dip (right hand) if landmarks[FINGER_TIPS[0]].x > landmarks[FINGER_DIPS[0]].x: finger_count += 1 # Other four fingers: tip is above dip when extended (smaller y) for i in range(1, 5): if landmarks[FINGER_TIPS[i]].y < landmarks[FINGER_DIPS[i]].y: finger_count += 1 total_fingers += finger_count # ---- 5. Display finger count on screen ---- display_text = f"Fingers: {total_fingers}" cv2.putText(frame, display_text, (10, 40), cv2.FONT_HERSHEY_SIMPLEX, 1.2, (0, 255, 0), 2) # ---- 6. Key handling ---- key = cv2.waitKey(1) & 0xff # 't' key: trigger TTS (with debounce) if key == ord('t'): now = time.time() if now - last_tts_time > DEBOUNCE_INTERVAL: last_tts_time = now tts_triggered = True tts_flash_until = now + 1.0 # flash for 1 second if total_fingers == 0: message = "no fingers detected" elif total_fingers == 1: message = "one finger detected" else: message = f"{total_fingers} fingers detected" print(f"[TTS] {message}") tts.say(message) # 'q' key: quit if key == ord('q'): break # ---- 7. Visual feedback while speaking (green border flash) ---- if tts_triggered and time.time() < tts_flash_until: h, w = frame.shape[:2] thickness = 8 cv2.rectangle(frame, (0, 0), (w - 1, h - 1), (0, 255, 0), thickness) cv2.putText(frame, "Speaking...", (10, 75), cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 255, 255), 2) else: tts_triggered = False # ---- 8. Show controls hint at bottom ---- cv2.putText(frame, "Press 't' to speak count | 'q' to quit", (10, frame.shape[0] - 15), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (180, 180, 180), 1) # ---- 9. Show frame ---- cv2.imshow("MediaPipe Hand Count + TTS", frame) # ======================== Cleanup ======================== picam2.stop_preview() picam2.stop() cv2.destroyAllWindows() print("Exited.") -------------------------------------------------- 7. Code Explanation -------------------------------------------------- Let's walk through the code section by section, focusing on what's new compared to the basic hand-counting program. -------------------------------------------------- 7.1 Imports and Initialization -------------------------------------------------- .. code-block:: python from fusion_hat.tts import Espeak import time tts = Espeak() tts.set_amp(200) tts.set_speed(150) tts.set_pitch(80) Two new imports and a TTS initialization block are the first additions. ``Espeak()`` creates the TTS engine, and the three ``set_*`` calls configure the voice. The ``import time`` is needed for debounce timing. -------------------------------------------------- 7.2 Debounce Constants and State Variables -------------------------------------------------- .. code-block:: python DEBOUNCE_INTERVAL = 1.5 last_tts_time = 0 tts_triggered = False tts_flash_until = 0 Four new variables are introduced: - ``DEBOUNCE_INTERVAL`` — prevents TTS spam (seconds). - ``last_tts_time`` — records when TTS was last triggered. - ``tts_triggered`` — flag for the visual flash effect. - ``tts_flash_until`` — timestamp when the flash should end. -------------------------------------------------- 7.3 Key Handling with Debounce -------------------------------------------------- .. code-block:: python key = cv2.waitKey(1) & 0xff if key == ord('t'): now = time.time() if now - last_tts_time > DEBOUNCE_INTERVAL: last_tts_time = now tts_triggered = True tts_flash_until = now + 1.0 if total_fingers == 0: message = "no fingers detected" elif total_fingers == 1: message = "one finger detected" else: message = f"{total_fingers} fingers detected" tts.say(message) This is the core TTS addition. Let's break it down: 1. **Key detection** — ``ord('t')`` checks if ``t`` was pressed. 2. **Debounce gate** — ``time.time() - last_tts_time > DEBOUNCE_INTERVAL`` ensures at least 1.5 seconds have passed since the last trigger. If not enough time has passed, the key press is ignored. 3. **Update state** — When the gate passes, we record the current time and set the flash timer. 4. **Build message** — The finger count is converted into a human-readable sentence. 5. **Speak** — ``tts.say(message)`` sends the text to the speaker. .. note:: ``tts.say()`` is **non-blocking** — the program continues processing video frames while speech plays in the background. -------------------------------------------------- 7.4 Visual Feedback -------------------------------------------------- .. code-block:: python if tts_triggered and time.time() < tts_flash_until: h, w = frame.shape[:2] thickness = 8 cv2.rectangle(frame, (0, 0), (w - 1, h - 1), (0, 255, 0), thickness) cv2.putText(frame, "Speaking...", (10, 75), cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 255, 255), 2) else: tts_triggered = False - A green border (8 pixels thick) is drawn around the entire frame. - A yellow "Speaking..." label appears below the finger count. - Both persist for 1 second, then disappear automatically. - When the flash timer expires, ``tts_triggered`` resets to ``False``, ready for the next trigger. This pattern is reusable — you can add the same feedback to any project that triggers TTS. ----------------------------------------------------------------- 8. Extension Ideas: Applying This Pattern to Other Projects ----------------------------------------------------------------- The TTS integration pattern you learned here is **generic**. You can add voice broadcast to any MediaPipe, OpenCV, or YOLO project by following these steps: **Step 1: Import and initialize TTS** .. code-block:: python from fusion_hat.tts import Espeak tts = Espeak() tts.set_amp(200) **Step 2: Add debounce variables (before the loop)** .. code-block:: python DEBOUNCE_INTERVAL = 1.5 last_tts_time = 0 **Step 3: Add key-triggered TTS (inside the loop)** .. code-block:: python if key == ord('t'): now = time.time() if now - last_tts_time > DEBOUNCE_INTERVAL: last_tts_time = now # Build your message from detection results tts.say(your_message) Here are some ideas for applying this pattern: - **MediaPipe Face Detection** (:ref:`mp_face`) → "Face detected at center of frame" - **MediaPipe Pose** (:ref:`mp_pose`) → "Both arms raised" or "Squat detected — good form!" - **OpenCV Color Tracking** (:ref:`play_with_opencv`) → "Red object moving left" or "Target locked" - **YOLO Object Detection** (:ref:`play_with_yolo`) → "Person detected" or "Two cars in view" - **Hardware Integration** → Replace the ``t`` key with a GPIO button press via ``fusion_hat`` for a completely hands-free experience. ----------------------------------------------------------------- 9. Troubleshooting ----------------------------------------------------------------- - **No sound from the speaker** Make sure the Fusion HAT+ speaker is properly connected and the volume is not muted. Try running a simple TTS test: .. code-block:: bash sudo python3 -c "from fusion_hat.tts import Espeak; Espeak().say('test')" If you hear "test", the TTS engine is working. - **TTS triggers too many times when holding the key** Increase ``DEBOUNCE_INTERVAL`` to a larger value, for example ``2.0`` or ``2.5`` seconds. If you want only a single trigger per key press (no repeat when held), track the key state across frames and only fire on the *rising edge* (key transition from not-pressed to pressed). - **Speech sounds too fast or unclear** Lower the speed: ``tts.set_speed(120)``. Adjust the pitch for clarity: ``tts.set_pitch(70)``. - **Speech overlaps with previous speech** Espeak on Fusion HAT+ queues speech by default. If you want to cancel ongoing speech before starting new speech, you can add a small delay or use a different TTS engine. - **Visual flash does not appear** Check that ``tts_triggered`` is set to ``True`` inside the debounce block and that ``tts_flash_until`` is set to ``time.time() + 1.0``. ----------------------------------------------------------------- 10. Summary ----------------------------------------------------------------- - This lesson demonstrated how to **add TTS voice broadcast** to a MediaPipe computer vision project. - The Fusion HAT+ ``Espeak`` engine provides a simple, offline TTS solution on Raspberry Pi. - **Key design patterns** covered: - Triggering TTS by key press (not on every frame) - **Debounce protection** to prevent speech overlap - **Visual feedback** (green border flash) for user awareness - Converting detection results into natural spoken messages - These patterns are **project-agnostic** — you can apply them to any OpenCV, MediaPipe, or YOLO project to add voice output. - Adding voice makes your projects more accessible and hands-free, opening the door to assistive technology applications and interactive installations.