ESP32 Speech-to-Text (No API Key Required)

Industrial-grade speech-to-text pipeline for ESP32. This repository provides:

An ESP32 client that captures audio over I2S and posts WAV to a server.
A lightweight Flask/Gunicorn server that returns JSON transcriptions via speech_recognition.

Designed for deterministic embedded behavior, clean I2S lifecycle, and zero vendor lock-in.

Overview

Client: ESP32 (Arduino) captures 16-bit mono audio and uploads to a server.
Server: Flask endpoint processes audio and returns transcription (/uploadAudio).
Wake Word (optional): Integrate the MARVIN wake word for hands-free activation.

Server Repository

ESpeechServer: https://github.com/TheZeroHz/ESpeechServer Deploy locally or to cloud (recommended), e.g. Render.

Wake Word Library (Optional)

Marvin_WakeWord_inferencing: https://github.com/TheZeroHz/Marvin_WakeWord_inferencing

Compatibility

ESP32 Arduino Core: 3.3.1 (recommended/supported)
Arduino IDE: 2.3.x
Boards:
- ESP32-S3 — validated
- ESP32 DOIT DevKit V1 — under test
I2S Microphone: INMP441 (or compatible)

If you previously targeted ESP32 core 2.0.14, upgrade to 3.3.1 for best results.

Features

Low-latency I2S capture with deterministic init/deinit (prevents double driver install).
No accounts, credit cards, or external API keys required.
Simple HTTP interface (POST /uploadAudio) returning JSON.
Easily deployable backend with Gunicorn.
Optional wake word handoff (pause WW → record STT → resume WW).

Demo & Tutorial

Video (all ESP32 boards): https://www.canva.com/design/DAGkKUr6V58/pw6ovNUVmsN3kMa85Zlr7w/watch?utm_content=DAGkKUr6V58&utm_campaign=designshare&utm_medium=link2&utm_source=uniquelinks&utlId=h054e4457dc
Screenshots

Requirements

Server

Python: 3.10 (recommended; set PYTHON_VERSION=3.10 on Render)
Packages: Flask, SpeechRecognition, pydub, gunicorn

ESP32

ESP32 Arduino Core: 3.3.1
Arduino IDE: 2.3.x
I2S MIC: INMP441 (or equivalent)
Stable Wi-Fi

Server Setup

Use the ESpeechServer repository.

Clone:

git clone https://github.com/TheZeroHz/ESpeechServer.git
cd ESpeechServer

Install:
```
pip install -r requirements.txt
```
Run (production style):
```
gunicorn app:app --bind 0.0.0.0:8888
```
Endpoint:
- POST /uploadAudio (content: WAV) → {"transcription": "..."}

Deploy on Render (Recommended)

Build Command: pip install -r requirements.txt
Start Command: gunicorn app:app
Environment Variable: PYTHON_VERSION=3.10
Server listens on PORT provided by Render automatically.

ESP32 Client Setup

Open Arduino IDE (2.3.x).
Install ESP32 Arduino Core 3.3.1 via Boards Manager.
Open the SpeechToText_ESP32 example in this repository.
Configure:
- Wi-Fi SSID/PASS
- Server URL (local or Render), e.g.:
```
STT.serverURL("https://<your-espeechserver>/uploadAudio");
```
- I2S pins to match your hardware (SCK/BCK, WS, SD).
Build & flash.

Usage Flow

(Optional) Run wake word detection loop.
On trigger:
- Stop WW loop, deinit I2S cleanly.
- Call STT.recordAudio() to capture STT audio.
- Call STT.getTranscription() to receive server JSON → string.
- Re-init WW loop if required.

Avoid simultaneous ownership of the I2S port to prevent i2s_driver_install errors.

API (Server)

POST /uploadAudio Body: WAV (binary or multipart). Response:
```
{ "transcription": "Hello, how are you?" }
```

curl example:

curl -X POST http://localhost:8888/uploadAudio --data-binary "@yourfile.wav"

Hardware Reference

Example wiring shown for INMP441 + ESP32-S3 (see: SpeechToText/img/HardWareSetUP.png). Ensure your I2S pin mapping in the sketch matches your board.

Troubleshooting

i2s port is in use / i2s_driver_install(...): configuration is invalid
- Ensure the wake word task is stopped and i2s_driver_uninstall() completed before ESpeech initializes I2S.
- Do not install the I2S driver twice.
No transcription returned
- Verify server reachability and URL.
- Confirm WAV parameters (16-bit, mono, 8/16 kHz).
- Check server logs for decoding/engine errors.
Board/core mismatch
- Use ESP32 Arduino Core 3.3.1.

Change Log (Summary)

Align with ESP32 Arduino Core 3.3.1
Deterministic I2S init/deinit to prevent double-install
Example updates for wake word ↔ STT handoff
Cleaned includes and configuration to avoid FS conflicts

License

This repository: MIT (see LICENSE). 3rd-party libraries retain their respective licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
SpeechToText		SpeechToText
examples		examples
img		img
src		src
ESpeechUserConfig.h		ESpeechUserConfig.h
LICENSE		LICENSE
README.md		README.md
keywords.txt		keywords.txt
library.json		library.json
library.properties		library.properties

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ESP32 Speech-to-Text (No API Key Required)

Overview

Server Repository

Wake Word Library (Optional)

Compatibility

Features

Demo & Tutorial

Requirements

Server

ESP32

Server Setup

Deploy on Render (Recommended)

ESP32 Client Setup

Usage Flow

API (Server)

Hardware Reference

Troubleshooting

Change Log (Summary)

License

About

Uh oh!

Releases 2

Uh oh!

Languages

License

TheZeroHz/ESpeech

Folders and files

Latest commit

History

Repository files navigation

ESP32 Speech-to-Text (No API Key Required)

Overview

Server Repository

Wake Word Library (Optional)

Compatibility

Features

Demo & Tutorial

Requirements

Server

ESP32

Server Setup

Deploy on Render (Recommended)

ESP32 Client Setup

Usage Flow

API (Server)

Hardware Reference

Troubleshooting

Change Log (Summary)

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Uh oh!

Languages