feat: audio speech to text partition#4264
feat: audio speech to text partition#4264claytonlin1110 wants to merge 19 commits intoUnstructured-IO:mainfrom
Conversation
|
@PastelStorm Would be appreciated if you can review this. Thanks. |
|
@PastelStorm Feel free to review when you have a chance |
PastelStorm
left a comment
There was a problem hiding this comment.
PR Review: feat/audio-speech-to-text-partition
Thanks for the contribution! Adding audio partitioning to the pipeline is a great feature. I've done a thorough review and have feedback organized by severity below.
Is the Whisper-based STT approach sound?
The general approach -- Whisper for local speech-to-text, pluggable agent interface mirroring the OCR pattern, optional [audio] extra -- is solid. However, several industry-standard practices are missing that would be needed before this is production-ready. Details below.
Bugs / Dead Code
1. Unreachable code in partition_audio (line 61-62)
else:
if file is None:
raise ValueError("Either filename or file must be provided.")
file.seek(0)exactly_one(filename=filename, file=file) on line 55 already raises if both are None. When we enter the else branch (filename is None), file is guaranteed to be not-None. This guard is dead code and should be removed.
2. Metadata handling is duplicated with the @apply_metadata decorator
Lines 79-93 of partition_audio manually build and apply filename, metadata_filename, and metadata_last_modified metadata. But the @apply_metadata(FileType.WAV) decorator already handles all of this (see unstructured/partition/common/metadata.py lines 201-209). The decorator runs after the function returns, so it will overwrite whatever the function sets.
The existing partition_text partitioner shows the correct, minimal pattern:
element = NarrativeText(text=text)
element.metadata = ElementMetadata(
last_modified=get_last_modified_date(filename) if filename else None,
)
element.metadata.detection_origin = "speech_to_text"
return [element]Only get_last_modified_date() auto-detection and detection_origin need to be set by the partitioner. Everything else (metadata_filename override, metadata_last_modified override, filetype MIME type) is the decorator's job.
3. Dead TYPE_CHECKING block in speech_to_text_interface.py
from typing import TYPE_CHECKING
...
if TYPE_CHECKING:
passBoth the import and the block are unused. Looks like leftover scaffolding.
4. Stale comment in model.py (line 48)
"Note not all of these can be partitioned, e.g. WAV and ZIP have no partitioner."
WAV now has a partitioner -- this comment should be updated.
Architecture / Design
5. Single NarrativeText element for entire transcript
This is probably the most significant design issue. Every other partitioner in this codebase produces multiple elements from a document (paragraphs, titles, list items, etc.). This partitioner collapses the entire audio transcript into one monolithic NarrativeText, which:
- Loses the segment-level structure Whisper already provides (
result["segments"]includes timestamps, per-segment text) - Produces one massive text blob for long audio (a 1-hour meeting becomes a single element)
- Undermines downstream chunking, since the chunker has to re-split text that was already segmented by Whisper
I'd suggest producing one NarrativeText per Whisper segment, or at least grouping segments into paragraph-sized elements. The segment timestamps could also be stored in element metadata.
6. Whisper model size is hardcoded to "base" with no configuration
class SpeechToTextAgentWhisper(SpeechToTextAgent):
def __init__(self, model_size: str = "base") -> None:
import whisper
self._model = whisper.load_model(model_size)There's no env var (e.g. WHISPER_MODEL_SIZE) or any other mechanism to configure this. Model size has dramatic effects on accuracy, latency, and memory (from tiny at ~1GB VRAM to large-v3 at ~10GB). Compare to how the OCR agent configuration is fully driven by environment variables.
7. No GPU/device or FP16 configuration
whisper.load_model() supports a device parameter and transcribe() supports fp16. These are standard levers for production deployments (FP16 gives ~2x GPU speedup). Neither is exposed or configurable here.
8. lru_cache(maxsize=1) is inconsistent with OCR agent pattern
The OCR agent uses @functools.lru_cache(maxsize=env_config.OCR_AGENT_CACHE_SIZE) -- configurable cache size with (module, language) as key. The STT agent hardcodes maxsize=1 with only (module,) as key. This should follow the established pattern with a configurable STT_AGENT_CACHE_SIZE in env_config.
9. Missing get_agent() convenience method
The OCR agent has get_agent(language) (reads config, resolves module) and get_instance(module, language) (does the import). The STT agent only has get_instance(), so the config resolution logic (stt_agent or env_config.STT_AGENT) lives in partition_audio instead of in the agent class where it belongs:
# Currently in partition_audio -- this should be in SpeechToTextAgent
agent_module = stt_agent or env_config.STT_AGENT
agent = SpeechToTextAgent.get_instance(agent_module)10. WAV-only despite Whisper supporting many formats
Whisper natively handles MP3, M4A, FLAC, OGG, etc. via ffmpeg. The implementation is artificially limited to WAV -- the temp file suffix is hardcoded to .wav, and only FileType.WAV is registered. Even if broader format support isn't in scope for this PR, the temp file suffix should at least be derived from the input rather than hardcoded.
Minor / Nits
11. No error handling around whisper.load_model()
This can fail with network errors (model download), CUDA OOM, or invalid model size strings. A try/except with a helpful error message would improve the developer experience.
12. Temp file is read entirely into memory
file.seek(0)
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
tmp.write(file.read())file.read() loads the entire audio file into memory. For large files, shutil.copyfileobj() would be more memory-efficient.
What's good
- Clean integration with the existing
FileType/_PartitionerLoaderdispatch system - Correct use of
@apply_metadataand@add_chunking_strategydecorators - Agent interface mirrors the established OCR agent pattern (ABC, dynamic import, whitelist)
- Module whitelist security check for dynamic imports
- Optional dependency via
[audio]extra - Tests correctly mock the agent rather than requiring Whisper to be installed
Summary
The PR adds a solid foundation for audio partitioning, and the structural choices (pluggable agent, optional extra, decorator integration) are the right ones. The main areas to address before merging:
- Must fix: Remove dead code (unreachable guard, empty
TYPE_CHECKING), remove duplicated metadata handling, fix stale comment - Should fix: Produce multiple elements from Whisper segments instead of a single blob; make model size, device, and cache size configurable via env vars; add
get_agent()convenience method - Nice to have: Support more audio formats, streaming temp file copy, error handling on model load
|
@PastelStorm Updated. Please check. |
Please update the branch, run linter and tests locally before pushing. Appreciate your work on this PR! |
5b8bf47 to
67595a3
Compare
|
@PastelStorm |
|
@PastelStorm yeah my thought was correct, and re-running CI passed. please check and review. |
|
@PastelStorm Please check and give me feedback. |
PR #4264 Review: Audio Speech-to-Text PartitionBugs / Will Crash1. Whisper's FP16 mode only works on CUDA GPUs. The default Fix: default to @property
def WHISPER_FP16(self) -> bool:
import torch
default = torch.cuda.is_available()
return self._get_bool("WHISPER_FP16", default)2. Whisper version upper bound
Design / Architecture Issues3. Both methods independently build the same
def transcribe(self, audio_path: str, *, language: str | None = None) -> str:
segments = self.transcribe_segments(audio_path, language=language)
return " ".join(seg["text"] for seg in segments)4. The Recommend separating the import phase from instantiation: try:
mod = importlib.import_module(module_name)
agent_cls = getattr(mod, class_name)
except (ImportError, AttributeError) as e:
raise RuntimeError("Could not load the SpeechToText agent class...") from e
try:
return agent_cls()
except Exception as e:
raise RuntimeError(f"STT agent {class_name} loaded but failed to initialize: {e}") from e5. Thread safety of cached Whisper model (Medium) The If there's any multi-threaded usage path (e.g. a web server), this needs a 6. When elements are consolidated, 7. An entire last_modified = get_last_modified_date(filename) if filename else NoneThis matches the pattern used by 8. Should use existing
Test Issues9. Test name claims cleanup verification but doesn't verify it (High)
10. Mock target is fragile (Medium) Tests patch 11. File leak in The temp file is created with def test_partition_audio_raises_with_both_filename_and_file():
with pytest.raises(ValueError):
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
partition_audio(filename=tmp.name, file=tmp)
# tmp leaked — no cleanup12. Missing test: transcription failure + temp file cleanup (Medium-High) No test verifies that the temp file is cleaned up when 13. Missing test coverage (Medium)
Minor / Nits14.
15. Lazy import in The 16. Version bump in feature PR (Low) The PR bumps What's Good
|
|
@PastelStorm Updated. |
|
@PastelStorm Please review as I have fixed everything. |
7c81d7c to
b6d3ac1
Compare
|
@PastelStorm Would you please review this PR assap? since main branch is updated frequently, i need to resolve conflicts and test issues... |
Here's the combined PR review message: PR Review: Audio Speech-to-Text PartitionOverall this is a well-structured PR that follows existing patterns in the codebase — the STT agent abstraction mirrors the OCR agent, it uses the standard Bugs1. Stale docstring still claims WAV is non-partitionable
Design Issues2.
Consequences:
Recommendation: either scope the public API and docs explicitly to WAV-only (remove the generic 3. Potential infinite recursion foot-gun in the The base class This works because Whisper overrides both methods, breaking the cycle. But it creates a subtle trap: any future agent that overrides only 4. STT agent caching/locking creates process-wide serialization bottleneck
Relatedly, the Scope / Housekeeping5. Unrelated fixes bundled into the audio PR Three changes are orthogonal to the audio feature:
These are reasonable changes on their own but inflate the diff and could introduce file-type detection regressions for non-audio formats. Consider splitting them into a separate PR. 6. Consolidation strategy for segment timestamps is DROP The TODO comment explains this well — no MIN/MAX consolidation strategy exists yet, so chunking silently discards Tests7. Weak assertion in agent config test
8. Duplicated test setup The test file repeatedly recreates near-identical temp-file setup and patch scaffolding across many tests. Extracting shared fixtures for temp audio file creation and mocked agent wiring would reduce maintenance burden and make behavior-differentiating assertions easier to spot. |
|
@PastelStorm updated. please review. |
|
@PastelStorm Feel free to take a final review :) Thanks. |
There will be many reviews, given how many issues are being detected each time. After you fix the issues below please let Claude/ChatGPT review your changes to speed up the review process. Thank you. Review:
|
| Priority | # | Issue |
|---|---|---|
| High | 1 | except Exception in email/msg — silent data loss, behavioral regression |
| High | 2 | get_instance doesn't verify issubclass(SpeechToTextAgent) |
| Medium | 3 | Audio FileTypes hardcode ["whisper"] dependency despite pluggable agent |
| Medium | 4 | WHISPER_FP16 bypasses _get_bool, inconsistent env-var handling |
| Medium | 5 | Duplicated segment filtering in both agent and partitioner |
| Medium | 6 | transcribe() is unused dead code |
| Low | 7 | No cache_clear() in tests — inter-test state leakage risk |
| Low | 8 | Inconsistent temp file handling / potential leaks in tests |
| Low | 9 | WEBM alias includes video/webm — ambiguous semantics |
| Low | 10 | Misleading control flow around temp file deletion vs. MIME detection |
| Low | 11 | filetype.py fallback is a no-op when both detectors fail |
|
@PastelStorm Would you please review one more time today? Also it would be great if you can let me know how I can let claude review my changes after push. Did you connect Claude with this github? |
I mean your own subscription or a free tier that is available to everyone. |
Will try, i updated though so feel free to review the last changes. |
I now have a complete picture of the updated branch. Here's the re-review. Re-Review:
|
| # | Issue | Status |
|---|---|---|
| 1 | except Exception in email/msg |
Fixed — narrowed to explicit tuple, logs warning with filename |
| 2 | get_instance doesn't verify issubclass |
Fixed — isinstance(loaded_class, type) + issubclass checks added |
| 3 | Audio FileTypes hardcode ["whisper"] |
Fixed — all audio types now use cast(list[str], []), deps validated at runtime |
| 4 | WHISPER_FP16 bypasses _get_bool |
Fixed — now uses os.environ presence check + delegates to _get_bool |
| 5 | Duplicated segment filtering | Fixed — filtering removed from whisper_stt.py, contract documented on TranscriptionSegment |
| 6 | transcribe() dead code |
Fixed — removed entirely |
| 7 | No cache_clear() fixture |
Fixed — _clear_stt_cache fixture added, applied to TestSpeechToTextAgentInterface |
| 8 | Temp file leaks / inconsistency | Fixed — tests now use _tmp_audio() consistently |
| 9 | WEBM alias video/webm |
Not addressed |
| 10 | Misleading control flow (MIME detection after temp deletion) | Fixed — MIME type + last_modified now computed before temp file lifecycle |
| 11 | filetype.py fallback no-op |
Partially addressed — comment added, but logic unchanged |
New issues introduced by the fixes
1. (Medium) TypeError from issubclass check escapes unhandled
The new validation in get_instance (lines 61-71) raises TypeError when the loaded class isn't a type or isn't a SpeechToTextAgent subclass. But this TypeError is raised inside a try block whose except only catches (ImportError, AttributeError):
try:
mod = importlib.import_module(module_name)
loaded_class = getattr(mod, class_name)
if not isinstance(loaded_class, type):
raise TypeError(...) # <-- not caught below
if not issubclass(loaded_class, SpeechToTextAgent):
raise TypeError(...) # <-- not caught below
except (ImportError, AttributeError) as e:
# TypeError propagates uncaught past this handler
raise RuntimeError(...) from eThe TypeError will propagate raw to the caller with no wrapping in RuntimeError and no logging. This is inconsistent with the error UX for ImportError/AttributeError (which get a friendly message about installing the audio extra). It also means the TypeError message will appear to the caller without the surrounding context.
Fix: Either add TypeError to the except clause (with a different message that doesn't suggest installing the audio extra), or move the issubclass checks after the try/except block so they produce a clean, distinct error path.
2. (Medium) EXPECTED_ATTACHMENT_ERRORS includes RuntimeError — still too broad
The new explicit tuple in both email.py and msg.py is a big improvement, but RuntimeError is still a very broad exception type:
EXPECTED_ATTACHMENT_ERRORS: Final[tuple[type[BaseException], ...]] = (
UnsupportedFileFormatError,
ImportError,
FileNotFoundError,
RuntimeError, # <-- catches any RuntimeError from any code path
)RuntimeError is commonly raised by unrelated failures — e.g. torch OOM, internal assertion failures in PDF parsing, broken pipe, etc. Including it means a corrupt PDF attachment that triggers a RuntimeError deep in pdfminer will be silently skipped with only a warning log. This is better than the previous except Exception, but the original concern about masking real bugs partially remains.
Recommendation: Consider introducing a specific exception (e.g. PartitioningDependencyError or similar) that the STT agent and other partition code raises for "missing tool/dependency at runtime" scenarios (like ffmpeg not found). Then replace RuntimeError with that specific type. If that's too much scope for this PR, at least add a comment documenting which RuntimeErrors this is intended to catch.
3. (Low) EXPECTED_ATTACHMENT_ERRORS is duplicated between email.py and msg.py
The exact same tuple is defined in both files:
# email.py
EXPECTED_ATTACHMENT_ERRORS: Final[tuple[type[BaseException], ...]] = (
UnsupportedFileFormatError, ImportError, FileNotFoundError, RuntimeError,
)
# msg.py
EXPECTED_ATTACHMENT_ERRORS: tuple[type[BaseException], ...] = (
UnsupportedFileFormatError, ImportError, FileNotFoundError, RuntimeError,
)Note also the minor inconsistency: email.py uses Final[...] while msg.py does not. This is a maintenance hazard — if the set of expected errors changes, both files must be updated in lockstep. This constant should live in unstructured.partition.common (or similar shared module) and be imported by both.
4. (Low) No test coverage for the new issubclass / isinstance validation
The issubclass check is new and addresses a previous review finding, but TestSpeechToTextAgentInterface has no test for it. There's a test for the whitelist rejection (test_get_instance_rejects_non_whitelisted_module) but none for:
- Pointing to a valid, whitelisted module attribute that isn't a class (e.g. a function or constant)
- Pointing to a class that exists but doesn't subclass
SpeechToTextAgent
These are the exact scenarios the new code defends against, and they should have tests.
5. (Low) _clear_stt_cache patches get_instance on the real SpeechToTextAgent but two tests in the same class mock get_instance away entirely
TestSpeechToTextAgentInterface uses @pytest.mark.usefixtures("_clear_stt_cache"), which calls SpeechToTextAgent.get_instance.cache_clear() before and after. But test_get_agent_uses_env_config_when_no_module_given and test_get_agent_passes_explicit_module_to_get_instance both _patch.object(SpeechToTextAgent, "get_instance"), which replaces get_instance entirely — so cache_clear() in the fixture teardown operates on the real (unpatched) function, not the mock. This isn't wrong (the mock is scoped with with and restored before teardown), but it means the cache-clear fixture is a no-op for those two tests. Only test_get_instance_rejects_non_whitelisted_module actually benefits from it. This is fine but worth knowing.
Remaining from prior review (unchanged)
6. (Low) WEBM alias includes video/webm
WEBM = ("webm", "audio", ..., "audio/webm", ["video/webm"])A .webm file containing video will be detected as FileType.WEBM and routed to partition_audio. This is semantically ambiguous since WebM is a container for both audio and video content.
Summary
The fixes are well-executed and address the most serious concerns from the first review. The branch is in much better shape. The remaining issues are:
| Priority | # | Issue |
|---|---|---|
| Medium | 1 | TypeError from new issubclass check propagates unhandled (inconsistent error UX) |
| Medium | 2 | RuntimeError in EXPECTED_ATTACHMENT_ERRORS is still broad |
| Low | 3 | EXPECTED_ATTACHMENT_ERRORS duplicated between email.py and msg.py (+ Final inconsistency) |
| Low | 4 | No test for the new issubclass/isinstance validation paths |
| Low | 5 | _clear_stt_cache is effectively a no-op for 2 of 3 tests in the class |
| Low | 6 | WEBM alias includes video/webm — ambiguous semantics (unchanged) |
|
@PastelStorm I just updated. |
Summary
Enables partitioning of WAV audio files into document elements by transcribing with an optional speech-to-text (STT) agent, defaulting to Whisper.
Closes #4029
Changes:
Usage
pip install "unstructured[audio]" then partition("file.wav") or partition_audio("file.wav", language="en").