Thanks to visit codestin.com
Credit goes to github.com

Skip to content

feat: audio speech to text partition#4264

Open
claytonlin1110 wants to merge 19 commits intoUnstructured-IO:mainfrom
claytonlin1110:feat/audio-speech-to-text-partition
Open

feat: audio speech to text partition#4264
claytonlin1110 wants to merge 19 commits intoUnstructured-IO:mainfrom
claytonlin1110:feat/audio-speech-to-text-partition

Conversation

@claytonlin1110
Copy link
Contributor

Summary

Enables partitioning of WAV audio files into document elements by transcribing with an optional speech-to-text (STT) agent, defaulting to Whisper.
Closes #4029

Changes:

  • New partition_audio() and routing for FileType.WAV so partition() supports audio.
  • Pluggable STT layer: SpeechToTextAgent interface and SpeechToTextAgentWhisper implementation.
  • Optional extra audio in pyproject.toml (openai-whisper); all-docs includes audio.
  • Config: STT_AGENT (and STT_AGENT_MODULES_WHITELIST) for choosing the STT implementation.

Usage

pip install "unstructured[audio]" then partition("file.wav") or partition_audio("file.wav", language="en").

@claytonlin1110
Copy link
Contributor Author

@PastelStorm Would be appreciated if you can review this. Thanks.

@claytonlin1110
Copy link
Contributor Author

@PastelStorm Feel free to review when you have a chance

Copy link
Contributor

@PastelStorm PastelStorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: feat/audio-speech-to-text-partition

Thanks for the contribution! Adding audio partitioning to the pipeline is a great feature. I've done a thorough review and have feedback organized by severity below.


Is the Whisper-based STT approach sound?

The general approach -- Whisper for local speech-to-text, pluggable agent interface mirroring the OCR pattern, optional [audio] extra -- is solid. However, several industry-standard practices are missing that would be needed before this is production-ready. Details below.


Bugs / Dead Code

1. Unreachable code in partition_audio (line 61-62)

else:
    if file is None:
        raise ValueError("Either filename or file must be provided.")
    file.seek(0)

exactly_one(filename=filename, file=file) on line 55 already raises if both are None. When we enter the else branch (filename is None), file is guaranteed to be not-None. This guard is dead code and should be removed.

2. Metadata handling is duplicated with the @apply_metadata decorator

Lines 79-93 of partition_audio manually build and apply filename, metadata_filename, and metadata_last_modified metadata. But the @apply_metadata(FileType.WAV) decorator already handles all of this (see unstructured/partition/common/metadata.py lines 201-209). The decorator runs after the function returns, so it will overwrite whatever the function sets.

The existing partition_text partitioner shows the correct, minimal pattern:

element = NarrativeText(text=text)
element.metadata = ElementMetadata(
    last_modified=get_last_modified_date(filename) if filename else None,
)
element.metadata.detection_origin = "speech_to_text"
return [element]

Only get_last_modified_date() auto-detection and detection_origin need to be set by the partitioner. Everything else (metadata_filename override, metadata_last_modified override, filetype MIME type) is the decorator's job.

3. Dead TYPE_CHECKING block in speech_to_text_interface.py

from typing import TYPE_CHECKING
...
if TYPE_CHECKING:
    pass

Both the import and the block are unused. Looks like leftover scaffolding.

4. Stale comment in model.py (line 48)

"Note not all of these can be partitioned, e.g. WAV and ZIP have no partitioner."

WAV now has a partitioner -- this comment should be updated.


Architecture / Design

5. Single NarrativeText element for entire transcript

This is probably the most significant design issue. Every other partitioner in this codebase produces multiple elements from a document (paragraphs, titles, list items, etc.). This partitioner collapses the entire audio transcript into one monolithic NarrativeText, which:

  • Loses the segment-level structure Whisper already provides (result["segments"] includes timestamps, per-segment text)
  • Produces one massive text blob for long audio (a 1-hour meeting becomes a single element)
  • Undermines downstream chunking, since the chunker has to re-split text that was already segmented by Whisper

I'd suggest producing one NarrativeText per Whisper segment, or at least grouping segments into paragraph-sized elements. The segment timestamps could also be stored in element metadata.

6. Whisper model size is hardcoded to "base" with no configuration

class SpeechToTextAgentWhisper(SpeechToTextAgent):
    def __init__(self, model_size: str = "base") -> None:
        import whisper
        self._model = whisper.load_model(model_size)

There's no env var (e.g. WHISPER_MODEL_SIZE) or any other mechanism to configure this. Model size has dramatic effects on accuracy, latency, and memory (from tiny at ~1GB VRAM to large-v3 at ~10GB). Compare to how the OCR agent configuration is fully driven by environment variables.

7. No GPU/device or FP16 configuration

whisper.load_model() supports a device parameter and transcribe() supports fp16. These are standard levers for production deployments (FP16 gives ~2x GPU speedup). Neither is exposed or configurable here.

8. lru_cache(maxsize=1) is inconsistent with OCR agent pattern

The OCR agent uses @functools.lru_cache(maxsize=env_config.OCR_AGENT_CACHE_SIZE) -- configurable cache size with (module, language) as key. The STT agent hardcodes maxsize=1 with only (module,) as key. This should follow the established pattern with a configurable STT_AGENT_CACHE_SIZE in env_config.

9. Missing get_agent() convenience method

The OCR agent has get_agent(language) (reads config, resolves module) and get_instance(module, language) (does the import). The STT agent only has get_instance(), so the config resolution logic (stt_agent or env_config.STT_AGENT) lives in partition_audio instead of in the agent class where it belongs:

# Currently in partition_audio -- this should be in SpeechToTextAgent
agent_module = stt_agent or env_config.STT_AGENT
agent = SpeechToTextAgent.get_instance(agent_module)

10. WAV-only despite Whisper supporting many formats

Whisper natively handles MP3, M4A, FLAC, OGG, etc. via ffmpeg. The implementation is artificially limited to WAV -- the temp file suffix is hardcoded to .wav, and only FileType.WAV is registered. Even if broader format support isn't in scope for this PR, the temp file suffix should at least be derived from the input rather than hardcoded.


Minor / Nits

11. No error handling around whisper.load_model()

This can fail with network errors (model download), CUDA OOM, or invalid model size strings. A try/except with a helpful error message would improve the developer experience.

12. Temp file is read entirely into memory

file.seek(0)
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
    tmp.write(file.read())

file.read() loads the entire audio file into memory. For large files, shutil.copyfileobj() would be more memory-efficient.


What's good

  • Clean integration with the existing FileType / _PartitionerLoader dispatch system
  • Correct use of @apply_metadata and @add_chunking_strategy decorators
  • Agent interface mirrors the established OCR agent pattern (ABC, dynamic import, whitelist)
  • Module whitelist security check for dynamic imports
  • Optional dependency via [audio] extra
  • Tests correctly mock the agent rather than requiring Whisper to be installed

Summary

The PR adds a solid foundation for audio partitioning, and the structural choices (pluggable agent, optional extra, decorator integration) are the right ones. The main areas to address before merging:

  1. Must fix: Remove dead code (unreachable guard, empty TYPE_CHECKING), remove duplicated metadata handling, fix stale comment
  2. Should fix: Produce multiple elements from Whisper segments instead of a single blob; make model size, device, and cache size configurable via env vars; add get_agent() convenience method
  3. Nice to have: Support more audio formats, streaming temp file copy, error handling on model load

@claytonlin1110
Copy link
Contributor Author

@PastelStorm Updated. Please check.

@PastelStorm
Copy link
Contributor

@PastelStorm Updated. Please check.

Please update the branch, run linter and tests locally before pushing. Appreciate your work on this PR!

@claytonlin1110 claytonlin1110 force-pushed the feat/audio-speech-to-text-partition branch from 5b8bf47 to 67595a3 Compare February 25, 2026 21:00
@claytonlin1110
Copy link
Contributor Author

@PastelStorm
https://github.com/Unstructured-IO/unstructured/actions/runs/22415785328/job/64901325310?pr=4264
Does this CI failing related to my PR? Would you please re-run CI?

@claytonlin1110
Copy link
Contributor Author

@PastelStorm yeah my thought was correct, and re-running CI passed. please check and review.

@claytonlin1110
Copy link
Contributor Author

@PastelStorm Please check and give me feedback.

@PastelStorm
Copy link
Contributor

@PastelStorm Please check and give me feedback.

PR #4264 Review: Audio Speech-to-Text Partition

Bugs / Will Crash

1. WHISPER_FP16 defaults to True — crashes on CPU-only machines (High)

Whisper's FP16 mode only works on CUDA GPUs. The default WHISPER_FP16=True combined with WHISPER_DEVICE="" (auto-detect, which resolves to CPU on most dev machines) will cause a RuntimeError on any machine without a CUDA GPU — which is the majority of development environments.

Fix: default to False, or auto-detect:

@property
def WHISPER_FP16(self) -> bool:
    import torch
    default = torch.cuda.is_available()
    return self._get_bool("WHISPER_FP16", default)

2. Whisper version upper bound <20260000 excludes all 2026 releases (Medium)

20260000 is not a valid calendar date (month 00), and since we're already in 2026, any Whisper release this year (e.g. 20260115) would be excluded. Compare with pdfminer.six>=20251230, <20270000 elsewhere in the same file. Should be <20270000.


Design / Architecture Issues

3. transcribe() vs transcribe_segments() code duplication in whisper_stt.py (Medium)

Both methods independently build the same options dict and call self._model.transcribe(). If a new option is added (e.g. temperature, beam_size), it must be updated in two places. transcribe() is also effectively dead code since partition_audio only calls transcribe_segments().

transcribe() should delegate to transcribe_segments():

def transcribe(self, audio_path: str, *, language: str | None = None) -> str:
    segments = self.transcribe_segments(audio_path, language=language)
    return " ".join(seg["text"] for seg in segments)

4. get_instance exception handling doesn't cover constructor failures (Medium)

The except (ImportError, AttributeError) only covers module loading, not cls() invocation. Unlike the OCR agent (whose __init__ is trivial: self.language = language), the Whisper __init__ does heavy lifting — model downloads, GPU allocation — making constructor exceptions expected failure modes. A custom STT agent with required constructor args would also throw TypeError, uncaught.

Recommend separating the import phase from instantiation:

try:
    mod = importlib.import_module(module_name)
    agent_cls = getattr(mod, class_name)
except (ImportError, AttributeError) as e:
    raise RuntimeError("Could not load the SpeechToText agent class...") from e
try:
    return agent_cls()
except Exception as e:
    raise RuntimeError(f"STT agent {class_name} loaded but failed to initialize: {e}") from e

5. Thread safety of cached Whisper model (Medium)

The lru_cache on get_instance means a single SpeechToTextAgentWhisper (and its self._model) is shared across all callers. Unlike Tesseract OCR (which spawns subprocesses), Whisper runs in-process and model.transcribe() is not documented as thread-safe — concurrent calls can corrupt state or cause CUDA errors.

If there's any multi-threaded usage path (e.g. a web server), this needs a threading.Lock or documentation that it requires process-based parallelism.

6. ConsolidationStrategy.DROP for segment timestamps loses temporal data during chunking (Medium)

When elements are consolidated, segment_start_seconds and segment_end_seconds are dropped. Ideally a chunk spanning segments 1-3 would keep min(start) and max(end). The existing ConsolidationStrategy enum doesn't have MIN/MAX variants, so DROP is the safe choice for now — but this should be noted with a TODO and a follow-up issue.

7. base_metadata object is wasteful (Nit)

An entire ElementMetadata is constructed just to hold a last_modified value, then detection_origin is set on it but never read from it. This should be a plain variable:

last_modified = get_last_modified_date(filename) if filename else None

This matches the pattern used by partition_rst, partition_org, partition_md, etc.

8. Should use existing is_temp_file_path() utility (Minor)

partition_audio inlines audio_path.startswith(tempfile.gettempdir()), but unstructured/utils.py already provides is_temp_file_path() — used by partition_docx, partition_html, partition_csv, etc. Use the utility for consistency.


Test Issues

9. Test name claims cleanup verification but doesn't verify it (High)

test_partition_audio_from_file_uses_temp_path_and_cleans_up never asserts that the temp file was actually deleted. The test only checks element text and metadata. It should capture the temp file path from the mock call and assert not Path(temp_path).exists().

10. Mock target is fragile (Medium)

Tests patch SpeechToTextAgent.get_instance but partition_audio calls SpeechToTextAgent.get_agent(). This works because get_agent delegates to get_instance, but it couples tests to an internal implementation detail. If get_agent is refactored to stop calling get_instance, every test breaks. Should patch get_agent directly.

11. File leak in test_partition_audio_raises_with_both_filename_and_file (Medium)

The temp file is created with delete=False and never cleaned up:

def test_partition_audio_raises_with_both_filename_and_file():
    with pytest.raises(ValueError):
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
            partition_audio(filename=tmp.name, file=tmp)
    # tmp leaked — no cleanup

12. Missing test: transcription failure + temp file cleanup (Medium-High)

No test verifies that the temp file is cleaned up when transcribe_segments() raises an exception. This is the exact scenario the finally block is supposed to handle.

13. Missing test coverage (Medium)

  • No test for stt_agent parameter forwarding to get_agent()
  • No test for explicit language value (only language=None is checked)
  • No test for whitespace-only segments being filtered out
  • No unit tests for SpeechToTextAgent or SpeechToTextAgentWhisper themselves

Minor / Nits

14. cls variable shadows convention in get_instance (Low)

cls = getattr(mod, class_name) uses cls as a local variable name in a @staticmethod. The OCR agent uses loaded_class instead, which is clearer.

15. Lazy import in STT_AGENT property (Low)

The STT_AGENT property does from unstructured.partition.utils.constants import STT_AGENT_WHISPER inside the method body. config.py already imports from constants at module level (line 16: from unstructured.partition.utils.constants import OCR_AGENT_TESSERACT). No circular dependency exists — this should be a module-level import for consistency.

16. Version bump in feature PR (Low)

The PR bumps __version__ from 0.21.7 to 0.21.8. This typically causes merge conflicts when multiple feature PRs target the same release and is better handled in a dedicated release step.


What's Good

  • Clean integration with FileType / _PartitionerLoader dispatch
  • Correct decorator order (@apply_metadata outer, @add_chunking_strategy inner) matching all other partitioners
  • Agent interface properly mirrors the OCR agent pattern (ABC, dynamic import, whitelist)
  • Configurable model size, device, and cache size via env vars (addressing prior review feedback)
  • Per-segment elements with timestamps instead of a single monolithic blob (also addresses prior feedback)
  • _audio_suffix() correctly derives the temp file extension from input
  • shutil.copyfileobj() for streaming file copy (addresses prior feedback about file.read())
  • Weaviate float mapping correctly handles the new Optional[float] metadata fields
  • **kwargs pass-through is consistent with every other partitioner

@claytonlin1110
Copy link
Contributor Author

@PastelStorm Updated.

@claytonlin1110
Copy link
Contributor Author

@PastelStorm Please review as I have fixed everything.

@claytonlin1110 claytonlin1110 force-pushed the feat/audio-speech-to-text-partition branch from 7c81d7c to b6d3ac1 Compare February 27, 2026 19:35
@claytonlin1110
Copy link
Contributor Author

@PastelStorm Would you please review this PR assap? since main branch is updated frequently, i need to resolve conflicts and test issues...

@PastelStorm
Copy link
Contributor

@PastelStorm Would you please review this PR assap? since main branch is updated frequently, i need to resolve conflicts and test issues...

Here's the combined PR review message:


PR Review: Audio Speech-to-Text Partition

Overall this is a well-structured PR that follows existing patterns in the codebase — the STT agent abstraction mirrors the OCR agent, it uses the standard @apply_metadata / @add_chunking_strategy decorators, and integrates cleanly with auto-partitioning via FileType metadata. The core architecture is sound. Below are the issues worth addressing before merge:


Bugs

1. Stale docstring still claims WAV is non-partitionable

FileType.is_partitionable docstring at line 173 of unstructured/file_utils/model.py still lists WAV alongside ZIP, EMPTY, and UNK as examples of non-partitionable types. The class-level docstring (line 48) was updated but this one was missed.


Design Issues

2. partition_audio behavior and metadata are scoped to WAV, but the API/docs imply broader audio support

partition_audio() accepts arbitrary audio inputs (including extension inference via _audio_suffix for .mp3, .flac, etc.), and the docstring says "audio file (e.g. WAV)", but @apply_metadata(FileType.WAV) is hardcoded and only FileType.WAV is wired as partitionable in model.py.

Consequences:

  • Direct calls like partition_audio(filename="x.mp3") will stamp metadata.filetype as "audio/wav" regardless of actual format.
  • Auto-routing from partition() won't support common audio formats (.mp3, .flac, etc.) despite the STT stack being format-agnostic — an abstraction mismatch between partitioner capability and the file-type model.

Recommendation: either scope the public API and docs explicitly to WAV-only (remove the generic _audio_suffix inference for other formats), or add first-class FileType entries for supported audio formats with correct metadata propagation. The current middle ground is likely to surprise users.

3. Potential infinite recursion foot-gun in the transcribe / transcribe_segments contract

The base class SpeechToTextAgent defines transcribe() as abstract and transcribe_segments() as a concrete default that calls self.transcribe(). The Whisper subclass then does the inverse: transcribe() delegates to self.transcribe_segments(), which does the real work.

This works because Whisper overrides both methods, breaking the cycle. But it creates a subtle trap: any future agent that overrides only transcribe() and implements it by delegating to self.transcribe_segments() — a natural instinct given the Whisper example — will get infinite recursion at runtime. Consider either making transcribe_segments() abstract too, or adding a clear docstring warning that transcribe() must not delegate to transcribe_segments().

4. STT agent caching/locking creates process-wide serialization bottleneck

get_instance() caches one agent via lru_cache, and the Whisper agent serializes all transcribe() calls with a single instance lock. This avoids races but effectively limits the process to one transcription at a time for the default agent — a hidden throughput ceiling under concurrent workloads. Worth documenting prominently (and/or noting that process-based concurrency is the intended scaling path).

Relatedly, the lru_cache key is agent_module (the fully-qualified class name), not the model config. Changing WHISPER_MODEL_SIZE, WHISPER_DEVICE, or WHISPER_FP16 after the first call has no effect without a process restart. This matches the OCR agent pattern but should be documented.


Scope / Housekeeping

5. Unrelated fixes bundled into the audio PR

Three changes are orthogonal to the audio feature:

  • BMP alias MIME types (image/x-bmp, image/x-ms-bmp)
  • HEIC alias MIME type (image/x-heic)
  • filetype.py octet-stream fallback logic (falling back from libmagic to the filetype package)

These are reasonable changes on their own but inflate the diff and could introduce file-type detection regressions for non-audio formats. Consider splitting them into a separate PR.

6. Consolidation strategy for segment timestamps is DROP

The TODO comment explains this well — no MIN/MAX consolidation strategy exists yet, so chunking silently discards segment_start_seconds / segment_end_seconds. Worth noting as a known limitation in the CHANGELOG entry or docstring so users who care about audio timeline alignment aren't surprised.


Tests

7. Weak assertion in agent config test

test_get_agent_uses_env_config_when_no_module_given asserts that the argument "contains SpeechToTextAgent or Whisper", which can pass for incorrect values. Should assert the exact expected qname (or patch env_config.STT_AGENT to a sentinel value and assert exact call).

8. Duplicated test setup

The test file repeatedly recreates near-identical temp-file setup and patch scaffolding across many tests. Extracting shared fixtures for temp audio file creation and mocked agent wiring would reduce maintenance burden and make behavior-differentiating assertions easier to spot.

@claytonlin1110
Copy link
Contributor Author

@PastelStorm updated. please review.

@claytonlin1110
Copy link
Contributor Author

@PastelStorm Feel free to take a final review :) Thanks.

@PastelStorm
Copy link
Contributor

@PastelStorm Feel free to take a final review :) Thanks.

There will be many reviews, given how many issues are being detected each time. After you fix the issues below please let Claude/ChatGPT review your changes to speed up the review process. Thank you.


Review: claytonlin1110:feat/audio-speech-to-text-partition


High

1. except Exception in email.py and msg.py is a behavioral regression — silent data loss

Both unstructured/partition/email.py and unstructured/partition/msg.py widen the attachment error handler from except UnsupportedFileFormatError to except Exception: return, with no logging:

except Exception:
    return

The original code only swallowed the specific case where no partitioner exists. The new code silently drops attachments on any failure — parser regressions, TypeError/AttributeError from coding bugs, MemoryError, permission errors, corrupt-data crashes, etc. This turns real bugs into invisible data loss.

Recommendation: Narrow to expected exceptions (UnsupportedFileFormatError, ImportError, possibly RuntimeError) and log at WARNING level with the attachment filename and exception class. At minimum, re-raise BaseException subclasses like KeyboardInterrupt.


2. SpeechToTextAgent.get_instance() does not verify interface conformance

In speech_to_text_interface.py, get_instance() dynamically imports a class by fully-qualified name and instantiates it, but never checks issubclass(loaded_class, SpeechToTextAgent):

mod = importlib.import_module(module_name)
loaded_class = getattr(mod, class_name)
# ... no issubclass check ...
return loaded_class()

A misconfigured class (e.g. pointing to a random class that happens to exist) will instantiate successfully and fail later at transcribe_segments() with an unhelpful AttributeError.

Recommendation: Add an explicit issubclass check immediately after import, with a clear error message naming the expected base class.


Medium

3. Audio partitioning is tightly coupled to Whisper despite the pluggable STT abstraction

Every audio FileType in model.py hardcodes ["whisper"] as importable_package_dependencies:

FLAC = ("flac", "audio", ["whisper"], "audio", [".flac"], "audio/flac", ["audio/x-flac"])
M4A  = ("m4a",  "audio", ["whisper"], "audio", [".m4a"],  "audio/mp4",  ["audio/x-m4a"])
MP3  = ("mp3",  "audio", ["whisper"], "audio", [".mp3"],  "audio/mpeg", [...])

partition() in auto.py enforces importable_package_dependencies before loading the partitioner. So even if a user configures a non-Whisper STT_AGENT, partition() can fail early unless Whisper is installed — defeating the entire abstraction.

Recommendation: Either make audio FileTypes dependency-free at the FileType level and validate the chosen agent's dependencies at runtime inside the agent itself, or introduce a level of indirection (e.g. ["openai-whisper"] only when Whisper is the selected agent).


4. WHISPER_FP16 bypasses _get_bool, inconsistent with ENVConfig conventions

@property
def WHISPER_FP16(self) -> bool:
    env_val = self._get_string("WHISPER_FP16")
    if env_val:
        return env_val.lower() in ("true", "1", "t")
    try:
        import torch
        return bool(torch.cuda.is_available())
    except ImportError:
        return False

The string-to-bool parsing duplicates _get_bool. More subtly, _get_string returns "" for both "not set" and "explicitly set to empty", making them indistinguishable — WHISPER_FP16= (empty) silently falls through to torch auto-detection rather than being treated as an explicit value.

Recommendation: Use _get_bool for the parsing, with a sentinel or os.environ.get check to distinguish "not set" from "explicitly empty" before falling through to auto-detect.


5. Duplicated empty-segment filtering — done in both agent and partitioner

Empty/whitespace segments are stripped and filtered in two places:

  • whisper_stt.py: if not text: continue
  • audio.py: text = seg["text"].strip(); if not text: continue

Every future SpeechToTextAgent implementor has to guess whether to filter empty segments or not. The partitioner will do it regardless, making the agent-side filtering dead code.

Recommendation: Pick one authoritative location (the partitioner, since it owns the contract), document it in the TranscriptionSegment / transcribe_segments interface docstring, and remove the duplicate from whisper_stt.py.


6. transcribe() convenience method is unused dead code

SpeechToTextAgent.transcribe() is defined in the interface but never called anywhere in the codebase — partition_audio always uses transcribe_segments(). The docstring also warns about infinite recursion if a subclass overrides transcribe to delegate back to transcribe_segments, which signals a fragile design. If this is intentional future API surface, it should be marked as such; otherwise it's dead code adding maintenance burden.


Low

7. No cache_clear() fixture for get_instance — inter-test state leakage

The OCR test suite has a _clear_cache fixture that calls OCRAgent.get_instance.cache_clear() before and after each test. The STT tests have no equivalent. Tests like test_get_instance_rejects_non_whitelisted_module call get_instance directly without clearing the lru_cache, risking cached state leaking between tests and masking bugs.

Recommendation: Add a _clear_cache fixture mirroring the OCR pattern, used by any test that touches get_instance directly.


8. Tests leak temp files / inconsistent temp file patterns

  • test_partition_audio_raises_with_both_filename_and_file creates a temp file with delete=False and manual cleanup, when the _tmp_audio context manager already exists for exactly this pattern.
  • Some fixture temp files use delete=False without guaranteed cleanup in finally.

Recommendation: Consistently use _tmp_audio or delete=True throughout, and ensure every delete=False has a corresponding finally: unlink.


9. WEBM alias includes video/webm — ambiguous file-type semantics

WEBM = ("webm", "audio", ["whisper"], "audio", [".webm"], "audio/webm", ["video/webm"])

This means a .webm file containing video (not audio) will be detected as FileType.WEBM and routed to partition_audio. Whisper may extract the audio track, but the file-type model is semantically misleading — WebM is a container for both audio and video.


10. _audio_mime_type called after temp file deletion — misleading control flow

In partition_audio, the temp file is deleted in the finally block, then _audio_mime_type(filename, file, metadata_filename) is called. This works only because _audio_mime_type reads from the original file argument (not the temp file), but the code structure suggests a dependency on the deleted file. Should be reordered or clarified.


11. filetype.py fallback adds complexity with no benefit in the failure case

The new fallback from libmagic to the filetype package when libmagic returns application/octet-stream is a good idea. But if ft.guess_mime() also returns None, the code falls back to returning magic_mime (which is "application/octet-stream") — identical to the original behavior, making the added code path a no-op in that case.


Summary Table

Priority # Issue
High 1 except Exception in email/msg — silent data loss, behavioral regression
High 2 get_instance doesn't verify issubclass(SpeechToTextAgent)
Medium 3 Audio FileTypes hardcode ["whisper"] dependency despite pluggable agent
Medium 4 WHISPER_FP16 bypasses _get_bool, inconsistent env-var handling
Medium 5 Duplicated segment filtering in both agent and partitioner
Medium 6 transcribe() is unused dead code
Low 7 No cache_clear() in tests — inter-test state leakage risk
Low 8 Inconsistent temp file handling / potential leaks in tests
Low 9 WEBM alias includes video/webm — ambiguous semantics
Low 10 Misleading control flow around temp file deletion vs. MIME detection
Low 11 filetype.py fallback is a no-op when both detectors fail

@claytonlin1110
Copy link
Contributor Author

@PastelStorm Would you please review one more time today? Also it would be great if you can let me know how I can let claude review my changes after push. Did you connect Claude with this github?

@PastelStorm
Copy link
Contributor

@PastelStorm Would you please review one more time today? Also it would be great if you can let me know how I can let claude review my changes after push. Did you connect Claude with this github?

I mean your own subscription or a free tier that is available to everyone.

@claytonlin1110
Copy link
Contributor Author

@PastelStorm Would you please review one more time today? Also it would be great if you can let me know how I can let claude review my changes after push. Did you connect Claude with this github?

I mean your own subscription or a free tier that is available to everyone.

Will try, i updated though so feel free to review the last changes.

@PastelStorm
Copy link
Contributor

@PastelStorm Would you please review one more time today? Also it would be great if you can let me know how I can let claude review my changes after push. Did you connect Claude with this github?

I mean your own subscription or a free tier that is available to everyone.

Will try, i updated though so feel free to review the last changes.

I now have a complete picture of the updated branch. Here's the re-review.


Re-Review: claytonlin1110:feat/audio-speech-to-text-partition (post-fixes)

The author addressed most of the previous findings substantively. Here's what was fixed, what remains, and what the fixes introduced.


Previously raised issues — status

# Issue Status
1 except Exception in email/msg Fixed — narrowed to explicit tuple, logs warning with filename
2 get_instance doesn't verify issubclass Fixedisinstance(loaded_class, type) + issubclass checks added
3 Audio FileTypes hardcode ["whisper"] Fixed — all audio types now use cast(list[str], []), deps validated at runtime
4 WHISPER_FP16 bypasses _get_bool Fixed — now uses os.environ presence check + delegates to _get_bool
5 Duplicated segment filtering Fixed — filtering removed from whisper_stt.py, contract documented on TranscriptionSegment
6 transcribe() dead code Fixed — removed entirely
7 No cache_clear() fixture Fixed_clear_stt_cache fixture added, applied to TestSpeechToTextAgentInterface
8 Temp file leaks / inconsistency Fixed — tests now use _tmp_audio() consistently
9 WEBM alias video/webm Not addressed
10 Misleading control flow (MIME detection after temp deletion) Fixed — MIME type + last_modified now computed before temp file lifecycle
11 filetype.py fallback no-op Partially addressed — comment added, but logic unchanged

New issues introduced by the fixes

1. (Medium) TypeError from issubclass check escapes unhandled

The new validation in get_instance (lines 61-71) raises TypeError when the loaded class isn't a type or isn't a SpeechToTextAgent subclass. But this TypeError is raised inside a try block whose except only catches (ImportError, AttributeError):

try:
    mod = importlib.import_module(module_name)
    loaded_class = getattr(mod, class_name)
    if not isinstance(loaded_class, type):
        raise TypeError(...)          # <-- not caught below
    if not issubclass(loaded_class, SpeechToTextAgent):
        raise TypeError(...)          # <-- not caught below
except (ImportError, AttributeError) as e:
    # TypeError propagates uncaught past this handler
    raise RuntimeError(...) from e

The TypeError will propagate raw to the caller with no wrapping in RuntimeError and no logging. This is inconsistent with the error UX for ImportError/AttributeError (which get a friendly message about installing the audio extra). It also means the TypeError message will appear to the caller without the surrounding context.

Fix: Either add TypeError to the except clause (with a different message that doesn't suggest installing the audio extra), or move the issubclass checks after the try/except block so they produce a clean, distinct error path.


2. (Medium) EXPECTED_ATTACHMENT_ERRORS includes RuntimeError — still too broad

The new explicit tuple in both email.py and msg.py is a big improvement, but RuntimeError is still a very broad exception type:

EXPECTED_ATTACHMENT_ERRORS: Final[tuple[type[BaseException], ...]] = (
    UnsupportedFileFormatError,
    ImportError,
    FileNotFoundError,
    RuntimeError,   # <-- catches any RuntimeError from any code path
)

RuntimeError is commonly raised by unrelated failures — e.g. torch OOM, internal assertion failures in PDF parsing, broken pipe, etc. Including it means a corrupt PDF attachment that triggers a RuntimeError deep in pdfminer will be silently skipped with only a warning log. This is better than the previous except Exception, but the original concern about masking real bugs partially remains.

Recommendation: Consider introducing a specific exception (e.g. PartitioningDependencyError or similar) that the STT agent and other partition code raises for "missing tool/dependency at runtime" scenarios (like ffmpeg not found). Then replace RuntimeError with that specific type. If that's too much scope for this PR, at least add a comment documenting which RuntimeErrors this is intended to catch.


3. (Low) EXPECTED_ATTACHMENT_ERRORS is duplicated between email.py and msg.py

The exact same tuple is defined in both files:

# email.py
EXPECTED_ATTACHMENT_ERRORS: Final[tuple[type[BaseException], ...]] = (
    UnsupportedFileFormatError, ImportError, FileNotFoundError, RuntimeError,
)

# msg.py
EXPECTED_ATTACHMENT_ERRORS: tuple[type[BaseException], ...] = (
    UnsupportedFileFormatError, ImportError, FileNotFoundError, RuntimeError,
)

Note also the minor inconsistency: email.py uses Final[...] while msg.py does not. This is a maintenance hazard — if the set of expected errors changes, both files must be updated in lockstep. This constant should live in unstructured.partition.common (or similar shared module) and be imported by both.


4. (Low) No test coverage for the new issubclass / isinstance validation

The issubclass check is new and addresses a previous review finding, but TestSpeechToTextAgentInterface has no test for it. There's a test for the whitelist rejection (test_get_instance_rejects_non_whitelisted_module) but none for:

  • Pointing to a valid, whitelisted module attribute that isn't a class (e.g. a function or constant)
  • Pointing to a class that exists but doesn't subclass SpeechToTextAgent

These are the exact scenarios the new code defends against, and they should have tests.


5. (Low) _clear_stt_cache patches get_instance on the real SpeechToTextAgent but two tests in the same class mock get_instance away entirely

TestSpeechToTextAgentInterface uses @pytest.mark.usefixtures("_clear_stt_cache"), which calls SpeechToTextAgent.get_instance.cache_clear() before and after. But test_get_agent_uses_env_config_when_no_module_given and test_get_agent_passes_explicit_module_to_get_instance both _patch.object(SpeechToTextAgent, "get_instance"), which replaces get_instance entirely — so cache_clear() in the fixture teardown operates on the real (unpatched) function, not the mock. This isn't wrong (the mock is scoped with with and restored before teardown), but it means the cache-clear fixture is a no-op for those two tests. Only test_get_instance_rejects_non_whitelisted_module actually benefits from it. This is fine but worth knowing.


Remaining from prior review (unchanged)

6. (Low) WEBM alias includes video/webm

WEBM = ("webm", "audio", ..., "audio/webm", ["video/webm"])

A .webm file containing video will be detected as FileType.WEBM and routed to partition_audio. This is semantically ambiguous since WebM is a container for both audio and video content.


Summary

The fixes are well-executed and address the most serious concerns from the first review. The branch is in much better shape. The remaining issues are:

Priority # Issue
Medium 1 TypeError from new issubclass check propagates unhandled (inconsistent error UX)
Medium 2 RuntimeError in EXPECTED_ATTACHMENT_ERRORS is still broad
Low 3 EXPECTED_ATTACHMENT_ERRORS duplicated between email.py and msg.py (+ Final inconsistency)
Low 4 No test for the new issubclass/isinstance validation paths
Low 5 _clear_stt_cache is effectively a no-op for 2 of 3 tests in the class
Low 6 WEBM alias includes video/webm — ambiguous semantics (unchanged)

@claytonlin1110
Copy link
Contributor Author

claytonlin1110 commented Mar 3, 2026

@PastelStorm I just updated.
Btw, Low priority # 6 fix is already done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat/Add speech to text to multimodal agent

2 participants