UN-2579 [FIX] Fixed MIME type validation to auto-detect file types instead of checking content-type header #1391

muhammad-ali-e · 2025-07-02T07:01:35Z

What

Fixed MIME type validation to auto-detect file types using Python Magic instead of relying solely on the Content-Type header sent by clients. When clients send
application/octet-stream as Content-Type, the system now reads the actual file content to detect the correct MIME type.
refactored code to reduce Cognitive Complexity

Why

Recent MIME type validation changes caused existing client integrations to fail when they explicitly set Content-Type as application/octet-stream. This resulted in files being
skipped with error messages like "Skipping file sample.pdf due to Unsupported MIME type: Unsupported MIME type 'application/octet-stream'".

How

Added _detect_mime_type() method that uses Python Magic to detect MIME type from file content
Modified file processing to detect MIME type from the first chunk of file data instead of trusting the Content-Type header
Maintained backward compatibility by still logging the original Content-Type header for debugging
Ensured proper error handling and logging throughout the process

Can this PR break any existing features. If yes, please list possible items. If no, please explain why.

No, this PR fixes a regression and restores backward compatibility. It only changes how MIME types are detected (from header to content analysis) which is more reliable and allows
previously working client integrations to function again.

Database Migrations

None required.

Env Config

None required.

Relevant Docs

None required.

Related Issues or PRs

UN-2579 - Bug handling MIME type for API deployment in the backend with backward compatibility

Dependencies Versions

None changed.

Notes on Testing

Test with files uploaded using application/octet-stream Content-Type
Verify that actual file types (PDF, TXT, etc.) are correctly detected
Confirm that unsupported file types are still properly rejected
Test backward compatibility with existing client integrations

Screenshots

None applicable.

…stead of checking content-type header 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

for more information, see https://pre-commit.ci

ritwik-g · 2025-07-02T09:01:59Z

backend/workflow_manager/endpoint_v2/source.py

+        try:
+            for chunk in file.chunks(chunk_size=cls.READ_CHUNK_SIZE):
+                if first_iteration:
+                    mime_type = cls._detect_mime_type(chunk, file.name)


@muhammad-ali-e do we check mimetype for every file? Or if the file.content_type is octet stream or missing? Reason for asking this, in past I recall @jagadeeswaran-zipstack working with magic library and often passing few bytes might not be enough but had to pass entire file content to detect some file mime types properly. So am bit worried this logic of relying on mime type from first chunks might not work in all cases

ritwik-g · 2025-07-02T09:03:44Z

backend/workflow_manager/endpoint_v2/source.py

+            str: Detected MIME type (may be unsupported)
+        """
+        # Primary MIME type detection using Python Magic
+        mime_type = magic.from_buffer(chunk, mime=True)


@muhammad-ali-e if mime type is octet stream can we have a secondary check using magica? You can connect with @johnyrahul to see how he handled it. Why? In some cases the pdf file's mime type is not being captured properly by magic and shown as binary, magica can be used as a secondary check in such cases alone

@ritwik-g is it a necessity to have both approaches? Maybe we could just use magika alone. I remember @shuveb suggested this too once

ritwik-g · 2025-07-02T09:06:19Z

backend/workflow_manager/endpoint_v2/source.py

-                file_history = FileHistoryHelper.get_file_history(
-                    workflow=workflow, cache_key=file_hash
+            # Handle unsupported files
+            if not success:


might make sense to rename it like process_chunks_success or mime_type_detected which would be more meaning ful

ritwik-g · 2025-07-02T09:09:01Z

backend/workflow_manager/endpoint_v2/source.py

+                if first_iteration:
+                    mime_type = cls._detect_mime_type(chunk, file.name)
+                    if not AllowedFileTypes.is_allowed(mime_type):
+                        raise UnsupportedMimeTypeError(


NIT: @muhammad-ali-e in case of unsupported wondering if we even need to raise this error or we can simply return return "", mime_type, False. Either should be fine. Just wondering which might be better. I feel like since we are only handling this specific error the try except really seems unnecessary and we could may be return from here. Not sure which might be the better practise

ritwik-g

Overall refactoring done is a good approach. I think the code will be easier to maintain

…yment-in-the-backend-with-backward-compatible

chandrasekharan-zipstack · 2025-07-02T09:49:13Z

backend/workflow_manager/endpoint_v2/source.py

@@ -885,6 +904,117 @@ def load_file(self, input_file_path: str) -> tuple[str, BytesIO]:

        return os.path.basename(input_file_path), file_stream

+    @classmethod
+    def _process_file_chunks(


NIT: @muhammad-ali-e provide a more suitable name. _process_file_chunks is a bit vague. Also we aren't accepting file chunks here rather we chunk it ourselves

chandrasekharan-zipstack · 2025-07-02T09:50:43Z

backend/workflow_manager/endpoint_v2/source.py

+        if file_hash in unique_file_hashes:
+            log_message = f"Skipping file '{file_name}' — duplicate detected within the current request. Already staged for processing."
+            workflow_log.log_info(logger=logger, message=log_message)
+            return True


@muhammad-ali-e didn't we recently discuss to not skip duplicates based on file hash / content and allow its processing?

chandrasekharan-zipstack · 2025-07-02T09:54:33Z

backend/workflow_manager/endpoint_v2/source.py

+        return False
+
+    @classmethod
+    def _get_execution_status(


NIT: Rename to _is_execution_completed() instead

…e-backend-with-backward-compatible' of github.com:Zipstack/unstract into UN-2579-bug-handling-mime-type-for-api-deployment-in-the-backend-with-backward-compatible

github-actions · 2025-07-02T12:54:16Z

filepath	function	$$\textcolor{#23d18b}{\tt{passed}}$$	SUBTOTAL
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_logs}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_cleanup}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_cleanup\_skip}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_client\_init}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image\_exists}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config\_without\_mount}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_run\_container}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image\_for\_sidecar}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_sidecar\_container}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{TOTAL}}$$		$$\textcolor{#23d18b}{\tt{11}}$$	$$\textcolor{#23d18b}{\tt{11}}$$

sonarqubecloud · 2025-07-02T12:55:09Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

muhammad-ali-e and others added 4 commits July 2, 2025 12:28

UN-2579 [FIX] Fixed MIME type validation to auto-detect file types in…

571841f

…stead of checking content-type header 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2d0c5c0

for more information, see https://pre-commit.ci

reduced complexity of method add_input_file_to_api_storage

fe0e963

resolve conflicts

c1b10a2

muhammad-ali-e requested review from chandrasekharan-zipstack, jaseemjaskp, vishnuszipstack, ritwik-g, gaya3-zipstack and johnyrahul and removed request for chandrasekharan-zipstack and jaseemjaskp July 2, 2025 07:21

ritwik-g reviewed Jul 2, 2025

View reviewed changes

Merge branch 'main' into UN-2579-bug-handling-mime-type-for-api-deplo…

03a5cb6

…yment-in-the-backend-with-backward-compatible

chandrasekharan-zipstack reviewed Jul 2, 2025

View reviewed changes

muhammad-ali-e added 2 commits July 2, 2025 18:23

addressing PR reviews

dd9192f

Merge branch 'UN-2579-bug-handling-mime-type-for-api-deployment-in-th…

99e4560

…e-backend-with-backward-compatible' of github.com:Zipstack/unstract into UN-2579-bug-handling-mime-type-for-api-deployment-in-the-backend-with-backward-compatible

This comment was marked as outdated.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UN-2579 [FIX] Fixed MIME type validation to auto-detect file types instead of checking content-type header #1391

UN-2579 [FIX] Fixed MIME type validation to auto-detect file types instead of checking content-type header #1391

Uh oh!

muhammad-ali-e commented Jul 2, 2025 •

edited

Loading

Uh oh!

ritwik-g Jul 2, 2025

Uh oh!

ritwik-g Jul 2, 2025

Uh oh!

chandrasekharan-zipstack Jul 2, 2025

Uh oh!

ritwik-g Jul 2, 2025

Uh oh!

ritwik-g Jul 2, 2025

Uh oh!

ritwik-g left a comment

Uh oh!

chandrasekharan-zipstack Jul 2, 2025

Uh oh!

chandrasekharan-zipstack Jul 2, 2025

Uh oh!

chandrasekharan-zipstack Jul 2, 2025

Uh oh!

github-actions bot commented Jul 2, 2025

Uh oh!

sonarqubecloud bot commented Jul 2, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

UN-2579 [FIX] Fixed MIME type validation to auto-detect file types instead of checking content-type header #1391

Are you sure you want to change the base?

UN-2579 [FIX] Fixed MIME type validation to auto-detect file types instead of checking content-type header #1391

Uh oh!

Conversation

muhammad-ali-e commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Can this PR break any existing features. If yes, please list possible items. If no, please explain why.

Database Migrations

Env Config

Relevant Docs

Related Issues or PRs

Dependencies Versions

Notes on Testing

Screenshots

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ritwik-g left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 2, 2025

Uh oh!

sonarqubecloud bot commented Jul 2, 2025

Quality Gate passed

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

muhammad-ali-e commented Jul 2, 2025 •

edited

Loading