Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

muhammad-ali-e
Copy link
Contributor

@muhammad-ali-e muhammad-ali-e commented Jul 2, 2025

What

  • Fixed MIME type validation to auto-detect file types using Python Magic instead of relying solely on the Content-Type header sent by clients. When clients send
  • application/octet-stream as Content-Type, the system now reads the actual file content to detect the correct MIME type.
  • refactored code to reduce Cognitive Complexity

Why

Recent MIME type validation changes caused existing client integrations to fail when they explicitly set Content-Type as application/octet-stream. This resulted in files being
skipped with error messages like "Skipping file sample.pdf due to Unsupported MIME type: Unsupported MIME type 'application/octet-stream'".

How

  • Added _detect_mime_type() method that uses Python Magic to detect MIME type from file content
  • Modified file processing to detect MIME type from the first chunk of file data instead of trusting the Content-Type header
  • Maintained backward compatibility by still logging the original Content-Type header for debugging
  • Ensured proper error handling and logging throughout the process

Can this PR break any existing features. If yes, please list possible items. If no, please explain why.

No, this PR fixes a regression and restores backward compatibility. It only changes how MIME types are detected (from header to content analysis) which is more reliable and allows
previously working client integrations to function again.

Database Migrations

None required.

Env Config

None required.

Relevant Docs

None required.

Related Issues or PRs

UN-2579 - Bug handling MIME type for API deployment in the backend with backward compatibility

Dependencies Versions

None changed.

Notes on Testing

  • Test with files uploaded using application/octet-stream Content-Type
  • Verify that actual file types (PDF, TXT, etc.) are correctly detected
  • Confirm that unsupported file types are still properly rejected
  • Test backward compatibility with existing client integrations

Screenshots

None applicable.

try:
for chunk in file.chunks(chunk_size=cls.READ_CHUNK_SIZE):
if first_iteration:
mime_type = cls._detect_mime_type(chunk, file.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@muhammad-ali-e do we check mimetype for every file? Or if the file.content_type is octet stream or missing? Reason for asking this, in past I recall @jagadeeswaran-zipstack working with magic library and often passing few bytes might not be enough but had to pass entire file content to detect some file mime types properly. So am bit worried this logic of relying on mime type from first chunks might not work in all cases

str: Detected MIME type (may be unsupported)
"""
# Primary MIME type detection using Python Magic
mime_type = magic.from_buffer(chunk, mime=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@muhammad-ali-e if mime type is octet stream can we have a secondary check using magica? You can connect with @johnyrahul to see how he handled it. Why? In some cases the pdf file's mime type is not being captured properly by magic and shown as binary, magica can be used as a secondary check in such cases alone

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ritwik-g is it a necessity to have both approaches? Maybe we could just use magika alone. I remember @shuveb suggested this too once

file_history = FileHistoryHelper.get_file_history(
workflow=workflow, cache_key=file_hash
# Handle unsupported files
if not success:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might make sense to rename it like process_chunks_success or mime_type_detected which would be more meaning ful

if first_iteration:
mime_type = cls._detect_mime_type(chunk, file.name)
if not AllowedFileTypes.is_allowed(mime_type):
raise UnsupportedMimeTypeError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: @muhammad-ali-e in case of unsupported wondering if we even need to raise this error or we can simply return return "", mime_type, False. Either should be fine. Just wondering which might be better. I feel like since we are only handling this specific error the try except really seems unnecessary and we could may be return from here. Not sure which might be the better practise

Copy link
Contributor

@ritwik-g ritwik-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall refactoring done is a good approach. I think the code will be easier to maintain

…yment-in-the-backend-with-backward-compatible
@@ -885,6 +904,117 @@ def load_file(self, input_file_path: str) -> tuple[str, BytesIO]:

return os.path.basename(input_file_path), file_stream

@classmethod
def _process_file_chunks(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: @muhammad-ali-e provide a more suitable name. _process_file_chunks is a bit vague. Also we aren't accepting file chunks here rather we chunk it ourselves

Comment on lines +997 to +1000
if file_hash in unique_file_hashes:
log_message = f"Skipping file '{file_name}' — duplicate detected within the current request. Already staged for processing."
workflow_log.log_info(logger=logger, message=log_message)
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@muhammad-ali-e didn't we recently discuss to not skip duplicates based on file hash / content and allow its processing?

return False

@classmethod
def _get_execution_status(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Rename to _is_execution_completed() instead

…e-backend-with-backward-compatible' of github.com:Zipstack/unstract into UN-2579-bug-handling-mime-type-for-api-deployment-in-the-backend-with-backward-compatible
Copy link
Contributor

github-actions bot commented Jul 2, 2025

filepath function $$\textcolor{#23d18b}{\tt{passed}}$$ SUBTOTAL
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_logs}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_cleanup}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_cleanup\_skip}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_client\_init}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_image\_exists}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_image}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config\_without\_mount}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_run\_container}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_image\_for\_sidecar}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_sidecar\_container}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{TOTAL}}$$ $$\textcolor{#23d18b}{\tt{11}}$$ $$\textcolor{#23d18b}{\tt{11}}$$

Copy link

sonarqubecloud bot commented Jul 2, 2025

ritwik-g

This comment was marked as outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants