-
Notifications
You must be signed in to change notification settings - Fork 549
UN-2579 [FIX] Fixed MIME type validation to auto-detect file types instead of checking content-type header #1391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…stead of checking content-type header 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
for more information, see https://pre-commit.ci
try: | ||
for chunk in file.chunks(chunk_size=cls.READ_CHUNK_SIZE): | ||
if first_iteration: | ||
mime_type = cls._detect_mime_type(chunk, file.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@muhammad-ali-e do we check mimetype for every file? Or if the file.content_type
is octet stream or missing? Reason for asking this, in past I recall @jagadeeswaran-zipstack working with magic library and often passing few bytes might not be enough but had to pass entire file content to detect some file mime types properly. So am bit worried this logic of relying on mime type from first chunks might not work in all cases
str: Detected MIME type (may be unsupported) | ||
""" | ||
# Primary MIME type detection using Python Magic | ||
mime_type = magic.from_buffer(chunk, mime=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@muhammad-ali-e if mime type is octet stream can we have a secondary check using magica? You can connect with @johnyrahul to see how he handled it. Why? In some cases the pdf file's mime type is not being captured properly by magic and shown as binary, magica can be used as a secondary check in such cases alone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file_history = FileHistoryHelper.get_file_history( | ||
workflow=workflow, cache_key=file_hash | ||
# Handle unsupported files | ||
if not success: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might make sense to rename it like process_chunks_success
or mime_type_detected
which would be more meaning ful
if first_iteration: | ||
mime_type = cls._detect_mime_type(chunk, file.name) | ||
if not AllowedFileTypes.is_allowed(mime_type): | ||
raise UnsupportedMimeTypeError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: @muhammad-ali-e in case of unsupported wondering if we even need to raise this error or we can simply return return "", mime_type, False
. Either should be fine. Just wondering which might be better. I feel like since we are only handling this specific error the try except really seems unnecessary and we could may be return from here. Not sure which might be the better practise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall refactoring done is a good approach. I think the code will be easier to maintain
…yment-in-the-backend-with-backward-compatible
@@ -885,6 +904,117 @@ def load_file(self, input_file_path: str) -> tuple[str, BytesIO]: | |||
|
|||
return os.path.basename(input_file_path), file_stream | |||
|
|||
@classmethod | |||
def _process_file_chunks( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: @muhammad-ali-e provide a more suitable name. _process_file_chunks
is a bit vague. Also we aren't accepting file chunks here rather we chunk it ourselves
if file_hash in unique_file_hashes: | ||
log_message = f"Skipping file '{file_name}' — duplicate detected within the current request. Already staged for processing." | ||
workflow_log.log_info(logger=logger, message=log_message) | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@muhammad-ali-e didn't we recently discuss to not skip duplicates based on file hash / content and allow its processing?
return False | ||
|
||
@classmethod | ||
def _get_execution_status( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Rename to _is_execution_completed()
instead
…e-backend-with-backward-compatible' of github.com:Zipstack/unstract into UN-2579-bug-handling-mime-type-for-api-deployment-in-the-backend-with-backward-compatible
|
|
What
application/octet-stream
as Content-Type, the system now reads the actual file content to detect the correct MIME type.Cognitive Complexity
Why
Recent MIME type validation changes caused existing client integrations to fail when they explicitly set Content-Type as
application/octet-stream
. This resulted in files beingskipped with error messages like "Skipping file sample.pdf due to Unsupported MIME type: Unsupported MIME type 'application/octet-stream'".
How
_detect_mime_type()
method that uses Python Magic to detect MIME type from file contentCan this PR break any existing features. If yes, please list possible items. If no, please explain why.
No, this PR fixes a regression and restores backward compatibility. It only changes how MIME types are detected (from header to content analysis) which is more reliable and allows
previously working client integrations to function again.
Database Migrations
None required.
Env Config
None required.
Relevant Docs
None required.
Related Issues or PRs
UN-2579 - Bug handling MIME type for API deployment in the backend with backward compatibility
Dependencies Versions
None changed.
Notes on Testing
application/octet-stream
Content-TypeScreenshots
None applicable.