Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Security: XXE vulnerability in DOCX pre-processor (ET.fromstring on untrusted input) #1565

@m45537-blip

Description

@m45537-blip

Security Vulnerability Report

Type: XML External Entity (XXE) Injection
Severity: High
File: packages/markitdown/src/markitdown/converter_utils/docx/pre_process.py, line 45
Commit tested: 4a5340f

Description

The function _convert_omath_to_latex() in pre_process.py uses ET.fromstring() to parse XML content extracted from user-supplied DOCX files:

math_root = ET.fromstring(MATH_ROOT_TEMPLATE.format(str(tag)))

The tag variable originates from BeautifulSoup parsing of the DOCX document.xml, which is user-supplied content. A DOCX file is a ZIP archive containing XML — an attacker can craft a DOCX with malicious Office Math Markup (OMML) tags containing XML external entity declarations.

Python's xml.etree.ElementTree is documented as "not secure against maliciously constructed data". While CPython's expat-based parser has limited XXE surface compared to lxml, it is still vulnerable to:

  • Billion Laughs (exponential entity expansion) causing denial of service
  • External entity resolution depending on parser configuration
  • DTD processing attacks

Secondary Finding (Medium)

Unvalidated exiftool_path parameter reaching subprocess.run() in packages/markitdown/src/markitdown/converters/_exiftool.py (lines 22 and 41). While this uses list-style invocation (not shell=True), the path is not validated against path traversal or symlink attacks.

Recommended Fix

For the XXE:

# Replace:
from xml.etree import ElementTree as ET

# With:
import defusedxml.ElementTree as ET

Or call defusedxml.defuse_stdlib() at module initialization.

For the subprocess issue:

  • Validate exiftool_path against an allowlist or verify it resolves to a known binary using shutil.which()

Impact

markitdown is widely used for converting documents to Markdown. Any application processing untrusted DOCX files is potentially vulnerable, including:

  • Web services accepting document uploads
  • CI/CD pipelines processing documentation
  • AI/LLM pipelines using markitdown for document ingestion
  • The markitdown MCP server (markitdown-mcp)

Disclosure Process

We attempted to report this through [email protected] (bounced — no longer accepted) and the MSRC Researcher Portal. This issue was discovered during an automated scan using the Colosseum deep code analysis platform — 51 gauntlets × 2 platforms × 7 rounds, with 98% cross-platform agreement.

Full scan report: https://battleharden.dev/reports/markitdown

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions