-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Security Vulnerability Report
Type: XML External Entity (XXE) Injection
Severity: High
File: packages/markitdown/src/markitdown/converter_utils/docx/pre_process.py, line 45
Commit tested: 4a5340f
Description
The function _convert_omath_to_latex() in pre_process.py uses ET.fromstring() to parse XML content extracted from user-supplied DOCX files:
math_root = ET.fromstring(MATH_ROOT_TEMPLATE.format(str(tag)))The tag variable originates from BeautifulSoup parsing of the DOCX document.xml, which is user-supplied content. A DOCX file is a ZIP archive containing XML — an attacker can craft a DOCX with malicious Office Math Markup (OMML) tags containing XML external entity declarations.
Python's xml.etree.ElementTree is documented as "not secure against maliciously constructed data". While CPython's expat-based parser has limited XXE surface compared to lxml, it is still vulnerable to:
- Billion Laughs (exponential entity expansion) causing denial of service
- External entity resolution depending on parser configuration
- DTD processing attacks
Secondary Finding (Medium)
Unvalidated exiftool_path parameter reaching subprocess.run() in packages/markitdown/src/markitdown/converters/_exiftool.py (lines 22 and 41). While this uses list-style invocation (not shell=True), the path is not validated against path traversal or symlink attacks.
Recommended Fix
For the XXE:
# Replace:
from xml.etree import ElementTree as ET
# With:
import defusedxml.ElementTree as ETOr call defusedxml.defuse_stdlib() at module initialization.
For the subprocess issue:
- Validate
exiftool_pathagainst an allowlist or verify it resolves to a known binary usingshutil.which()
Impact
markitdown is widely used for converting documents to Markdown. Any application processing untrusted DOCX files is potentially vulnerable, including:
- Web services accepting document uploads
- CI/CD pipelines processing documentation
- AI/LLM pipelines using markitdown for document ingestion
- The markitdown MCP server (markitdown-mcp)
Disclosure Process
We attempted to report this through [email protected] (bounced — no longer accepted) and the MSRC Researcher Portal. This issue was discovered during an automated scan using the Colosseum deep code analysis platform — 51 gauntlets × 2 platforms × 7 rounds, with 98% cross-platform agreement.
Full scan report: https://battleharden.dev/reports/markitdown