Security: XXE vulnerability in DOCX pre-processor (ET.fromstring on untrusted input)

## Security Vulnerability Report

**Type:** XML External Entity (XXE) Injection
**Severity:** High
**File:** `packages/markitdown/src/markitdown/converter_utils/docx/pre_process.py`, line 45
**Commit tested:** 4a5340f93b2bf1dc11641f921fbfd6d5f016924b

### Description

The function `_convert_omath_to_latex()` in `pre_process.py` uses `ET.fromstring()` to parse XML content extracted from user-supplied DOCX files:

```python
math_root = ET.fromstring(MATH_ROOT_TEMPLATE.format(str(tag)))
```

The `tag` variable originates from BeautifulSoup parsing of the DOCX `document.xml`, which is user-supplied content. A DOCX file is a ZIP archive containing XML — an attacker can craft a DOCX with malicious Office Math Markup (OMML) tags containing XML external entity declarations.

Python's `xml.etree.ElementTree` is [documented as "not secure against maliciously constructed data"](https://docs.python.org/3/library/xml.etree.elementtree.html). While CPython's expat-based parser has limited XXE surface compared to lxml, it is still vulnerable to:
- **Billion Laughs** (exponential entity expansion) causing denial of service
- External entity resolution depending on parser configuration
- DTD processing attacks

### Secondary Finding (Medium)

Unvalidated `exiftool_path` parameter reaching `subprocess.run()` in `packages/markitdown/src/markitdown/converters/_exiftool.py` (lines 22 and 41). While this uses list-style invocation (not `shell=True`), the path is not validated against path traversal or symlink attacks.

### Recommended Fix

For the XXE:
```python
# Replace:
from xml.etree import ElementTree as ET

# With:
import defusedxml.ElementTree as ET
```

Or call `defusedxml.defuse_stdlib()` at module initialization.

For the subprocess issue:
- Validate `exiftool_path` against an allowlist or verify it resolves to a known binary using `shutil.which()`

### Impact

markitdown is widely used for converting documents to Markdown. Any application processing untrusted DOCX files is potentially vulnerable, including:
- Web services accepting document uploads
- CI/CD pipelines processing documentation
- AI/LLM pipelines using markitdown for document ingestion
- The markitdown MCP server (markitdown-mcp)

### Disclosure Process

We attempted to report this through secure@microsoft.com (bounced — no longer accepted) and the MSRC Researcher Portal. This issue was discovered during an automated scan using the [Colosseum](https://battleharden.dev) deep code analysis platform — 51 gauntlets × 2 platforms × 7 rounds, with 98% cross-platform agreement.

Full scan report: https://battleharden.dev/reports/markitdown

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security: XXE vulnerability in DOCX pre-processor (ET.fromstring on untrusted input) #1565

Security Vulnerability Report

Description

Secondary Finding (Medium)

Recommended Fix

Impact

Disclosure Process

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Security: XXE vulnerability in DOCX pre-processor (ET.fromstring on untrusted input) #1565

Description

Security Vulnerability Report

Description

Secondary Finding (Medium)

Recommended Fix

Impact

Disclosure Process

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions