HTML inside XLS not parsed properly

Hello! I am having some issues with reading XLS files that are actually HTML tables. When exporting a report from Salesforce, if you choose to export as an XLS, the report is actually exported as an HTML table, but saved as an XLS file. Weird, but whatever. I am handling the use case anyway, in case other systems do this, and in order to allow this as an input for my program instead of just throwing an error for the user.

**Issue 1:** The header cells in the HTML table are not being read by the parser. They are ignored. A quick fix for this was to do regex that replaces all &lt;th&gt;, &lt;/th&gt;, and <th ...> variations with &lt;td&gt; tags. Not having to do this would be nice, but it was a simple enough workaround.

**Issue 2:** I tested putting HTML tags as fields in my Salesforce export, for example I set the value of a Project Manager name to <td>. Salesforce did the correct thing by HTML encoding this value to "&lt;td&gt;", however when reading this file in code, these values are completely ignored. I'm not sure if this will ever be a use case for an input file, however I want to protect against it in case it is.

Here is a simple example of an HTML table that I saved as an XLS file that demonstrates both Issue 1 and 2:

<html>
	<table>
		<tr>
			<th>
				Header 1
			</th>
			<th>
				Header 2
			</th>
		</tr>
		<tr>
			<td>Col val 1</td>
			<td>Col val 2</td>
		</tr>
		<tr>
			<td>Col val 3</td>
			<td>Col val 4</td>
		</tr>
		<tr>
			<td>&lt;td&gt;</td>
			<td>&lt;td&gt;</td>
		</tr>
	</table>
</html>

> <html>
	<table>
		<tr>
			<th>
				Header 1
			</th>
			<th>
				Header 2
			</th>
		</tr>
		<tr>
			<td>Col val 1</td>
			<td>Col val 2</td>
		</tr>
		<tr>
			<td>Col val 3</td>
			<td>Col val 4</td>
		</tr>
		<tr>
			<td>&lt;td&gt;</td>
			<td>&lt;td&gt;</td>
		</tr>
	</table>
</html>

When calling xlsx.read, here is the output when inspecting in the debugger (as you can see, the header values are not picked up, and the HTML encoded values are not picked up either):

![image](https://user-images.githubusercontent.com/68528/42530265-9737b532-8435-11e8-9c97-19c365a51843.png)

Please let me know if you have suggestions for the issues I am experiencing, other than telling the user to download the Salesforce report as a CSV instead :)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HTML inside XLS not parsed properly #1178

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Header 1	Header 2
Col val 1	Col val 2
Col val 3	Col val 4
<td>	<td>

Uh oh!

HTML inside XLS not parsed properly #1178

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions