Thanks to visit codestin.com
Credit goes to github.com

Skip to content

HTML inside XLS not parsed properly #1178

@mishaberman

Description

@mishaberman

Hello! I am having some issues with reading XLS files that are actually HTML tables. When exporting a report from Salesforce, if you choose to export as an XLS, the report is actually exported as an HTML table, but saved as an XLS file. Weird, but whatever. I am handling the use case anyway, in case other systems do this, and in order to allow this as an input for my program instead of just throwing an error for the user.

Issue 1: The header cells in the HTML table are not being read by the parser. They are ignored. A quick fix for this was to do regex that replaces all <th>, </th>, and <th ...> variations with <td> tags. Not having to do this would be nice, but it was a simple enough workaround.

Issue 2: I tested putting HTML tags as fields in my Salesforce export, for example I set the value of a Project Manager name to . Salesforce did the correct thing by HTML encoding this value to "<td>", however when reading this file in code, these values are completely ignored. I'm not sure if this will ever be a use case for an input file, however I want to protect against it in case it is.

Here is a simple example of an HTML table that I saved as an XLS file that demonstrates both Issue 1 and 2:

Header 1 Header 2
Col val 1 Col val 2
Col val 3 Col val 4
<td> <td>
<table>
	<tr>
		<th>
			Header 1
		</th>
		<th>
			Header 2
		</th>
	</tr>
	<tr>
		<td>Col val 1</td>
		<td>Col val 2</td>
	</tr>
	<tr>
		<td>Col val 3</td>
		<td>Col val 4</td>
	</tr>
	<tr>
		<td>&lt;td&gt;</td>
		<td>&lt;td&gt;</td>
	</tr>
</table>

When calling xlsx.read, here is the output when inspecting in the debugger (as you can see, the header values are not picked up, and the HTML encoded values are not picked up either):

image

Please let me know if you have suggestions for the issues I am experiencing, other than telling the user to download the Salesforce report as a CSV instead :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions