Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@iamllama
Copy link
Contributor

@iamllama iamllama commented Mar 12, 2025

Resolves #3853

Anki's current csv delimiter heuristic works by looking for potential delimiters in ascending expected frequency (\t, |, ;, :, ,, ) in the first 8kb. While a wrong guess isn't necessarily a problem since the delimiter can be changed in the import options page, hierarchical tags (::) throw a wrench in this, as pointed out by @GithubAnon0000. And it's a common enough usecase that anki maybe shouldn't penalise. I imagine this was also foreseen by Rumov when he left a todo 3 years ago

This pr proposes pulling in and using the csv-sniffer crate, which is able to correctly deduce the delimiter for the samples in #3588 and #3853 in a multiplatform manner

EDIT: replaced csv-sniffer with qsv-sniffer, a fork that fixes some of the former's issues. However, not only is it more bloated than csv-sniffer, it still (unsurprisingly) has samples that it fails on: jblondin/csv-sniffer#18. Not sure if this is worth pursuing

@dae
Copy link
Member

dae commented Mar 15, 2025

I'd prefer not to add all those extra crates if we can avoid it. I'll follow up on the initial issue.

@iamllama
Copy link
Contributor Author

I've reverted the crate addition/usage, with another suggested method: trimming potential delimiters from non-freeform metadata lines

I wonder if we could perhaps hackily solve this by stripping off trailing commas from comment lines, as a concession to spreadsheet users?

Instead of just removing commas, it makes use of the fact that the currently supported delimiters are all non-ascii-alphanumeric. This would allow samples like #separator:,,,, #separator:Pipe|||, #html:true,,, and #tags column:8,,, to be parsed instead of being ignored, as demonstrated in the test

All the freeform metadata values (#deck, #notetype, #tags, #columns) are left alone to avoid breakage, at the expense of inconsistency with the other lines. #match scope and #if match are left alone because their constrained values can include or +, which unfortunately aren't compatible with this method

@iamllama iamllama changed the title Use a smarter csv delimiter heuristic Loosen csv metadata parsing Mar 16, 2025
@dae
Copy link
Member

dae commented Mar 19, 2025

One problem with such selective/"magical" approaches is that they can cause confusion. A user may find a simple metadata line in a spreadsheet works, then wonder why other ones don't work correctly when they go to use them in the future. But indiscriminately stripping non-alphanumeric characters from all metadata lines is going to confuse/inconvenience some users too, so it's not an easy call.

Given that your approach is a subset of the other one and doesn't preclude us doing more later if we feel we need to, it seems like the more conservative choice. Thank you @iamllama!

@dae dae merged commit d8c83ac into ankitects:main Mar 19, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] File headers not working correctly with .csv files

2 participants