Loosen csv metadata parsing #3862

iamllama · 2025-03-12T22:46:40Z

Resolves #3853

Anki's current csv delimiter heuristic works by looking for potential delimiters in ascending expected frequency (\t, |, ;, :, ,, ) in the first 8kb. While a wrong guess isn't necessarily a problem since the delimiter can be changed in the import options page, hierarchical tags (::) throw a wrench in this, as pointed out by @GithubAnon0000. And it's a common enough usecase that anki maybe shouldn't penalise. I imagine this was also foreseen by Rumov when he left a todo 3 years ago

This pr proposes pulling in and using the ~~csv-sniffer~~ crate, which is able to correctly deduce the delimiter for the samples in #3588 and #3853 in a multiplatform manner

EDIT: replaced csv-sniffer with qsv-sniffer, a fork that fixes some of the former's issues. However, not only is it more bloated than csv-sniffer, it still (unsurprisingly) has samples that it fails on: jblondin/csv-sniffer#18. Not sure if this is worth pursuing

dae · 2025-03-15T11:13:44Z

I'd prefer not to add all those extra crates if we can avoid it. I'll follow up on the initial issue.

iamllama · 2025-03-16T05:14:49Z

I've reverted the crate addition/usage, with another suggested method: trimming potential delimiters from non-freeform metadata lines

I wonder if we could perhaps hackily solve this by stripping off trailing commas from comment lines, as a concession to spreadsheet users?

Instead of just removing commas, it makes use of the fact that the currently supported delimiters are all non-ascii-alphanumeric. This would allow samples like #separator:,,,, #separator:Pipe|||, #html:true,,, and #tags column:8,,, to be parsed instead of being ignored, as demonstrated in the test

All the freeform metadata values (#deck, #notetype, #tags, #columns) are left alone to avoid breakage, at the expense of inconsistency with the other lines. #match scope and #if match are left alone because their constrained values can include or +, which unfortunately aren't compatible with this method

dae · 2025-03-19T11:56:11Z

One problem with such selective/"magical" approaches is that they can cause confusion. A user may find a simple metadata line in a spreadsheet works, then wonder why other ones don't work correctly when they go to use them in the future. But indiscriminately stripping non-alphanumeric characters from all metadata lines is going to confuse/inconvenience some users too, so it's not an easy call.

Given that your approach is a subset of the other one and doesn't preclude us doing more later if we feel we need to, it seems like the more conservative choice. Thank you @iamllama!

iamllama added 3 commits March 13, 2025 10:17

add qsv-sniffer crate

eaa8c86

use qsv-sniffer before falling back to old delimiter heuristic

e80cf77

update test metadata macro

25f79e2

iamllama force-pushed the csv-sniffer branch from 57b9939 to 25f79e2 Compare March 13, 2025 02:19

iamllama added 3 commits March 16, 2025 12:17

revert impl

42ea4e9

trim potential suffixed delimiters from non-freeform meta lines

3395ad1

add test

e12b712

iamllama changed the title ~~Use a smarter csv delimiter heuristic~~ Loosen csv metadata parsing Mar 16, 2025

dae merged commit d8c83ac into ankitects:main Mar 19, 2025
1 check passed

iamllama mentioned this pull request May 17, 2025

Add clarification on csv delimiters ankitects/anki-manual#378

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loosen csv metadata parsing #3862

Loosen csv metadata parsing #3862

Uh oh!

iamllama commented Mar 12, 2025 •

edited

Loading

Uh oh!

dae commented Mar 15, 2025

Uh oh!

iamllama commented Mar 16, 2025

Uh oh!

dae commented Mar 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Loosen csv metadata parsing #3862

Loosen csv metadata parsing #3862

Uh oh!

Conversation

iamllama commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dae commented Mar 15, 2025

Uh oh!

iamllama commented Mar 16, 2025

Uh oh!

dae commented Mar 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

iamllama commented Mar 12, 2025 •

edited

Loading