Thanks to visit codestin.com
Credit goes to github.com

Skip to content

JSON? #11

@fasiha

Description

@fasiha

I love how compact the JmdictFurigana.txt file format is but it certainly is non-standard and not trivial to parse correctly. I wonder if you've thought about distributing the data as some kind of JSON?

I have a package that parses the compact JmdictFurigana.txt into a line-delimited JSON file, whose lines are for example:

{"text":"アカガエル科","reading":"アカガエルか","furigana":["アカガエル",{"ruby":"","rt":""}]}
{"text":"給料明細","reading":"きゅうりょうめいさい","furigana":[{"ruby":"","rt":"きゅう"},{"ruby":"","rt":"りょう"},{"ruby":"","rt":"めい"},{"ruby":"","rt":"さい"}]}

Each line is valid JSON, with the following schema (in TypeScript notation, so with type X, X never shows up in the generated JSON):

type Ruby = {
  ruby: string,
  rt: string,
};

type Furigana = string|Ruby;

type Entry = {
  furigana: Furigana[],
  reading: string,
  text: string,
};

I use ruby/rt to match HTML.

This particular line-delimited JSON format expands the 8.7 MB original to 24 MB, but gzip compression means they're 2.5 MB versus 3.8 MB respectively over the wire. I can imagine replacing the Entry schema above with a simpler array-based one, something like type Entry = [string, string, Furigana[]] if we wanted to reduce filesize, or imitate the current JmdictFurigana.txt format.

Feel free to say no if you've thought about this and didn't want to support it! I've parsed the file in three languages now (JavaScript, Clojure, and again TypeScript/JavaScript) and it's sufficiently tricky that I thought I'd ask. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions