Minimal n-triples toolkit. It can:
- shrink n-triples by applying namespace abbreviations (given some rules)
- convert n-triples to line delimited JSON (.ldj)
To list the abbreviation rules, run:
$ ntto -d
To create an abbreviated NT file from an NT file, run:
$ ntto -o OUTPUT.NT -a FILE.nt
To create an abbreviated JSON file from an NT file, run:
$ ntto -a -j FILE.nt > OUTPUT.LDJ
To create an abbreviated JSON file from an NT file while ignoring conversion errors, run:
$ ntto -a -j -i FILE.nt > OUTPUT.LDJ
To create an abbreviated JSON file from an NT file while ignoring conversion errors and using a custom RULES file, run:
$ ntto -r RULES -a -j -i FILE.nt > OUTPUT.LDJ
RPM and DEB packages can be found under releases.
With a proper Go setup, a
$ go get github.com/miku/ntto/cmd/ntto
should work as well.
$ ntto
Usage: ntto [OPTIONS] FILE
-a abbreviate n-triples using rules
-c dump constructed sed command and exit
-cpuprofile string
write cpu profile to file
-d dump rules and exit
-i ignore conversion errors
-j convert nt to json
-n string
string to indicate empty string replacement (default "<NULL>")
-o string
output file to write result to
-r string
path to rules file, use built-in if none given
-v prints current version and exits
-w int
parallelism measure (default 4)
ntto takes a RULES file (alternatively uses some hardwired rules) to abbreviate
common prefixes in a n-triple file. ntto does not do the replacements itself, but outsources it to external programs, like replace or perl.
With the help of replace ntto can shorten up to 3M lines per second. The resulting
file size can be up to 50% of the size of the original file.
$ cat RULES
# example rules file
dbp http://dbpedia.org/resource/
gnd http://d-nb.info/gnd/
dnbes http://d-nb.info/standards/elementset/gnd#
dnbac http://d-nb.info/standards/vocab/gnd/geographic-area-code#
dnbv http://d-nb.info/standards/vocab/gnd/
viaf http://viaf.org/viaf/
frbr http://rdvocab.info/uri/schema/FRBRentitiesRDA/
rdgr http://rdvocab.info/ElementsGr2/
# empty lines are ignored, as are comments
foaf http://xmlns.com/foaf/0.1/
rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs http://www.w3.org/2000/01/rdf-schema#
schema http://schema.org/
dc http://purl.org/dc/elements/1.1/
dcterms http://purl.org/dc/terms/
$ wc -l file.nt
114171541
$ time ntto -o output.nt -a file.nt
real 1m51.202s
user 1m3.626s
sys 0m13.602s
$ time ntto -a -j file.nt > output.ldj
real 15m47.872s
user 16m19.516s
sys 2m3.013s
Sometimes, less is more, but YMMV:
$ time ntto -w 2 -a -j file.nt > output.ldj
real 12m3.619s
user 15m17.422s
sys 2m14.430s