conll-rdf is a tool package for converting between formats of annotated linguistic corpora and linking to external ontologies.
It consists of a number of java modules which can be chained to construct a data processing pipeline.
A variety of CoNLL formats are supported.
- convert any CoNLL-like tsv format (e.g.
.conll) to conll-rdf (.ttl). - perform SPARQL Updates on conll-rdf data.
- visualize conll-rdf structure.
- convert conll-rdf back to conll.
In general, we read data line by line from stdin, process it and write results to stdout.
For quick set-up, we recommend using .sh scripts to pipe your data through the tools.
Each pipeline element can be called via ./run.sh $CLASS [args].
Another method is to configure the pipeline in a config-json and call the entire pipeline with ./run.sh CoNLLRDFManager -c $config-json.
Of course you can also use the provided classes within java as any other library.
Download the repository from GitHub: git clone https://github.com/acoli-repo/CoNLL-RDF.git.
You need java and javac to run and compile the source.
- use
java -versionto check if you have java installed. (Java version 8 or 9 is required for logging with log4j) - check with
javac -versionif the JDK compiler is installed.
The source is compiled automatically once you run the tool.
Note: If you prefer using OpenJDK instead of Oracle Java, it might be necessary to install openjfx. sudo apt-get install openjfx (Thanks to Francesco Mambrini for finding out!)
All required java libraries are contained in lib/.
- you might get an error like
bash: ./../test.sh: Permission deniedwhen trying to run a script. Use this command to change the filemode:chmod +x <SCRIPT> - an error starting like
ERROR CoNLLRDFUpdater :: SPARQL parse exception for Update No. 0: DIRECTUPDATE [...]when running the RDFUpdater can be raised if the path to a sparql query is wrong. Check for extra or missing../.
All relevant classes are in src/. Documentation for them is found in doc/.
A hands-on tutorial with a variety of sample-pipelines, which you can adapt to your needs, can be found in examples/. These convert data found in data/.
In case your corpus directly corresponds to a format found there you can directly convert it with given scripts into conll-rdf.
Suppose we have a corpus example.conll and want to convert it to conll-rdf to make it compatible with a given LLOD technology. We can do this with a simple shell-command:
cat example.conll | ./run.sh CoNLLStreamExtractor my-baseuri.org/example.conll# \
ID WORD LEMMA POS_COARSE POS FEATS HEAD EDGE > example.ttlThis will create a new file example.ttl in conll-rdf by simply providing
- a base-URI (ideally a resolvable URL to adhere to the five stars of LOD).
- the names of the CoNLL columns from left to right.
run.sh is used to make things feel more bash-like. It determines the classpath, updates class files if necessary and runs the specified java class with the provided arguments.
- eg.
cat foo.ttl | ./run.sh CoNLLRDFFormatter > foo_formatted.ttlwould pipefoo.ttlthroughCoNLLRDFFormatterintofoo_formatted.ttl. - In case the respective
.classfiles cannot be found,run.shcallscompile.shto compile the java classes from source. Of course, you may also runcompile.shindependently.
In-depth information on all the classes of conll-rdf can be found in the documentation.
IMPORTANT USAGE HINT: All given data is parsed sentence-wise (if applicable). Meaning that for CoNLL data as input a newline is considered as a sentence boundary marker (in regard to the CoNLL data model). The ID column (if present) must contain sentence internal IDs (if this is not the case this column must be renamed before conversion/parsing) - if no such column is provided sentence internal IDs will be generated. Please refer to the paper mentioned below under Reference.
CoNLLRDFManager processes a pipeline provided as JSON.
Synopsis: CoNLLRDFManager -c [JSON-config]
CoNLLStreamExtractor expects CoNLL from stdin and writes conll-rdf to stdout.
Synopsis: CoNLLStreamExtractor baseURI FIELD1[.. FIELDn] [-s SPARQL_SELECT]
CoNLLRDFUpdater expects conll-rdf from stdin and writes conll-rdf to stdout. It is designed for updating existing conll-rdf files and is able to load external ontologies or RDF data into separate Graphs during runtime. This is especially useful for linking CoNLL-RDF files to other ontologies.
It can also output .dot graph-files (or triples stdout).
Synopsis: CoNLLRDFUpdater -custom [-model URI [GRAPH]] [-updates [UPDATE]]
CoNLLRDFFormatter expects conll-rdf in .ttl and writes to different formats. Can also visualize your data.
Synopsis: CoNLLRDFFormatter [-rdf [COLS]] [-debug] [-grammar] [-semantics] [-conll COLS] [-sparqltsv SPARQL]
It can write:
- canonical conll-rdf as
.ttl. .conllof specified columns.- TSV to
stdoutbased on a given sparql select query. debughighlighted.ttltostderr, e.g. highlighting triples representing conll columns or sentence structure differently.
grammar: writes conll data structure in tree-like format tostdout. (/resp.\are pointing in direction ofconll:HEAD)
semanticsseperate visualization of object properties ofconll:WORDusingterms:namespace, useful for visualizing knowledge graphs.EXPERIMENTAL
- can be used to manually annotate / change annotations in .ttl files.
- will visualize input just like
CoNLLRDFFormatter -grammar. Will not make in-place changes but write the changed file tostdout(e.g../run.sh CoNLLRDFAnnotator file_old.ttl > file_new.ttl)- Note: Piping output into old file is not supported! Will result in data loss.
CoNLL2RDFcontains the central conversion functionality. For practical uses, interface with its functionality through CoNLLStreamExtractor. Arguments to CoNLLStreamExtractor will be passed through.CoNLLRDFVizis an auxiliary class for the future development of debugging and visualizing complex SPARQL Update chains in their effects on selected pieces of CoNLL(-RDF) data- conll-rdf assumes UTF-8.
- Christian Chiarcos - [email protected]
- Christian Fäth - [email protected]
- Benjamin Kosmehl - [email protected]
See also the list of contributors who participated in this project.
- Chiarcos C., Fäth C. (2017), CoNLL-RDF: Linked Corpora Done in an NLP-Friendly Way. In: Gracia J., Bond F., McCrae J., Buitelaar P., Chiarcos C., Hellmann S. (eds) Language, Data, and Knowledge. LDK 2017. pp 74-88.
This repository has been created in context of
- Applied Computational Linguistics (ACoLi)
- Specialised Information Service Linguistics (FID)
- funded by the German Research Foundation (DFG, funding code CH1413/2-1)
- Linked Open Dictionaries (LiODi)
- funded by the German Federal Ministry of Education and Research (BMBF, funding code 01UG1631)
- QuantQual@CEDIFOR (QuantQual)
- funded by the Centre for the Digital Foundation of Research in the Humanities, Social and Educational Science (CEDIFOR, funding code 01UG1416A)
This repository is being published under two licenses. Apache 2.0 is used for code, see LICENSE.main. CC-BY 4.0 for all data from universal dependencies and SPARQL scripts, see LICENSE.data.
├── src/
├── lib/
├── examples/
│ ├── analyze-ud.sh
│ ├── convert-ud.sh
│ ├── link-ud.sh
│ └── parse-ud.sh
├── compile.sh
└── run.sh
├── data/
└── examples/
└── sparql/
Please cite Chiarcos C., Fäth C. (2017), CoNLL-RDF: Linked Corpora Done in an NLP-Friendly Way. In: Gracia J., Bond F., McCrae J., Buitelaar P., Chiarcos C., Hellmann S. (eds) Language, Data, and Knowledge. LDK 2017. pp 74-88.