Open
Description
I have the following code to fix tokenization issues in Spanish AnCora (UniversalDependencies/UD_Spanish-AnCora#6):
if re.search(r'\w[¡!]$', node.form): # Separate the punctuation and attach it to the rest. punct = node.create_child() punct.shift_after_node(node) punct.form = node.form[-1:] node.form = node.form[:-1] punct.lemma = punct.form punct.upos = 'PUNCT' punct.xpos = 'faa' if punct.form == '¡' else 'fat' punct.feats['PunctType'] = 'Excl' punct.feats['PunctSide'] = 'Ini' if punct.form == '¡' else 'Fin' punct.misc['SpaceAfter'] = node.misc['SpaceAfter'] node.misc['SpaceAfter'] = 'No' punct.deprel = 'punct'
The method shift_after_node()
correctly updates ids and basic heads that are after the new position of the shifted node. Unfortunately it fails to also update the enhanced heads when enhanced representation is present. Hence the following source
-19 Yahoo! Yahoo! PROPN np0000o _ 16 appos 16:appos ClusterId=CESS-CAST-A-20000503-1687-s5.sn.51|ClusterType=Spec.organization|MentionSpan=19 -20 con con ADP sps00 _ 21 case 21:case _ -21 intenciones intención NOUN ncfp000 Gender=Fem|Number=Plur 8 obl 8:obl ClusterId=CESS-CAST-A-20000503-1687-s5.sn.57|ClusterType=Gen|MentionSpan=21-22
results in the following (note the mismatch in the parent of the preposition con):
+19 Yahoo Yahoo! PROPN np0000o _ 16 appos 16:appos ClusterId=CESS-CAST-A-20000503-1687-s5.sn.51|ClusterType=Spec.organization|MentionSpan=19|SpaceAfter=No +20 ! ! PUNCT fat PunctSide=Fin|PunctType=Excl 19 punct _ _ +21 con con ADP sps00 _ 22 case 21:case _ +22 intenciones intención NOUN ncfp000 Gender=Fem|Number=Plur 8 obl 8:obl ClusterId=CESS-CAST-A-20000503-1687-s5.sn.57|ClusterType=Gen|MentionSpan=21-22
It just occurred to me that the MentionSpan
would also need updating but for that one would probably need to activate the CorefUD sub-API first?