Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add support MasakhaNER v2 dataset#3013

Merged
alanakbik merged 2 commits intomasterfrom
add-support-maskhaner-v2
Dec 14, 2022
Merged

Add support MasakhaNER v2 dataset#3013
alanakbik merged 2 commits intomasterfrom
add-support-maskhaner-v2

Conversation

@stefan-it
Copy link
Member

@stefan-it stefan-it commented Dec 6, 2022

Hi,

this PR adds support for the recently released version 2 of the MasakhaNER dataset!

A new version argument was added and now defaults to "v2" of the dataset. Demo usage:

from flair.datasets import NER_MASAKHANE

v1_all = NER_MASAKHANE(languages="all", version="v1")
v2_all = NER_MASAKHANE(languages="all", version="v2")

Closes #2971.

Unittests

This PR also adds some unittests for v1 and v2 of the dataset. Here's a comparison between the number of sentences, that are mentioned in the paper and the "Parsed" ones with this implementation:

Version 1 - bold denotes difference:

Language Paper (Train, Dev, Test) Flair (Train, Dev, Test)
amh 1750 / 250 / 500 1750 / 250 / 500
hau 1903 / 272 / 545 1903 / 272 / 545
ibo 2233 / 319 / 638 2233 / 319 / 639
kin 2110 / 301 / 604 2110 / 301 / 604
lug 2003 / 200 / 401 2003 / 200 / 401
luo 644 / 92 / 185 644 / 92 / 185
pcm 2100 / 300 / 600 2100 / 300 / 600
swa 2104 / 300 / 602 2104 / 300 / 602
wol 1871 / 267 / 536 1871 / 267 / 536
yor 2124 / 303 / 608 2124 / 303 / 608

Version 2 - bold denotes difference:

Language Paper (Train, Dev, Test) Flair (Train, Dev, Test)
bam 4462 / 638 / 1274 4462 / 638 / 1274
bbj 3384 / 483 / 966 3384 / 483 / 966
ewe 3505 / 501 / 1001 3505 / 501 / 1001
fon 4343 / 621 / 1240 4343 / 621 / 1240
hau 5716 / 816 / 1633 5716 / 816 / 1633
ibo 7634 / 1090 / 2181 7634 / 1090 / 2181
kin 7825 / 1118 / 2235 7825 / 1118 / 2235
lug 4942 / 706 / 1412 4942 / 706 / 1412
mos 4532 / 648 / 1294 4532 / 648 / 1294
pcm 5646 / 806 / 1613 5646 / 806 / 1613
nya 6250 / 893 / 1785 6250 / 893 / 1785
sna 6207 / 887 / 1773 6207 / 887 / 1773
swa 6593 / 942 / 1883 6593 / 942 / 1883
tsn 3489 / 499 / 996 3489 / 499 / 996
twi 4240 / 605 / 1211 4240 / 605 / 1211
wol 4593 / 656 / 1312 4593 / 656 / 1312
xho 5718 / 817 / 1633 5718 / 817 / 1633
yor 6877 / 983 / 1964 6876 / 983 / 1964
zul 5848 / 836 / 1670 5848 / 836 / 1670

@alanakbik
Copy link
Collaborator

@stefan-it thanks for adding this!

@alanakbik alanakbik merged commit 0cd260e into master Dec 14, 2022
@alanakbik alanakbik deleted the add-support-maskhaner-v2 branch December 14, 2022 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for new MasakhaNER v2 dataset

2 participants

Comments