yasuokaの日記: EvaHan 2022 Bakeoffに対するesuparの挑戦
EvaHan 2022 Bakeoffに、遅ればせながらesuparで挑戦してみた。まずはSikuRoBERTaを使う「closed modality」クラスで、Google Colaboratory (GPU)用に書いてみた。
!pip install esupar
import os
url="https://github.com/CIRCSE/LT4HALA"
d=os.path.basename(url)
!test -d $d || git clone --depth=1 $url
!cp $d/2022/data_and_doc/EvaHan*.txt $d/2022/data_and_doc/*EvaHan*.py .
!sed '1s/^.//' EvaHan_testa_raw.txt | tr -d '\015' > testa.txt
!sed '1s/^.//' EvaHan_testb_raw.txt | tr -d '\015' > testb.txt
!test -f zuozhuan_train_utf8.txt || unzip $d/2022/data_and_doc/zuozhuan_train_utf8.zip
!sed '1s/^.//' zuozhuan_train_utf8.txt | tr -d '\015' | nawk '{{gsub(/。\/w/,"。/w\n");print}}' > train.txt
s='NF>0{OFS="\t";printf("# text = ");for(i=1;i<=NF;i++){split($i,a,"/");printf("%s",a[1])}print"";for(i=1;i<=NF;i++){split($i,a,"/");print i,a[1],"_",a[2],"_","_","_","_","_","SpaceAfter=No"}print""}'
!nawk '{s}' train.txt > train.pos
!python -m esupar.train SIKU-BERT/sikuroberta roberta-han 32 /tmp train.pos
from transformers import pipeline
tagger=pipeline(task="ner",model="roberta-han",device=0)
for f in ["testa","testb"]:
with open(f+".txt","r",encoding="utf-8") as r:
with open(f+"_close.txt","w",encoding="utf-8") as w:
for s in r:
d=[]
if s.strip()!="":
t=s.split("。")
u=[j+"。" if i<len(t)-1 else j for i,j in enumerate(t) if j!=""]
v=tagger(u)
for j,k in zip(u,v):
d+=[[t["entity"],j[t["start"]:t["end"]]] for t in k]
for i in range(len(d)-1,0,-1):
if d[i][0].startswith("I-"):
if d[i-1][0].startswith("B-"):
e=d.pop(i)
d[i-1]=[d[i-1][0][2:],d[i-1][1]+e[1]]
elif d[i-1][0].startswith("I-"):
e=d.pop(i)
d[i-1][1]=d[i-1][1]+e[1]
for i in range(len(d)):
if d[i][0].startswith("B-") or d[i][0].startswith("I-"):
d[i][0]=d[i][0][2:]
print(" ".join(t[1]+"/"+t[0] for t in d),file=w)
!python eval_EvaHan_2022_FINAL.py testa_close.txt EvaHan_testa_gold.txt
!python eval_EvaHan_2022_FINAL.py testb_close.txt EvaHan_testb_gold.txt
ただ、esuparはUniversal Dependencies向けの細かいチューニングがおこなわれている上に、学習モジュールがブラックボックスなので「closed modality」と言っても、信じてもらえなさそうな気がする。とりあえず私(安岡孝一)の手元では、15分ほどで以下の結果が得られた。
The result of testa_close.txt is:
+-----------------+---------+---------+---------+
| Task | P | R | F1 |
+-----------------+---------+---------+---------+
|Word segmentation| 95.3882 | 96.6869 | 96.0332 |
+-----------------+---------+---------+---------+
| Pos tagging | 91.0675 | 92.3074 | 91.6833 |
+-----------------+---------+---------+---------+
The result of testb_close.txt is:
+-----------------+---------+---------+---------+
| Task | P | R | F1 |
+-----------------+---------+---------+---------+
|Word segmentation| 94.5596 | 92.0145 | 93.2697 |
+-----------------+---------+---------+---------+
| Pos tagging | 88.0331 | 85.6636 | 86.8322 |
+-----------------+---------+---------+---------+
残念ながら、復旦大学のチームに今一歩で及んでいない。しかも、4ヶ月前のesuparだと、差はもっと拡がると思う。なかなか世界の壁は厚いなあ。
EvaHan 2022 Bakeoffに対するesuparの挑戦 More ログイン