パスワードを忘れた? アカウント作成
15747693 journal
中国

yasuokaの日記: 様々な事前学習モデルで戦うEvaHan 2022 Bakeoff

日記 by yasuoka

一昨日昨日の日記で挑戦したEvaHan 2022 Bakeoffだが、事前学習モデルをroberta-classical-chinese-base-uposに置き換えるべく、さらにプログラムを書き直してみた。Google Colaboratory (GPU)だと、こんな感じ。

pretrained_model="KoichiYasuoka/roberta-classical-chinese-base-upos"
!pip install transformers
import os
url="https://github.com/CIRCSE/LT4HALA"
d=os.path.basename(url)
!test -d $d || git clone --depth=1 $url
!cp $d/2022/data_and_doc/EvaHan*.txt $d/2022/data_and_doc/*EvaHan*.py .
!sed '1s/^.//' EvaHan_testa_raw.txt | tr -d '\015' > testa.txt
!sed '1s/^.//' EvaHan_testb_raw.txt | tr -d '\015' > testb.txt
!test -f zuozhuan_train_utf8.txt || unzip $d/2022/data_and_doc/zuozhuan_train_utf8.zip
!sed '1s/^.//' zuozhuan_train_utf8.txt | tr -d '\015' | nawk '{{gsub(/。\/w/,"。/w\n");print}}' | egrep -v '^ *$' > train.txt
class EvaHanDataset(object):
  def __init__(self,file,tokenizer):
    self.ids,self.pos=[],[]
    label,cls,sep=set(),tokenizer.cls_token_id,tokenizer.sep_token_id
    with open(file,"r",encoding="utf-8") as r:
      for t in r:
        w,p=[k.split("/") for k in t.split()],[]
        v=tokenizer([k[0] for k in w],add_special_tokens=False)["input_ids"]
        for x,y in zip(v,w):
          if len(y)==1:
            y.append("w")
          if len(x)==1:
            p.append(y[1])
          elif len(x)>1:
            p.extend(["B-"+y[1]]+["I-"+y[1]]*(len(x)-1))
        self.ids.append([cls]+sum(v,[])+[sep])
        self.pos.append(["w"]+p+["w"])
        label=set(sum([self.pos[-1],list(label)],[]))
    self.label2id={l:i for i,l in enumerate(sorted(label))}
  __len__=lambda self:len(self.ids)
  __getitem__=lambda self,i:{"input_ids":self.ids[i],"labels":[self.label2id[t] for t in self.pos[i]]}
from transformers import AutoTokenizer,AutoConfig,AutoModel,AutoModelForTokenClassification,DataCollatorForTokenClassification,TrainingArguments,Trainer,pipeline
tkz=AutoTokenizer.from_pretrained(pretrained_model)
mdl=AutoModel.from_pretrained(pretrained_model)
dir="."+pretrained_model.replace("/",".")
mdl.save_pretrained(dir+"/tmp-model")
trainDS=EvaHanDataset("train.txt",tkz)
cfg=AutoConfig.from_pretrained(dir+"/tmp-model",num_labels=len(trainDS.label2id),label2id=trainDS.label2id,id2label={i:l for l,i in trainDS.label2id.items()})
arg=TrainingArguments(per_device_train_batch_size=32,output_dir="/tmp",overwrite_output_dir=True,save_total_limit=2,save_strategy="epoch")
trn=Trainer(model=AutoModelForTokenClassification.from_pretrained(dir+"/tmp-model",config=cfg),args=arg,train_dataset=trainDS,data_collator=DataCollatorForTokenClassification(tkz))
trn.train()
trn.save_model(dir+"/evahan2022-model")
tkz.save_pretrained(dir+"/evahan2022-model")
tagger=pipeline(task="ner",model=dir+"/evahan2022-model",device=0)
for f in ["testa","testb"]:
  with open(f+".txt","r",encoding="utf-8") as r:
    u,e=[],[]
    for s in r:
      t=s.split("。")
      w=[j+"。" if i<len(t)-1 else j for i,j in enumerate(t) if j!=""]
      if len(w)==0:
        e[-1]=e[-1]+"\n"
      else:
        u.extend(w)
        e.extend([" "]*(len(w)-1)+["\n"])
  with open(f+dir+".txt","w",encoding="utf-8") as w:
    for s,v,z in zip(u,tagger(u),e):
      d=[[t["entity"],s[t["start"]:t["end"]]] for t in v]
      for i in range(len(d)-1,0,-1):
        if d[i][0].startswith("I-"):
          if d[i-1][0].startswith("B-"):
            e=d.pop(i)
            d[i-1]=[d[i-1][0][2:],d[i-1][1]+e[1]]
          elif d[i-1][0].startswith("I-"):
            e=d.pop(i)
            d[i-1][1]=d[i-1][1]+e[1]
      for i in range(len(d)):
        if d[i][0].startswith("B-") or d[i][0].startswith("I-"):
          d[i][0]=d[i][0][2:]
      print(" ".join(t[1]+"/"+t[0] for t in d),file=w,end=z)
!python eval_EvaHan_2022_FINAL.py testa{dir}.txt EvaHan_testa_gold.txt
!python eval_EvaHan_2022_FINAL.py testb{dir}.txt EvaHan_testb_gold.txt

私(安岡孝一)の手元では、10分ほどで以下の結果が得られた。

The result of testa.KoichiYasuoka.roberta-classical-chinese-base-upos.txt is:
+-----------------+---------+---------+---------+
|       Task      |    P    |    R    |    F1   |
+-----------------+---------+---------+---------+
|Word segmentation| 95.6372 | 96.7829 | 96.2066 |
+-----------------+---------+---------+---------+
|   Pos tagging   | 91.4782 | 92.5740 | 92.0228 |
+-----------------+---------+---------+---------+

The result of testb.KoichiYasuoka.roberta-classical-chinese-base-upos.txt is:
+-----------------+---------+---------+---------+
|       Task      |    P    |    R    |    F1   |
+-----------------+---------+---------+---------+
|Word segmentation| 94.6948 | 91.7414 | 93.1947 |
+-----------------+---------+---------+---------+
|   Pos tagging   | 88.1912 | 85.4407 | 86.7942 |
+-----------------+---------+---------+---------+

昨日のSikuRoBERTaに較べると、TestAは少し良くなっているものの、TestBは負けている。ちなみに1行目を「pretrained_model="KoichiYasuoka/roberta-classical-chinese-base-char"」に変えたところ、私の手元では以下の結果になった。

The result of testa.KoichiYasuoka.roberta-classical-chinese-base-char.txt is:
+-----------------+---------+---------+---------+
|       Task      |    P    |    R    |    F1   |
+-----------------+---------+---------+---------+
|Word segmentation| 95.6555 | 96.6620 | 96.1562 |
+-----------------+---------+---------+---------+
|   Pos tagging   | 91.5679 | 92.5314 | 92.0471 |
+-----------------+---------+---------+---------+

The result of testb.KoichiYasuoka.roberta-classical-chinese-base-char.txt is:
+-----------------+---------+---------+---------+
|       Task      |    P    |    R    |    F1   |
+-----------------+---------+---------+---------+
|Word segmentation| 94.5461 | 91.3866 | 92.9395 |
+-----------------+---------+---------+---------+
|   Pos tagging   | 88.3888 | 85.4351 | 86.8869 |
+-----------------+---------+---------+---------+

あるいは1行目を「pretrained_model="KoichiYasuoka/bert-ancient-chinese-base-upos"」に変えたところ、私の手元では以下の結果になった。

The result of testa.KoichiYasuoka.bert-ancient-chinese-base-upos.txt is:
+-----------------+---------+---------+---------+
|       Task      |    P    |    R    |    F1   |
+-----------------+---------+---------+---------+
|Word segmentation| 95.8582 | 96.9998 | 96.4256 |
+-----------------+---------+---------+---------+
|   Pos tagging   | 91.6673 | 92.7589 | 92.2098 |
+-----------------+---------+---------+---------+

The result of testb.KoichiYasuoka.bert-ancient-chinese-base-upos.txt is:
+-----------------+---------+---------+---------+
|       Task      |    P    |    R    |    F1   |
+-----------------+---------+---------+---------+
|Word segmentation| 94.9103 | 92.2411 | 93.5567 |
+-----------------+---------+---------+---------+
|   Pos tagging   | 88.5648 | 86.0741 | 87.3017 |
+-----------------+---------+---------+---------+

1行目を「pretrained_model="Jihuai/bert-ancient-chinese"」に変えたところ、私の手元では以下の結果になった。

The result of testa.Jihuai.bert-ancient-chinese.txt is:
+-----------------+---------+---------+---------+
|       Task      |    P    |    R    |    F1   |
+-----------------+---------+---------+---------+
|Word segmentation| 96.0153 | 97.0495 | 96.5297 |
+-----------------+---------+---------+---------+
|   Pos tagging   | 91.7915 | 92.7802 | 92.2832 |
+-----------------+---------+---------+---------+

The result of testb.Jihuai.bert-ancient-chinese.txt is:
+-----------------+---------+---------+---------+
|       Task      |    P    |    R    |    F1   |
+-----------------+---------+---------+---------+
|Word segmentation| 94.9311 | 92.1185 | 93.5037 |
+-----------------+---------+---------+---------+
|   Pos tagging   | 88.4839 | 85.8624 | 87.1534 |
+-----------------+---------+---------+---------+

1行目を「pretrained_model="ethanyt/guwenbert-base"」に変えたところ、私の手元では以下の結果になった。

The result of testa.ethanyt.guwenbert-base.txt is:
+-----------------+---------+---------+---------+
|       Task      |    P    |    R    |    F1   |
+-----------------+---------+---------+---------+
|Word segmentation| 91.8585 | 94.1737 | 93.0017 |
+-----------------+---------+---------+---------+
|   Pos tagging   | 83.9147 | 86.0296 | 84.9590 |
+-----------------+---------+---------+---------+

The result of testb.ethanyt.guwenbert-base.txt is:
+-----------------+---------+---------+---------+
|       Task      |    P    |    R    |    F1   |
+-----------------+---------+---------+---------+
|Word segmentation| 90.2871 | 88.8548 | 89.5652 |
+-----------------+---------+---------+---------+
|   Pos tagging   | 80.1835 | 78.9115 | 79.5424 |
+-----------------+---------+---------+---------+

1行目を「pretrained_model="uer/gpt2-chinese-ancient"」に変えたところ、私の手元では以下の結果になった。

The result of testa.uer.gpt2-chinese-ancient.txt is:
+-----------------+---------+---------+---------+
|       Task      |    P    |    R    |    F1   |
+-----------------+---------+---------+---------+
|Word segmentation| 87.7262 | 93.0433 | 90.3066 |
+-----------------+---------+---------+---------+
|   Pos tagging   | 80.3559 | 85.2263 | 82.7195 |
+-----------------+---------+---------+---------+

The result of testb.uer.gpt2-chinese-ancient.txt is:
+-----------------+---------+---------+---------+
|       Task      |    P    |    R    |    F1   |
+-----------------+---------+---------+---------+
|Word segmentation| 88.7446 | 91.5074 | 90.1048 |
+-----------------+---------+---------+---------+
|   Pos tagging   | 77.3378 | 79.7455 | 78.5232 |
+-----------------+---------+---------+---------+

ざっと見くらべたところでは、TestAに関してはbert-ancient-chineseが、TestBに関してはbert-ancient-chinese-uposが、それぞれいい値になっている。TestAは『春秋左氏伝』、TestBは『史記』の一部らしいのだけど、さて、こういう結果、どう解釈すればいいのかな。

この議論は、yasuoka (21275)によって「 ログインユーザだけ」として作成されている。 ログインしてから来てね。
typodupeerror

Stay hungry, Stay foolish. -- Steven Paul Jobs

読み込み中...