
yasuokaの日記: 古典中国語(漢文)C-CLUEの固有表現抽出にRoBERTa-Classical-Chineseで挑戦
Zhongqing JIANG, Zengqing WU, Chuan XIAOの『Token-Free Cross-Lingual Named Entity Recognition for Classical Chinese』(第15回データ工学と情報マネジメントに関するフォーラム, 1b-6-2, 2023年3月6日)を読みつつ、RoBERTa-Classical-Chineseを使えば、もう少し精度が上がるんじゃないか、という気がした。そこで、Transformersのrun_ner.pyとroberta-classical-chinese-base-charを使って、古典中国語(漢文)C-CLUEの固有表現抽出に挑戦してみることにした。Google Colaboratory (GPU版)だと、こんな感じ。
!pip install transformers datasets evaluate seqeval accelerate
!test -d C-CLUE || git clone --depth=1 https://github.com/jizijing/C-CLUE
s='$1=="transformers"{printf("-b v%s",$2)}'
!test -d transformers || git clone `pip list | awk '{s}'` https://github.com/huggingface/transformers
def makejson(token_file,tag_file,json_file):
with open(token_file,"r",encoding="utf-8") as r1, open(tag_file,"r",encoding="utf-8") as r2, open(json_file,"w",encoding="utf-8") as w:
for s,t in zip(r1,r2):
print('{"tokens":["'+s.rstrip().replace(' ','","')+'"],"tags":["'+t.rstrip().replace(' ','","')+'"]}',file=w)
makejson("C-CLUE/data_ner/source.txt","C-CLUE/data_ner/target.txt","train.json")
makejson("C-CLUE/data_ner/dev.txt","C-CLUE/data_ner/dev-label.txt","dev.json")
makejson("C-CLUE/data_ner/test1.txt","C-CLUE/data_ner/test_tgt.txt","test.json")
!python transformers/examples/pytorch/token-classification/run_ner.py --model_name_or_path KoichiYasuoka/roberta-classical-chinese-base-char --train_file train.json --validation_file dev.json --test_file test.json --output_dir ./roberta-classical-chinese-base-ner --do_train --do_eval --do_predict
私(安岡孝一)の手元では、5分程度でroberta-classical-chinese-base-nerが出来上がり、以下のmetricsとなった。
***** train metrics *****
epoch = 3.0
train_loss = 0.2081
train_runtime = 0:02:18.68
train_samples = 1902
train_samples_per_second = 41.145
train_steps_per_second = 5.149
***** eval metrics *****
epoch = 3.0
eval_accuracy = 0.9087
eval_f1 = 0.626
eval_loss = 0.3011
eval_precision = 0.5595
eval_recall = 0.7103
eval_runtime = 0:00:02.10
eval_samples = 238
eval_samples_per_second = 113.254
***** predict metrics *****
predict_accuracy = 0.9124
predict_f1 = 0.6612
predict_loss = 0.2924
predict_precision = 0.5743
predict_recall = 0.7792
predict_runtime = 0:00:02.06
predict_samples_per_second = 115.185
predict_steps_per_second = 14.519
F1-scoreが66.12、Precisionが57.43、Recallが77.92なので、まだPrecisionが不十分な気がする。ただ、このC-CLUEって誰も使ってないしメンテもされてないので、わざわざこれに特化してチューニングするのは、正直あまり気が乗らないなあ。
古典中国語(漢文)C-CLUEの固有表現抽出にRoBERTa-Classical-Chineseで挑戦 More ログイン