古典中国語(漢文)C-CLUEの固有表現抽出にRoBERTa-Classical-Chineseで挑戦 | yasuokaの日記

yasuokaの日記：古典中国語(漢文)C-CLUEの固有表現抽出にRoBERTa-Classical-Chineseで挑戦 0

日記 by yasuoka 2023年05月12日 18時16分

Zhongqing JIANG, Zengqing WU, Chuan XIAOの『Token-Free Cross-Lingual Named Entity Recognition for Classical Chinese』(第15回データ工学と情報マネジメントに関するフォーラム, 1b-6-2, 2023年3月6日)を読みつつ、RoBERTa-Classical-Chineseを使えば、もう少し精度が上がるんじゃないか、という気がした。そこで、Transformersのrun_ner.pyとroberta-classical-chinese-base-charを使って、古典中国語(漢文)C-CLUEの固有表現抽出に挑戦してみることにした。Google Colaboratory (GPU版)だと、こんな感じ。

!pip install transformers datasets evaluate seqeval accelerate !test -d C-CLUE || git clone --depth=1 https://github.com/jizijing/C-CLUE s='$1=="transformers"{printf("-b v%s",$2)}' !test -d transformers || git clone `pip list | awk '{s}'` https://github.com/huggingface/transformers def makejson(token_file,tag_file,json_file): with open(token_file,"r",encoding="utf-8") as r1, open(tag_file,"r",encoding="utf-8") as r2, open(json_file,"w",encoding="utf-8") as w: for s,t in zip(r1,r2): print('{"tokens":["'+s.rstrip().replace(' ','","')+'"],"tags":["'+t.rstrip().replace(' ','","')+'"]}',file=w) makejson("C-CLUE/data_ner/source.txt","C-CLUE/data_ner/target.txt","train.json") makejson("C-CLUE/data_ner/dev.txt","C-CLUE/data_ner/dev-label.txt","dev.json") makejson("C-CLUE/data_ner/test1.txt","C-CLUE/data_ner/test_tgt.txt","test.json") !python transformers/examples/pytorch/token-classification/run_ner.py --model_name_or_path KoichiYasuoka/roberta-classical-chinese-base-char --train_file train.json --validation_file dev.json --test_file test.json --output_dir ./roberta-classical-chinese-base-ner --do_train --do_eval --do_predict

私(安岡孝一)の手元では、5分程度でroberta-classical-chinese-base-nerが出来上がり、以下のmetricsとなった。

***** train metrics ***** epoch = 3.0 train_loss = 0.2081 train_runtime = 0:02:18.68 train_samples = 1902 train_samples_per_second = 41.145 train_steps_per_second = 5.149 ***** eval metrics ***** epoch = 3.0 eval_accuracy = 0.9087 eval_f1 = 0.626 eval_loss = 0.3011 eval_precision = 0.5595 eval_recall = 0.7103 eval_runtime = 0:00:02.10 eval_samples = 238 eval_samples_per_second = 113.254 ***** predict metrics ***** predict_accuracy = 0.9124 predict_f1 = 0.6612 predict_loss = 0.2924 predict_precision = 0.5743 predict_recall = 0.7792 predict_runtime = 0:00:02.06 predict_samples_per_second = 115.185 predict_steps_per_second = 14.519

F1-scoreが66.12、Precisionが57.43、Recallが77.92なので、まだPrecisionが不十分な気がする。ただ、このC-CLUEって誰も使ってないしメンテもされてないので、わざわざこれに特化してチューニングするのは、正直あまり気が乗らないなあ。

yasuokaの日記：古典中国語(漢文)C-CLUEの固有表現抽出にRoBERTa-Classical-Chineseで挑戦 0

古典中国語(漢文)C-CLUEの固有表現抽出にRoBERTa-Classical-Chineseで挑戦 More ログイン

スラド