Transformersとbert-base-japanese-char-extendedとUD_Japanese-GSDで作る日本語「長単位」形態素解析器 | yasuokaの日記

yasuokaの日記： Transformersとbert-base-japanese-char-extendedとUD_Japanese-GSDで作る日本語「長単位」形態素解析器 0

日記 by yasuoka 2021年10月19日 19時34分

今日のNINJALサロンで話題になったので、日本語「長単位」形態素解析器を試作してみることにした。方法としてはUD_Japanese-GSDのLUWPOSを、Transformersのrun_ner.pyとbert-base-japanese-char-extendedで、何とか「長単位」の系列ラベリングに落とし込む。Google Colaboratory (GPU版)だと、こんな感じ。

!pip install transformers datasets seqeval !test -d UD_Japanese-GSD || git clone https://github.com/universaldependencies/UD_Japanese-GSD !test -f run_ner.py || curl -LO https://raw.githubusercontent.com/huggingface/transformers/v`pip list | sed -n 's/^transformers *$[^ ]*$ *$/\1/p'`/examples/pytorch/token-classification/run_ner.py for d in ["train","dev","test"]: with open("UD_Japanese-GSD/ja_gsd-ud-"+d+".conllu","r",encoding="utf-8") as f: r=f.read() with open(d+".json","w",encoding="utf-8") as f: tokens=[] tags=[] i=0 for s in r.split("\n"): t=s.split("\t") if len(t)==10 and not s.startswith("#"): for c in t[1]: tokens.append(c) b=t[9][t[9].index("LUWBILabel=")+11] p=[u[7:] for u in t[9].split("|") if u.startswith("LUWPOS=")][0] if p=="名詞-普通名詞-副詞可能": p="名詞-普通名詞-一般" p=p.replace("-","/") tags.extend([b+"-"+p]+["I-"+p]*(len(t[1])-1)) else: if len(tokens)>0: print("{\"tokens\":[\""+"\",\"".join(tokens)+"\"],\"tags\":[\""+"\",\"".join(tags)+"\"]}",file=f) tokens=[] tags=[] if len(tokens)>0: print("{\"tokens\":[\""+"\",\"".join(tokens)+"\"],\"tags\":[\""+"\",\"".join(tags)+"\"]}",file=f) !python run_ner.py --model_name_or_path KoichiYasuoka/bert-base-japanese-char-extended --train_file train.json --validation_file dev.json --test_file test.json --output_dir ja_luw.pos --do_train --do_eval --do_predict

ただ「名詞-普通名詞-副詞可能」というLUWPOSが、一ヶ所だけ現れて気持ち悪かったので、そこは「名詞-普通名詞-一般」に直している。私(安岡孝一)の手元では、15分程度でja_luw.posが出来上がり、以下のmetricsとなった。

***** train metrics ***** epoch = 3.0 train_loss = 0.2678 train_runtime = 0:13:01.48 train_samples = 7050 train_samples_per_second = 27.064 train_steps_per_second = 3.386 ***** eval metrics ***** epoch = 3.0 eval_accuracy = 0.9637 eval_f1 = 0.9705 eval_loss = 0.1588 eval_precision = 0.9678 eval_recall = 0.9734 eval_runtime = 0:00:06.57 eval_samples = 507 eval_samples_per_second = 77.053 eval_steps_per_second = 9.727 ***** predict metrics ***** predict_accuracy = 0.9669 predict_f1 = 0.9672 predict_loss = 0.1594 predict_precision = 0.9624 predict_recall = 0.9721 predict_runtime = 0:00:06.92 predict_samples_per_second = 78.466 predict_steps_per_second = 9.826

96%程度なので、まあ悪くない感じだ。出来上がったja_luw.posで、「全学年にわたって小学校の国語の教科書に大量の挿し絵が用いられている」を形態素解析してみよう。

from transformers import AutoModelForTokenClassification,AutoTokenizer,TokenClassificationPipeline mdl=AutoModelForTokenClassification.from_pretrained("ja_luw.pos") tkz=AutoTokenizer.from_pretrained("ja_luw.pos") nlp=TokenClassificationPipeline(model=mdl,tokenizer=tkz,aggregation_strategy="simple") d=nlp(inputs="全学年にわたって小学校の国語の教科書に大量の挿し絵が用いられている") print(d)

私の手元では、以下の結果になった。

[{'entity_group': '名詞/普通名詞/一般', 'score': 0.99953014, 'word': '全学年', 'start': 0, 'end': 3}, {'entity_group': '助詞/格助詞', 'score': 0.99852294, 'word': 'にわたって', 'start': 3, 'end': 8}, {'entity_group': '名詞/普通名詞/一般', 'score': 0.9992971, 'word': '小学校', 'start': 8, 'end': 11}, {'entity_group': '助詞/格助詞', 'score': 0.99976945, 'word': 'の', 'start': 11, 'end': 12}, {'entity_group': '名詞/普通名詞/一般', 'score': 0.99958575, 'word': '国語', 'start': 12, 'end': 14}, {'entity_group': '助詞/格助詞', 'score': 0.99973565, 'word': 'の', 'start': 14, 'end': 15}, {'entity_group': '名詞/普通名詞/一般', 'score': 0.99962956, 'word': '教科書', 'start': 15, 'end': 18}, {'entity_group': '助詞/格助詞', 'score': 0.99977046, 'word': 'に', 'start': 18, 'end': 19}, {'entity_group': '形状詞/一般', 'score': 0.98628527, 'word': '大量', 'start': 19, 'end': 21}, {'entity_group': '助詞/格助詞', 'score': 0.99945784, 'word': 'の', 'start': 21, 'end': 22}, {'entity_group': '名詞/普通名詞/一般', 'score': 0.9996125, 'word': '挿し絵', 'start': 22, 'end': 25}, {'entity_group': '助詞/格助詞', 'score': 0.99980253, 'word': 'が', 'start': 25, 'end': 26}, {'entity_group': '動詞/一般/上一段/ア行', 'score': 0.9828689, 'word': '用い', 'start': 26, 'end': 28}, {'entity_group': '助動詞/助動詞/レル', 'score': 0.99877167, 'word': 'られ', 'start': 28, 'end': 30}, {'entity_group': '助動詞/上一段/ア行', 'score': 0.99854565, 'word': 'ている', 'start': 30, 'end': 33}]

「全学年」「にわたって」「小学校」「の」「国語」「の」「教科書」「に」「大量」「の」「挿し絵」「が」「用い」「られ」「ている」となっており、私の見る限り、正しく解析できているようだ。さて、UD_Japanese-GSDLUWがリリースされたら、このあたり、もっとうまくいくかな。

yasuokaの日記： Transformersとbert-base-japanese-char-extendedとUD_Japanese-GSDで作る日本語「長単位」形態素解析器 0

Transformersとbert-base-japanese-char-extendedとUD_Japanese-GSDで作る日本語「長単位」形態素解析器 More ログイン

スラド