roberta-classical-chinese-base-ud-goeswithで見る古典中国語(漢文)係り受け隣接行列ロジット | yasuokaの日記

yasuokaの日記： roberta-classical-chinese-base-ud-goeswithで見る古典中国語(漢文)係り受け隣接行列ロジット 0

日記 by yasuoka 2022年10月25日 23時11分

昨日の日記に続いて、古典中国語モデルroberta-classical-chinese-base-ud-goeswithも改造してみた。とりあえず、Google Colaboratoryで動かしてみよう。

!pip install transformers ufal.chu_liu_edmonds deplacy from transformers import pipeline nlp=pipeline(task="universal-dependencies",trust_remote_code=True, model="KoichiYasuoka/roberta-classical-chinese-base-ud-goeswith", aggregation_strategy="simple") doc=nlp("孟子見梁惠王") import deplacy deplacy.serve(doc,port=None)

「孟子見梁惠王」を係り受け解析してみたところ、私(安岡孝一)の手元では以下の結果になった。

# text = 孟子見梁惠王 1 孟子 _ PROPN _ NameType=Prs 2 nsubj _ SpaceAfter=No 2 見 _ VERB _ _ 0 root _ SpaceAfter=No 3 梁 _ PROPN _ Case=Loc|NameType=Nat 5 nmod _ SpaceAfter=No 4 惠 _ PROPN _ NameType=Prs 5 compound _ SpaceAfter=No 5 王 _ NOUN _ _ 2 obj _ SpaceAfter=No

SVGで可視化すると、こんな感じ。もちろん、このモデルにおいても、内部の係り受け隣接行列を剝き身で見ることができる。ちょっと見てみよう。

!pip install transformers import torch,numpy from transformers import AutoTokenizer,AutoModelForTokenClassification brt="KoichiYasuoka/roberta-classical-chinese-base-ud-goeswith" txt="孟子見梁惠王" tkz=AutoTokenizer.from_pretrained(brt) mdl=AutoModelForTokenClassification.from_pretrained(brt) v,l=tkz(txt,return_offsets_mapping=True),mdl.config.id2label w,u=v["input_ids"],[txt[s:e] for s,e in v["offset_mapping"] if s<e] x=[w[:i]+[tkz.mask_token_id]+w[i+1:]+[j] for i,j in enumerate(w[1:-1],1)] with torch.no_grad(): m=mdl(input_ids=torch.tensor(x)).logits.numpy()[:,1:-2,:] r=[1 if i==0 else -1 if l[i].endswith("|root") else 0 for i in range(len(l))] m+=numpy.where(numpy.add.outer(numpy.identity(m.shape[0]),r)==0,0,numpy.nan) d,p=numpy.nanmax(m,axis=2),numpy.nanargmax(m,axis=2) print(" ".join(x.rjust(12-len(x)) for x in u)) for i,j in enumerate(u): print("\n"+" ".join("{:12.3f}".format(x) for x in d[i])," ",j) print(" ".join(l[x].split("|")[-1][:12].rjust(12) for x in p[i]))

隣接行列のロジット(対数オッズ)は、以下の結果になった。

孟子見梁惠王 1.614 13.447 3.614 1.796 2.638 3.227 孟 root goeswith parataxis flat flat conj 2.562 1.379 2.898 1.769 2.058 2.189 子 goeswith root goeswith flat flat conj 12.649 3.503 14.121 1.530 1.870 12.823 見 nsubj nsubj root obj nsubj obj 1.686 3.282 2.806 2.027 5.426 4.830 梁 flat goeswith case root goeswith goeswith 1.872 4.209 2.639 6.933 1.752 6.175 惠 goeswith goeswith goeswith nmod root goeswith 1.496 3.133 2.452 10.543 11.329 2.845 王 conj goeswith amod nmod compound root

「孟子」は「孟」と「子」の2つのトークンに分かれており、内部的には「孟」=goeswith⇒「子」となっているのがわかる。というか、1文字1トークンのモデルなので、複数の文字に渡る漢語は、内部的にはgoeswith(泣き別れ)で繋がることになるわけだ。モデルの作り方は、ここに書いておいたので、手元にGPUがある読者は挑戦してみてほしい。

yasuokaの日記： roberta-classical-chinese-base-ud-goeswithで見る古典中国語(漢文)係り受け隣接行列ロジット 0

roberta-classical-chinese-base-ud-goeswithで見る古典中国語(漢文)係り受け隣接行列ロジット More ログイン

スラド