yasuokaの日記: ixaKatのバスク語係り受けをUniversal Dependenciesに変換できるか
昨日の日記で紹介したixaKatだが、『Dependentzia Unibertsalen eredura egokitutako euskarazko zuhaitz-bankua』を横目に、Universal Dependenciesへの変換をGoogle Colaboratory上で考えてみた。
!test -d ixa-pipe-dep-eu || ( curl -L http://ixa2.si.ehu.es/ixakat/downloads/ixa-pipe-dep-eu-v2.0.0.tgz | tar xzf - )
!test -d dep-eu-resources-v2.0.0 || ( curl -L http://ixa2.si.ehu.es/ixakat/downloads/dep-eu-resources-v2.0.0.tgz | tar xzf - )
!test -d ixa-pipe-pos-eu || ( curl -L http://ixa2.si.ehu.es/eustagger/download/ixa-pipe-pos-eu-x86-64.tar.bz2 | tar xjf - )
!test -d ixa-pipes-1.1.1 || ( curl -L http://ixa2.si.ehu.es/ixa-pipes/models/ixa-pipes-1.1.1.tar.gz | tar xzf - )
!pip install deplacy
!echo Euskaldun izatea lan extra bat izatea da. | sh ixa-pipe-pos-eu/ixa-pipe-pos-eu.sh | java -jar ixa-pipe-dep-eu/ixa-pipe-dep-eu-2.0.0-exec.jar -b dep-eu-resources-v2.0.0 -c tmp.conll | sed -e '/<terms>/,/<\/terms>/d' -e '/<deps>/,/<\/deps>/d' | java -jar ixa-pipes-1.1.1/ixa-pipe-pos-1.5.1-exec.jar tag -m ixa-pipes-1.1.1/ud-morph-models-1.5.0/eu/eu-pos-perceptron-ud.bin -lm ixa-pipes-1.1.1/ud-morph-models-1.5.0/eu/eu-lemma-perceptron-ud.bin > tmp.xml
with open("tmp.xml","r",encoding="utf-8") as f:
xml=f.read()
startchar=[]
endchar=[]
upos=[]
for s in xml.split("\n"):
if s.find("<wf id=")>=0:
i=s.index('offset="')
j=int(s[i+8:s.index('"',i+8)])
startchar.append(j)
i=s.index('length="')
endchar.append(j+int(s[i+8:s.index('"',i+8)]))
if s.find("<term id=")>=0:
i=s.index('morphofeat="')
j=s[i+12:s.index('"',i+12)]
if j=="CONJ":
j="CCONJ"
upos.append(j)
startchar.append(0)
with open("tmp.conll","r",encoding="utf-8") as f:
conll=f.read()
p={"apocmod":"parataxis","apoxmod":"parataxis","aponcmod":"appos","auxmod":"aux","ccomp_obj":"ccomp","ccomp_subj":"csubj","entios":"flat","galdemod":"aux","gradmod":"advmod","haos":"compound","itj_out":"vocative","itjout":"vocative","lot":"conj","lot_at":"discourse","lotat":"discourse","menos":"mark","ncobj":"obj","ncsubj":"nsubj","nczobj":"iobj","postos":"nmod","prtmod":"aux","xcomp_obj":"ccomp","xcomp_subj":"csubj","xcomp_zobj":"advcl","xpred":"aux"}
u={"PUNCT":"punct","NUM":"nummod","DET":"det"}
doc=""
i=0
for s in conll.split("\n"):
t=s.split("\t")
if len(t)==10:
t[3]=upos[i]
t[9]="SpaceAfter=No" if endchar[i]==startchar[i+1] else "_"
dep="dep:"+t[7]
if t[6]=="0":
dep="root"
elif t[7] in p:
dep=p[t[7]]
elif upos[i] in u:
dep=u[upos[i]]
elif t[7] in {"cmod","xmod"}:
dep="advcl" if upos[int(t[6])-int(t[0])+i] in {"VERB","AUX","ADJ","ADV"} else "acl"
elif t[7]=="ncmod":
dep="nmod"
if upos[i]=="ADV":
dep="advmod"
elif upos[i]=="ADJ":
dep="amod"
elif upos[int(t[6])-int(t[0])+i] in {"VERB","AUX","ADJ","ADV"}:
dep="obl"
elif t[7]=="ncpred":
dep="ccomp" if upos[i]=="VERB" else "obj"
t[7]=dep
i+=1
doc+="\t".join(t)+"\n"
doc=doc.replace("\n\n\n","\n\n")
import deplacy
deplacy.render(doc)
deplacy.serve(doc,port=None)
かなり長くなってしまったが、「Euskaldun izatea lan extra bat izatea da.」の係り受け解析結果は、私(安岡孝一)の手元では以下のようになった。
Euskaldun PROPN <╗ obl
izatea VERB ═╝<══════╗ csubj
lan NOUN ═╗═╗<╗ ║ obl
extra NOUN <╝ ║ ║ ║ nmod
bat NUM <══╝ ║ ║ nummod
izatea VERB ═════╝<╗ ║ ccomp
da VERB ═══════╝═╝═╗ root
. PUNCT <══════════╝ punct
1 Euskaldun euskaldun PROPN ADJ KAS=ZERO|CLUSTER=01010111|CLUSTERM=0101|ATZIZKIA=Null 2 obl _ _
2 izatea izan VERB ADI_SIN KAS=ABS|ERL=KONPL|ADM=ADIZE|CLUSTER=0110100|CLUSTERM=0110|ATZIZKIA=Null 7 csubj _ _
3 lan lan NOUN IZE_ARR KAS=ZERO|CLUSTER=1011110111010|CLUSTERM=1011|ATZIZKIA=Null 6 obl _ _
4 extra extra NOUN ADJ KAS=ZERO|CLUSTER=01111110100|CLUSTERM=0111|ATZIZKIA=Null 3 nmod _ _
5 bat bat NUM DET_DZH CLUSTER=1011010|CLUSTERM=1011|ATZIZKIA=Null 3 nummod _ _
6 izatea izate VERB IZE_ARR KAS=ABS|NUM=S|CLUSTER=0110100|CLUSTERM=0110|ATZIZKIA=a 7 ccomp _ _
7 da izan VERB ADT ASP=PNT|MDN=A1|DADUDIO=NOR|NOR=HURA|CLUSTER=0110100|CLUSTERM=0110|ATZIZKIA=Null 0 root _ SpaceAfter=No
8 . . PUNCT PUNT_PUNT _ 7 punct _ _
SVGで見ると、こんな感じ。「da」をコピュラとみなすべきかどうかは、ちょっと悩んだのだが、「izatea」⇐csubj=「izatea」=cop⇒「da」という構造は読みにくいので、両方の「izatea」が「da」にぶらさがる構造にした。さて、他の例文でも、うまく動くかな?
ixaKatのバスク語係り受けをUniversal Dependenciesに変換できるか More ログイン