Re: roberta-base-japanese-with-auto-jumanppのトークナイザはsentencepieceを必要としないのか | yasuokaの日記

yasuokaの日記： Re: roberta-base-japanese-with-auto-jumanppのトークナイザはsentencepieceを必要としないのか 1

日記 by yasuoka 2022年10月18日 0時28分

昨日の日記の続きだが、現時点のroberta-base-japanese-with-auto-jumanppは、どうもvocab.txtが壊れているようだ。とりあえず、そのあたりを報告しつつ、AlbertTokenizerの力を借りてvocab.txtを直す方法を考えてみた。Google Colaboratoryで動かしてみよう。

!test -d jumanpp-2.0.0-rc3 || curl -L https://github.com/ku-nlp/jumanpp/releases/download/v2.0.0-rc3/jumanpp-2.0.0-rc3.tar.xz | tar xJf - !test -x /usr/local/bin/jumanpp || ( mkdir jumanpp-2.0.0-rc3/build && cd jumanpp-2.0.0-rc3/build && cmake .. -DCMAKE_BUILD_TYPE=Release && make install ) !pip install transformers pyknp sentencepiece from transformers import AlbertTokenizer,BertJapaneseTokenizer,AutoModelForMaskedLM,FillMaskPipeline from transformers.utils import cached_file tkz=AlbertTokenizer(cached_file("nlp-waseda/roberta-base-japanese-with-auto-jumanpp","spiece.model")) with open("vocab.txt","w",encoding="utf-8") as w: print("\n##".join(tkz.convert_ids_to_tokens(range(len(tkz)))).replace("\n##[","\n[").replace("\n##<","\n<").replace("\n##\u2581","\n"),file=w) tokenizer=BertJapaneseTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese-with-auto-jumanpp",vocab_file="vocab.txt") model=AutoModelForMaskedLM.from_pretrained("nlp-waseda/roberta-base-japanese-with-auto-jumanpp") fmp=FillMaskPipeline(model=model,tokenizer=tokenizer) print(fmp("国境の[MASK]トンネルを抜けると雪国であった。"))

vocab.txtを直して、「国境の[MASK]トンネルを抜けると雪国であった。」の[MASK]を穴埋めさせてみたところ、私(安岡孝一)の手元では以下の結果になった。

[{'score': 0.16705404222011566, 'token': 2244, 'token_str': '地下', 'sequence': '国境の地下トンネルを抜けると雪国であった。'}, {'score': 0.14420612156391144, 'token': 2309, 'token_str': '長い', 'sequence': '国境の長いトンネルを抜けると雪国であった。'}, {'score': 0.02641996741294861, 'token': 509, 'token_str': '北', 'sequence': '国境の北トンネルを抜けると雪国であった。'}, {'score': 0.023077642545104027, 'token': 526, 'token_str': '南', 'sequence': '国境の南トンネルを抜けると雪国であった。'}, {'score': 0.0190455112606287, 'token': 577, 'token_str': '山', 'sequence': '国境の山トンネルを抜けると雪国であった。'}]

2番目に「長い」が出てくる。これなら大丈夫だと思う。でも、本当にsentencepieceじゃなくて、WordpieceTokenizerでいいのかしら。

この議論は、yasuoka (21275)によってログインユーザだけとして作成されたが、今となっては新たにコメントを付けることはできません。

記事ページを表示すべてのコメント取得

検索1コメント Log In/Create an Account

vocab.txtが無事に修正 (スコア:2)

by yasuoka (21275) on 2022年10月18日 13時15分 (#4345778) 日記

された [huggingface.co]ようです。これで、以下のGoogle Colaboratory向けプログラムが「正しく」動くようになりました。
!test -d jumanpp-2.0.0-rc3 || curl -L https://github.com/ku-nlp/jumanpp/releases/download/v2.0.0-rc3/jumanpp-2.0.0-rc3.tar.xz | tar xJf - !test -x /usr/local/bin/jumanpp || ( mkdir jumanpp-2.0.0-rc3/build && cd jumanpp-2.0.0-rc3/build && cmake .. -DCMAKE_BUILD_TYPE=Release && make install ) !pip install transformers pyknp from transformers import pipeline fmp=pipeline("fill-mask","nlp-waseda/roberta-base-japanese-with-auto-jumanpp") print(fmp("国境の[MASK]トンネルを抜けると雪国であった。"))

より多くのコメントがこの議論にあるかもしれませんが、JavaScriptが有効ではない環境を使用している場合、クラシックなコメントシステム(D1)に設定を変更する必要があります。

yasuokaの日記： Re: roberta-base-japanese-with-auto-jumanppのトークナイザはsentencepieceを必要としないのか 1

Re: roberta-base-japanese-with-auto-jumanppのトークナイザはsentencepieceを必要としないのか More ログイン

vocab.txtが無事に修正 (スコア:2)

スラド