I got involved with tokenization (mecab, kuromoji) and got confused as to what a "word" is in the Japanese language.
Scenario 1:
"たべません" analyzed by a tokenizer, outputs the following tokens:
たべ 動詞,自立,*,* たべる
ませ 助動詞,*,*,* ます
ん 助動詞,*,*,* ん
While the result outputs 3 token, I think generally it would be considered a single word. In this case the word "たべません" would consist of a verb and 2 auxilary verbs (助動詞) attached to it.
Scenario 2:
"たべています" analyzed by a tokenizer, outputs the following tokens:
たべ 動詞,自立,*,* たべる
て 助詞,接続助詞,*,* て
い 動詞,非自立,*,* いる
ます 助動詞,*,*,* ます
I think usually "たべています" would be considered as two seperate words "たべて" and "います". In this case the word "たべて" consists of a verb and a particle (助詞) and "います" consists of a dependend (非自立) verb with an auxilary verb attached to it.
Scenario 3:
"たべてて" (a contraction of "たべていて") analyzed by a tokenizer, outputs the following tokens:
たべ 動詞,自立,*,* たべる
て 動詞,非自立,*,* てる
て 助詞,接続助詞,*,* て
While I think that "たべていて" would be seperated as "たべて" and "いて", I have no clue what to do with "たべてて". Following the logic from scenario 2, I would say we should split at the てる as it is a dependent verb, just like the "います" was in scenario 2. On the other hand it seems inconsistent to me, that we would end up with "たべ" by itself...
Scenario 4:
"たべながら" analyzed by a tokenizer, outputs the following tokens:
たべ 動詞,自立,*,* たべる タベ タベ
ながら 助詞,接続助詞,*,* ながら ナガラ ナガラ
I think usually "たべながら"would be considered as two seperate words "たべ" and "ながら". However this would contradict the logic in scenario 1, since "ながら" is classified as particle (助詞),just like "て".
Conclusion and Problem
As pointed out in scenario 3 and 4, I can't come up with a good and consistent way to split "words" in Japanese.
I want to programmatically process big amounts of text with meaningful results. If I consider every token a single word, the data would end up beeing confusing and unpresentable. On the other hand if I include every particle and helper verb into a single word, I would end up with far too many unique words.
I am hoping you can show me a middle ground between both extremes.
Is my intuition bad as far as the suggested "splits" go? Am I missing some important rule here? Are the tokenizer results debatable? Or is there simply no hope?