Are there any consistent rules as to how to distinguish words in Japanese?

Question

I got involved with tokenization (mecab, kuromoji) and got confused as to what a "word" is in the Japanese language.

Scenario 1:

"たべません" analyzed by a tokenizer, outputs the following tokens:

たべ  動詞,自立,*,*   たべる
ませ  助動詞,*,*,*   ます
ん   助動詞,*,*,*   ん

While the result outputs 3 token, I think generally it would be considered a single word. In this case the word "たべません" would consist of a verb and 2 auxilary verbs (助動詞) attached to it.

Scenario 2:

"たべています" analyzed by a tokenizer, outputs the following tokens:

たべ  動詞,自立,*,*   たべる
て   助詞,接続助詞,*,* て
い   動詞,非自立,*,*  いる
ます  助動詞,*,*,*   ます

I think usually "たべています" would be considered as two seperate words "たべて" and "います". In this case the word "たべて" consists of a verb and a particle (助詞) and "います" consists of a dependend (非自立) verb with an auxilary verb attached to it.

Scenario 3:

"たべてて" (a contraction of "たべていて") analyzed by a tokenizer, outputs the following tokens:

たべ  動詞,自立,*,*   たべる
て   動詞,非自立,*,*  てる
て   助詞,接続助詞,*,* て

While I think that "たべていて" would be seperated as "たべて" and "いて", I have no clue what to do with "たべてて". Following the logic from scenario 2, I would say we should split at the てる as it is a dependent verb, just like the "います" was in scenario 2. On the other hand it seems inconsistent to me, that we would end up with "たべ" by itself...

Scenario 4:

"たべながら" analyzed by a tokenizer, outputs the following tokens:

たべ  動詞,自立,*,*   たべる タベ  タベ
ながら 助詞,接続助詞,*,* ながら ナガラ ナガラ

I think usually "たべながら"would be considered as two seperate words "たべ" and "ながら". However this would contradict the logic in scenario 1, since "ながら" is classified as particle (助詞),just like "て".

Conclusion and Problem

As pointed out in scenario 3 and 4, I can't come up with a good and consistent way to split "words" in Japanese.

I want to programmatically process big amounts of text with meaningful results. If I consider every token a single word, the data would end up beeing confusing and unpresentable. On the other hand if I include every particle and helper verb into a single word, I would end up with far too many unique words.

I am hoping you can show me a middle ground between both extremes.

Is my intuition bad as far as the suggested "splits" go? Am I missing some important rule here? Are the tokenizer results debatable? Or is there simply no hope?

`I want to programmatically process big amounts of text with meaningful results.` -- If you want to do that for Japanese, you're going to have to do a lot more studying of the language. The tokenizer appears to be programmed to analyze based on native-Japanese grammarian ideas, where many and various morphophonemic units are treated separately. English-language grammars of Japanese analyze things quite differently, as you're starting to be aware. Even within Japanese-language materials, there's disagreement: some dictionaries class て as a 助動詞, some as 助詞. ながら might be 副助詞, 接尾語, or 接続助詞. Etc. — Eiríkr Útlendi, Jul 19 '19 at 18:56
I am aware that applying grammar concepts of the English language onto Japanese generally doesn't work very well and I understand Japanese grammar at least to the degree, that the tokenizer results make perfect sense to me. It's interesting to hear, that the classifications of particles are debated (I was already worrying that that's the case). Still the question stands if there is a proper way to split up Japanese sentences in a way that makes sense to your random english speaking student. Do you have some good references to Japanese grammar? — Wolf, Jul 19 '19 at 19:05
I haven't delved into Japanese grammar as a subject in too long to know any useful grammar books. For my own purposes, I use monolingual Japanese dictionaries a lot. Aside from my dead-tree copies, I also use [Weblio](https://www.weblio.jp/) and [Kotobank](https://kotobank.jp/) as decent dictionary aggregators. Kotobank is slightly more useful for etymology and grammar research. See also the JA Wikipedia, such as the [Category:日本語文法](https://ja.wikipedia.org/wiki/Category:%E6%97%A5%E6%9C%AC%E8%AA%9E%E6%96%87%E6%B3%95) page. — Eiríkr Útlendi, Jul 19 '19 at 20:18
No hope for sure. I have trouble with this still after studying for many years. In fact, I can read Japanese with kanji and katakana mixed in way easier than hiragana only. My kids read hiragana only books and it boggles my mind. I can never figure out where one word begins and another ends. — Jeshizaemon, Jul 23 '19 at 02:26

naruto · Accepted Answer · 2019-08-27T03:39:19.367

You are seeing the difference between 日本語文法 taught at Japanese-as-a-second-language (JSL) school and 国語文法/学校文法 used by native Japanese speakers. The outputs of tokenizers made by Japanese developers naturally follow the latter. For instance, 食べません is generally considered a single (conjugated) word by JSL learners, but it's generally considered as having three words in Japan. Japanese monolingual dictionaries like 広辞苑 treat て/で, ます, ん and so on as "words", not part of a conjugation. There is no such a thing as te-form in 国語文法.

Auxiliary verbs in Japanese
Does "te-form" of a verb always include て/で? Why?
_{(This question was made by me, a native Japanese speaker, when I was ignorant of the difference)}
Random Japanese / 活用
Isn't a tag "auxiliary verbs" or "jyodoushi" necessary?

Regarding 食べてて, it's a contracted form of 食べていて, which is four words according to 国語文法:

食べ: 連用形 of the verb 食べる
て: 接続助詞
い: 連用形 of the verb いる (used as a subsidiary verb to express progressive aspect)
て: 接続助詞

The contracted form 食べてて no longer includes い, but it is semantically important, and a tokenizer cannot ignore it just because it's visually nonexistent. To address this issue, the tokenizer introduced a "utility" or "quasi-auxiliary" verb てる, which is essentially a single-word shortened version of two-word て + いる. With this, 食べてて can be cleanly tokenized without losing or adding something:

食べ: 連用形 of the verb 食べる
て: 連用形 of the "utility verb" てる
て: 接続助詞

I think similar "utility verbs" are used to tokenize other contractions like (食べ)とく, (食べ)ちゃう and (食べ)たげる. (I tentatively called them "utility verbs", but there may be a better name.) Note that てる, とく, ちゃう, たげる and so on belong to no traditional word class, just as you cannot classify English contractions like d'you, it'll and aren't.

In the mean time (question was asked 1 month ago), I went with what you called the "国語文法" for my problem for various reasons. Also I came to realize, that basically all "grammar/文法" is either incorrect or unuseful, as grammar-study is ultimately more of an interpretation of language structure than it is a proper description. What really created the bigger issues for my project in the end was the discrepancies between the two "文法"s you talked about. Having a tokenizer using "国語文法" and a J-E dictionary using "日本語文法" and different word classifications created a lot of friction... — Wolf, Aug 28 '19 at 09:28
Not at all. These parsers are extremely useful and certainly not 'incorrect'. The parsers are designed for linguistic analysis which can be used for things like machine translation, sense analysis and so forth. Learners have different needs so the grammar analysis is simplified within that context. — Gazzer, Dec 16 '19 at 08:16

score 0 · Answer 2 · answered Jul 27 '19 at 18:51

What constitutes a “word” is very different depending on different contexts. For example, studies based on intonation show that most English speakers consider “high school” to be a single word, despite it being written as two in the orthography. In English splitting words is usually done based on the spelling, which makes the problem a lot easier. In Japanese, there are many different possible ways to split text into words, none of which are necessarily the most correct. To complicate things even more, in Japanese particles are usually considered clitics that attach to the word before them by native speakers, and thus are written with no space before them; a Japanese person asked to write Japanese with spaces would be more likely to write 魚を than 魚を.

So, if you want a consistent way to split Japanese into words that makes sense to everyone (or just most people), you aren't going to find it. However, if you're programmatically splitting this because you're going to process it in some further computation, the important thing is that you pick a consistent definition of what constitutes a word and stick to it. The easiest solution to work with in my mind would be to consider verb and adjective conjugations part of the word that comes before them (and I would personally consider something like the iru in -teiru to be part of the same word, but this is really up for debate), and to consider particles as separate words, as this is probably easier to work with. If you're looking to transform sentences in ways which are grammatically meaningful (I'm still not entirely sure what it is that you're trying to do with these split words), simply splitting into words might not be sufficient and you might need to get into syntax trees, but hopefully you won't need that for your use case as it is a pretty enormous task to write software that can analyze natural language that way.

Are there any consistent rules as to how to distinguish words in Japanese?

2 Answers2