Japanese parsing

parsing using the reader’s knowledge

Consider the following sign:

Sequences of kanjis are not separated

English separates words with spaces, but Japanese doesn’t use any separators for kanji or hiragana; they appear often in katakana, though:

Tokyo Medical Univ. Hospital
thank you very much
Albert Einstein

toukyou ika daigaku byouin
doumo arigatou gozai-masu
arubeeto ainshutain

Japanese common form

Sequences of words written in either kanji or hiragana, like ‘Tokyo Medical University Hospital’ or ‘thank you very much’ are not separated and, instead, the reader must already know how to split the text, i.e., how to parse the sentence into words. It is like us reading ‘tokyomedicaluniversityhospital’ and ‘thankyouverymuch’, i.e., it’s cumbersome but doable, and with practice we would get used to it.

Katakana is a different story because a Japanese reader wouldn’t have previous knowledge of the foreign words and wouldn’t be able to split the text. Hence, we often find that words in katakana are separated by a dot, ‘・’, which is the equivalent to the dash, ‘-‘, in English. Thus, reading the dot-separated katakana version of ‘Albert Einstein’ is similar to us reading ‘albert-einstein’, i.e., not a problem.

Writing systems as word separators

Everything in Japanese can be written completely using either hiragana or katakana; either of them would suffice. However, normally, nouns, adjectives, and root verbs are written in kanji, while tenses of verbs and particles are written in hiragana. Thus, a combination of kanjis and kanas aids in the parsing of sentences because we roughly know what they generally represent. For example:

Japanese common form

Mr. Tanaka drinks coffee.
tanaka san wa kouhii wo nomi-masu

In this sentence, the noun ‘Tanaka’ (田中) and the stem of the verb ‘drinks’ (飲) are written in kanji; the honorific ‘Mr.’ (さん), the topic marker (は) and subject marker (を) particles, and the verb conjugation of ‘drinks’ (みます) are written in hiragana; and the foreign-origin word ‘coffee’ (コーヒー) is written in katakana. Hence, the writing systems cues the reader about the nature of the words.

There are many exceptions to these general rules, though. Consider the following newspaper article:

Various punctuation marks

Sometimes a part of a noun is written in kanji and the rest in hiragana. In a red box in the article we find the word ‘kodomo’ (children); the first character of the word 子ども (children) is the kanji 子 (ko), while the second and third characters are the hiragana ども (domo), so in spite that ‘children’ is a noun, the use of kanjis and kanas here doesn’t help us to identify the start or end of the word.

Although not so common, a single word can also be a combination of hiragana and katakana, or even kanji, hiragana, and katakana. For example, the word keshigomu (eraser) combines the Japanese word ‘keshi’ (けし – to erase), in hiragana, and the foreign-origin word ‘gomu’ (ゴム – gum, or rubber), in katakana. Furthermore, since ‘keshi’ is a verb, we can also write it with a kanji and a hiragana: 消し. Thus, we can write ‘eraser’ as either:

  • けしゴム – hiragana + kanji
  • 消しゴム – kanji + hiragana + katakana

Punctuation marks

In addition to the use of kanji vs. kana for word parsing, Japanese uses punctuation marks that parse sentences and sentence fragments. As we can see in the annotated newspaper article, the following are the Japanese equivalents to some of the roman punctuation marks:

  1. periods: 。。。(little circles instead of dots)
  2. commas: 、、、(Japanese commas point forward, roman commas point backwards)
  3. single quotes: 「  」(type them with [ and ] when in hiragana mode)
  4. dash marks: ・・・ (for example, when writing down a telephone number)
  5. tilde: 〜 (as shown with the schedule-box, similar to the English ~)
  6. parentheses: ( ) (oriented in the direction of the text)
  7. brackets: [ ] (oriented in the direction of the text)

Japanese does not have hyphens to indicate a word split between two consecutive text lines. For example, in the article, the word 大学生 (dai-gaku-sei – ‘college student’) is split between two different columns but there is punctuation mark similar to the hyphen in the first column that would indicate that the word is split and that it ends in the second column.