Japanese parsing

parsing using the reader’s knowledge

Consider the kanjis in the following sign:

Sequences of kanjis are not separated

Like the translation under the kanjis show, these eight kanjis represent four words but… what kanjis correspond to each word? Some words might be a single kanji long, like 日 (hi – sun); some others might be two-kanjis long, like 日本 (ni-hon – Japan); while some others might be three kanjis long, like 日本語 (ni-hon-go – the Japanese language). So how do we know which characters are part of a word and which are not? How do Japanese people split a sequence of characters – kanjis or kanas – into separate words?

The process of separating a sequence of characters or sounds into a meaningful sequence of individual units (technically called ‘tokens’) is called ‘parsing’ and it’s one of the main difficulties of understanding a new language, specially when listening. The reason we pronounce words separately when we are trying to be very clear about what we are saying is to facilitate our listener the parsing of our words; we are duplicating the role of the spaces in written language with pauses in the spoken language. We start getting really good at understanding a new language, when we are able to do this parsing on the fly.

English separates words with spaces, but Japanese doesn’t use any separators for kanji or hiragana; they appear often in katakana, though:


English
Tokyo Medical Univ. Hospital
thank you very much
Albert Einstein


romaji
toukyou ika daigaku byouin
doumo arigatou gozai-masu
arubeeto ainshutain


Japanese common form
東京医科大学病院
どうもありがとうございます
アルバート·アインシュタイン


Sequences of words written in either kanji or hiragana, like ‘Tokyo Medical University Hospital’ or ‘thank you very much’ are not separated and, instead, the reader must already know how to split the text, i.e., how to parse the sentence into words. It is like us reading ‘tokyomedicaluniversityhospital’ and ‘thankyouverymuch’; it’s cumbersome but doable, and with practice we would get used to it.

Katakana is a different story because a Japanese reader wouldn’t have previous knowledge of the foreign words and wouldn’t be able to parse the text. Hence, we often find that words in katakana are separated by a dot, ‘・’, which is the Japanese equivalent to the dash, ‘-‘, in English. Reading the dot-separated katakana version of ‘Albert Einstein’ is similar to us reading ‘albert-einstein’; this is easy because the sentence is parsed, with dashes instead of spaces, but still it is parsed.

Writing systems as word separators

Everything in Japanese can be written completely using either hiragana or katakana; either of them would suffice. However, normally, nouns, adjectives, and root verbs are written in kanji, while tenses of verbs and particles are written in hiragana. Thus, given the absence of spaces, a combination of kanjis and kanas aids in the parsing of sentences because we roughly know what they generally represent. For example:


English
romaji
Japanese common form


Mr. Tanaka drinks coffee.
tanaka san wa kouhii wo nomi-masu
田中さんはコーヒーを飲みます


In this sentence, the noun ‘Tanaka’ (田中) and the stem of the verb ‘drinks’ (飲) are written in kanji; the honorific ‘Mr.’ (さん), the topic marker (は) and subject marker (を) particles, and the verb conjugation of ‘drinks’ (みます) are written in hiragana; and the foreign-origin word ‘coffee’ (コーヒー) is written in katakana. Hence, the writing systems cue the reader about the nature of the words. To make this clear, imagine that the switching from one style of writing to another is equivalent to capitalizing a letter in English; we can now see how this switching helps parse the sentence:


A single writing style
たなかさんはこうひいをのみます
tanakasanwakouhiiwonomimasu
mrtanakadrinkscoffee


multiple writing systems
田中さんはコーヒーを飲みます
TanakaSanwaKouhiiWoNoMimasu
MrTanakaDrinksCoffee


The parsing is not perfect, though. Using multiple systems of writing does not separate ‘san’ and ‘wa’ into different words, and splits ‘nomimasu’, which is a single word, into ‘no’ and ‘mimasu’. Thus, the reader is still required to have some knowledge about the text, but now is much easier to parse the text.

Multiple writing systems in a word

Consider the following newspaper article:

Various punctuation marks

Usually the root of the verb is written in kanji and the conjugation in hiragana, like in 飲みます (nomimasu – to drink). Nouns are often written in kanji, but sometimes a part of a noun is written in kanji and the rest in hiragana. In a red box in the article we find the word ‘kodomo’ (children); the first character of the word 子ども (children) is the kanji 子 (ko), while the second and third characters are the hiragana ども (domo), so in spite that ‘children’ is a noun, the use of kanjis and kanas here doesn’t help us to identify the start or end of the word.

Although not so common, a single word can also be a combination of hiragana and katakana, or even kanji, hiragana, and katakana. For example, the word ‘keshigomu’ (eraser) combines the Japanese word ‘keshi’ (けし – to erase), in hiragana, and the foreign-origin word ‘gomu’ (ゴム – gum, or rubber), in katakana. Furthermore, since ‘keshi’ (to erase) is a verb, we can also write it with a kanji and a hiragana: 消し. Thus, we can write ‘eraser’ in two ways. Roman characters have also found their way as part of a few Japanese words, e.g., we can write T-shirt either in katakana, or using the roman letter:


English
eraser
eraser
T-shirt
T-shirt


romaji
keshi-gomu
keshi-gomu
tiishatsu
tiishatsu


Japanese
けしゴム (kanji + hiragana)
消しゴム (kanji + hiragana + katakana)
テイシャツ (katakana)
Tシャツ (roman + katakana)


Punctuation marks

In addition to the use of kanji vs. kana for word parsing, Japanese uses punctuation marks that parse sentences and sentence fragments. As we can see in the annotated newspaper article, the following are the Japanese equivalents to some of the roman punctuation marks:

  1. periods: 。。。(little circles instead of dots)
  2. commas: 、、、(Japanese commas point forward, roman commas point backwards)
  3. single quotes: 「  」(type them with [ and ] when in hiragana mode)
  4. dash marks: ・・・ (for example, when writing down a telephone number)
  5. tilde: 〜 (as shown with the schedule-box, similar to the English ~)
  6. parentheses: ( ) (oriented in the direction of the text)
  7. brackets: [ ] (oriented in the direction of the text)

Japanese doesn’t use hyphens to indicate a word is split between two consecutive text lines. For example, in the article, the word 大学生 (dai-gaku-sei – ‘college student’) is split between two different columns but there is no punctuation mark similar to the hyphen in the first column that indicates that the word is split and it ends in the second column.