Japanese word parsing

Japanese doesn’t separate, or ‘parse’, words with spaces; instead words are parsed either using different writing systems to write different words, or using punctuation marks.

Various punctuation marks

Writing systems as word separators

In Japanese, just writing some words in kanjis and some others in kanas often helps us figure out where a word starts or ends. However, many times a part of a word is written in kanji and the rest in hiragana, e.g., in the article, the first character of the the word 子ども (child) is the kanji 子 (ko), while the second and third characters are the hiragana ども (domo), so here the use of kanjis and kanas doesn’t help to identify the start or end of the word. This mix of kanjis and hiraganas also happens with verbs, where the verb’s stem usually is a kanji, and the suffix that conjugates it is in hiragana:

  • 食べれ – eat (causal)
  • 食べます – eat (formal)
  • 食べない – don’t eat (casual)
  • 食べません – don’t eat, (formal)

A single word can also be a combination of hiragana and katakana, or even kanji, hiragana, and katakana. For example, the word keshigomu (eraser) combines the Japanese word keshi (けし – to erase), written in hiragana, and the foreign word gomu (ゴム – gum or rubber), written in katakana. Furthermore, keshi, being a Japanese verb, can be written with a combination of a kanji and a hiragana: 消し. Thus, we will find the word ‘eraser’ as either:

  • けしゴム (hiragana + kanji)
  • 消しゴム (kanji + hiragana + katakana)

Punctuation marks

Japanese does not have hyphens. For example, in the article, the word 大学生 (daigakusei – ‘college student’) is split between two different columns but there is no indication in the first column that the word is split and that it finishes in the second column.

The following are the Japanese equivalents to some of the roman punctuation marks, which do separate words:

  1. periods: 。。。(little circles instead of dots)
  2. commas: 、、、(Japanese commas point forward, roman commas point backwards)
  3. single quotes: 「  」(type [ and ] when in hiragana mode)
  4. dash marks: ・・・ (for example, when writing down a telephone number)
  5. tilde: 〜 (as shown with the schedule-box, similar to the English ~)
  6. parentheses: ( ) (oriented in the direction of the text)