Session 4-3
Textual Encoding for Government-Designated Textbooks (Kokutei Tokuhon)
- Akihito Kawase (National Institute for Japanese Language and Linguistics)
- Toshinobu Ogiso (National Institute for Japanese Language and Linguistics)
This study examines possibilities for converting several textbooks to XML format, which will provide fundamental documents for analyzing reading books written in the late Meiji period to the early Showa period. The documents selected for the analysis are the Kokutei Tokuhon, government-designated textbooks that were used in elementary schools through the late 1940s.
The Kokutei Tokuhon are the textbooks which the Japanese government (Ministry of Education) designated for elementary school national language education, and have been repeatedly published in revised and altered versions since the first edition in 1903 (Meiji-36) to the last sixth edition in 1947 (Showa-22).
It is suspected that the alterations of editions mainly reflect social influences and cultural awareness with the unique and remarkable characteristics as described below: (Edition 1) sudden change of social life after the victory in the First Sino-Japanese War, (Edition 2) the change in social conditions and the spread of nationalism after the victory in the Russo-Japanese War, (Edition 3) the development of the education campaign and the child-centered educational philosophy after World War I, (Edition 4) nationalism after the Manchurian Incident, (Edition 5) strain under the Pacific War, (Edition 6) reformation of the educational system with the establishment of the Fundamental Law of Education after World War II.
Therefore, describing the Kokutei Tokuhon in a machine-readable manner has profound significance, not only for providing resources for studying the fundamental lifestyle changes during the social modernization through the Meiji, Taisho, and Showa eras, but also for providing indispensable documents to clarify the flow and tendencies of national language education. It is expected that the output of this study will create a fundamental resource for supporting studies on the background of the establishment of the modern Japanese language.
Furthermore, since the National Institute for Japanese Language and Linguistics (NINJAL) has been conducting digitization and morphological analysis of Japanese classics, these studies will contribute to capturing the features of diachronic changes in the Japanese language by comparing morphological data from the Kokutei Tokuhon to those previous studies from a corpus-linguistic viewpoint.
We specifically describe the Kokutei Tokuhon in detail in the following manner:
1. The fundamental text structure (elements from <text>
to <s>
level)
In the case of Kokutei Tokuhon, the number of texts and publication content vary based on the edition, but most of the structure is common. The fundamental text structure was common across different editions, although publication numbers and content varied. Here, we interpret each textbook as the smallest unit of the file, and convert total 71 textbooks with <text>
elements. The structure of each <text>
element is a combination of three parts: the front matter, the body part that contains text with an illustration, and the back matter.
Since this structure is well aligned with the composition of an orthodox Western manuscript, we can mark up most of the elements in reference to TEI P5: Guidelines. Table 1 shows the list of elements used to mark up the textbook.
Element | Role | Element | Role |
---|---|---|---|
<text> |
whole text | <q> |
quote |
<front> |
front matter | <s> |
sentence unit |
<body> |
body matter | <rs> |
referencing string |
<back> |
back matter | <pb> |
page preak |
<div> |
divisions | <cb> |
column break |
<head> |
title | <lb> |
line break |
<p> |
paragraph | #PCDATA |
character data |
2. Elements from <s>
to #PCDATA
level
The following two issues remain in the process of encoding the Kokutei Tokuhon: (A) structuring ruby annotations and (B) editing prolonged sound marks, a Japanese symbol which indicates a long vowel (e.g. the symbol “ー” in おとーさん, which is pronounced otoh-san).
If we give priority to morphological analysis or constructing a dictionary for modern Japanese, revising the original text (below the sentence level) with the TEI element <seg>
(arbitrary segment), representing any segmentation of text below the ‘chunk’ level might be one of the solutions to issues (A) and (B).
However, to conduct string matching or structural comparisons based on the original text, the corrected text (the optimal characters for morphological analysis) and the original appearance information must be preserved. Throughout the study, we would like to have constructive discussions and come up with practical suggestions for resolving these issues.
- Keywords
- Kokutei Tokuhon, modern Japanese, XML