Package nltk :: Package corpus :: Package reader :: Module sinica_treebank

Module sinica_treebank

Sinica Treebank Corpus Sample

http://rocling.iis.sinica.edu.tw/CKIP/engversion/treebank.htm

10,000 parsed sentences, drawn from the Academia Sinica Balanced Corpus of Modern Chinese. Parse tree notation is based on Information-based Case Grammar. Tagset documentation is available at http://www.sinica.edu.tw/SinicaCorpus/modern_e_wordtype.html

Language and Knowledge Processing Group, Institute of Information Science, Academia Sinica

It is distributed with the Natural Language Toolkit under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike License [http://creativecommons.org/licenses/by-nc-sa/2.5/].

References:

Feng-Yi Chen, Pi-Fang Tsai, Keh-Jiann Chen, and Chu-Ren Huang (1999) The Construction of Sinica Treebank. Computational Linguistics and Chinese Language Processing, 4, pp 87-104.

Huang Chu-Ren, Keh-Jiann Chen, Feng-Yi Chen, Keh-Jiann Chen, Zhao-Ming Gao, and Kuang-Yu Chen. 2000. Sinica Treebank: Design Criteria, Annotation Guidelines, and On-line Interface. Proceedings of 2nd Chinese Language Processing Workshop, Association for Computational Linguistics.

Chen Keh-Jiann and Yu-Ming Hsieh (2004) Chinese Treebanks and Grammar Extraction, Proceedings of IJCNLP-04, pp560-565.

Classes

[hide private]

SinicaTreebankCorpusReader
Reader for the sinica treebank.

Variables

[hide private]

IDENTIFIER = re.compile(r'^#\S+\s')

APPENDIX = re.compile(r'(?<=\))#.*$')

TAGWORD = re.compile(r':([^:\|]+):([^:\|]+)')

WORD = re.compile(r':[^:\|]+:([^:\|]+)')