My latest Flickr photograph

Frequency lists in nihongoresources

Saturday 27th of September 2008, 05:57:40 am

Japanese word frequency lists are hard to come by. Or rather, impossible. There is one list on the Monash Nihongo ftp Archive, but it's based on newspapers... and we all know the kind of language you find in newspapers isn't exactly what you find "in real life".

So instead I decided to just run my own frequency analysis, and make those files available (I need them for my own nihongoresources work too). I hit up Perfect Dark, downloaded as many novels as it would let me, and ran those through ChaSen (a morphological chunker).

The result got tallied in two ways: term frequency, and term base frequency.

1) Term frequency simply tallied all the different chunks, looking for "word pronunciation". This means that for words such as 十分, with different pronunciations depending on what they are used to mean, there will be two or more distinct entries. It also considers different verbal conjugations different entries. So して and した are both forms of する, but term tallying ignores this. For every distinct use, a term will have a POS tag in its fourth 'column' (colums being tab delimited).

2) Term base frequency is a more compact tally, which ignores all the different roles or pronunciations a word might have, and just tallies the word's base form. This means that 十分 will be the combined tally of all its readings, and して, した, します, etc will all count towards する's total. Every term that falls in multiple POS categories will have a comma-separated list for its POS 'column'.

A bit of statistics: the data preprocessing involved throwing away any file that wasn't .txt (rtf, pdf, etc were not considered), as well as any .txt file that has only a handful of pages. I threw away any .txt under 30kb. This resulted in 1322 text files of varying length, of which I stripped the first and last 25 lines to make sure no "header" information got boosted. The resulting 1242 files were chunked, and the decompositions then tallied.

If you want to know which novels were used to compile these lists: too bad. I have no idea. Part of the "grabbing them off Perfect Dark" means that in some countries this might be considered against the law (Luckily, not mine). While ultimately the word frequencies do not allow the original data to be generated in any way, I figured it safest if I didn't bother recording the titles of the novels I used. Nor did I find it important to keep them after I was done (I can think of better uses for 3Gb of space =).

Of course, for those who care, the two data files can be downloaded: click on the relevant link for either the term aggregates or base aggregates.

And if you end up using them for something interesting, drop a line! I like interesting things, especially if I contributed to making it happen in some way ^_^