Some interesting statistics in nihongoresources
Friday 3rd of October 2008, 06:10:57 am
As some of you may know I'm working on ordering the Japanese kanji in such a way that they make sense to learn in that order. The jouyou set is a nice idea, but it intended to be taught in Japanese schools from the age of "young", and most people learning Japanese learn it when they are well in their teens or older.
This means that there is far more sense in ordering kanji in such a way that with every kanji or small set of kanji you learn, you can read a significantly larger portion of any Japanese that is tossed at you. For this purpose, I abstracted word frequency data - and consequently kanji frequency and importance - from a large collection of digital copies of Japanese novels.
This has brought to light some interesting statistics. For instance, do you know how many kanji you need to know to be able to read (on average) 50% of all the kanji in Japanese novel material?
Half the number of jouyou kanji you might think. that's 1945 divided by two. So, 970ish kanji roughly? No.
Okay... 750?
No.
500?
No, keep going lower.
400?
No seriously, go lower.
250??
No... lower still.
"wtf?"Yeah that's what I thought too. Here's the list of how many kanji you need to know in order to have covered which percentage of kanji-use (on average) in about 1300 novels:
- up to 10%: 11 kanji
- up to 20%: 20 more kanji: 31
- up to 30%: 33 more kanji: 64
- up to 40%: 50 more kanji: 114
- up to 50%: 74 more kanji: 188
- up to 60%: 103 more kanji: 291
- up to 70%: 148 more kanji: 439
- up to 80%: 222 more kanji: 661
- up to 90%: 348 more kanji: 1045
that's right, with only 1000 kanji you can read pretty much 90% of written Japanese novels. Now for the fun part. As is to be expected, the upper 10% is the problem, but how much of a problem:
- up to 95%: 377 more kanji: 1422
- up to 99%: 768 more kanji: 2190
that's only slightly more than the number of jouyou kanji to read 99% of kanji used Japanese novels. Now here's the scary part:
"Oh my god, Mike, that's horrible!" I hear you say. Perhaps... perhaps it is horrible. After all, 3773 kanji to learn? But then again, there is another interesting statistic to look at here. Not every kanji is used equally often (obviously), but what if we do a "cap off" criterium: what if we want to know how many kanji we need to learn so that we can read all kanji that are at least once in every book, on average?
- 0.9% more: 847 kanji, 3037 to get to 99.9%
- that final 0.1%: 736 kanji. grand total: 3373 kanji
Well then it turns out the jouyou number is not that crazy after all: you will get by just fine with a knowledge of roughly 1900 kanji, putting us at a roughly 97.8% kanji comprehension level.
So, how to go from here? how to order the kanji in such a way that it makes sense to learn them in that order? Well, that's where the "blue data" comes in. There no real reason I call it this, the "blue data" is a collection of data that says which kanji have how many strokes, which readings, and how they visually decompose in simpler kanji shapes. Combining this data allows the lists to be ordered on a few features:
- how many strokes are there in a kanji,
- how many subcomponents does the kanji consist of,
- how many words is this kanji used in (plus how important those words are), and finally:
- how simple is the kanji to learn, given the kanji we already learnt in this ordering
This will be a bit of a puzzle, and will yield equally interesting statistics on how many words, and which, you need to know to be able to understand what part of written Japanese.
I'll keep you posted!