Some more statistics in nihongoresources
Friday 3rd of October 2008, 07:49:19 am
In the previous post I talked about the distribution of kanji in novels, and how that impacted the number you needed to learn in order to comprehend how much of kanji use in novels.
Let's do the same with words and see where we end up. Of course, with words you need to be careful: if you know "the", "a" and "it", then you probably understand 10% of the English language and it won't do you any good at all (although they most certainly are important words to learn)
Japanese has a similar problem. If you know the 'word' た, you "understand" a bit over 6% of the language. It also does you no good at all to just know 'た', so the first "percentages" are pretty much completely useless. For this purpose, I decided to run the word frequency/importance tests based on all the words found in around 1300 novels, minus interpunction and particles - particles cover a whopping near 50% of all written Japanese, but comprise only about 25 really frequently used ones, so they're relatively pointless to include in a "how important are they to know" list: if you don't know particles, you cannot read Japanese, the end.
So that said, let's have a mosey through statistics lane, and see what 1300ish novels teach us about how important Japanese words are to the written language:
- up to 10% requires knowing 1 word
- up to 20% requires knowing 5 more words, for a total of 6
- up to 30% requires knowing 20 more words, total of 26
- up to 40% requires knowing 59 more words: 85
- up to 50%, 155 more words, 240 words total
- up to 60%, 376 more words, 616 words total
- up to 70%, 814 more words, 1430 words total
- up to 80%, 1911 more words, 3341 words total
And then things get funky.
- up to 90%, 5341 more words, 8682 words total
- up to 95%, 7495 more words, 16177 words total
And then things get
- up to 99%, 21574 more words, 37751 words total
- up to 99.9%, 29012 more words, 66763 words total
- all of it, 22551 more words, 89314 words total
What does this tell us?
Frankly, not a whole lot. This data is less useful than the kanji data presented in the previous post, because 'words' are hard to classify, because they comprise quite a few word classes, not all of which contribute equally to a mastery of the language.
Japanese is a verb oriented language, in that sentence consisting of a single verb are contextually "padded" by a listener or reader to make sense. If someone asks me whether I read a particular book and I say "read" in Japanese, then they will understand that to mean "yes, I read the book you mention".
However, for verbs to make sense, typically at some point it needs to have been linked to a nominal of some sort.
Okay, so verbs and nouns, restrict the list to those two... but hang on, adjectives are important too. Sure, "car" is a more essential concept than "black car" or "old car", but they are important to know if you want to do more than just babble childlike. In fact, one could even argue that without adjectives, one could never properly refer to something, while with adjectives all you need is the generic nominals "thing" and "concept" (for tangibles and intangibles) and then as long as you have enough adjectives, you can describe whatever thing or concept you mean.
So things are tricky. What do these word statistics tell us? Foremost, they tell us we need to make a choice. At which point is "partial coverage" enough to get the gist of a text? Opinions vary, but I'm going to take a bold step and say "80% minimum for any kind of meaningful understanding". And 80% minimum means that in a sentence, you will not understand a word for every four you do understand. That's pretty limited, but at least you should be able to get the gist. Sort of.
A better value would be 95%, which means that in every 20 words you read, there will be one word you don't know. Now you can read most sentences and only need to reach for a dictionary a few times, although you will probably understand what the word meant if you keep reading, using context to fill in the blanks - like we all do when we read texts in our native language(s).
So that puts us at a base vocabulary of around sixteen thousand words. "Holy crap, Mike, that's a lot!" I can hear you say. Perhaps... perhaps it sounds like a lot but really, it isn't. The average English speaker (and the average English speaker is not going to learn a second language, so your count will be higher) knows around twelve thousand words. If you've enjoyed a college/university education, then that number goes up to around seventeen thousand.
Does this mean that a university-level educated English speaker needs to learn a vocabular the size of their native vocabulary?
No.
Statistics is a tricky thing, and we have to be really careful with what we conclude from what are just numbers. Although sixteen thousand certainly sounds high, let's again use a "cap", and see how many words we are left with when we want words that, on average, are found at least twice per novel. This gives us a completely different number: only around 2000 words.
If we up the count to words that occur, on average, at least once per novel (to bring it in line with our criterion for useful kanji), then we get a vocabulary of close to 3800.
that's not so bad, really. Do a mere 10 words a day, keep it up for only a year, and you're pretty much done.
That's right: one year of doing 10 words and 6 kanji a day, and you're done with the language... how incredible does that sound?
You're right, "too incredible", because you also need to learn the patterns of grammar that tell you how to interpret sentences, and the patterns of Japanese culture that tell you how to interpret the interplay between certain words, not to mention learning to recognise idiomatic expressions... but the previous bit of fact stands: you can learn to read and write Japanese in a year, with frankly fairly little exercise a day. Half an hour to an hour a day to learn a new language in a year?
I should charge big bucks for that approach, sir =P