My latest Flickr photograph

Writing a shoutbox

Thursday 22nd of July 2010, 09:55:58 pm

Because I'm migrating some of the functionality of nihongoresources to a new domain, dictionaries.nihongoresources.com I figured this was a good opportunity to let people voice their opinion in a less barrier, more direct way. Shoutbox style.

I tried a few premade ones, but they all had their own issues, so I figured I have enough technical skill, why not write one myself? After all, how hard can it be? This not being Top Gear, the answer is "not really that hard", you just need to make sure to do things in such a way that other people will be able to customise it easily. Because that means you'll be able too, as well.

The result is a pretty simple PHP SQLITE backend / JS CSS frontend shoutbox that uses the remarkably simple on-page code "<div id='shoutbox'></div>" and that's all you have to do (well, and load the shoutbox.js script of course).

It works well enough for me. To see an example in action, just head over that aforementioned dictionaries.nihongoresources.com!

Download: here

0 comments - view/add comments

Where's that sequence, Poms?!

Sunday 26th of October 2008, 01:18:20 pm

To answer that question: intangibly located in hypothesis space. Basically, decomposing kanji is the single most boring job I can think of doing, and I had to do a lot of it.

4,299 glyphs later, and the decomposition job is done. At least, you'd think, but secretly it turns out this is not the case. Yes, it is interesting data and I'd love to see it turn out to be a valuable to other people too, but the bottom line is this: I finished manually decomposing over four thousand glyphs. This data will have mistakes.

And it won't just have the silly typo mistakes, it probably some of those really nasty very-hard-to-track errors: all the right subcomponents, all the right decomposition markers, and then not quite in the right order. Other than running through each kanji/glyph all over again, I am not going to catch this myself.

So a request: got free time? Need a bit of mind numbing?

HELP OUT!

To see what I'm talking about I've put the "indigo" data (which is what the decomposition data is known as, on my computers) online for downloading - note that while released under a creative commons license (v3.0, by/nc), this file is considered a "beta" version of sorts.

I can probably use the data as it is now to sequence the kanji in a pretty useful way, and the errors that *are* left in it are unlikely to have a massive impact on the sequencing (and of course the later in the sequence, the less of an impact), but I'd be far happier if I can guarantee - with help from some other people =) - that this data is correct.

You know where to contact me! =D

On a functional note: the download is a zip file with three files in it:

So, enjoy, and I hope to post the sequenced kanji series soon (I am in the process of moving at the moment too, so that might frustrate a timely delivery a bit more. Have to be out of my current house before November >.>;)

- Pomax, signing off for the day

4 comments - view/add comments

Done with kanji decomposing

Saturday 18th of October 2008, 05:47:43 am

The title might be misleading - I am done decomposing all 3774 kanji that rolled out of the data that I dicussed in my earlier post. That still leaves a small task of adding in the descriptor characters to indicate *how* the decompositions form the original characters, and then seeing which of the characters that were used in the decompositions don't actually exist in the original data set, so that I need to treat those too (I expect this to only a few hundred at most).

As for the progress: I have added descriptor characters to all kanji up to stroke count 10, which means about a third of the kanji are "done", with about two third still waiting to be "described". This is a much quicker process than writing up decompositions so I expect this to take no more than until next weekend and we're good to go for the real task:

Ordering the kanji based on which are most important to the language, optimised compositionally, so that you get to learn kanji that you will see used all over the place, before you learn ones that are rare, in such a way that you will mostly learn new kanji that use kanji you already know as components.

Sounds good? It does to me. Never did like the jouyou ordering >.>

2 comments - view/add comments

Pomax <3 Ideographic Description Characters

Saturday 11th of October 2008, 10:19:01 am

Last time I made the "blue file", a file describing which kanji is decomposed (visually) in what way, I did not have the knowledge that I do now:

There is an Ideographic Description Characters block in the unicode 5.0 planes, which means that describing kanji has become a lot less problematic!

What are they? Well, if your font supports it, they look like this: ⿰, ⿱, ⿲, ⿳, ⿴, ⿵, ⿶, ⿷, ⿸, ⿹, ⿺ and ⿻. These can then be used to describe kanji fairly easily. Say I want to describe 寺, I could write ⿱土寸 and Bob's my uncle - the description is complete.

This is great! Except of course for one thing: overlapping kanji. How do you describe 丸? That's what ⿻ is for, which indicates an overlap of components: ⿻九ヽ works, although that doesn't quite tell you where the overlap is, so I use ⿻⿱九ヽ to indicate there is an overlap, and broadly it can be considered a vertical split with "nothing" and "a tick mark". This is not ideal, but certainly better than using brackets and parentheses.

That only leaves my woe that there is no "commonly used ideographs" block, that simply contains all the graphemes used in kanji/hanzi. It gets really annoying having to describe 艮 without that ′ at the bottom right.

0 comments - view/add comments

CJK Unified Ideographs Extension B

Thursday 9th of October 2008, 03:24:26 pm

It is my serious conjecture you probably never bothered looking through the full set of CJK unicode blocks. I cannot fault you on that, there are better ways to spend your time. Playing a video game, thinking up a lolcat caption, maybe playing 'doctor' with the ol' girlfriend...

If you're not doing any of those things, and you happen to be working on graphical decompositions of kanji, then I believe it is safe to say you're likely going to be running into this block. Now, a word of warning.

   IT IS HUGE

Yes, the regular Unified Ideographs block is pretty big. It has about 20,000 characters in there, but for the most part, these will be characters that you can go "okay... I can see how that happens" to.

Not so with scary extensions B block. That shit be crazy. Is your browsers UTF8 supportive? You sure? Got a good unicode font installed? No I mean a good unicode font, not something wimpy like Arial Unicode, or even the respectable Code2000/Code2001 combination... Okay here we go:

   𠙴 𠤬 𡤼 𡰣 𡿧 𣁬 𣱱 𤕪 𥫘 𩇨

What? What is up with these? Missing strokes? Mirror images? Too many strokes? Wait, there's more:

   𡆢 𡦹 𢀓

What ARE these? O_o

No, no, wait for it, there is more... Oh there is more:

   𡕜 𡬞

Does that seem "too many" for you? No? Yeah I didn't think so, I've been saving this just in case you weren't weirded out yet...

   𦣩 - symmetry in writing...

   𤴐 - have some boxes.

   𣡽 ... okay I just... I just don't know. HOW MANY TREES DO YOU NEED?!

Extension B block is just *filled* with this stuff, there are almost three times as many Extension B block Chinese characters as there are in the *actually used* regular Unified Ideographs block. I'm serious, there are over 40,000 characters in that block that you are going to be going "wth?" at.

Still here? then go download Babelmap, make sure you have plenty of font bindings installed (bitstream cyberbit, code 2000, maybe a bit of simsun-ExtB... or a lot of it) and happy viewing.

Maddening... why was I using this block you ask? Well, annoyingly the ideograph system in unicode is not based on "combining graphemes" like it is for things like "è" which is supposed to be a unicode sequence of the letter 'e' and then the 'acute' diacritic mark.

Instead, all Chinese characters have been given their own space... And not all of them were in the "regular" block, so it is not unusual to look at a kanji - say, a simple one like 是 - and then having to trawl through the CJK Unified Ideographs block, then the CJK Unified Ideographs Extension A block, and then the CJK Unified Ideographs B block just to find 𤴓 (it's in B... of course it is).

I'm doing this mostly for me, but it makes coping easier if I pretend I'm doing this for you >.>

5 comments - view/add comments

10,000 hits a day O_O!

Monday 6th of October 2008, 12:59:33 pm

Well, this is something novel.

I started out Nihongoresources I was getting 300 hits a week, and that was awesome. Then it became 3000 a week, and that was pretty impressive. Then I reached 100,000 a month and I was duly impressed but now it's getting slightly insane. Nihongoresources just past the "10,000 page requests a day" mark!

A DAY! O_O!!

No, I don't make any kind of money off of it (donations go to the site, buying japanese dictionaries, books, etc, paying for hosting). Wish I was but then I'd need to turn it into a business model somehow and I don't know if I can (or want to) do so without destroying its "it's free" aspect.

We'll see how well the new kanji ordering approach will work out, and the book revision.

Damn, 10,000 hits a day (and proper hits, not just page hits but unique page requests)

Just... you monster.

1 comment - view/add comments

Some more statistics

Friday 3rd of October 2008, 07:49:19 am

In the previous post I talked about the distribution of kanji in novels, and how that impacted the number you needed to learn in order to comprehend how much of kanji use in novels.

Let's do the same with words and see where we end up. Of course, with words you need to be careful: if you know "the", "a" and "it", then you probably understand 10% of the English language and it won't do you any good at all (although they most certainly are important words to learn)

Japanese has a similar problem. If you know the 'word' た, you "understand" a bit over 6% of the language. It also does you no good at all to just know 'た', so the first "percentages" are pretty much completely useless. For this purpose, I decided to run the word frequency/importance tests based on all the words found in around 1300 novels, minus interpunction and particles - particles cover a whopping near 50% of all written Japanese, but comprise only about 25 really frequently used ones, so they're relatively pointless to include in a "how important are they to know" list: if you don't know particles, you cannot read Japanese, the end.

So that said, let's have a mosey through statistics lane, and see what 1300ish novels teach us about how important Japanese words are to the written language:

  • up to 10% requires knowing 1 word
  • up to 20% requires knowing 5 more words, for a total of 6
  • up to 30% requires knowing 20 more words, total of 26
  • up to 40% requires knowing 59 more words: 85
  • up to 50%, 155 more words, 240 words total
  • up to 60%, 376 more words, 616 words total
  • up to 70%, 814 more words, 1430 words total
  • up to 80%, 1911 more words, 3341 words total

And then things get funky.

  • up to 90%, 5341 more words, 8682 words total
  • up to 95%, 7495 more words, 16177 words total

And then things get really funky:

  • up to 99%, 21574 more words, 37751 words total
  • up to 99.9%, 29012 more words, 66763 words total
  • all of it, 22551 more words, 89314 words total

What does this tell us?

Frankly, not a whole lot. This data is less useful than the kanji data presented in the previous post, because 'words' are hard to classify, because they comprise quite a few word classes, not all of which contribute equally to a mastery of the language.

Japanese is a verb oriented language, in that sentence consisting of a single verb are contextually "padded" by a listener or reader to make sense. If someone asks me whether I read a particular book and I say "read" in Japanese, then they will understand that to mean "yes, I read the book you mention".

However, for verbs to make sense, typically at some point it needs to have been linked to a nominal of some sort.

Okay, so verbs and nouns, restrict the list to those two... but hang on, adjectives are important too. Sure, "car" is a more essential concept than "black car" or "old car", but they are important to know if you want to do more than just babble childlike. In fact, one could even argue that without adjectives, one could never properly refer to something, while with adjectives all you need is the generic nominals "thing" and "concept" (for tangibles and intangibles) and then as long as you have enough adjectives, you can describe whatever thing or concept you mean.

So things are tricky. What do these word statistics tell us? Foremost, they tell us we need to make a choice. At which point is "partial coverage" enough to get the gist of a text? Opinions vary, but I'm going to take a bold step and say "80% minimum for any kind of meaningful understanding". And 80% minimum means that in a sentence, you will not understand a word for every four you do understand. That's pretty limited, but at least you should be able to get the gist. Sort of.

A better value would be 95%, which means that in every 20 words you read, there will be one word you don't know. Now you can read most sentences and only need to reach for a dictionary a few times, although you will probably understand what the word meant if you keep reading, using context to fill in the blanks - like we all do when we read texts in our native language(s).

So that puts us at a base vocabulary of around sixteen thousand words. "Holy crap, Mike, that's a lot!" I can hear you say. Perhaps... perhaps it sounds like a lot but really, it isn't. The average English speaker (and the average English speaker is not going to learn a second language, so your count will be higher) knows around twelve thousand words. If you've enjoyed a college/university education, then that number goes up to around seventeen thousand.

Does this mean that a university-level educated English speaker needs to learn a vocabular the size of their native vocabulary?

No.

Statistics is a tricky thing, and we have to be really careful with what we conclude from what are just numbers. Although sixteen thousand certainly sounds high, let's again use a "cap", and see how many words we are left with when we want words that, on average, are found at least twice per novel. This gives us a completely different number: only around 2000 words.

If we up the count to words that occur, on average, at least once per novel (to bring it in line with our criterion for useful kanji), then we get a vocabulary of close to 3800.

that's not so bad, really. Do a mere 10 words a day, keep it up for only a year, and you're pretty much done.

That's right: one year of doing 10 words and 6 kanji a day, and you're done with the language... how incredible does that sound?

You're right, "too incredible", because you also need to learn the patterns of grammar that tell you how to interpret sentences, and the patterns of Japanese culture that tell you how to interpret the interplay between certain words, not to mention learning to recognise idiomatic expressions... but the previous bit of fact stands: you can learn to read and write Japanese in a year, with frankly fairly little exercise a day. Half an hour to an hour a day to learn a new language in a year?

I should charge big bucks for that approach, sir =P

0 comments - view/add comments

Some interesting statistics

Friday 3rd of October 2008, 06:10:57 am

As some of you may know I'm working on ordering the Japanese kanji in such a way that they make sense to learn in that order. The jouyou set is a nice idea, but it intended to be taught in Japanese schools from the age of "young", and most people learning Japanese learn it when they are well in their teens or older.

This means that there is far more sense in ordering kanji in such a way that with every kanji or small set of kanji you learn, you can read a significantly larger portion of any Japanese that is tossed at you. For this purpose, I abstracted word frequency data - and consequently kanji frequency and importance - from a large collection of digital copies of Japanese novels.

This has brought to light some interesting statistics. For instance, do you know how many kanji you need to know to be able to read (on average) 50% of all the kanji in Japanese novel material?

Half the number of jouyou kanji you might think. that's 1945 divided by two. So, 970ish kanji roughly? No.

Okay... 750?

  No.

500?

  No, keep going lower.

400?

  No seriously, go lower.

250??

  No... lower still.

"wtf?"

Yeah that's what I thought too. Here's the list of how many kanji you need to know in order to have covered which percentage of kanji-use (on average) in about 1300 novels:

  • up to 10%: 11 kanji
  • up to 20%: 20 more kanji: 31
  • up to 30%: 33 more kanji: 64
  • up to 40%: 50 more kanji: 114
  • up to 50%: 74 more kanji: 188
  • up to 60%: 103 more kanji: 291
  • up to 70%: 148 more kanji: 439
  • up to 80%: 222 more kanji: 661
  • up to 90%: 348 more kanji: 1045

that's right, with only 1000 kanji you can read pretty much 90% of written Japanese novels. Now for the fun part. As is to be expected, the upper 10% is the problem, but how much of a problem:

  • up to 95%: 377 more kanji: 1422
  • up to 99%: 768 more kanji: 2190

that's only slightly more than the number of jouyou kanji to read 99% of kanji used Japanese novels. Now here's the scary part:

  • 0.9% more: 847 kanji, 3037 to get to 99.9%
  • that final 0.1%: 736 kanji. grand total: 3373 kanji
"Oh my god, Mike, that's horrible!" I hear you say. Perhaps... perhaps it is horrible. After all, 3773 kanji to learn? But then again, there is another interesting statistic to look at here. Not every kanji is used equally often (obviously), but what if we do a "cap off" criterium: what if we want to know how many kanji we need to learn so that we can read all kanji that are at least once in every book, on average?

Well then it turns out the jouyou number is not that crazy after all: you will get by just fine with a knowledge of roughly 1900 kanji, putting us at a roughly 97.8% kanji comprehension level.

So, how to go from here? how to order the kanji in such a way that it makes sense to learn them in that order? Well, that's where the "blue data" comes in. There no real reason I call it this, the "blue data" is a collection of data that says which kanji have how many strokes, which readings, and how they visually decompose in simpler kanji shapes. Combining this data allows the lists to be ordered on a few features:

  1. how many strokes are there in a kanji,
  2. how many subcomponents does the kanji consist of,
  3. how many words is this kanji used in (plus how important those words are), and finally:
  4. how simple is the kanji to learn, given the kanji we already learnt in this ordering

This will be a bit of a puzzle, and will yield equally interesting statistics on how many words, and which, you need to know to be able to understand what part of written Japanese.

I'll keep you posted!

13 comments - view/add comments

Frequency lists

Saturday 27th of September 2008, 05:57:40 am

Japanese word frequency lists are hard to come by. Or rather, impossible. There is one list on the Monash Nihongo ftp Archive, but it's based on newspapers... and we all know the kind of language you find in newspapers isn't exactly what you find "in real life".

So instead I decided to just run my own frequency analysis, and make those files available (I need them for my own nihongoresources work too). I hit up Perfect Dark, downloaded as many novels as it would let me, and ran those through ChaSen (a morphological chunker).

The result got tallied in two ways: term frequency, and term base frequency.

1) Term frequency simply tallied all the different chunks, looking for "word pronunciation". This means that for words such as 十分, with different pronunciations depending on what they are used to mean, there will be two or more distinct entries. It also considers different verbal conjugations different entries. So して and した are both forms of する, but term tallying ignores this. For every distinct use, a term will have a POS tag in its fourth 'column' (colums being tab delimited).

2) Term base frequency is a more compact tally, which ignores all the different roles or pronunciations a word might have, and just tallies the word's base form. This means that 十分 will be the combined tally of all its readings, and して, した, します, etc will all count towards する's total. Every term that falls in multiple POS categories will have a comma-separated list for its POS 'column'.

A bit of statistics: the data preprocessing involved throwing away any file that wasn't .txt (rtf, pdf, etc were not considered), as well as any .txt file that has only a handful of pages. I threw away any .txt under 30kb. This resulted in 1322 text files of varying length, of which I stripped the first and last 25 lines to make sure no "header" information got boosted. The resulting 1242 files were chunked, and the decompositions then tallied.

If you want to know which novels were used to compile these lists: too bad. I have no idea. Part of the "grabbing them off Perfect Dark" means that in some countries this might be considered against the law (Luckily, not mine). While ultimately the word frequencies do not allow the original data to be generated in any way, I figured it safest if I didn't bother recording the titles of the novels I used. Nor did I find it important to keep them after I was done (I can think of better uses for 3Gb of space =).

Of course, for those who care, the two data files can be downloaded: click on the relevant link for either the term aggregates or base aggregates.

And if you end up using them for something interesting, drop a line! I like interesting things, especially if I contributed to making it happen in some way ^_^

4 comments - view/add comments