My latest Flickr photograph

CJK Unified Ideographs Extension B in nihongoresources

Thursday 9th of October 2008, 03:24:26 pm

It is my serious conjecture you probably never bothered looking through the full set of CJK unicode blocks. I cannot fault you on that, there are better ways to spend your time. Playing a video game, thinking up a lolcat caption, maybe playing 'doctor' with the ol' girlfriend...

If you're not doing any of those things, and you happen to be working on graphical decompositions of kanji, then I believe it is safe to say you're likely going to be running into this block. Now, a word of warning.


Yes, the regular Unified Ideographs block is pretty big. It has about 20,000 characters in there, but for the most part, these will be characters that you can go "okay... I can see how that happens" to.

Not so with scary extensions B block. That shit be crazy. Is your browsers UTF8 supportive? You sure? Got a good unicode font installed? No I mean a good unicode font, not something wimpy like Arial Unicode, or even the respectable Code2000/Code2001 combination... Okay here we go:

   𠙴 𠤬 𡤼 𡰣 𡿧 𣁬 𣱱 𤕪 𥫘 𩇨

What? What is up with these? Missing strokes? Mirror images? Too many strokes? Wait, there's more:

   𡆢 𡦹 𢀓

What ARE these? O_o

No, no, wait for it, there is more... Oh there is more:

   𡕜 𡬞

Does that seem "too many" for you? No? Yeah I didn't think so, I've been saving this just in case you weren't weirded out yet...

   𦣩 - symmetry in writing...

   𤴐 - have some boxes.

   𣡽 ... okay I just... I just don't know. HOW MANY TREES DO YOU NEED?!

Extension B block is just *filled* with this stuff, there are almost three times as many Extension B block Chinese characters as there are in the *actually used* regular Unified Ideographs block. I'm serious, there are over 40,000 characters in that block that you are going to be going "wth?" at.

Still here? then go download Babelmap, make sure you have plenty of font bindings installed (bitstream cyberbit, code 2000, maybe a bit of simsun-ExtB... or a lot of it) and happy viewing.

Maddening... why was I using this block you ask? Well, annoyingly the ideograph system in unicode is not based on "combining graphemes" like it is for things like "è" which is supposed to be a unicode sequence of the letter 'e' and then the 'acute' diacritic mark.

Instead, all Chinese characters have been given their own space... And not all of them were in the "regular" block, so it is not unusual to look at a kanji - say, a simple one like 是 - and then having to trawl through the CJK Unified Ideographs block, then the CJK Unified Ideographs Extension A block, and then the CJK Unified Ideographs B block just to find 𤴓 (it's in B... of course it is).

I'm doing this mostly for me, but it makes coping easier if I pretend I'm doing this for you >.>