Ads

Thursday, August 7, 2014

Python - Finding Codepoints

Each Unicode character is identified by a unique codepoint. You can find information on character codepoints on official Unicode Web sites, but a quick way to look at visual forms of characters is by generating an HTML page with charts of Unicode characters. The script below does this:
mk_unicode_chart.py
# Create an HTML chart of Unicode characters by codepoint
import sys
head = 'Unicode Code Points\n' +\
       '\n' +\
       '\n

Unicode Code Points

' foot = '' fp = sys.stdout fp.write(head) num_blocks = 32 # Up to 256 in theory, but IE5.5 is flaky for block in range(0,256*num_blocks,256): fp.write('\n\n

Range %5d-%5d

' % (block,block+256)) start = unichr(block).encode('utf-16') fp.write('\n
     ')
    for col in range(16): fp.write(str(col).ljust(3))
    fp.write('
') for offset in range(0,256,16): fp.write('\n
')
        fp.write('+'+str(offset).rjust(3)+' ')
        line = '  '.join([unichr(n+block+offset) for n in range(16)])
        fp.write(line.encode('UTF-8'))
        fp.write('
') fp.write(foot) fp.close()

Exactly what you see when looking at the generated HTML page depends on just what Web browser and OS platform the page is viewed on—as well as on installed fonts and other factors. Generally, any character that cannot be rendered on the current browser will appear as some sort of square, dot, or question mark. Anything that is rendered is generally accurate. Once a character is visually identified, further information can be generated with the unicodedata module:
 
>>> import unicodedata
>>> unicodedata.name(unichr(1488))
'HEBREW LETTER ALEF'
>>> unicodedata.category(unichr(1488))
'Lo'
>>> unicodedata.bidirectional(unichr(1488))
'R'

A variant here would be to include the information provided by unicodedata within a generated HTML chart, although such a listing would be far more verbose than the example above.

No comments: