Or, the absolute minimum every software developer linguist absolutely, positively must know about Unicode and character sets (no excuses!)

Note: This text was written as part of a larger programming tutorial in Python, and the code samples are taken from an interactive session using the Jupyter notebook. As a consequence, there are digressions here and there about playing with text data in Python. These might seem:

  1. useless if what you came for is just the part about text encoding;
  2. long-winded if you already know some Python;
  3. or confusing if, on the contrary, you're not familiar with programming at all, much less with Python.

If any of these is your case, my advice is: ignore the code, focus on the comments around it, they're more than enough to follow the thread of the explanation. Though if you've got a little more time, why not try some of these out in an interactive Python session? ;) And now, without further ado...

Much like any other piece of data inside a digital computer, text is represented as a series of binary digits (bits), i.e. 0's and 1's. A mapping between sequences of bits and characters is called an encoding. How many different characters your encoding can handle depends on how many bits you allow per character:

  • with 1 bit you can have 2^1 = 2 characters (one is represented by 0, the other by 1)
  • with 2 bits you can have 2^2 = 4 characters(represented by 00, 01, 10 and 11)
  • etc.

The oldest encoding still in widespread use (it's what makes the Internet and the web tick) is ASCII, which is a 7-bit encoding:

In [1]:
2**7
Out[1]:
128

This means it can represent 128 different characters, which comfortably fits the basic Latin alphabet (both lowercase and uppercase), Arabic numerals, punctuation and some "control characters" which were primarily useful on the old teletype terminals for which ASCII was designed. For instance, the letter "A" corresponds to the number 65 (1000001 in binary, see below).

"ASCII" stands for "American Standard Code for Information Interchange" -- which explains why there are no accented characters, for instance.

Nowadays, ASCII is represented using 8 bits (== 1 byte), because that's the unit of computer memory which has become ubiquitous (in terms of both hardware and software assumptions), but still uses only 7 bits' worth of information.

In [2]:
2**8
Out[2]:
256
In [3]:
# how to find out the binary representation of a decimal number?
"{:b}".format(65)
Out[3]:
'1000001'
In [4]:
# Digression/explanation: the format() method
#
# the format() string method inserts its arguments into the string
# wherever there is a "{}"
"{} {} {}".format("foo", "bar", "baz")
Out[4]:
'foo bar baz'
In [5]:
# you can also specify a different order by using (zero-based) 
# positional indices -- or even repeating them
"{1} {0} {1}".format("foo", "bar")
Out[5]:
'bar foo bar'
In [6]:
# for long strings with many insertions, where you might mess up the
# order of arguments, keyword arguments are also available
"{foo_arg} {bar_arg}".format(bar_arg="bar", foo_arg="foo")
Out[6]:
'foo bar'
In [7]:
# and you can also request various formatting adjustments or conversions
# to be made by specifying them after a ":" -- e.g. "b" prints a given
# number in its binary representation
"{:b}".format(45)
Out[7]:
'101101'
In [8]:
# or simply
bin(45)
# but that has an ugly "0b" in front, and we would've missed out on
# format() if we'd used that directly!
Out[8]:
'0b101101'

What happens in the range [128; 256) is not covered by the ASCII standard. In the 1990s, many encodings were standardized which used this range for their own purposes, usually representing additional accented characters used in a particular region. E.g. Czech (and Slovak, Polish...) alphabets can be represented using the ISO latin-2 encoding, or Microsoft's cp-1250. Encodings which stick with the same character mappings as ASCII in the range [0; 128) and represent them physically in the same way (as 1 byte), while potentially adding more character mappings beyond that, are called ASCII-compatible.

ASCII compatibility is a good thing™, because when you start reading a character stream in a computer, there's no way to know in advance what encoding it is in (unless it's a file you've encoded yourself). So in practice, a heuristic has been established to start reading the stream assuming it is ASCII by default, and switch to a different encoding if evidence becomes available that motivates it. For instance, HTML files should all start something like this:

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8"/>
  ...

This way, whenever a program wants to read a file like this, it can start off with ASCII, waiting to see if it reaches the charset (i.e. encoding) attribute, and once it does, it can switch from ASCII to that encoding (UTF-8 here) and restart reading the file, now fairly sure that it's using the correct encoding. This trick works only if we can assume that whatever encoding the rest of the file is in, the first few lines can be considered as ASCII for all practical intents and purposes.

Without the charset attribute, the only way to know if the encoding is right would be for you to look at the rendered text and see if it makes sense; if it did not, you'd have to resort to trial and error, manually switching the encodings and looking for the one in which the numbers behind the characters stop coming out as gibberish and are actually translated into intelligible text.

In [9]:
# Let's take a look at printable characters in the latin-2 character
# set. Each mapping is called a "codepoint": it is a correspondence
# between an integer and a character.
import codecs

latin2 = []
for codepoint in range(256):
    byte = bytes([codepoint])
    character = codecs.decode(byte, encoding="latin2")
    if character.isprintable():
        latin2.append((codepoint, character))

latin2
Out[9]:
[(32, ' '),
 (33, '!'),
 (34, '"'),
 (35, '#'),
 (36, '$'),
 (37, '%'),
 (38, '&'),
 (39, "'"),
 (40, '('),
 (41, ')'),
 (42, '*'),
 (43, '+'),
 (44, ','),
 (45, '-'),
 (46, '.'),
 (47, '/'),
 (48, '0'),
 (49, '1'),
 (50, '2'),
 (51, '3'),
 (52, '4'),
 (53, '5'),
 (54, '6'),
 (55, '7'),
 (56, '8'),
 (57, '9'),
 (58, ':'),
 (59, ';'),
 (60, '<'),
 (61, '='),
 (62, '>'),
 (63, '?'),
 (64, '@'),
 (65, 'A'),
 (66, 'B'),
 (67, 'C'),
 (68, 'D'),
 (69, 'E'),
 (70, 'F'),
 (71, 'G'),
 (72, 'H'),
 (73, 'I'),
 (74, 'J'),
 (75, 'K'),
 (76, 'L'),
 (77, 'M'),
 (78, 'N'),
 (79, 'O'),
 (80, 'P'),
 (81, 'Q'),
 (82, 'R'),
 (83, 'S'),
 (84, 'T'),
 (85, 'U'),
 (86, 'V'),
 (87, 'W'),
 (88, 'X'),
 (89, 'Y'),
 (90, 'Z'),
 (91, '['),
 (92, '\\'),
 (93, ']'),
 (94, '^'),
 (95, '_'),
 (96, '`'),
 (97, 'a'),
 (98, 'b'),
 (99, 'c'),
 (100, 'd'),
 (101, 'e'),
 (102, 'f'),
 (103, 'g'),
 (104, 'h'),
 (105, 'i'),
 (106, 'j'),
 (107, 'k'),
 (108, 'l'),
 (109, 'm'),
 (110, 'n'),
 (111, 'o'),
 (112, 'p'),
 (113, 'q'),
 (114, 'r'),
 (115, 's'),
 (116, 't'),
 (117, 'u'),
 (118, 'v'),
 (119, 'w'),
 (120, 'x'),
 (121, 'y'),
 (122, 'z'),
 (123, '{'),
 (124, '|'),
 (125, '}'),
 (126, '~'),
 (161, 'Ą'),
 (162, '˘'),
 (163, 'Ł'),
 (164, '¤'),
 (165, 'Ľ'),
 (166, 'Ś'),
 (167, '§'),
 (168, '¨'),
 (169, 'Š'),
 (170, 'Ş'),
 (171, 'Ť'),
 (172, 'Ź'),
 (174, 'Ž'),
 (175, 'Ż'),
 (176, '°'),
 (177, 'ą'),
 (178, '˛'),
 (179, 'ł'),
 (180, '´'),
 (181, 'ľ'),
 (182, 'ś'),
 (183, 'ˇ'),
 (184, '¸'),
 (185, 'š'),
 (186, 'ş'),
 (187, 'ť'),
 (188, 'ź'),
 (189, '˝'),
 (190, 'ž'),
 (191, 'ż'),
 (192, 'Ŕ'),
 (193, 'Á'),
 (194, 'Â'),
 (195, 'Ă'),
 (196, 'Ä'),
 (197, 'Ĺ'),
 (198, 'Ć'),
 (199, 'Ç'),
 (200, 'Č'),
 (201, 'É'),
 (202, 'Ę'),
 (203, 'Ë'),
 (204, 'Ě'),
 (205, 'Í'),
 (206, 'Î'),
 (207, 'Ď'),
 (208, 'Đ'),
 (209, 'Ń'),
 (210, 'Ň'),
 (211, 'Ó'),
 (212, 'Ô'),
 (213, 'Ő'),
 (214, 'Ö'),
 (215, '×'),
 (216, 'Ř'),
 (217, 'Ů'),
 (218, 'Ú'),
 (219, 'Ű'),
 (220, 'Ü'),
 (221, 'Ý'),
 (222, 'Ţ'),
 (223, 'ß'),
 (224, 'ŕ'),
 (225, 'á'),
 (226, 'â'),
 (227, 'ă'),
 (228, 'ä'),
 (229, 'ĺ'),
 (230, 'ć'),
 (231, 'ç'),
 (232, 'č'),
 (233, 'é'),
 (234, 'ę'),
 (235, 'ë'),
 (236, 'ě'),
 (237, 'í'),
 (238, 'î'),
 (239, 'ď'),
 (240, 'đ'),
 (241, 'ń'),
 (242, 'ň'),
 (243, 'ó'),
 (244, 'ô'),
 (245, 'ő'),
 (246, 'ö'),
 (247, '÷'),
 (248, 'ř'),
 (249, 'ů'),
 (250, 'ú'),
 (251, 'ű'),
 (252, 'ü'),
 (253, 'ý'),
 (254, 'ţ'),
 (255, '˙')]

Using the 8th bit (and thus the codepoint range [128; 256)) solves the problem of handling languages with character sets different than that of American English, but introduces a lot of complexity -- whenever you come across a text file with an unknown encoding, it might be in one of literally dozens of encodings. Additional drawbacks include:

  • how to handle multilingual text with characters from many different alphabets, which are not part of the same 8-bit encoding?
  • how to handle writing systems which have way more than 256 "characters", e.g. Chinese, Japanese and Korean (CJK) ideograms?

For these purposes, a standard encoding known as Unicode was developed which strives for universal coverage of all possible character sets. Unicode is much bigger than the encodings we've seen so far -- its most frequently used subset, the Basic Multilingual Plane, has 2^16 codepoints, but overall the number of codepoints is past 1M and there's room to accommodate many more.

In [10]:
2**16
Out[10]:
65536

Now, the most straightforward representation for 2^16 codepoints is what? Well, it's simply using 16 bits per character, i.e. 2 bytes. That encoding exists, it's called UTF-16, but consider the drawbacks:

  • we've lost ASCII compatibility by the simple fact of using 2 bytes per character instead of 1 (encoding "a" as 01100001 or 01100001|00000000, with the | indicating an imaginary boundary between bytes, is not the same thing)
  • encoding a string in a character set which uses a "reasonable" number of characters (like any European language) now takes twice as much space without any added benefit (which is probably not a good idea, given the general dominance of English -- one of those "reasonable character set size" languages -- in electronic communication)

Looks like we'll have to think outside the box. The box in question here is called fixed-width encodings -- all of the encoding schemes we've encountered so far were fixed-width, meaning that each character was represented by either 7, 8 or 16 bits. In other word, you could jump around the string in multiples of 7, 8 or 16 and always land at the beginning of a character. (Not exactly true for UTF-16, because it is something more than just a "16-bit ASCII": it has ways of handling characters beyond 2^16 using so-called surrogate sequences -- but you get the gist.)

"UTF" stands for "Unicode Transformation Format".

The smart idea that some bright people have come up with was to use a variable-width encoding. The most ubiquitous one currently is UTF-8, which we've already met in the HTML example above. UTF-8 is ASCII-compatible, i.e. the 1's and 0's used to encode text containing only ASCII characters are the same regardless of whether you use ASCII or UTF-8: it's a sequence of 8-bit bytes. But UTF-8 can also handle many more additional characters, as defined by the Unicode standard, by using progressively longer and longer sequences of bits.

In [11]:
def print_as_binary_utf8(string):
    """Prints binary representation of string as encoded by UTF-8.
    
    """
    binary_bytes = []
    # encode the string as UTF-8 and iterate over the bytes
    for byte in string.encode("utf-8"):
        # generate a string of general format "0b101...", which
        # is the binary representation of the byte
        binary = bin(byte)
        # remove the leading "0b"
        binary = binary[2:]
        # pad the representation with leading zeros to the size of
        # a full byte (= a sequence of 8 1's and 0's) if necessary
        binary_byte = binary.rjust(8, "0")
        binary_bytes.append(binary_byte)
    print("'{}' encoded in UTF-8 is: {}".format(string, binary_bytes))

print_as_binary_utf8("A")   # the representations...
print_as_binary_utf8("č")   # ... keep...
print_as_binary_utf8("字")  # ... getting longer.
'A' encoded in UTF-8 is: ['01000001']
'č' encoded in UTF-8 is: ['11000100', '10001101']
'字' encoded in UTF-8 is: ['11100101', '10101101', '10010111']

How does it achieve that? The obvious problem here is that with a fixed-width encoding, you just chop up the string at regular intervals (7, 8, 16 bits) and you know that each interval represents one character. So how do you know where to chop up a variable width-encoded string, if each character can take up a different number of bits? We won't go into the details, but essentially, the trick is to use some of the bits in the representation of a codepoint to store information not about which character it is (whether it's an "A" or a "字"), but how many bits it occupies. In other words, if you want to skip ahead 10 characters in a string encoded with a variable width-encoding, you can't just skip 10 * 7 or 8 or 16 bits; you have to read all the intervening characters to figure out how much space they take up.

There's much more to Unicode than this simple introduction, for instance the various ways diacritics are handled: "č" can be represented either as a single codepoint (LATIN SMALL LETTER C WITH CARON -- all Unicode codepoints have cute names like this) or a sequence of two codepoints, the character "c" and a combining diacritic mark (COMBINING CARON). You can search for the codepoints corresponding to Unicode characters e.g. here and play with them in Python using the chr(0xXXXX) built-in function or with the special string escape sequence \uXXXX (where XXXX is the hexadecimal representation of the codepoint) -- both are ways to get the character corresponding to the given codepoint:

In [12]:
# "č" as LATIN SMALL LETTER C WITH CARON, codepoint 010D
print(chr(0x010D))
print("\u010D")
č
č
In [13]:
# "č" as a sequence of LATIN SMALL LETTER C, codepoint 0063, and
# COMBINING CARON, codepoint 030c
print(chr(0x0063) + chr(0x030c))
print("\u0063\u030c")
č
č

Hexadecimal is just a more convenient way of representing sequences of bits, where each of the X's can be a number between 0 and 15 (10--15 are represented by the letters A--F). Each hexadecimal number can thus represent 16 different values, and therefore it can stand in for a sequence of 4 bits (2^4 == 16). Without worrying too much about the details right now, our old friend ASCII uppercase "A" can be thought of equivalently either as decimal 65, binary 1000001, or hexadecimal 0x41 (the "0x" prefix is there just to say "this is a hexadecimal number"). >

Binary and hexadecimal numbers are often written padded with leading zeros to some number of bytes, but these have no effect on the value, much like decimal 42 and 00000042 are effectively the same numbers.

In [14]:
# use hex() to find out the hexadecimal representation of a decimal
# integer...
hex(99)
Out[14]:
'0x63'
In [15]:
# ... and int() to go back
int(0x63)
Out[15]:
99

This means you have to be careful when working with languages that use accents, because for a computer, the two possible representations are of course different strings, even though for you, they're conceptually the same:

In [16]:
s1 = "\u010D"
s2 = "\u0063\u030c"
# s1 and s2 look the same to the naked eye...
print(s1, s2)
č č
In [17]:
# ... but in the eternal realm of Plato's Ideas, they're not
s1 == s2
Out[17]:
False

Watch out, they even have different lengths! This might come to bite you if you're trying to compute the length of a word in letters.

In [18]:
print("s1 is", len(s1), "character(s) long.")
print("s2 is", len(s2), "character(s) long.")
s1 is 1 character(s) long.
s2 is 2 character(s) long.

Generally, most text out there will use the first, single-codepoint approach whenever possible, and pre-packaged linguistic corpora will try to be consistent about this (unless they come from the web, which always warrants being suspicious and defensive about your material). If you're worried about inconsistencies in your data, you can perform a normalization:

In [19]:
from unicodedata import normalize

# NFC stands for Normal Form C; this normalization applies a canonical
# decomposition (into a multi-codepoint representation) followed by a
# canonical composition (into a single-codepoint representation)
s1 = normalize("NFC", s1)
s2 = normalize("NFC", s2)

s1 == s2
Out[19]:
True

Let's wrap things up by saying that Python itself uses Unicode internally and (mostly?) assumes UTF-8 when reading files. So if you're using UTF-8 as is increasingly the case (and you should be), you won't have to worry too much about encodings, except perhaps for normalization.

In [20]:
# a good idea when dealing with Unicode text from an unknown and
# unreliable source is to look at the set of codepoints contained
# in it and eliminate or replace those that shouldn't be there
import unicodedata


def inspect_codepoints(text):
    charset = set()
    for char in text:
        charset.add(char)
    for char in sorted(charset):
        info = r"{} (\u{:04x}): {} (category: {})".format(
            char, ord(char), unicodedata.name(char),
            unicodedata.category(char))
        print(info)
        

# depending on your font configuration, it may be very hard to spot
# the two intruders in the sentence below that look like regular
# letters but really are specialized variants; you might want
# to replace them before doing further text processing...
inspect_codepoints("Intruders here, good 𝗍hinɡ I checked.")
  (\u0020): SPACE (category: Zs)
, (\u002c): COMMA (category: Po)
. (\u002e): FULL STOP (category: Po)
I (\u0049): LATIN CAPITAL LETTER I (category: Lu)
c (\u0063): LATIN SMALL LETTER C (category: Ll)
d (\u0064): LATIN SMALL LETTER D (category: Ll)
e (\u0065): LATIN SMALL LETTER E (category: Ll)
g (\u0067): LATIN SMALL LETTER G (category: Ll)
h (\u0068): LATIN SMALL LETTER H (category: Ll)
i (\u0069): LATIN SMALL LETTER I (category: Ll)
k (\u006b): LATIN SMALL LETTER K (category: Ll)
n (\u006e): LATIN SMALL LETTER N (category: Ll)
o (\u006f): LATIN SMALL LETTER O (category: Ll)
r (\u0072): LATIN SMALL LETTER R (category: Ll)
s (\u0073): LATIN SMALL LETTER S (category: Ll)
t (\u0074): LATIN SMALL LETTER T (category: Ll)
u (\u0075): LATIN SMALL LETTER U (category: Ll)
ɡ (\u0261): LATIN SMALL LETTER SCRIPT G (category: Ll)
𝗍 (\u1d5cd): MATHEMATICAL SANS-SERIF SMALL T (category: Ll)
In [21]:
# ... because of course, for a computer, the word "thing" written with
# two different variants of "g" is really just two different words, which
# is probably not what you want
"thing" == "thinɡ"
Out[21]:
False

In any case, here's what happens when processing text with Python ("Unicode" in the central box stands for Python's internal representation of Unicode, which is not UTF-8 nor UTF-16):

Text IO in Python

(Image shamelessly hotlinked from / courtesy of the NLTK Book. Go check it out, it's an awesome intro to Python programming for linguists!)

A terminological postscript: we've been using some terms a bit informally and for the most part it's okay, but it's good to get the distinctions straight in one's head at least once. So, a character set is a mapping between codepoints (integers) and characters. We may for instance say that in our character set, the integer 99 corresponds to the character "c".

On the other hand, an encoding is a mapping between a codepoint (an integer) and a physical sequence of 1's and 0's that represent it in memory. With fixed-width encodings, this mapping is generally straightforward -- the 1's and 0's directly represent the given integer, only in binary and padded with zeros to fit the desired width. With variable-width encodings, as the necessity creeps in to include the information about how many bits are spanned by the current character, this straightforward correspondence breaks down.

A comparison might be helpful here: as encodings, UTF-8 and UTF-16 both use the same character set -- the same integers corresponding to the same characters. But since they're different encodings, when the time comes to turn these integers into sequences of bits to store in a computer's memory, each of them generates a different one.

For more on Unicode, a great read already hinted at above is Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). To make the discussion digestible for newcomers, I sometimes slightly distorted facts about how things are "really really" done. And some inaccuracies may be genuine mistakes. In any case, please let me know in the comments! I'm grateful for feedback and looking to improve this material; I'll fix the mistakes and consider ditching some of the simplifications if they prove untenable :)


Comments

comments powered by Disqus