Or, the absolute minimum every software developer linguist absolutely, positively must know about Unicode and character sets (no excuses!)

Note: This text was written as part of a larger programming tutorial in Python, and the code samples are taken from an interactive session using the Jupyter notebook. As a consequence, there are digressions here and there about playing with text data in Python. These might seem:

  1. useless if what you came for is just the part about text encoding;
  2. long-winded if you already know some Python;
  3. or confusing if, on the contrary, you're not familiar with programming at all, much less with Python.

If any of these is your case, my advice is: ignore the code, focus on the comments around it, they're more than enough to follow the thread of the explanation. Though if you've got a little more time, why not try some of these out in an interactive Python session? ;) And now, without further ado...

Much like any other piece of data inside a digital computer, text is represented as a series of binary digits (bits), i.e. 0's and 1's. A mapping between sequences of bits and characters is called an encoding. How many different characters your encoding can handle depends on how many bits you allow per character:

  • with 1 bit you can have 2^1 = 2 characters (one is represented by 0, the other by 1)
  • with 2 bits you can have 2^2 = 4 characters(represented by 00, 01, 10 and 11)
  • etc.

The oldest encoding still in widespread use (it's what makes the Internet and the web tick) is ASCII, which is a 7-bit encoding:

In [1]:

This means it can represent 128 different characters, which comfortably fits the basic Latin alphabet (both lowercase and uppercase), Arabic numerals, punctuation and some "control characters" which were primarily useful on the old teletype terminals for which ASCII was designed. For instance, the letter "A" corresponds to the number 65 (1000001 in binary, see below).

"ASCII" stands for "American Standard Code for Information Interchange" -- which explains why there are no accented characters, for instance.

Nowadays, ASCII is represented using 8 bits (== 1 byte), because that's the unit of computer memory which has become ubiquitous (in terms of both hardware and software assumptions), but still uses only 7 bits' worth of information.

In [2]:
In [3]:
# how to find out the binary representation of a decimal number?
In [4]:
# Digression/explanation: the format() method
# the format() string method inserts its arguments into the string
# wherever there is a "{}"
"{} {} {}".format("foo", "bar", "baz")
'foo bar baz'
In [5]:
# you can also specify a different order by using (zero-based) 
# positional indices -- or even repeating them
"{1} {0} {1}".format("foo", "bar")
'bar foo bar'
In [6]:
# for long strings with many insertions, where you might mess up the
# order of arguments, keyword arguments are also available
"{foo_arg} {bar_arg}".format(bar_arg="bar", foo_arg="foo")
'foo bar'
In [7]:
# and you can also request various formatting adjustments or conversions
# to be made by specifying them after a ":" -- e.g. "b" prints a given
# number in its binary representation
In [8]:
# or simply
# but that has an ugly "0b" in front, and we would've missed out on
# format() if we'd used that directly!

What happens in the range [128; 256) is not covered by the ASCII standard. In the 1990s, many encodings were standardized which used this range for their own purposes, usually representing additional accented characters used in a particular region. E.g. Czech (and Slovak, Polish...) alphabets can be represented using the ISO latin-2 encoding, or Microsoft's cp-1250. Encodings which stick to the same character mappings as ASCII in the range [0; 128) and represent them physically in the same way (as 1 byte), while potentially adding more character mappings beyond that, are called ASCII-compatible.

ASCII compatibility is a good thing™, because when you start reading a character stream in a computer, there's no way to know in advance what encoding it is in (unless it's a file you've encoded yourself). So in practice, a heuristic has been established to start reading the stream assuming it is ASCII by default, and switch to a different encoding if evidence becomes available that motivates it. For instance, HTML files should all start something like this:

<!DOCTYPE html>
  <meta charset="utf-8"/>

This way, whenever a program wants to read a file like this, it can start off with ASCII, waiting to see if it reaches the charset (i.e. encoding) attribute, and once it does, it can switch from ASCII to that encoding (UTF-8 here) and restart reading the file, now fairly sure that it's using the correct encoding. This trick works only if we can assume that whatever encoding the rest of the file is in, the first few lines can be considered as ASCII for all practical intents and purposes.

Without the charset attribute, the only way to know if the encoding is right would be for you to look at the rendered text and see if it makes sense; if it did not, you'd have to resort to trial and error, manually switching the encodings and looking for the one in which the numbers behind the characters stop coming out as gibberish and are actually translated into intelligible text.

Let's take a look at printable characters in the Latin-2 character set. The character set consists of mappings between integers and characters; each one of these is called a "codepoint". The Latin-2 encoding then defines how to encode each of these integers as a series of bits (1's and 0's) in the computer's memory.

In [9]:
latin2 = []
# the Latin-2 character set has 256 codepoints, corresponding to
# integers from 0 to 255
for codepoint in range(256):
    # the Latin-2 encoding is simple: each codepoint is encoded
    # as the byte corresponding to that integer in binary
    byte = bytes([codepoint])
    character = byte.decode(encoding="latin2")
    if character.isprintable():
        latin2.append((codepoint, character))

[(32, ' '),
 (33, '!'),
 (34, '"'),
 (35, '#'),
 (36, '$'),
 (37, '%'),
 (38, '&'),
 (39, "'"),
 (40, '('),
 (41, ')'),
 (42, '*'),
 (43, '+'),
 (44, ','),
 (45, '-'),
 (46, '.'),
 (47, '/'),
 (48, '0'),
 (49, '1'),
 (50, '2'),
 (51, '3'),
 (52, '4'),
 (53, '5'),
 (54, '6'),
 (55, '7'),
 (56, '8'),
 (57, '9'),
 (58, ':'),
 (59, ';'),
 (60, '<'),
 (61, '='),
 (62, '>'),
 (63, '?'),
 (64, '@'),
 (65, 'A'),
 (66, 'B'),
 (67, 'C'),
 (68, 'D'),
 (69, 'E'),
 (70, 'F'),
 (71, 'G'),
 (72, 'H'),
 (73, 'I'),
 (74, 'J'),
 (75, 'K'),
 (76, 'L'),
 (77, 'M'),
 (78, 'N'),
 (79, 'O'),
 (80, 'P'),
 (81, 'Q'),
 (82, 'R'),
 (83, 'S'),
 (84, 'T'),
 (85, 'U'),
 (86, 'V'),
 (87, 'W'),
 (88, 'X'),
 (89, 'Y'),
 (90, 'Z'),
 (91, '['),
 (92, '\\'),
 (93, ']'),
 (94, '^'),
 (95, '_'),
 (96, '`'),
 (97, 'a'),
 (98, 'b'),
 (99, 'c'),
 (100, 'd'),
 (101, 'e'),
 (102, 'f'),
 (103, 'g'),
 (104, 'h'),
 (105, 'i'),
 (106, 'j'),
 (107, 'k'),
 (108, 'l'),
 (109, 'm'),
 (110, 'n'),
 (111, 'o'),
 (112, 'p'),
 (113, 'q'),
 (114, 'r'),
 (115, 's'),
 (116, 't'),
 (117, 'u'),
 (118, 'v'),
 (119, 'w'),
 (120, 'x'),
 (121, 'y'),
 (122, 'z'),
 (123, '{'),
 (124, '|'),
 (125, '}'),
 (126, '~'),
 (161, 'Ą'),
 (162, '˘'),
 (163, 'Ł'),
 (164, '¤'),
 (165, 'Ľ'),
 (166, 'Ś'),
 (167, '§'),
 (168, '¨'),
 (169, 'Š'),
 (170, 'Ş'),
 (171, 'Ť'),
 (172, 'Ź'),
 (174, 'Ž'),
 (175, 'Ż'),
 (176, '°'),
 (177, 'ą'),
 (178, '˛'),
 (179, 'ł'),
 (180, '´'),
 (181, 'ľ'),
 (182, 'ś'),
 (183, 'ˇ'),
 (184, '¸'),
 (185, 'š'),
 (186, 'ş'),
 (187, 'ť'),
 (188, 'ź'),
 (189, '˝'),
 (190, 'ž'),
 (191, 'ż'),
 (192, 'Ŕ'),
 (193, 'Á'),
 (194, 'Â'),
 (195, 'Ă'),
 (196, 'Ä'),
 (197, 'Ĺ'),
 (198, 'Ć'),
 (199, 'Ç'),
 (200, 'Č'),
 (201, 'É'),
 (202, 'Ę'),
 (203, 'Ë'),
 (204, 'Ě'),
 (205, 'Í'),
 (206, 'Î'),
 (207, 'Ď'),
 (208, 'Đ'),
 (209, 'Ń'),
 (210, 'Ň'),
 (211, 'Ó'),
 (212, 'Ô'),
 (213, 'Ő'),
 (214, 'Ö'),
 (215, '×'),
 (216, 'Ř'),
 (217, 'Ů'),
 (218, 'Ú'),
 (219, 'Ű'),
 (220, 'Ü'),
 (221, 'Ý'),
 (222, 'Ţ'),
 (223, 'ß'),
 (224, 'ŕ'),
 (225, 'á'),
 (226, 'â'),
 (227, 'ă'),
 (228, 'ä'),
 (229, 'ĺ'),
 (230, 'ć'),
 (231, 'ç'),
 (232, 'č'),
 (233, 'é'),
 (234, 'ę'),
 (235, 'ë'),
 (236, 'ě'),
 (237, 'í'),
 (238, 'î'),
 (239, 'ď'),
 (240, 'đ'),
 (241, 'ń'),
 (242, 'ň'),
 (243, 'ó'),
 (244, 'ô'),
 (245, 'ő'),
 (246, 'ö'),
 (247, '÷'),
 (248, 'ř'),
 (249, 'ů'),
 (250, 'ú'),
 (251, 'ű'),
 (252, 'ü'),
 (253, 'ý'),
 (254, 'ţ'),
 (255, '˙')]

Using the 8th bit (and thus the codepoint range [128; 256)) solves the problem of handling languages with character sets different than that of American English, but introduces a lot of complexity -- whenever you come across a text file with an unknown encoding, it might be in one of literally dozens of encodings. Additional drawbacks include:

  • how to handle multilingual text with characters from many different alphabets, which are not part of the same 8-bit encoding?
  • how to handle writing systems which have way more than 256 "characters", e.g. Chinese, Japanese and Korean (CJK) ideograms?

For these purposes, a standard character set known as Unicode was developed which strives for universal coverage of (ultimately) all characters ever used in the history of writing, even adding new ones like emojis. Unicode is much bigger than the character sets we've seen so far -- its most frequently used subset, the Basic Multilingual Plane, has 2^16 codepoints, but overall the number of codepoints is past 1M and there's room to accommodate many more.

In [10]:

Now, the most straightforward representation for 2^16 codepoints is what? Well, it's simply using 16 bits per character, i.e. 2 bytes. That encoding exists, it's called UTF-16, but consider the drawbacks:

  • we've lost ASCII compatibility by the simple fact of using 2 bytes per character instead of 1 (encoding "a" as 01100001 or 01100001|00000000, with the | indicating an imaginary boundary between bytes, is not the same thing)
  • encoding a string in a character set which uses a "reasonable" number of characters (like any European language) now takes twice as much space without any added benefit (which is probably not a good idea, given the general dominance of English -- one of those "reasonable character set size" languages -- in electronic communication)

Looks like we'll have to think outside the box. The box in question here is called fixed-width encodings -- all of the encoding schemes we've encountered so far were fixed-width, meaning that each character was represented by either 7, 8 or 16 bits. In other word, you could jump around the string in multiples of 7, 8 or 16 and always land at the beginning of a character. (Not exactly true for UTF-16, because it is something more than just a "16-bit ASCII": it has ways of handling characters beyond 2^16 using so-called surrogate sequences -- but you get the gist.)

"UTF" stands for "Unicode Transformation Format".

The smart idea that some bright people have come up with was to use a variable-width encoding. The most ubiquitous one currently is UTF-8, which we've already met in the HTML example above. UTF-8 is ASCII-compatible, i.e. the 1's and 0's used to encode text containing only ASCII characters are the same regardless of whether you use ASCII or UTF-8: it's a sequence of 8-bit bytes. But UTF-8 can also handle many more additional characters, as defined by the Unicode standard, by using progressively longer and longer sequences of bits.

In [11]:
def print_as_binary_utf8(char):
    """Prints binary representation of character as encoded by UTF-8.
    # encode the string as UTF-8 and iterate over the bytes;
    # iterating over a sequence of bytes yields integers in the
    # range [0; 256); the formatting directive "{:08b}" does two
    # things:
    #   - "b" prints the integer in its binary representation
    #   - "08" pads the binary representation with 0's to a total
    #     width of 8 characters, which is the width of a byte
    binary_bytes = ["{:08b}".format(byte) for byte in char.encode("utf8")]
    print("{!r} encoded in UTF-8 is: {}".format(char, binary_bytes))

print_as_binary_utf8("A")   # the representations...
print_as_binary_utf8("č")   # ... keep...
print_as_binary_utf8("字")  # ... getting longer.
'A' encoded in UTF-8 is: ['01000001']
'č' encoded in UTF-8 is: ['11000100', '10001101']
'字' encoded in UTF-8 is: ['11100101', '10101101', '10010111']

How does it achieve that? The obvious problem here is that with a fixed-width encoding, you just chop up the string at regular intervals (7, 8, 16 bits) and you know that each interval represents one character. So how do you know where to chop up a variable width-encoded string, if each character can take up a different number of bits?

Essentially, the trick is to use some of the bits in the representation of a codepoint to store information not about which character it is (whether it's an "A" or a "字"), but how many bits it occupies. In other words, if you want to skip ahead 10 characters in a string encoded with a variable width-encoding, you can't just skip 10 * 7 or 8 or 16 bits; you have to read all the intervening characters to figure out how much space they take up. Take the following example:

In [12]:
for char in "Básník 李白":
'B' encoded in UTF-8 is: ['01000010']
'á' encoded in UTF-8 is: ['11000011', '10100001']
's' encoded in UTF-8 is: ['01110011']
'n' encoded in UTF-8 is: ['01101110']
'í' encoded in UTF-8 is: ['11000011', '10101101']
'k' encoded in UTF-8 is: ['01101011']
' ' encoded in UTF-8 is: ['00100000']
'李' encoded in UTF-8 is: ['11100110', '10011101', '10001110']
'白' encoded in UTF-8 is: ['11100111', '10011001', '10111101']

Notice the initial bits in each byte of a character follow a pattern depending on how many bytes in total that character has:

  • if it's a 1-byte character, that byte starts with 0
  • if it's a 2-byte character, the first byte starts with 11 and the following one with 10
  • if it's a 3-byte character, the first byte starts with 111 and the following ones with 10

This makes it possible to find out which bytes belong to which characters, and also to spot invalid strings, as the leading byte in a multi-byte sequence always "announces" how many continuation bytes (= starting with 10) should follow.

So much for a quick introduction to UTF-8 (= the encoding), but there's much more to Unicode (= the character set). While UTF-8 defines only how integer numbers corresponding to codepoints are to be represented as 1's and 0's in a computer's memory, Unicode specifies how those numbers are to be interpreted as characters, what their properties and mutual relationships are, what conversions (i.e. mappings between (sequences of) codepoints) they can undergo, etc.

Consider for instance the various ways diacritics are handled: "č" can be represented either as a single codepoint (LATIN SMALL LETTER C WITH CARON -- all Unicode codepoints have cute names like this) or a sequence of two codepoints, the character "c" and a combining diacritic mark (COMBINING CARON). You can search for the codepoints corresponding to Unicode characters e.g. here and play with them in Python using the chr(0xXXXX) built-in function or with the special string escape sequence \uXXXX (where XXXX is the hexadecimal representation of the codepoint) -- both are ways to get the character corresponding to the given codepoint:

In [13]:
# "č" as LATIN SMALL LETTER C WITH CARON, codepoint 010d
In [14]:
# "č" as a sequence of LATIN SMALL LETTER C, codepoint 0063, and
# COMBINING CARON, codepoint 030c
print(chr(0x0063) + chr(0x030c))

Hexadecimal is just a more convenient way of representing sequences of bits, where each of the X's can be a number between 0 and 15 (10--15 are represented by the letters A--F). Each hexadecimal number can thus represent 16 different values, and therefore it can stand in for a sequence of 4 bits (2^4 == 16). Without worrying too much about the details right now, our old friend ASCII uppercase "A" can be thought of equivalently either as decimal 65, binary 0b1000001, or hexadecimal 0x41 (the "0b" / "0x" prefixes are there just to say "this is a binary / hexadecimal number").

Binary and hexadecimal numbers are often written padded with leading zeros to some number of bytes, but these have no effect on the value, much like decimal 42 and 00000042 are effectively the same numbers.

In [15]:
# use hex() to find out the hexadecimal representation of a decimal
# integer...
In [16]:
# ... and int() to go back...
In [17]:
# ... or just evaluate the hexadecimal number
In [18]:
# of course, chr() also works with decimal numbers

This means you have to be careful when working with languages that use accents, because for a computer, the two possible representations are of course different strings, even though for you, they're conceptually the same:

In [19]:
s1 = "\u010d"
s2 = "\u0063\u030c"
# s1 and s2 look the same to the naked eye...
print(s1, s2)
č č
In [20]:
# ... but in the eternal realm of Plato's Ideas, they're not
s1 == s2

Watch out, they even have different lengths! This might come to bite you if you're trying to compute the length of a word in letters.

In [21]:
print("s1 is", len(s1), "character(s) long.")
print("s2 is", len(s2), "character(s) long.")
s1 is 1 character(s) long.
s2 is 2 character(s) long.

For this reason, even though we've been informally calling these Unicode entities "characters", it is more accurate and less confusing to use the technical term "codepoints".

Generally, most text out there will use the first, single-codepoint approach whenever possible, and pre-packaged linguistic corpora will try to be consistent about this (unless they come from the web, which always warrants being suspicious and defensive about your material). If you're worried about inconsistencies in your data, you can perform a normalization:

In [22]:
from unicodedata import normalize

# NFC stands for Normal Form C; this normalization applies a canonical
# decomposition (into a multi-codepoint representation) followed by a
# canonical composition (into a single-codepoint representation)
s1 = normalize("NFC", s1)
s2 = normalize("NFC", s2)

s1 == s2

Let's wrap things up by saying that Python itself uses Unicode internally, but the encoding it defaults to when opening an external file depends on the locale of the system (broadly speaking, the set of region, language and character-encoding related settings of the operating system). On most modern Linux and macOS systems, this will probably be a UTF-8 locale and Python will therefore assume UTF-8 as the encoding by default. Unfortunately, Windows is different. To be on the safe side, whenever opening files in Python, you can specify the encoding explicitly:

In [23]:
with open("unicode.ipynb", encoding="utf-8") as file:
In [24]:
# a good idea when dealing with Unicode text from an unknown and
# unreliable source is to look at the set of codepoints contained
# in it and eliminate or replace those that shouldn't be there
import unicodedata

def inspect_codepoints(text):
    charset = set(text)
    for char in sorted(charset):
        info = r"{} (\u{:04x}): {} (category: {})".format(
            char, ord(char), unicodedata.name(char),

# depending on your font configuration, it may be very hard to spot
# the two intruders in the sentence below that look like regular
# letters but really are specialized variants; you might want
# to replace them before doing further text processing...
inspect_codepoints("Intruders here, good 𝗍hinɡ I checked.")
  (\u0020): SPACE (category: Zs)
, (\u002c): COMMA (category: Po)
. (\u002e): FULL STOP (category: Po)
I (\u0049): LATIN CAPITAL LETTER I (category: Lu)
c (\u0063): LATIN SMALL LETTER C (category: Ll)
d (\u0064): LATIN SMALL LETTER D (category: Ll)
e (\u0065): LATIN SMALL LETTER E (category: Ll)
g (\u0067): LATIN SMALL LETTER G (category: Ll)
h (\u0068): LATIN SMALL LETTER H (category: Ll)
i (\u0069): LATIN SMALL LETTER I (category: Ll)
k (\u006b): LATIN SMALL LETTER K (category: Ll)
n (\u006e): LATIN SMALL LETTER N (category: Ll)
o (\u006f): LATIN SMALL LETTER O (category: Ll)
r (\u0072): LATIN SMALL LETTER R (category: Ll)
s (\u0073): LATIN SMALL LETTER S (category: Ll)
t (\u0074): LATIN SMALL LETTER T (category: Ll)
u (\u0075): LATIN SMALL LETTER U (category: Ll)
ɡ (\u0261): LATIN SMALL LETTER SCRIPT G (category: Ll)
𝗍 (\u1d5cd): MATHEMATICAL SANS-SERIF SMALL T (category: Ll)
In [25]:
# ... because of course, for a computer, the word "thing" written with
# two different variants of "g" is really just two different words, which
# is probably not what you want
"thing" == "thinɡ"

In any case, here's what happens when processing text with Python ("Unicode" in the central box stands for Python's internal representation of Unicode, which is not UTF-8 nor UTF-16):

Text IO in Python

(Image shamelessly hotlinked from / courtesy of the NLTK Book. Go check it out, it's an awesome intro to Python programming for linguists!)

A terminological postscript: we've been using some terms a bit informally, but now that we have a practical intuition for what they mean, it's good to get the definitions straight in one's head. So, a character set is a mapping between codepoints (integers) and characters. We may for instance say that in our character set, the integer 99 corresponds to the character "c".

On the other hand, an encoding is a mapping between a codepoint (an integer) and a physical sequence of 1's and 0's that represent it in memory. With fixed-width encodings, this mapping is generally straightforward -- the 1's and 0's directly represent the given integer, only in binary and padded with zeros to fit the desired width. With variable-width encodings, which have to explicitly encode information about how many bits are spanned by each codepoint, this straightforward correspondence breaks down.

A comparison might be helpful here: as encodings, UTF-8 and UTF-16 both use the same character set -- the same integers corresponding to the same characters. But since they're different encodings, when the time comes to turn these integers into sequences of bits to store in a computer's memory, each of them generates a different one.

For more on Unicode, a great read already hinted at above is Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). Another great piece of material is the Characters, Symbols and the Unicode Miracle video by the Computerphile channel on YouTube. To make the discussion digestible for newcomers, I sometimes slightly distorted facts about how things are "really really" done. And some inaccuracies may be genuine mistakes. In any case, please let me know in the comments! I'm grateful for feedback and looking to improve this material; I'll fix the mistakes and consider ditching some of the simplifications if they prove untenable :)


comments powered by Disqus