The Mystery of Ctrl+[

Tommy Bennett gave a great lightning talk at CppCon 2016 entitled “Algorithm Mnemonics: Increase Your Productivity With STL Algorithms”. This was really two lightning talks jammed into one. In the second half, Tommy talks about how to increase your productivity with the vim editor simply by learning to use Ctrl+[ instead of ESC because of the distance traveled by your fingers on the keyboard. In the talk, Tommy wonders why Ctrl+[ works for ESC. This short post will explain the mystery of Ctrl+[ along with other interesting characters you can type on your keyboard.

What Characters Look Like to a Computer

A digital computer only understands how to manipulate zeros and ones. So how does a computer deal with a “character” (letter, digit, punctuation symbol, etc.)? The problem of encoding characters predates digital computers and is the basis for Morse code used in telegraphy. Morse code uses two symbols (“dot” and “dash”) and a variable number of symbols per character. Frequently occurring characters are encoded with a shorter sequence of symbols and less frequently used characters are encoded with a longer sequence of symbols. As the telegraph network evolved into the telex network a switch was made from Morse code to Baudot code. Baudot code is a 5 bit binary code. Because 5 bits only encodes 32 different symbols, Baudot code uses two special codes to shift between two different interpretations of the remaining 30 values, the so-called “figures” and “letters” shift codes. A stream of Baudot coded characters in a message would “shift to letters” and then contain a sequence of letters and minimal control characters such as carraige return and line feed. When punctuation or numbers were needed, a “shift to figures” code would be sent. A message containing a mixture of letters and figures would alternately shift back and forth.

ASCII: The One Code to Rule Them All (sort of)

By the early 1960s, computers from different manufacturers were exchanging data and needed to agree upon the codes used to represent characters. If the computers from different manufacturers used the same coding scheme for character data, then no translation would be needed to exchange data between machines. This culminated in the first ASCII (American Standard Code for Information Interchange) standard in 1963. Most computers adopted If we take a look at how the characters in ASCII are distributed among the code values, we will see some interesting patterns:

USASCII code chart (Source: Wikipedia)

USASCII code chart (Source: Wikipedia)

Notice how the columns labelled 2 and 3 roughly correspond to the “figures” of Baudot code and columns 3 and 4 to “letters”. (Baudot code could not represent lower case letters.) Also notice that the relative position of any individual upper case letter within a column corresponds to the same relative position within the column for its lower case counterpart. Also notice that the codes for letters are increasing with the order of the letters within the alphabet, unlike Baudot code. These characteristics of USASCII are intentional and allow for the easy manipulation of character codes in order to perform certain operations such as case conversion and sorting. They also facilitate certain tricks in keyboard design. A shift key merely has to toggle bit b6 shown in the figure in order to switch between the code for the upper case symbol and the code for the lower case symbol.

The Mystery of Ctrl+[ Revealed

Now let’s take a look at those mysterious symbolic names in the first two columns. These are the columns that represent control codes. These are non-printing codes designed to issue controlling commands to the receiving device. Because they are non-printing codes, their operation is normally hidden from view. Some of them are already familiar to you: BS is backspace, LF is line feed and CR is carriage return. These control codes are the origin of the Ctrl key on your keyboard. When you hold Ctrl, you are telling the keyboard “make sure bits b6 and b7 are zero”. Now you can see all the different codes you can generate by holding the Ctrl key by looking in columns 4 and 5 of the ASCII chart.

Do you need to type a NUL character? Easy, just type Ctrl+@ (Ctrl+Shift+2 on most PC-101 style keyboards).

Now take a look at the control character corresponding to the position of [ in column 5. It lines up with the code for ESC in column 2. So that’s why Ctrl+[ works as a synonym for ESC in vim. Tommy mentions “vim has this support built-in all along!”, not realizing that it is a trick of the keyboard in conjunction with the ASCII character encoding. Vim is completely out of the loop; as far as it is concerned an ESC was typed.

If you play around with these control characters you’ll realize that not all of them can be typed directly into a command-prompt as the command interpreter and operating system use some of these codes for their own purposes. Ctrl+C serves as an “interrupt” command to console programs on both Unix and Windows. Ctrl+D is the end-of-file character on Unix, while Ctrl+Z is the similar character on Windows. Ctrl+H serves as backspace, Ctrl+M serves as carriage return and so-on. It is possible to programmatically receive these characters on both operating systems, but it isn’t the default situation.

Now you know the mystery behind Ctrl+[.

Does ASCII Rule Them All?

Above I said that ASCII was the one code to rule them all—sort of. One glaring exception is IBM that had an investment in their own character code called EBCDIC and they didn’t make much equipment that used ASCII in order to be backwards compatible with their existing equipment and installed base. EBCDIC is less convenient for some operations than ASCII. For instance the codes making up the alphabet are not contiguous—there is a gap between I and J and between R and Z. This makes lexicographical sorting more difficult. The order of lower case codes and upper case codes is also swapped with respect to ASCII. While the same keyboard tricks can be played with shifting between upper case and lower case, the corresponding ASCII control codes are sprinkled about in non-consecutive codes.

ASCII was clearly designed for use in the United States; after all, the first letter does stand for American. Notice that ASCII is a 7 bit code, but characters are typically one-to-one with bytes which hold 8 bits. ASCII intended that the remaining 128 codes in a byte could be used for so-called “national character sets”. European languages used these additional code points for accented characters and other special symbols (mostly currency symbols). This means the same character code is used differently in different regions, again prohibiting data interchange. The problem is further compounded for asian languages that have many more symbols than can fit in an 8 bit code.

A number of European language national character set standards followed ASCII, along with encoding schemes designed for asian languages such as JIS X 0208 for Japanese and the Big5 encoding for traditional Chinese.

To make a long story short, all these competing encoding schemes complicate life for software and users that want to exchange data between machines. Unicode, in one form or another, is the dominant character encoding in modern computing. (EBCDIC is still heavily used within the mainframe world.) If there is any one character encoding that can be said to rule them all in the modern era, it is Unicode.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: