Tuesday, May 3, 2011

CHARACTER FORMATS

We think of computing as work with numbers, but in fact most computing operates on character data rather
than numeric data—names, addresses, order numbers, gender, birthdates, etc. are usually, or often, represented
by strings of characters rather than numeric values.
Characters are mapped to integer numbers. There have been many character–to-integer mappings
over the years. IBM invented a mapping called binary coded decimal (BCD), and later extended
BCD interchange coded (EBCDIC), which became a de facto standard with IBM’s early success in the
computer market.
The American standard American Standard Code for Information Interchange (ASCII) was defined in the
1960s and became the choice of most computer vendors, aside from IBM. Today Unicode is becoming popular
because it is backwards compatible with ASCII and allows the encoding of more complex alphabets, such as
those used for Russian, Chinese, and other languages. We will use ASCII to illustrate the idea of character
encoding, since it is still widely used, and it is simpler to describe than Unicode.
In ASCII each character is assigned a 7-bit integer value. For instance, ‘A’ = 65 (1000001), ‘B’ = 66
(1000010), ‘C’ = 67 (1000011), etc. The 8th bit in a character byte is intended to be used as a parity bit, which
allows for a simple error detection scheme.
If parity is used, the 8th or parity bit is used to force the sum of the bits in the character to be an
even number (even parity) or an odd number (odd parity). Thus, the 8 bits for the character ‘B’ could take
these forms:
01000010 even parity
11000010 odd parity
01000010 no parity
If parity is being used, and a noisy signal causes one of the bits of the character to be misinterpreted, the
communication device will see that the parity of the character no longer checks. The data transfer can then be
retried, or an error announced. This topic is more properly discussed under the heading of data communications,
but since we had to mention the 8th bit of the ASCII code, we didn’t want you to be left completely in the dark
about what parity bits and parity checking are.
The lowercase characters are assigned a different set of numbers: ‘a’ = 97 (1100001), ‘b’ = 98 (1100010),
‘c’ = 99 (1100011), etc. In addition, many special characters are defined: ‘$’ = 36 (0100100), ‘+’ = 43
(0101011), ‘>’ = 62 (01111110), etc.
A number of “control characters” are also defined in ASCII. Control characters do not print, but can be used
in streams of characters to control devices. For example, ‘line feed’ = 10 (0001010), ‘tab’ = 11 (0001011),
‘backspace’ = 8 (0001000), etc.
For output, to send the string “Dog” followed by a linefeed, the following sequence of bytes would be sent
(the msb is the parity bit, and in this example parity is being ignored, and the parity bit set to 0):
01000100 01101111 01100111 00001010
D o g lf (line feed)
Likewise for input, if a program is reading from a keyboard, the keyboard will send a sequence of integer
values that correspond to the letters being typed.
How does a program know whether to interpret a series of bits as an integer, a character, or a floating-point
number? Bits are bits, and there is no label on a memory location saying this location holds an integer/character/real.
The answer is that the program will interpret the bits based on its expectation.
If the program expects to find a character, it will try to interpret the bits as a character. If the bit pattern
doesn’t make sense as a character encoding, either the program will fail or an error message will result.
Likewise, if the program expects an integer, it will interpret the bit pattern as an integer, even if the bit pattern
originally encoded a character. It is incumbent on the programmer to be sure that the program’s handling of data
is appropriate.

No comments:

Post a Comment