7 Network Working Group F. Yergeau
8 Request for Comments: 2044 Alis Technologies
9 Category: Informational October 1996
12 UTF-8, a transformation format of Unicode and ISO 10646
16 This memo provides information for the Internet community. This memo
17 does not specify an Internet standard of any kind. Distribution of
18 this memo is unlimited.
22 The Unicode Standard, version 1.1, and ISO/IEC 10646-1:1993 jointly
23 define a 16 bit character set which encompasses most of the world's
24 writing systems. 16-bit characters, however, are not compatible with
25 many current applications and protocols, and this has led to the
26 development of a few so-called UCS transformation formats (UTF), each
27 with different characteristics. UTF-8, the object of this memo, has
28 the characteristic of preserving the full US-ASCII range: US-ASCII
29 characters are encoded in one octet having the usual US-ASCII value,
30 and any octet with such a value can only be an US-ASCII character.
31 This provides compatibility with file systems, parsers and other
32 software that rely on US-ASCII values but are transparent to other
37 The Unicode Standard, version 1.1 [UNICODE], and ISO/IEC 10646-1:1993
38 [ISO-10646] jointly define a 16 bit character set, UCS-2, which
39 encompasses most of the world's writing systems. ISO 10646 further
40 defines a 31-bit character set, UCS-4, with currently no assignments
41 outside of the region corresponding to UCS-2 (the Basic Multilingual
42 Plane, BMP). The UCS-2 and UCS-4 encodings, however, are hard to use
43 in many current applications and protocols that assume 8 or even 7
44 bit characters. Even newer systems able to deal with 16 bit
45 characters cannot process UCS-4 data. This situation has led to the
46 development of so-called UCS transformation formats (UTF), each with
47 different characteristics.
49 UTF-1 has only historical interest, having been removed from ISO
50 10646. UTF-7 has the quality of encoding the full Unicode repertoire
51 using only octets with the high-order bit clear (7 bit US-ASCII
52 values, [US-ASCII]), and is thus deemed a mail-safe encoding
53 ([RFC1642]). UTF-8, the object of this memo, uses all bits of an
54 octet, but has the quality of preserving the full US-ASCII range:
58 Yergeau Informational [Page 1]
60 RFC 2044 UTF-8 October 1996
63 US-ASCII characters are encoded in one octet having the normal US-
64 ASCII value, and any octet with such a value can only stand for an
65 US-ASCII character, and nothing else.
67 UTF-16 is a scheme for transforming a subset of the UCS-4 repertoire
68 into a pair of UCS-2 values from a reserved range. UTF-16 impacts
69 UTF-8 in that UCS-2 values from the reserved range must be treated
70 specially in the UTF-8 transformation.
72 UTF-8 encodes UCS-2 or UCS-4 characters as a varying number of
73 octets, where the number of octets, and the value of each, depend on
74 the integer value assigned to the character in ISO 10646. This
75 transformation format has the following characteristics (all values
78 - Character values from 0000 0000 to 0000 007F (US-ASCII repertoire)
79 correspond to octets 00 to 7F (7 bit US-ASCII values).
81 - US-ASCII values do not appear otherwise in a UTF-8 encoded charac-
82 ter stream. This provides compatibility with file systems or
83 other software (e.g. the printf() function in C libraries) that
84 parse based on US-ASCII values but are transparent to other val-
87 - Round-trip conversion is easy between UTF-8 and either of UCS-4,
90 - The first octet of a multi-octet sequence indicates the number of
91 octets in the sequence.
93 - Character boundaries are easily found from anywhere in an octet
96 - The lexicographic sorting order of UCS-4 strings is preserved. Of
97 course this is of limited interest since the sort order is not
98 culturally valid in either case.
100 - The octet values FE and FF never appear.
102 UTF-8 was originally a project of the X/Open Joint
103 Internationalization Group XOJIG with the objective to specify a File
104 System Safe UCS Transformation Format [FSS-UTF] that is compatible
105 with UNIX systems, supporting multilingual text in a single encoding.
106 The original authors were Gary Miller, Greger Leijonhufvud and John
107 Entenmann. Later, Ken Thompson and Rob Pike did significant work for
114 Yergeau Informational [Page 2]
116 RFC 2044 UTF-8 October 1996
119 A description can also be found in Unicode Technical Report #4 [UNI-
120 CODE]. The definitive reference, including provisions for UTF-16
121 data within UTF-8, is Annex R of ISO/IEC 10646-1 [ISO-10646].
125 In UTF-8, characters are encoded using sequences of 1 to 6 octets.
126 The only octet of a "sequence" of one has the higher-order bit set to
127 0, the remaining 7 bits being used to encode the character value. In
128 a sequence of n octets, n>1, the initial octet has the n higher-order
129 bits set to 1, followed by a bit set to 0. The remaining bit(s) of
130 that octet contain bits from the value of the character to be
131 encoded. The following octet(s) all have the higher-order bit set to
132 1 and the following bit set to 0, leaving 6 bits in each to contain
133 bits from the character to be encoded.
135 The table below summarizes the format of these different octet types.
136 The letter x indicates bits available for encoding bits of the UCS-4
139 UCS-4 range (hex.) UTF-8 octet sequence (binary)
140 0000 0000-0000 007F 0xxxxxxx
141 0000 0080-0000 07FF 110xxxxx 10xxxxxx
142 0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
144 0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
145 0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
146 0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx
148 Encoding from UCS-4 to UTF-8 proceeds as follows:
150 1) Determine the number of octets required from the character value
151 and the first column of the table above.
153 2) Prepare the high-order bits of the octets as per the second column
156 3) Fill in the bits marked x from the bits of the character value,
157 starting from the lower-order bits of the character value and
158 putting them first in the last octet of the sequence, then the
159 next to last, etc. until all x bits are filled in.
170 Yergeau Informational [Page 3]
172 RFC 2044 UTF-8 October 1996
175 The algorithm for encoding UCS-2 (or Unicode) to UTF-8 can be
176 obtained from the above, in principle, by simply extending each
177 UCS-2 character with two zero-valued octets. However, UCS-2 val-
178 ues between D800 and DFFF, being actually UCS-4 characters trans-
179 formed through UTF-16, need special treatment: the UTF-16 trans-
180 formation must be undone, yielding a UCS-4 character that is then
181 transformed as above.
183 Decoding from UTF-8 to UCS-4 proceeds as follows:
185 1) Initialize the 4 octets of the UCS-4 character with all bits set
188 2) Determine which bits encode the character value from the number of
189 octets in the sequence and the second column of the table above
192 3) Distribute the bits from the sequence to the UCS-4 character,
193 first the lower-order bits from the last octet of the sequence and
194 proceeding to the left until no x bits are left.
196 If the UTF-8 sequence is no more than three octets long, decoding
197 can proceed directly to UCS-2 (or equivalently Unicode).
199 A more detailed algorithm and formulae can be found in [FSS_UTF],
200 [UNICODE] or Annex R to [ISO-10646].
204 The Unicode sequence "A<NOT IDENTICAL TO><ALPHA>." (0041, 2262, 0391,
205 002E) may be encoded as follows:
209 The Unicode sequence "Hi Mom <WHITE SMILING FACE>!" (0048, 0069,
210 0020, 004D, 006F, 006D, 0020, 263A, 0021) may be encoded as follows:
212 48 69 20 4D 6F 6D 20 E2 98 BA 21
214 The Unicode sequence representing the Han characters for the Japanese
215 word "nihongo" (65E5, 672C, 8A9E) may be encoded as follows:
217 E6 97 A5 E6 9C AC E8 AA 9E
226 Yergeau Informational [Page 4]
228 RFC 2044 UTF-8 October 1996
233 This memo is meant to serve as the basis for registration of a MIME
234 character encoding (charset) as per [RFC1521]. The proposed charset
235 parameter value is "UTF-8". This string would label media types
236 containing text consisting of characters from the repertoire of ISO
237 10646-1 encoded to a sequence of octets using the encoding scheme
240 Security Considerations
242 Security issues are not discussed in this memo.
246 The following have participated in the drafting and discussion of
249 James E. Agenbroad Andries Brouwer
250 Martin J. D|rst David Goldsmith
251 Edwin F. Hart Kent Karlsson
252 Markus Kuhn Michael Kung
253 Alain LaBonte Murray Sargent
254 Keld Simonsen Arnold Winkler
258 [FSS_UTF] X/Open CAE Specification C501 ISBN 1-85912-082-2 28cm.
259 22p. pbk. 172g. 4/95, X/Open Company Ltd., "File Sys-
260 tem Safe UCS Transformation Format (FSS_UTF)", X/Open
261 Preleminary Specification, Document Number P316. Also
262 published in Unicode Technical Report #4.
264 [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Infor-
265 mation technology -- Universal Multiple-Octet Coded
266 Character Set (UCS) -- Part 1: Architecture and Basic
267 Multilingual Plane. UTF-8 is described in Annex R,
268 adopted but not yet published. UTF-16 is described in
269 Annex Q, adopted but not yet published.
271 [RFC1521] Borenstein, N., and N. Freed, "MIME (Multipurpose
272 Internet Mail Extensions) Part One: Mechanisms for
273 Specifying and Describing the Format of Internet Mes-
274 sage Bodies", RFC 1521, Bellcore, Innosoft, September
277 [RFC1641] Goldsmith, D., and M. Davis, "Using Unicode with
278 MIME", RFC 1641, Taligent inc., July 1994.
282 Yergeau Informational [Page 5]
284 RFC 2044 UTF-8 October 1996
287 [RFC1642] Goldsmith, D., and M. Davis, "UTF-7: A Mail-safe
288 Transformation Format of Unicode", RFC 1642,
289 Taligent, Inc., July 1994.
291 [UNICODE] The Unicode Consortium, "The Unicode Standard --
292 Worldwide Character Encoding -- Version 1.0", Addison-
293 Wesley, Volume 1, 1991, Volume 2, 1992. UTF-8 is
294 described in Unicode Technical Report #4.
296 [US-ASCII] Coded Character Set--7-bit American Standard Code for
297 Information Interchange, ANSI X3.4-1986.
303 100, boul. Alexis-Nihon
308 Tel: +1 (514) 747-2547
309 Fax: +1 (514) 747-2561
310 EMail: fyergeau@alis.com
338 Yergeau Informational [Page 6]