+++ /dev/null
-
-
-
-
-
-
-Network Working Group T. Howes
-Request for Comments: 1488 University of Michigan
- S. Kille
- ISODE Consortium
- W. Yeong
- Performance Systems International
- C. Robbins
- NeXor Ltd.
- July 1993
-
-
- The X.500 String Representation of Standard Attribute Syntaxes
-
-Status of this Memo
-
- This RFC specifies an IAB standards track protocol for the Internet
- community, and requests discussion and suggestions for improvements.
- Please refer to the current edition of the "IAB Official Protocol
- Standards" for the standardization state and status of this protocol.
- Distribution of this memo is unlimited.
-
-Abstract
-
- The Lightweight Directory Access Protocol (LDAP) [9] requires that
- the contents of AttributeValue fields in protocol elements be octet
- strings. This document defines the requirements that must be
- satisfied by encoding rules used to render Directory attribute
- syntaxes into a form suitable for use in the LDAP, then goes on to
- define the encoding rules for the standard set of attribute syntaxes
- defined in [1,2] and [3].
-
-1. Attribute Syntax Encoding Requirements
-
- This section defines general requirements for lightweight directory
- protocol attribute syntax encodings. All documents defining attribute
- syntax encodings for use by the lightweight directory protocols are
- expected to conform to these requirements.
-
- The encoding rules defined for a given attribute syntax must produce
- octet strings. To the greatest extent possible, encoded octet
- strings should be usable in their native encoded form for display
- purposes. In particular, encoding rules for attribute syntaxes
- defining non-binary values should produce strings that can be
- displayed with little or no translation by clients implementing the
- lightweight directory protocols.
-
-
-
-
-
-
-Howes, Kille, Yeong & Robbins [Page 1]
-\f
-RFC 1488 X.500 Syntax Encoding July 1993
-
-
-2. Standard Attribute Syntax Encodings
-
- For the purposes of defining the encoding rules for the standard
- attribute syntaxes, the following auxiliary BNF definitions will be
- used:
-
- <a> ::= 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' |
- 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' |
- 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z' | 'A' |
- 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' |
- 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' |
- 'T' | 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z'
-
- <d> ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
-
- <hex-digit> ::= <d> | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' |
- 'A' | 'B' | 'C' | 'D' | 'E' | 'F'
-
- <k> ::= <a> | <d> | '-'
-
- <p> ::= <a> | <d> | ''' | '(' | ')' | '+' | ',' | '-' | '.' |
- '/' | ':' | '?' | ' '
-
- <CRLF> ::= The ASCII newline character with hexadecimal value 0x0A
-
- <letterstring> ::= <a> | <a> <letterstring>
-
- <numericstring> ::= <d> | <d> <numericstring>
-
- <keystring> ::= <a> | <a> <anhstring>
-
- <anhstring> ::= <k> | <k> <anhstring>
-
- <printablestring> ::= <p> | <p> <printablestring>
-
- <space> ::= ' ' | ' ' <space>
-
-2.1. Undefined
-
- Values of type Undefined are encoded as if they were values of type
- Octet String.
-
-2.2. Case Ignore String
-
- A string of type caseIgnoreStringSyntax is encoded as the string
- value itself.
-
-
-
-
-
-Howes, Kille, Yeong & Robbins [Page 2]
-\f
-RFC 1488 X.500 Syntax Encoding July 1993
-
-
-2.3. Case Exact String
-
- The encoding of a string of type caseExactStringSyntax is the string
- value itself.
-
-2.4. Printable String
-
- The encoding of a string of type printableStringSyntax is the string
- value itself.
-
-2.5. Numeric String
-
- The encoding of a string of type numericStringSyntax is the string
- value itself.
-
-2.6. Octet String
-
- The encoding of a string of type octetStringSyntax is the string
- value itself.
-
-2.7. Case Ignore IA5 String
-
- The encoding of a string of type caseIgnoreIA5String is the string
- value itself.
-
-2.8. IA5 String
-
- The encoding of a string of type iA5StringSyntax is the string value
- itself.
-
-2.9. T61 String
-
- The encoding of a string of type t61StringSyntax is the string value
- itself.
-
-2.10. Case Ignore List
-
- Values of type caseIgnoreListSyntax are encoded according to the
- following BNF:
-
- <caseignorelist> ::= <caseignorestring> |
- <caseignorestring> '$' <caseignorelist>
-
- <caseignorestring> ::= a string encoded according to the rules
- for Case Ignore String as above.
-
-
-
-
-
-
-Howes, Kille, Yeong & Robbins [Page 3]
-\f
-RFC 1488 X.500 Syntax Encoding July 1993
-
-
-2.11. Case Exact List
-
- Values of type caseExactListSyntax are encoded according to the
- following BNF:
-
- <caseexactlist> ::= <caseexactstring> |
- <caseexactstring> '$' <caseexactlist>
-
- <caseexactstring> ::= a string encoded according to the rules for
- Case Exact String as above.
-
-2.12. Distinguished Name
-
- Values of type distinguishedNameSyntax are encoded to have the
- representation defined in [5].
-
-2.13. Boolean
-
- Values of type booleanSyntax are encoded according to the following
- BNF:
-
- <boolean> ::= "TRUE" | "FALSE"
-
- Boolean values have an encoding of "TRUE" if they are logically true,
- and have an encoding of "FALSE" otherwise.
-
-2.14. Integer
-
- Values of type integerSyntax are encoded as the decimal
- representation of their values, with each decimal digit represented
- by the its character equivalent. So the digit 1 is represented by the
- character
-
-2.15. Object Identifier
-
- Values of type objectIdentifierSyntax are encoded according to the
- following BNF:
-
- <oid> ::= <descr> | <descr> '.' <numericoid> | <numericoid>
-
- <descr> ::= <keystring>
-
- <numericoid> ::= <numericstring> | <numericstring> '.' <numericoid>
-
- In the above BNF, <descr> is the syntactic representation of an
- object descriptor. When encoding values of type
- objectIdentifierSyntax, the first encoding option should be used in
- preference to the second, which should be used in preference to the
-
-
-
-Howes, Kille, Yeong & Robbins [Page 4]
-\f
-RFC 1488 X.500 Syntax Encoding July 1993
-
-
- third wherever possible. That is, in encoding object identifiers,
- object descriptors (where assigned and known by the implementation)
- should be used in preference to numeric oids to the greatest extent
- possible. For example, in encoding the object identifier representing
- an organizationName, the descriptor "organizationName" is preferable
- to "ds.4.10", which is in turn preferable to the string "2.5.4.10".
-
-2.16. Telephone Number
-
- Values of type telephoneNumberSyntax are encoded as if they were
- Printable String types.
-
-2.17. Telex Number
-
- Values of type telexNumberSyntax are encoded according to the
- following BNF:
-
- <telex-number> ::= <actual-number> '$' <country> '$' <answerback>
-
- <actual-number> ::= <printablestring>
-
- <country> ::= <printablestring>
-
- <answerback> ::= <printablestring>
-
- In the above, <actual-number> is the syntactic representation of the
- number portion of the TELEX number being encoded, <country> is the
- TELEX country code, and <answerback> is the answerback code of a
- TELEX terminal.
-
-2.18. Teletex Terminal Identifier
-
- Values of type teletexTerminalIdentifier are encoded according to the
- following BNF:
-
- <teletex-id> ::= <printablestring> 0*( '$' <printablestring>)
-
- In the above, the first <printablestring> is the encoding of the
- first portion of the teletex terminal identifier to be encoded, and
- the subsequent 0 or more <printablestrings> are subsequent portions
- of the teletex terminal identifier.
-
-2.19. Facsimile Telephone Number
-
- Values of type FacsimileTelephoneNumber are encoded according to the
- following BNF:
-
- <fax-number> ::= <printablestring> [ '$' <faxparameters> ]
-
-
-
-Howes, Kille, Yeong & Robbins [Page 5]
-\f
-RFC 1488 X.500 Syntax Encoding July 1993
-
-
- <faxparameters> ::= <faxparm> | <faxparm> '$' <faxparameters>
-
- <faxparm> ::= 'twoDimensional' | 'fineResolution' | 'unlimitedLength' |
- 'b4Length' | 'a3Width' | 'b4Width' | 'uncompressed'
-
- In the above, the first <printablestring> is the actual fax number,
- and the <faxparm> tokens represent fax parameters.
-
-2.20. Presentation Address
-
- Values of type PresentationAddress are encoded to have the
- representation described in [6].
-
-2.21. UTC Time
-
- Values of type uTCTimeSyntax are encoded as if they were Printable
- Strings with the strings containing a UTCTime value.
-
-2.22. Guide (search guide)
-
- Values of type Guide, such as values of the searchGuide attribute,
- are encoded according to the following BNF:
-
- <guide-value> ::= [ <object-class> '#' ] <criteria>
-
- <object-class> ::= an encoded value of type objectIdentifierSyntax
-
- <criteria> ::= <criteria-item> | <criteria-set> | '!' <criteria>
-
- <criteria-set> ::= [ '(' ] <criteria> '&' <criteria-set> [ ')' ] |
- [ '(' ] <criteria> '|' <criteria-set> [ ')' ]
-
- <criteria-item> ::= [ '(' ] <attributetype> '$' <match-type> [ ')' ]
-
- <match-type> ::= "EQ" | "SUBSTR" | "GE" | "LE" | "APPROX"
-
-2.23. Postal Address
-
-Values of type PostalAddress are encoded according to the following BNF:
-
- <postal-address> ::= <t61string> | <t61string> '$' <postal-address>
-
- In the above, each <t61string> component of a postal address value is
- encoded as a value of type t61StringSyntax.
-
-
-
-
-
-
-
-Howes, Kille, Yeong & Robbins [Page 6]
-\f
-RFC 1488 X.500 Syntax Encoding July 1993
-
-
-2.24. User Password
-
- Values of type userPasswordSyntax are encoded as if they were of type
- octetStringSyntax.
-
-2.25. User Certificate
-
- Values of type userCertificate are encoded according to the following
- BNF:
-
- <certificate> ::= <signature> '#' <issuer> '#' <validity> '#' <subject>
- '#' <public-key-info>
-
- <signature> ::= <algorithm-id>
-
- <issuer> ::= an encoded Distinguished Name
-
- <validity> ::= <not-before-time> '#' <not-after-time>
-
- <not-before-time> ::= <utc-time>
-
- <not-after-time> ::= <utc-time>
-
- <algorithm-parameters> ::= <null> | <integervalue> |
- '{ASN}' <hex-string>
-
- <subject> ::= an encoded Distinguished Name
-
- <public-key-info> ::= <algorithm-id> '#' <encrypted-value>
-
- <encrypted-value> ::= <hex-string> | <hex-string> '-' <d>
-
- <algorithm-id> ::= <oid> '#' <algorithm-parameters>
-
- <utc-time> ::= an encoded UTCTime value
-
- <hex-string> ::= <hex-digit> | <hex-digit> <hex-string>
-
-2.26. CA Certificate
-
- Values of type cACertificate are encoded as if the values were of
- type userCertificate.
-
-2.27. Authority Revocation List
-
- Values of type authorityRevocationList are encoded according to the
- following BNF:
-
-
-
-
-Howes, Kille, Yeong & Robbins [Page 7]
-\f
-RFC 1488 X.500 Syntax Encoding July 1993
-
-
- <certificate-list> ::= <signature> '#' <issuer> '#'
- <utc-time> [ '#' <revoked-certificates> ]
-
- <revoked-certificates> ::= <algorithm> '#' <encrypted-value>
- [ '#' 0*(<revoked-certificate>) '#']
-
- <revoked-certificates> ::= <subject> '#' <algorithm> '#'
- <serial> '#' <utc-time>
-
- The syntactic components <algorithm>, <issuer>, <encrypted-value>,
- <utc-time>, <subject> and <serial> have the same definitions as in
- the BNF for the userCertificate attribute syntax.
-
-2.28. Certificate Revocation List
-
- Values of type certificateRevocationList are encoded as if the values
- were of type authorityRevocationList.
-
-2.29. Cross Certificate Pair
-
- Values of type crossCertificatePair are encoded according to the
- following BNF:
-
- <certificate-pair> ::= <certificate> '|' <certificate>
-
- The syntactic component <certificate> has the same definition as in
- the BNF for the userCertificate attribute syntax.
-
-2.30. Delivery Method
-
- Values of type deliveryMethod are encoded according to the following
- BNF:
-
- <delivery-value> ::= <pdm> | <pdm> '$' <delivery-value>
-
- <pdm> ::= 'any' | 'mhs' | 'physical' | 'telex' | 'teletex' |
- 'g3fax' | 'g4fax' | 'ia5' | 'videotex' | 'telephone'
-
-2.31. Other Mailbox
-
- Values of the type otherMailboxSyntax are encoded according to the
- following BNF:
-
- <otherMailbox> ::= <mailbox-type> '$' <mailbox>
-
- <mailbox-type> ::= an encoded Printable String
-
- <mailbox> ::= an encoded IA5 String
-
-
-
-Howes, Kille, Yeong & Robbins [Page 8]
-\f
-RFC 1488 X.500 Syntax Encoding July 1993
-
-
- In the above, <mailbox-type> represents the type of mail system in
- which the mailbox resides, for example "Internet" or "MCIMail"; and
- <mailbox> is the actual mailbox in the mail system defined by
- <mailbox-type>.
-
-2.32. Mail Preference
-
- Values of type mailPreferenceOption are encoded according to the
- following BNF:
-
- <mail-preference> ::= "NO-LISTS" | "ANY-LIST" | "PROFESSIONAL-LISTS"
-
-2.33. MHS OR Address
-
- Values of type MHS OR Address are encoded as strings, according to
- the format defined in [10].
-
-2.34. Photo
-
- Values of type Photo are encoded as if they were octet strings
- containing JPEG images in the JPEG File Interchange Format (JFIF), as
- described in [8].
-
-2.35. Fax
-
- Values of type Fax are encoded as if they were octet strings
- containing Group 3 Fax images as defined in [7].
-
-3. Acknowledgements
-
- Many of the attribute syntax encodings defined in this document are
- adapted from those used in the QUIPU X.500 implementation. The
- contribu- tions of the authors of the QUIPU implementation in the
- specification of the QUIPU syntaxes [4] are gratefully acknowledged.
-
-4. Bibliography
-
- [1] The Directory: Selected Attribute Syntaxes. CCITT,
- Recommendation X.520.
-
- [2] Information Processing Systems -- Open Systems Interconnection --
- The Directory: Selected Attribute Syntaxes.
-
- [3] Barker, P., and S. Kille, "The COSINE and Internet X.500 Schema",
- RFC 1274, University College London, November 1991.
-
- [4] The ISO Development Environment: User's Manual -- Volume 5:
- QUIPU. Colin Robbins, Stephen E. Kille.
-
-
-
-Howes, Kille, Yeong & Robbins [Page 9]
-\f
-RFC 1488 X.500 Syntax Encoding July 1993
-
-
- [5] Kille, S., "A String Representation of Distinguished Names", RFC
- 1485, July 1993.
-
- [6] Kille, S., "A String Representation for Presentation Addresses",
- RFC 1278, University College London, November 1991.
-
- [7] Terminal Equipment and Protocols for Telematic Services -
- Standardization of Group 3 facsimile apparatus for document
- transmission. CCITT, Recommendation T.4.
-
- [8] JPEG File Interchange Format (Version 1.02). Eric Hamilton, C-
- Cube Microsystems, Milpitas, CA, September 1, 1992.
-
- [9] Yeong, W., Howes, T., and S. Kille, "Lightweight Directory Access
- Protocol", RFC 1487, Performance Systems International,
- University of Michigan, ISODE Consortium, July 1993.
-
- [10] Kille, S., "Mapping between X.400(1988)/ISO 10021 and RFC 822",
- RFC 1327, University College London, May 1992.
-
-5. Security Considerations
-
- Security issues are not discussed in this memo.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-Howes, Kille, Yeong & Robbins [Page 10]
-\f
-RFC 1488 X.500 Syntax Encoding July 1993
-
-
-6. Authors' Addresses
-
- Tim Howes
- University of Michigan
- ITD Research Systems
- 535 W William St.
- Ann Arbor, MI 48103-4943
- USA
-
- Phone: +1 313 747-4454
- EMail: tim@umich.edu
-
-
- Steve Kille
- ISODE Consortium
- PO Box 505
- London
- SW11 1DX
- UK
-
- Phone: +44-71-223-4062
- EMail: S.Kille@isode.com
-
-
- Wengyik Yeong
- PSI, Inc.
- 510 Huntmar Park Drive
- Herndon, VA 22070
- USA
-
- Phone: +1 703-450-8001
- EMail: yeongw@psilink.com
-
-
- Colin Robbins
- NeXor Ltd
- University Park
- Nottingham
- NG7 2RD
- UK
-
-
-
-
-
-
-
-
-
-
-
-Howes, Kille, Yeong & Robbins [Page 11]
-\f
\ No newline at end of file
+++ /dev/null
-
-
-
-
-
-
-Network Working Group T. Howes
-Request for Comments: 1558 University of Michigan
-Category: Informational December 1993
-
-
- A String Representation of LDAP Search Filters
-
-Status of this Memo
-
- This memo provides information for the Internet community. This memo
- does not specify an Internet standard of any kind. Distribution of
- this memo is unlimited.
-
-Abstract
-
- The Lightweight Directory Access Protocol (LDAP) [1] defines a
- network representation of a search filter transmitted to an LDAP
- server. Some applications may find it useful to have a common way of
- representing these search filters in a human-readable form. This
- document defines a human-readable string format for representing LDAP
- search filters.
-
-1. LDAP Search Filter Definition
-
- An LDAP search filter is defined in [1] as follows:
-
- Filter ::= CHOICE {
- and [0] SET OF Filter,
- or [1] SET OF Filter,
- not [2] Filter,
- equalityMatch [3] AttributeValueAssertion,
- substrings [4] SubstringFilter,
- greaterOrEqual [5] AttributeValueAssertion,
- lessOrEqual [6] AttributeValueAssertion,
- present [7] AttributeType,
- approxMatch [8] AttributeValueAssertion
- }
-
- SubstringFilter ::= SEQUENCE {
- type AttributeType,
- SEQUENCE OF CHOICE {
- initial [0] LDAPString,
- any [1] LDAPString,
- final [2] LDAPString
- }
- }
-
-
-
-
-
-Howes [Page 1]
-\f
-RFC 1558 Representation of LDAP Filters December 1993
-
-
- AttributeValueAssertion ::= SEQUENCE
- attributeType AttributeType,
- attributeValue AttributeValue
- }
-
- AttributeType ::= LDAPString
-
- AttributeValue ::= OCTET STRING
-
- LDAPString ::= OCTET STRING
-
- where the LDAPString above is limited to the IA5 character set. The
- AttributeType is a string representation of the attribute object
- identifier in dotted OID format (e.g., "2.5.4.10"), or the shorter
- string name of the attribute (e.g., "organizationName", or "o"). The
- AttributeValue OCTET STRING has the form defined in [2]. The Filter
- is encoded for transmission over a network using the Basic Encoding
- Rules defined in [3], with simplifications described in [1].
-
-2. String Search Filter Definition
-
- The string representation of an LDAP search filter is defined by the
- following BNF. It uses a prefix format.
-
- <filter> ::= '(' <filtercomp> ')'
- <filtercomp> ::= <and> | <or> | <not> | <item>
- <and> ::= '&' <filterlist>
- <or> ::= '|' <filterlist>
- <not> ::= '!' <filter>
- <filterlist> ::= <filter> | <filter> <filterlist>
- <item> ::= <simple> | <present> | <substring>
- <simple> ::= <attr> <filtertype> <value>
- <filtertype> ::= <equal> | <approx> | <greater> | <less>
- <equal> ::= '='
- <approx> ::= '~='
- <greater> ::= '>='
- <less> ::= '<='
- <present> ::= <attr> '=*'
- <substring> ::= <attr> '=' <initial> <any> <final>
- <initial> ::= NULL | <value>
- <any> ::= '*' <starval>
- <starval> ::= NULL | <value> '*' <starval>
- <final> ::= NULL | <value>
-
- <attr> is a string representing an AttributeType, and has the format
- defined in [1]. <value> is a string representing an AttributeValue,
- or part of one, and has the form defined in [2]. If a <value> must
- contain one of the characters '*' or '(' or ')', these characters
-
-
-
-Howes [Page 2]
-\f
-RFC 1558 Representation of LDAP Filters December 1993
-
-
- should be escaped by preceding them with the backslash '\' character.
-
-3. Examples
-
- This section gives a few examples of search filters written using
- this notation.
-
- (cn=Babs Jensen)
- (!(cn=Tim Howes))
- (&(objectClass=Person)(|(sn=Jensen)(cn=Babs J*)))
- (o=univ*of*mich*)
-
-4. Security Considerations
-
- Security issues are not discussed in this memo.
-
-5. References
-
- [1] Yeong, W., Howes, T., and S. Kille, "Lightweight Directory Access
- Protocol", RFC 1487, Performance Systems International,
- University of Michigan, ISODE Consortium, July 1993.
-
- [2] Howes, T., Kille, S., Yeong, W., and C. Robbins, "The String
- Representation of Standard Attribute Syntaxes", RFC 1488,
- University of Michigan, ISODE Consortium, Performance Systems
- International, NeXor Ltd., July 1993.
-
- [3] "Specification of Basic Encoding Rules for Abstract Syntax
- Notation One (ASN.1)", CCITT Recommendation X.209, 1988.
-
-6. Author's Address
-
- Tim Howes
- University of Michigan
- ITD Research Systems
- 535 W William St.
- Ann Arbor, MI 48103-4943
- USA
-
- Phone: +1 313 747-4454
- EMail: tim@umich.edu
-
-
-
-
-
-
-
-
-
-
-Howes [Page 3]
-\f
\ No newline at end of file
+++ /dev/null
-
-
-
-
-
-
-Network Working Group F. Yergeau
-Request for Comments: 2044 Alis Technologies
-Category: Informational October 1996
-
-
- UTF-8, a transformation format of Unicode and ISO 10646
-
-Status of this Memo
-
- This memo provides information for the Internet community. This memo
- does not specify an Internet standard of any kind. Distribution of
- this memo is unlimited.
-
-Abstract
-
- The Unicode Standard, version 1.1, and ISO/IEC 10646-1:1993 jointly
- define a 16 bit character set which encompasses most of the world's
- writing systems. 16-bit characters, however, are not compatible with
- many current applications and protocols, and this has led to the
- development of a few so-called UCS transformation formats (UTF), each
- with different characteristics. UTF-8, the object of this memo, has
- the characteristic of preserving the full US-ASCII range: US-ASCII
- characters are encoded in one octet having the usual US-ASCII value,
- and any octet with such a value can only be an US-ASCII character.
- This provides compatibility with file systems, parsers and other
- software that rely on US-ASCII values but are transparent to other
- values.
-
-1. Introduction
-
- The Unicode Standard, version 1.1 [UNICODE], and ISO/IEC 10646-1:1993
- [ISO-10646] jointly define a 16 bit character set, UCS-2, which
- encompasses most of the world's writing systems. ISO 10646 further
- defines a 31-bit character set, UCS-4, with currently no assignments
- outside of the region corresponding to UCS-2 (the Basic Multilingual
- Plane, BMP). The UCS-2 and UCS-4 encodings, however, are hard to use
- in many current applications and protocols that assume 8 or even 7
- bit characters. Even newer systems able to deal with 16 bit
- characters cannot process UCS-4 data. This situation has led to the
- development of so-called UCS transformation formats (UTF), each with
- different characteristics.
-
- UTF-1 has only historical interest, having been removed from ISO
- 10646. UTF-7 has the quality of encoding the full Unicode repertoire
- using only octets with the high-order bit clear (7 bit US-ASCII
- values, [US-ASCII]), and is thus deemed a mail-safe encoding
- ([RFC1642]). UTF-8, the object of this memo, uses all bits of an
- octet, but has the quality of preserving the full US-ASCII range:
-
-
-
-Yergeau Informational [Page 1]
-\f
-RFC 2044 UTF-8 October 1996
-
-
- US-ASCII characters are encoded in one octet having the normal US-
- ASCII value, and any octet with such a value can only stand for an
- US-ASCII character, and nothing else.
-
- UTF-16 is a scheme for transforming a subset of the UCS-4 repertoire
- into a pair of UCS-2 values from a reserved range. UTF-16 impacts
- UTF-8 in that UCS-2 values from the reserved range must be treated
- specially in the UTF-8 transformation.
-
- UTF-8 encodes UCS-2 or UCS-4 characters as a varying number of
- octets, where the number of octets, and the value of each, depend on
- the integer value assigned to the character in ISO 10646. This
- transformation format has the following characteristics (all values
- are in hexadecimal):
-
- - Character values from 0000 0000 to 0000 007F (US-ASCII repertoire)
- correspond to octets 00 to 7F (7 bit US-ASCII values).
-
- - US-ASCII values do not appear otherwise in a UTF-8 encoded charac-
- ter stream. This provides compatibility with file systems or
- other software (e.g. the printf() function in C libraries) that
- parse based on US-ASCII values but are transparent to other val-
- ues.
-
- - Round-trip conversion is easy between UTF-8 and either of UCS-4,
- UCS-2 or Unicode.
-
- - The first octet of a multi-octet sequence indicates the number of
- octets in the sequence.
-
- - Character boundaries are easily found from anywhere in an octet
- stream.
-
- - The lexicographic sorting order of UCS-4 strings is preserved. Of
- course this is of limited interest since the sort order is not
- culturally valid in either case.
-
- - The octet values FE and FF never appear.
-
- UTF-8 was originally a project of the X/Open Joint
- Internationalization Group XOJIG with the objective to specify a File
- System Safe UCS Transformation Format [FSS-UTF] that is compatible
- with UNIX systems, supporting multilingual text in a single encoding.
- The original authors were Gary Miller, Greger Leijonhufvud and John
- Entenmann. Later, Ken Thompson and Rob Pike did significant work for
- the formal UTF-8.
-
-
-
-
-
-Yergeau Informational [Page 2]
-\f
-RFC 2044 UTF-8 October 1996
-
-
- A description can also be found in Unicode Technical Report #4 [UNI-
- CODE]. The definitive reference, including provisions for UTF-16
- data within UTF-8, is Annex R of ISO/IEC 10646-1 [ISO-10646].
-
-2. UTF-8 definition
-
- In UTF-8, characters are encoded using sequences of 1 to 6 octets.
- The only octet of a "sequence" of one has the higher-order bit set to
- 0, the remaining 7 bits being used to encode the character value. In
- a sequence of n octets, n>1, the initial octet has the n higher-order
- bits set to 1, followed by a bit set to 0. The remaining bit(s) of
- that octet contain bits from the value of the character to be
- encoded. The following octet(s) all have the higher-order bit set to
- 1 and the following bit set to 0, leaving 6 bits in each to contain
- bits from the character to be encoded.
-
- The table below summarizes the format of these different octet types.
- The letter x indicates bits available for encoding bits of the UCS-4
- character value.
-
- UCS-4 range (hex.) UTF-8 octet sequence (binary)
- 0000 0000-0000 007F 0xxxxxxx
- 0000 0080-0000 07FF 110xxxxx 10xxxxxx
- 0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
-
- 0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
- 0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
- 0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx
-
- Encoding from UCS-4 to UTF-8 proceeds as follows:
-
- 1) Determine the number of octets required from the character value
- and the first column of the table above.
-
- 2) Prepare the high-order bits of the octets as per the second column
- of the table.
-
- 3) Fill in the bits marked x from the bits of the character value,
- starting from the lower-order bits of the character value and
- putting them first in the last octet of the sequence, then the
- next to last, etc. until all x bits are filled in.
-
-
-
-
-
-
-
-
-
-
-Yergeau Informational [Page 3]
-\f
-RFC 2044 UTF-8 October 1996
-
-
- The algorithm for encoding UCS-2 (or Unicode) to UTF-8 can be
- obtained from the above, in principle, by simply extending each
- UCS-2 character with two zero-valued octets. However, UCS-2 val-
- ues between D800 and DFFF, being actually UCS-4 characters trans-
- formed through UTF-16, need special treatment: the UTF-16 trans-
- formation must be undone, yielding a UCS-4 character that is then
- transformed as above.
-
- Decoding from UTF-8 to UCS-4 proceeds as follows:
-
- 1) Initialize the 4 octets of the UCS-4 character with all bits set
- to 0.
-
- 2) Determine which bits encode the character value from the number of
- octets in the sequence and the second column of the table above
- (the bits marked x).
-
- 3) Distribute the bits from the sequence to the UCS-4 character,
- first the lower-order bits from the last octet of the sequence and
- proceeding to the left until no x bits are left.
-
- If the UTF-8 sequence is no more than three octets long, decoding
- can proceed directly to UCS-2 (or equivalently Unicode).
-
- A more detailed algorithm and formulae can be found in [FSS_UTF],
- [UNICODE] or Annex R to [ISO-10646].
-
-3. Examples
-
- The Unicode sequence "A<NOT IDENTICAL TO><ALPHA>." (0041, 2262, 0391,
- 002E) may be encoded as follows:
-
- 41 E2 89 A2 CE 91 2E
-
- The Unicode sequence "Hi Mom <WHITE SMILING FACE>!" (0048, 0069,
- 0020, 004D, 006F, 006D, 0020, 263A, 0021) may be encoded as follows:
-
- 48 69 20 4D 6F 6D 20 E2 98 BA 21
-
- The Unicode sequence representing the Han characters for the Japanese
- word "nihongo" (65E5, 672C, 8A9E) may be encoded as follows:
-
- E6 97 A5 E6 9C AC E8 AA 9E
-
-
-
-
-
-
-
-
-Yergeau Informational [Page 4]
-\f
-RFC 2044 UTF-8 October 1996
-
-
-MIME registrations
-
- This memo is meant to serve as the basis for registration of a MIME
- character encoding (charset) as per [RFC1521]. The proposed charset
- parameter value is "UTF-8". This string would label media types
- containing text consisting of characters from the repertoire of ISO
- 10646-1 encoded to a sequence of octets using the encoding scheme
- outlined above.
-
-Security Considerations
-
- Security issues are not discussed in this memo.
-
-Acknowledgments
-
- The following have participated in the drafting and discussion of
- this memo:
-
- James E. Agenbroad Andries Brouwer
- Martin J. D|rst David Goldsmith
- Edwin F. Hart Kent Karlsson
- Markus Kuhn Michael Kung
- Alain LaBonte Murray Sargent
- Keld Simonsen Arnold Winkler
-
-Bibliography
-
- [FSS_UTF] X/Open CAE Specification C501 ISBN 1-85912-082-2 28cm.
- 22p. pbk. 172g. 4/95, X/Open Company Ltd., "File Sys-
- tem Safe UCS Transformation Format (FSS_UTF)", X/Open
- Preleminary Specification, Document Number P316. Also
- published in Unicode Technical Report #4.
-
- [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Infor-
- mation technology -- Universal Multiple-Octet Coded
- Character Set (UCS) -- Part 1: Architecture and Basic
- Multilingual Plane. UTF-8 is described in Annex R,
- adopted but not yet published. UTF-16 is described in
- Annex Q, adopted but not yet published.
-
- [RFC1521] Borenstein, N., and N. Freed, "MIME (Multipurpose
- Internet Mail Extensions) Part One: Mechanisms for
- Specifying and Describing the Format of Internet Mes-
- sage Bodies", RFC 1521, Bellcore, Innosoft, September
- 1993.
-
- [RFC1641] Goldsmith, D., and M. Davis, "Using Unicode with
- MIME", RFC 1641, Taligent inc., July 1994.
-
-
-
-Yergeau Informational [Page 5]
-\f
-RFC 2044 UTF-8 October 1996
-
-
- [RFC1642] Goldsmith, D., and M. Davis, "UTF-7: A Mail-safe
- Transformation Format of Unicode", RFC 1642,
- Taligent, Inc., July 1994.
-
- [UNICODE] The Unicode Consortium, "The Unicode Standard --
- Worldwide Character Encoding -- Version 1.0", Addison-
- Wesley, Volume 1, 1991, Volume 2, 1992. UTF-8 is
- described in Unicode Technical Report #4.
-
- [US-ASCII] Coded Character Set--7-bit American Standard Code for
- Information Interchange, ANSI X3.4-1986.
-
-Author's Address
-
- Francois Yergeau
- Alis Technologies
- 100, boul. Alexis-Nihon
- Suite 600
- Montreal QC H4M 2P2
- Canada
-
- Tel: +1 (514) 747-2547
- Fax: +1 (514) 747-2561
- EMail: fyergeau@alis.com
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-Yergeau Informational [Page 6]
-\f