From eeefab745c0fce09b33aa119e069780480f5a949 Mon Sep 17 00:00:00 2001 From: Kurt Zeilenga Date: Mon, 7 Feb 2000 05:48:17 +0000 Subject: [PATCH] Move a few obsolete RFCs to the Attic --- doc/rfc/rfc1488.txt | 619 -------------------------------------------- doc/rfc/rfc1558.txt | 171 ------------ doc/rfc/rfc2044.txt | 339 ------------------------ 3 files changed, 1129 deletions(-) delete mode 100644 doc/rfc/rfc1488.txt delete mode 100644 doc/rfc/rfc1558.txt delete mode 100644 doc/rfc/rfc2044.txt diff --git a/doc/rfc/rfc1488.txt b/doc/rfc/rfc1488.txt deleted file mode 100644 index ecafe75bfd..0000000000 --- a/doc/rfc/rfc1488.txt +++ /dev/null @@ -1,619 +0,0 @@ - - - - - - -Network Working Group T. Howes -Request for Comments: 1488 University of Michigan - S. Kille - ISODE Consortium - W. Yeong - Performance Systems International - C. Robbins - NeXor Ltd. - July 1993 - - - The X.500 String Representation of Standard Attribute Syntaxes - -Status of this Memo - - This RFC specifies an IAB standards track protocol for the Internet - community, and requests discussion and suggestions for improvements. - Please refer to the current edition of the "IAB Official Protocol - Standards" for the standardization state and status of this protocol. - Distribution of this memo is unlimited. - -Abstract - - The Lightweight Directory Access Protocol (LDAP) [9] requires that - the contents of AttributeValue fields in protocol elements be octet - strings. This document defines the requirements that must be - satisfied by encoding rules used to render Directory attribute - syntaxes into a form suitable for use in the LDAP, then goes on to - define the encoding rules for the standard set of attribute syntaxes - defined in [1,2] and [3]. - -1. Attribute Syntax Encoding Requirements - - This section defines general requirements for lightweight directory - protocol attribute syntax encodings. All documents defining attribute - syntax encodings for use by the lightweight directory protocols are - expected to conform to these requirements. - - The encoding rules defined for a given attribute syntax must produce - octet strings. To the greatest extent possible, encoded octet - strings should be usable in their native encoded form for display - purposes. In particular, encoding rules for attribute syntaxes - defining non-binary values should produce strings that can be - displayed with little or no translation by clients implementing the - lightweight directory protocols. - - - - - - -Howes, Kille, Yeong & Robbins [Page 1] - -RFC 1488 X.500 Syntax Encoding July 1993 - - -2. Standard Attribute Syntax Encodings - - For the purposes of defining the encoding rules for the standard - attribute syntaxes, the following auxiliary BNF definitions will be - used: - - ::= 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | - 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | - 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z' | 'A' | - 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | - 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | - 'T' | 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z' - - ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' - - ::= | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | - 'A' | 'B' | 'C' | 'D' | 'E' | 'F' - - ::= | | '-' - -

::= | | ''' | '(' | ')' | '+' | ',' | '-' | '.' | - '/' | ':' | '?' | ' ' - - ::= The ASCII newline character with hexadecimal value 0x0A - - ::= | - - ::= | - - ::= | - - ::= | - - ::=

|

- - ::= ' ' | ' ' - -2.1. Undefined - - Values of type Undefined are encoded as if they were values of type - Octet String. - -2.2. Case Ignore String - - A string of type caseIgnoreStringSyntax is encoded as the string - value itself. - - - - - -Howes, Kille, Yeong & Robbins [Page 2] - -RFC 1488 X.500 Syntax Encoding July 1993 - - -2.3. Case Exact String - - The encoding of a string of type caseExactStringSyntax is the string - value itself. - -2.4. Printable String - - The encoding of a string of type printableStringSyntax is the string - value itself. - -2.5. Numeric String - - The encoding of a string of type numericStringSyntax is the string - value itself. - -2.6. Octet String - - The encoding of a string of type octetStringSyntax is the string - value itself. - -2.7. Case Ignore IA5 String - - The encoding of a string of type caseIgnoreIA5String is the string - value itself. - -2.8. IA5 String - - The encoding of a string of type iA5StringSyntax is the string value - itself. - -2.9. T61 String - - The encoding of a string of type t61StringSyntax is the string value - itself. - -2.10. Case Ignore List - - Values of type caseIgnoreListSyntax are encoded according to the - following BNF: - - ::= | - '$' - - ::= a string encoded according to the rules - for Case Ignore String as above. - - - - - - -Howes, Kille, Yeong & Robbins [Page 3] - -RFC 1488 X.500 Syntax Encoding July 1993 - - -2.11. Case Exact List - - Values of type caseExactListSyntax are encoded according to the - following BNF: - - ::= | - '$' - - ::= a string encoded according to the rules for - Case Exact String as above. - -2.12. Distinguished Name - - Values of type distinguishedNameSyntax are encoded to have the - representation defined in [5]. - -2.13. Boolean - - Values of type booleanSyntax are encoded according to the following - BNF: - - ::= "TRUE" | "FALSE" - - Boolean values have an encoding of "TRUE" if they are logically true, - and have an encoding of "FALSE" otherwise. - -2.14. Integer - - Values of type integerSyntax are encoded as the decimal - representation of their values, with each decimal digit represented - by the its character equivalent. So the digit 1 is represented by the - character - -2.15. Object Identifier - - Values of type objectIdentifierSyntax are encoded according to the - following BNF: - - ::= | '.' | - - ::= - - ::= | '.' - - In the above BNF, is the syntactic representation of an - object descriptor. When encoding values of type - objectIdentifierSyntax, the first encoding option should be used in - preference to the second, which should be used in preference to the - - - -Howes, Kille, Yeong & Robbins [Page 4] - -RFC 1488 X.500 Syntax Encoding July 1993 - - - third wherever possible. That is, in encoding object identifiers, - object descriptors (where assigned and known by the implementation) - should be used in preference to numeric oids to the greatest extent - possible. For example, in encoding the object identifier representing - an organizationName, the descriptor "organizationName" is preferable - to "ds.4.10", which is in turn preferable to the string "2.5.4.10". - -2.16. Telephone Number - - Values of type telephoneNumberSyntax are encoded as if they were - Printable String types. - -2.17. Telex Number - - Values of type telexNumberSyntax are encoded according to the - following BNF: - - ::= '$' '$' - - ::= - - ::= - - ::= - - In the above, is the syntactic representation of the - number portion of the TELEX number being encoded, is the - TELEX country code, and is the answerback code of a - TELEX terminal. - -2.18. Teletex Terminal Identifier - - Values of type teletexTerminalIdentifier are encoded according to the - following BNF: - - ::= 0*( '$' ) - - In the above, the first is the encoding of the - first portion of the teletex terminal identifier to be encoded, and - the subsequent 0 or more are subsequent portions - of the teletex terminal identifier. - -2.19. Facsimile Telephone Number - - Values of type FacsimileTelephoneNumber are encoded according to the - following BNF: - - ::= [ '$' ] - - - -Howes, Kille, Yeong & Robbins [Page 5] - -RFC 1488 X.500 Syntax Encoding July 1993 - - - ::= | '$' - - ::= 'twoDimensional' | 'fineResolution' | 'unlimitedLength' | - 'b4Length' | 'a3Width' | 'b4Width' | 'uncompressed' - - In the above, the first is the actual fax number, - and the tokens represent fax parameters. - -2.20. Presentation Address - - Values of type PresentationAddress are encoded to have the - representation described in [6]. - -2.21. UTC Time - - Values of type uTCTimeSyntax are encoded as if they were Printable - Strings with the strings containing a UTCTime value. - -2.22. Guide (search guide) - - Values of type Guide, such as values of the searchGuide attribute, - are encoded according to the following BNF: - - ::= [ '#' ] - - ::= an encoded value of type objectIdentifierSyntax - - ::= | | '!' - - ::= [ '(' ] '&' [ ')' ] | - [ '(' ] '|' [ ')' ] - - ::= [ '(' ] '$' [ ')' ] - - ::= "EQ" | "SUBSTR" | "GE" | "LE" | "APPROX" - -2.23. Postal Address - -Values of type PostalAddress are encoded according to the following BNF: - - ::= | '$' - - In the above, each component of a postal address value is - encoded as a value of type t61StringSyntax. - - - - - - - -Howes, Kille, Yeong & Robbins [Page 6] - -RFC 1488 X.500 Syntax Encoding July 1993 - - -2.24. User Password - - Values of type userPasswordSyntax are encoded as if they were of type - octetStringSyntax. - -2.25. User Certificate - - Values of type userCertificate are encoded according to the following - BNF: - - ::= '#' '#' '#' - '#' - - ::= - - ::= an encoded Distinguished Name - - ::= '#' - - ::= - - ::= - - ::= | | - '{ASN}' - - ::= an encoded Distinguished Name - - ::= '#' - - ::= | '-' - - ::= '#' - - ::= an encoded UTCTime value - - ::= | - -2.26. CA Certificate - - Values of type cACertificate are encoded as if the values were of - type userCertificate. - -2.27. Authority Revocation List - - Values of type authorityRevocationList are encoded according to the - following BNF: - - - - -Howes, Kille, Yeong & Robbins [Page 7] - -RFC 1488 X.500 Syntax Encoding July 1993 - - - ::= '#' '#' - [ '#' ] - - ::= '#' - [ '#' 0*() '#'] - - ::= '#' '#' - '#' - - The syntactic components , , , - , and have the same definitions as in - the BNF for the userCertificate attribute syntax. - -2.28. Certificate Revocation List - - Values of type certificateRevocationList are encoded as if the values - were of type authorityRevocationList. - -2.29. Cross Certificate Pair - - Values of type crossCertificatePair are encoded according to the - following BNF: - - ::= '|' - - The syntactic component has the same definition as in - the BNF for the userCertificate attribute syntax. - -2.30. Delivery Method - - Values of type deliveryMethod are encoded according to the following - BNF: - - ::= | '$' - - ::= 'any' | 'mhs' | 'physical' | 'telex' | 'teletex' | - 'g3fax' | 'g4fax' | 'ia5' | 'videotex' | 'telephone' - -2.31. Other Mailbox - - Values of the type otherMailboxSyntax are encoded according to the - following BNF: - - ::= '$' - - ::= an encoded Printable String - - ::= an encoded IA5 String - - - -Howes, Kille, Yeong & Robbins [Page 8] - -RFC 1488 X.500 Syntax Encoding July 1993 - - - In the above, represents the type of mail system in - which the mailbox resides, for example "Internet" or "MCIMail"; and - is the actual mailbox in the mail system defined by - . - -2.32. Mail Preference - - Values of type mailPreferenceOption are encoded according to the - following BNF: - - ::= "NO-LISTS" | "ANY-LIST" | "PROFESSIONAL-LISTS" - -2.33. MHS OR Address - - Values of type MHS OR Address are encoded as strings, according to - the format defined in [10]. - -2.34. Photo - - Values of type Photo are encoded as if they were octet strings - containing JPEG images in the JPEG File Interchange Format (JFIF), as - described in [8]. - -2.35. Fax - - Values of type Fax are encoded as if they were octet strings - containing Group 3 Fax images as defined in [7]. - -3. Acknowledgements - - Many of the attribute syntax encodings defined in this document are - adapted from those used in the QUIPU X.500 implementation. The - contribu- tions of the authors of the QUIPU implementation in the - specification of the QUIPU syntaxes [4] are gratefully acknowledged. - -4. Bibliography - - [1] The Directory: Selected Attribute Syntaxes. CCITT, - Recommendation X.520. - - [2] Information Processing Systems -- Open Systems Interconnection -- - The Directory: Selected Attribute Syntaxes. - - [3] Barker, P., and S. Kille, "The COSINE and Internet X.500 Schema", - RFC 1274, University College London, November 1991. - - [4] The ISO Development Environment: User's Manual -- Volume 5: - QUIPU. Colin Robbins, Stephen E. Kille. - - - -Howes, Kille, Yeong & Robbins [Page 9] - -RFC 1488 X.500 Syntax Encoding July 1993 - - - [5] Kille, S., "A String Representation of Distinguished Names", RFC - 1485, July 1993. - - [6] Kille, S., "A String Representation for Presentation Addresses", - RFC 1278, University College London, November 1991. - - [7] Terminal Equipment and Protocols for Telematic Services - - Standardization of Group 3 facsimile apparatus for document - transmission. CCITT, Recommendation T.4. - - [8] JPEG File Interchange Format (Version 1.02). Eric Hamilton, C- - Cube Microsystems, Milpitas, CA, September 1, 1992. - - [9] Yeong, W., Howes, T., and S. Kille, "Lightweight Directory Access - Protocol", RFC 1487, Performance Systems International, - University of Michigan, ISODE Consortium, July 1993. - - [10] Kille, S., "Mapping between X.400(1988)/ISO 10021 and RFC 822", - RFC 1327, University College London, May 1992. - -5. Security Considerations - - Security issues are not discussed in this memo. - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Howes, Kille, Yeong & Robbins [Page 10] - -RFC 1488 X.500 Syntax Encoding July 1993 - - -6. Authors' Addresses - - Tim Howes - University of Michigan - ITD Research Systems - 535 W William St. - Ann Arbor, MI 48103-4943 - USA - - Phone: +1 313 747-4454 - EMail: tim@umich.edu - - - Steve Kille - ISODE Consortium - PO Box 505 - London - SW11 1DX - UK - - Phone: +44-71-223-4062 - EMail: S.Kille@isode.com - - - Wengyik Yeong - PSI, Inc. - 510 Huntmar Park Drive - Herndon, VA 22070 - USA - - Phone: +1 703-450-8001 - EMail: yeongw@psilink.com - - - Colin Robbins - NeXor Ltd - University Park - Nottingham - NG7 2RD - UK - - - - - - - - - - - -Howes, Kille, Yeong & Robbins [Page 11] - \ No newline at end of file diff --git a/doc/rfc/rfc1558.txt b/doc/rfc/rfc1558.txt deleted file mode 100644 index 1bb5bd9647..0000000000 --- a/doc/rfc/rfc1558.txt +++ /dev/null @@ -1,171 +0,0 @@ - - - - - - -Network Working Group T. Howes -Request for Comments: 1558 University of Michigan -Category: Informational December 1993 - - - A String Representation of LDAP Search Filters - -Status of this Memo - - This memo provides information for the Internet community. This memo - does not specify an Internet standard of any kind. Distribution of - this memo is unlimited. - -Abstract - - The Lightweight Directory Access Protocol (LDAP) [1] defines a - network representation of a search filter transmitted to an LDAP - server. Some applications may find it useful to have a common way of - representing these search filters in a human-readable form. This - document defines a human-readable string format for representing LDAP - search filters. - -1. LDAP Search Filter Definition - - An LDAP search filter is defined in [1] as follows: - - Filter ::= CHOICE { - and [0] SET OF Filter, - or [1] SET OF Filter, - not [2] Filter, - equalityMatch [3] AttributeValueAssertion, - substrings [4] SubstringFilter, - greaterOrEqual [5] AttributeValueAssertion, - lessOrEqual [6] AttributeValueAssertion, - present [7] AttributeType, - approxMatch [8] AttributeValueAssertion - } - - SubstringFilter ::= SEQUENCE { - type AttributeType, - SEQUENCE OF CHOICE { - initial [0] LDAPString, - any [1] LDAPString, - final [2] LDAPString - } - } - - - - - -Howes [Page 1] - -RFC 1558 Representation of LDAP Filters December 1993 - - - AttributeValueAssertion ::= SEQUENCE - attributeType AttributeType, - attributeValue AttributeValue - } - - AttributeType ::= LDAPString - - AttributeValue ::= OCTET STRING - - LDAPString ::= OCTET STRING - - where the LDAPString above is limited to the IA5 character set. The - AttributeType is a string representation of the attribute object - identifier in dotted OID format (e.g., "2.5.4.10"), or the shorter - string name of the attribute (e.g., "organizationName", or "o"). The - AttributeValue OCTET STRING has the form defined in [2]. The Filter - is encoded for transmission over a network using the Basic Encoding - Rules defined in [3], with simplifications described in [1]. - -2. String Search Filter Definition - - The string representation of an LDAP search filter is defined by the - following BNF. It uses a prefix format. - - ::= '(' ')' - ::= | | | - ::= '&' - ::= '|' - ::= '!' - ::= | - ::= | | - ::= - ::= | | | - ::= '=' - ::= '~=' - ::= '>=' - ::= '<=' - ::= '=*' - ::= '=' - ::= NULL | - ::= '*' - ::= NULL | '*' - ::= NULL | - - is a string representing an AttributeType, and has the format - defined in [1]. is a string representing an AttributeValue, - or part of one, and has the form defined in [2]. If a must - contain one of the characters '*' or '(' or ')', these characters - - - -Howes [Page 2] - -RFC 1558 Representation of LDAP Filters December 1993 - - - should be escaped by preceding them with the backslash '\' character. - -3. Examples - - This section gives a few examples of search filters written using - this notation. - - (cn=Babs Jensen) - (!(cn=Tim Howes)) - (&(objectClass=Person)(|(sn=Jensen)(cn=Babs J*))) - (o=univ*of*mich*) - -4. Security Considerations - - Security issues are not discussed in this memo. - -5. References - - [1] Yeong, W., Howes, T., and S. Kille, "Lightweight Directory Access - Protocol", RFC 1487, Performance Systems International, - University of Michigan, ISODE Consortium, July 1993. - - [2] Howes, T., Kille, S., Yeong, W., and C. Robbins, "The String - Representation of Standard Attribute Syntaxes", RFC 1488, - University of Michigan, ISODE Consortium, Performance Systems - International, NeXor Ltd., July 1993. - - [3] "Specification of Basic Encoding Rules for Abstract Syntax - Notation One (ASN.1)", CCITT Recommendation X.209, 1988. - -6. Author's Address - - Tim Howes - University of Michigan - ITD Research Systems - 535 W William St. - Ann Arbor, MI 48103-4943 - USA - - Phone: +1 313 747-4454 - EMail: tim@umich.edu - - - - - - - - - - -Howes [Page 3] - \ No newline at end of file diff --git a/doc/rfc/rfc2044.txt b/doc/rfc/rfc2044.txt deleted file mode 100644 index 22e74522a4..0000000000 --- a/doc/rfc/rfc2044.txt +++ /dev/null @@ -1,339 +0,0 @@ - - - - - - -Network Working Group F. Yergeau -Request for Comments: 2044 Alis Technologies -Category: Informational October 1996 - - - UTF-8, a transformation format of Unicode and ISO 10646 - -Status of this Memo - - This memo provides information for the Internet community. This memo - does not specify an Internet standard of any kind. Distribution of - this memo is unlimited. - -Abstract - - The Unicode Standard, version 1.1, and ISO/IEC 10646-1:1993 jointly - define a 16 bit character set which encompasses most of the world's - writing systems. 16-bit characters, however, are not compatible with - many current applications and protocols, and this has led to the - development of a few so-called UCS transformation formats (UTF), each - with different characteristics. UTF-8, the object of this memo, has - the characteristic of preserving the full US-ASCII range: US-ASCII - characters are encoded in one octet having the usual US-ASCII value, - and any octet with such a value can only be an US-ASCII character. - This provides compatibility with file systems, parsers and other - software that rely on US-ASCII values but are transparent to other - values. - -1. Introduction - - The Unicode Standard, version 1.1 [UNICODE], and ISO/IEC 10646-1:1993 - [ISO-10646] jointly define a 16 bit character set, UCS-2, which - encompasses most of the world's writing systems. ISO 10646 further - defines a 31-bit character set, UCS-4, with currently no assignments - outside of the region corresponding to UCS-2 (the Basic Multilingual - Plane, BMP). The UCS-2 and UCS-4 encodings, however, are hard to use - in many current applications and protocols that assume 8 or even 7 - bit characters. Even newer systems able to deal with 16 bit - characters cannot process UCS-4 data. This situation has led to the - development of so-called UCS transformation formats (UTF), each with - different characteristics. - - UTF-1 has only historical interest, having been removed from ISO - 10646. UTF-7 has the quality of encoding the full Unicode repertoire - using only octets with the high-order bit clear (7 bit US-ASCII - values, [US-ASCII]), and is thus deemed a mail-safe encoding - ([RFC1642]). UTF-8, the object of this memo, uses all bits of an - octet, but has the quality of preserving the full US-ASCII range: - - - -Yergeau Informational [Page 1] - -RFC 2044 UTF-8 October 1996 - - - US-ASCII characters are encoded in one octet having the normal US- - ASCII value, and any octet with such a value can only stand for an - US-ASCII character, and nothing else. - - UTF-16 is a scheme for transforming a subset of the UCS-4 repertoire - into a pair of UCS-2 values from a reserved range. UTF-16 impacts - UTF-8 in that UCS-2 values from the reserved range must be treated - specially in the UTF-8 transformation. - - UTF-8 encodes UCS-2 or UCS-4 characters as a varying number of - octets, where the number of octets, and the value of each, depend on - the integer value assigned to the character in ISO 10646. This - transformation format has the following characteristics (all values - are in hexadecimal): - - - Character values from 0000 0000 to 0000 007F (US-ASCII repertoire) - correspond to octets 00 to 7F (7 bit US-ASCII values). - - - US-ASCII values do not appear otherwise in a UTF-8 encoded charac- - ter stream. This provides compatibility with file systems or - other software (e.g. the printf() function in C libraries) that - parse based on US-ASCII values but are transparent to other val- - ues. - - - Round-trip conversion is easy between UTF-8 and either of UCS-4, - UCS-2 or Unicode. - - - The first octet of a multi-octet sequence indicates the number of - octets in the sequence. - - - Character boundaries are easily found from anywhere in an octet - stream. - - - The lexicographic sorting order of UCS-4 strings is preserved. Of - course this is of limited interest since the sort order is not - culturally valid in either case. - - - The octet values FE and FF never appear. - - UTF-8 was originally a project of the X/Open Joint - Internationalization Group XOJIG with the objective to specify a File - System Safe UCS Transformation Format [FSS-UTF] that is compatible - with UNIX systems, supporting multilingual text in a single encoding. - The original authors were Gary Miller, Greger Leijonhufvud and John - Entenmann. Later, Ken Thompson and Rob Pike did significant work for - the formal UTF-8. - - - - - -Yergeau Informational [Page 2] - -RFC 2044 UTF-8 October 1996 - - - A description can also be found in Unicode Technical Report #4 [UNI- - CODE]. The definitive reference, including provisions for UTF-16 - data within UTF-8, is Annex R of ISO/IEC 10646-1 [ISO-10646]. - -2. UTF-8 definition - - In UTF-8, characters are encoded using sequences of 1 to 6 octets. - The only octet of a "sequence" of one has the higher-order bit set to - 0, the remaining 7 bits being used to encode the character value. In - a sequence of n octets, n>1, the initial octet has the n higher-order - bits set to 1, followed by a bit set to 0. The remaining bit(s) of - that octet contain bits from the value of the character to be - encoded. The following octet(s) all have the higher-order bit set to - 1 and the following bit set to 0, leaving 6 bits in each to contain - bits from the character to be encoded. - - The table below summarizes the format of these different octet types. - The letter x indicates bits available for encoding bits of the UCS-4 - character value. - - UCS-4 range (hex.) UTF-8 octet sequence (binary) - 0000 0000-0000 007F 0xxxxxxx - 0000 0080-0000 07FF 110xxxxx 10xxxxxx - 0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx - - 0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx - 0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx - 0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx - - Encoding from UCS-4 to UTF-8 proceeds as follows: - - 1) Determine the number of octets required from the character value - and the first column of the table above. - - 2) Prepare the high-order bits of the octets as per the second column - of the table. - - 3) Fill in the bits marked x from the bits of the character value, - starting from the lower-order bits of the character value and - putting them first in the last octet of the sequence, then the - next to last, etc. until all x bits are filled in. - - - - - - - - - - -Yergeau Informational [Page 3] - -RFC 2044 UTF-8 October 1996 - - - The algorithm for encoding UCS-2 (or Unicode) to UTF-8 can be - obtained from the above, in principle, by simply extending each - UCS-2 character with two zero-valued octets. However, UCS-2 val- - ues between D800 and DFFF, being actually UCS-4 characters trans- - formed through UTF-16, need special treatment: the UTF-16 trans- - formation must be undone, yielding a UCS-4 character that is then - transformed as above. - - Decoding from UTF-8 to UCS-4 proceeds as follows: - - 1) Initialize the 4 octets of the UCS-4 character with all bits set - to 0. - - 2) Determine which bits encode the character value from the number of - octets in the sequence and the second column of the table above - (the bits marked x). - - 3) Distribute the bits from the sequence to the UCS-4 character, - first the lower-order bits from the last octet of the sequence and - proceeding to the left until no x bits are left. - - If the UTF-8 sequence is no more than three octets long, decoding - can proceed directly to UCS-2 (or equivalently Unicode). - - A more detailed algorithm and formulae can be found in [FSS_UTF], - [UNICODE] or Annex R to [ISO-10646]. - -3. Examples - - The Unicode sequence "A." (0041, 2262, 0391, - 002E) may be encoded as follows: - - 41 E2 89 A2 CE 91 2E - - The Unicode sequence "Hi Mom !" (0048, 0069, - 0020, 004D, 006F, 006D, 0020, 263A, 0021) may be encoded as follows: - - 48 69 20 4D 6F 6D 20 E2 98 BA 21 - - The Unicode sequence representing the Han characters for the Japanese - word "nihongo" (65E5, 672C, 8A9E) may be encoded as follows: - - E6 97 A5 E6 9C AC E8 AA 9E - - - - - - - - -Yergeau Informational [Page 4] - -RFC 2044 UTF-8 October 1996 - - -MIME registrations - - This memo is meant to serve as the basis for registration of a MIME - character encoding (charset) as per [RFC1521]. The proposed charset - parameter value is "UTF-8". This string would label media types - containing text consisting of characters from the repertoire of ISO - 10646-1 encoded to a sequence of octets using the encoding scheme - outlined above. - -Security Considerations - - Security issues are not discussed in this memo. - -Acknowledgments - - The following have participated in the drafting and discussion of - this memo: - - James E. Agenbroad Andries Brouwer - Martin J. D|rst David Goldsmith - Edwin F. Hart Kent Karlsson - Markus Kuhn Michael Kung - Alain LaBonte Murray Sargent - Keld Simonsen Arnold Winkler - -Bibliography - - [FSS_UTF] X/Open CAE Specification C501 ISBN 1-85912-082-2 28cm. - 22p. pbk. 172g. 4/95, X/Open Company Ltd., "File Sys- - tem Safe UCS Transformation Format (FSS_UTF)", X/Open - Preleminary Specification, Document Number P316. Also - published in Unicode Technical Report #4. - - [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Infor- - mation technology -- Universal Multiple-Octet Coded - Character Set (UCS) -- Part 1: Architecture and Basic - Multilingual Plane. UTF-8 is described in Annex R, - adopted but not yet published. UTF-16 is described in - Annex Q, adopted but not yet published. - - [RFC1521] Borenstein, N., and N. Freed, "MIME (Multipurpose - Internet Mail Extensions) Part One: Mechanisms for - Specifying and Describing the Format of Internet Mes- - sage Bodies", RFC 1521, Bellcore, Innosoft, September - 1993. - - [RFC1641] Goldsmith, D., and M. Davis, "Using Unicode with - MIME", RFC 1641, Taligent inc., July 1994. - - - -Yergeau Informational [Page 5] - -RFC 2044 UTF-8 October 1996 - - - [RFC1642] Goldsmith, D., and M. Davis, "UTF-7: A Mail-safe - Transformation Format of Unicode", RFC 1642, - Taligent, Inc., July 1994. - - [UNICODE] The Unicode Consortium, "The Unicode Standard -- - Worldwide Character Encoding -- Version 1.0", Addison- - Wesley, Volume 1, 1991, Volume 2, 1992. UTF-8 is - described in Unicode Technical Report #4. - - [US-ASCII] Coded Character Set--7-bit American Standard Code for - Information Interchange, ANSI X3.4-1986. - -Author's Address - - Francois Yergeau - Alis Technologies - 100, boul. Alexis-Nihon - Suite 600 - Montreal QC H4M 2P2 - Canada - - Tel: +1 (514) 747-2547 - Fax: +1 (514) 747-2561 - EMail: fyergeau@alis.com - - - - - - - - - - - - - - - - - - - - - - - - - - - -Yergeau Informational [Page 6] - -- 2.39.5