git.sur5r.net Git - openldap/blob - libraries/liblunicode/ucdata/README

   1 #
   2 # $Id: README,v 1.32 1999/11/29 16:41:05 mleisher Exp $
   3 #
   4
   5                            MUTT UCData Package 2.4
   6                            -----------------------
   7
   8 This is a package that supports ctype-like operations for Unicode UCS-2 text
   9 (and surrogates), case mapping, decomposition lookup, and provides a
  10 bidirectional reordering algorithm.  To use it, you will need to get the
  11 latest "UnicodeData-*.txt" (or later) file from the Unicode Web or FTP site.
  12
  13 The character information portion of the package consists of three parts:
  14
  15   1. A program called "ucgendat" which generates five data files from the
  16      UnicodeData-*.txt file.  The files are:
  17
  18      A. case.dat   - the case mappings.
  19      B. ctype.dat  - the character property tables.
  20      C. decomp.dat - the character decompositions.
  21      D. cmbcl.dat  - the non-zero combining classes.
  22      E. num.dat    - the codes representing numbers.
  23
  24   2. The "ucdata.[ch]" files which implement the functions needed to
  25      check to see if a character matches groups of properties, to map between
  26      upper, lower, and title case, to look up the decomposition of a
  27      character, look up the combining class of a character, and get the number
  28      value of a character.
  29
  30   3. The UCData.java class which provides the same API (with minor changes for
  31      the numbers) and loads the same binary data files as the C code.
  32
  33 A short reference to the functions available is in the "api.txt" file.
  34
  35 Techie Details
  36 ==============
  37
  38 The "ucgendat" program parses files from the command line which are all in the
  39 Unicode Character Database (UCDB) format.  An additional properties file,
  40 "MUTTUCData.txt", provides some extra properties for some characters.
  41
  42 The program looks for the two character properties fields (2 and 4), the
  43 combining class field (3), the decomposition field (5), the numeric value
  44 field (8), and the case mapping fields (12, 13, and 14).  The decompositions
  45 are recursively expanded before being written out.
  46
  47 The decomposition table contains all the canonical decompositions.  This means
  48 all decompositions that do not have tags such as "<compat>" or "<font>".
  49
  50 The data is almost all stored as unsigned longs (32-bits assumed) and the
  51 routines that load the data take care of endian swaps when necessary.  This
  52 also means that surrogates (>= 0x10000) can be placed in the data files the
  53 "ucgendat" program parses.
  54
  55 The data is written as external files and broken into five parts so it can be
  56 selectively updated at runtime if necessary.
  57
  58 The data files currently generated from the "ucgendat" program total about 56K
  59 in size all together.
  60
  61 The format of the binary data files is documented in the "format.txt" file.
  62
  63 ==========================================================================
  64
  65                        The "Pretty Good Bidi Algorithm"
  66                        --------------------------------
  67
  68 This routine provides an alternative to the Unicode Bidi algorithm.  The
  69 difference is that this version of the PGBA does not handle the explicit
  70 directional codes (LRE, RLE, LRO, RLO, PDF).  It should now produce the same
  71 results as the Unicode BiDi algorithm for implicit reordering.  Included are
  72 functions for doing cursor motion in both logical and visual order.
  73
  74 This implementation is provided to demonstrate an effective alternate method
  75 for implicit reordering.  To make this useful for an application, it probably
  76 needs some changes to the memory allocation and deallocation, as well as data
  77 structure additions for rendering.
  78
  79 Mark Leisher <mleisher@crl.nmsu.edu>
  80 19 November 1999
  81
  82 -----------------------------------------------------------------------------
  83
  84 CHANGES
  85 =======
  86
  87 Version 2.4
  88 -----------
  89 1. Improved some bidi algorithm documentation in the code.
  90
  91 2. Fixed a code mixup that produced a non-working version.
  92
  93 Version 2.3
  94 -----------
  95 1. Fixed a misspelling in the ucpgba.h header file.
  96
  97 2. Fixed a bug which caused trailing weak non-digit sequences to be left out of
  98    the reordered string in the bidi algorithm.
  99
 100 3. Fixed a problem with weak sequences containing non-spacing marks in the
 101    bidi algorithm.
 102
 103 4. Fixed a problem with text runs of the opposite direction of the string
 104    surrounding a weak + neutral text run appearing in the wrong order in the
 105    bidi algorithm.
 106
 107 5. Added a default overall direction parameter to the reordering function for
 108    cases of strings with no strong directional characters in the bidi
 109    algorithm.
 110
 111 6. The bidi API documentation was improved.
 112
 113 7. Added a man page for the bidi API.
 114
 115 Version 2.2
 116 -----------
 117 1. Fixed a problem with the bidi algorithm locating directional section
 118    boundaries.
 119
 120 2. Fixed a problem with the bidi algorithm starting the reordering correctly.
 121
 122 3. Fixed a problem with the bidi algorithm determining end boundaries for LTR
 123    segments.
 124
 125 4. Fixed a problem with the bidi algorithm reordering weak (digits and number
 126    separators) segments.
 127
 128 5. Added automatic switching of symmetrically paired characters when
 129    reversing RTL segments.
 130
 131 6. Added a missing symmetric character to the extra character properties in
 132    MUTTUCData.txt.
 133
 134 7. Added support for doing logical and visual cursor traversal.
 135
 136 Version 2.1
 137 -----------
 138 1. Updated the ucgendat program to handle the Unicode 3.0 character database
 139    properties.  The AL and BM bidi properties gets marked as strong RTL and
 140    Other Neutral, the NSM, LRE, RLE, PDF, LRO, and RLO controls all get marked
 141    as Other Neutral.
 142
 143 2. Fixed some problems with testing against signed values in the UCData.java
 144    code and some minor cleanup.
 145
 146 3. Added the "Pretty Good Bidi Algorithm."
 147
 148 Version 2.0
 149 -----------
 150 1. Removed the old Java stuff for a new class that loads directly from the
 151    same data files as the C code does.
 152
 153 2. Fixed a problem with choosing the correct field when mapping case.
 154
 155 3. Adjust some search routines to start their search in the correct position.
 156
 157 4. Moved the copyright year to 1999.
 158
 159 Version 1.9
 160 -----------
 161 1. Fixed a problem with an incorrect amount of storage being allocated for the
 162    combining class nodes.
 163
 164 2. Fixed an invalid initialization in the number code.
 165
 166 3. Changed the Java template file formatting a bit.
 167
 168 4. Added tables and function for getting decompositions in the Java class.
 169
 170 Version 1.8
 171 -----------
 172 1. Fixed a problem with adding certain ranges.
 173
 174 2. Added two more macros for testing for identifiers.
 175
 176 3. Tested with the UnicodeData-2.1.5.txt file.
 177
 178 Version 1.7
 179 -----------
 180 1. Fixed a problem with looking up decompositions in "ucgendat."
 181
 182 Version 1.6
 183 -----------
 184 1. Added two new properties introduced with UnicodeData-2.1.4.txt.
 185
 186 2. Changed the "ucgendat.c" program a little to automatically align the
 187    property data on a 4-byte boundary when new properties are added.
 188
 189 3. Changed the "ucgendat.c" programs to only generate canonical
 190    decompositions.
 191
 192 4. Added two new macros ucisinitialpunct() and ucisfinalpunct() to check for
 193    initial and final punctuation characters.
 194
 195 5. Minor additions and changes to the documentation.
 196
 197 Version 1.5
 198 -----------
 199 1. Changed all file open calls to include binary mode with "b" for DOS/WIN
 200    platforms.
 201
 202 2. Wrapped the unistd.h include so it won't be included when compiled under
 203    Win32.
 204
 205 3. Fixed a bad range check for hex digits in ucgendat.c.
 206
 207 4. Fixed a bad endian swap for combining classes.
 208
 209 5. Added code to make a number table and associated lookup functions.
 210    Functions added are ucnumber(), ucdigit(), and ucgetnumber().  The last
 211    function is to maintain compatibility with John Cowan's "uctype" package.
 212
 213 Version 1.4
 214 -----------
 215 1. Fixed a bug with adding a range.
 216
 217 2. Fixed a bug with inserting a range in order.
 218
 219 3. Fixed incorrectly specified ucisdefined() and ucisundefined() macros.
 220
 221 4. Added the missing unload for the combining class data.
 222
 223 5. Fixed a bad macro placement in ucisweak().
 224
 225 Version 1.3
 226 -----------
 227 1. Bug with case mapping calculations fixed.
 228
 229 2. Bug with empty character property entries fixed.
 230
 231 3. Bug with incorrect type in the combining class lookup fixed.
 232
 233 4. Some corrections done to api.txt.
 234
 235 5. Bug in certain character property lookups fixed.
 236
 237 6. Added a character property table that records the defined characters.
 238
 239 7. Replaced ucisunknown() with ucisdefined() and ucisundefined().
 240
 241 Version 1.2
 242 -----------
 243 1. Added code to ucgendat to generate a combining class table.
 244
 245 2. Fixed an endian problem with the byte count of decompositions.
 246
 247 3. Fixed some minor problems in the "format.txt" file.
 248
 249 4. Removed some bogus "Ss" values from MUTTUCData.txt file.
 250
 251 5. Added API function to get combining class.
 252
 253 6. Changed the open mode to "rb" so binary data files will be opened correctly
 254    on DOS/WIN as well as other platforms.
 255
 256 7. Added the "api.txt" file.
 257
 258 Version 1.1
 259 -----------
 260 1. Added ucisxdigit() which I overlooked.
 261
 262 2. Added UC_LT to the ucisalpha() macro which I overlooked.
 263
 264 3. Change uciscntrl() to include UC_CF.
 265
 266 4. Added ucisocntrl() and ucfntcntrl() macros.
 267
 268 5. Added a ucisblank() which I overlooked.
 269
 270 6. Added missing properties to ucissymbol() and ucisnumber().
 271
 272 7. Added ucisgraph() and ucisprint().
 273
 274 8. Changed the "Mr" property to "Sy" to mark this subset of mirroring
 275    characters as symmetric to avoid trampling the Unicode/ISO10646 sense of
 276    mirroring.
 277
 278 9. Added another property called "Ss" which includes control characters
 279    traditionally seen as spaces in the isspace() macro.
 280
 281 10. Added a bunch of macros to be API compatible with John Cowan's package.
 282
 283 ACKNOWLEDGEMENTS
 284 ================
 285
 286 Thanks go to John Cowan <cowan@locke.ccil.org> for pointing out lots of
 287 missing things and giving me stuff, particularly a bunch of new macros.
 288
 289 Thanks go to Bob Verbrugge <bob_verbrugge@nl.compuware.com> for pointing out
 290 various bugs.
 291
 292 Thanks go to Christophe Pierret <cpierret@businessobjects.com> for pointing
 293 out that file modes need to have "b" for DOS/WIN machines, pointing out
 294 unistd.h is not a Win 32 header, and pointing out a problem with ucisalnum().
 295
 296 Thanks go to Kent Johnson <kent@pondview.mv.com> for finding a bug that caused
 297 incomplete decompositions to be generated by the "ucgendat" program.
 298
 299 Thanks go to Valeriy E. Ushakov <uwe@ptc.spbu.ru> for spotting an allocation
 300 error and an initialization error.