MECCA already handles input text characters in ISO-8859-1 encoding. However, in certain situations where the text data was originated elsewhere, there can be difficulties for the MECCA user, when "special characters" such as copyright, trademark, left and right double-quote, etc. are concerned. MECCA now provides an UTF-8 input text encoding conversion utility, unimap, to help customers import text data that was prepared on non MECCA platforms, such as MS Windows. It can be run as: unimap output_file It is also automatically called by the "bcompose" command, and by MECCA in Batch-Compose, and Read-in-a-file for text input. So unless you simply want to test the mapping, you don't need to run it by hand. A typical situation involves a .DOC file made on Windows. To import such text for composition on MECCA: If you have MS Word on Windows: Open the .DOC file in Word, then Save As Text, select Custom Encoding, and then choose UTF-8 Encoding. If you do not have Word: Open the .DOC file with Wordpad, save as "Text .txt in Unicode"; then open the saved file with Notepad, save as "Text" but now in "UTF-8 encoding". For Unicode, Wordpad only saves in 2-byte format (each character is written as a 2-byte value), Notepad can save in UTF-8 (in which characters occupy variable number of bytes, but ASCII characters are preserved as they are), hence the two-step process. MS text editors such as Notepad/Wordpad will write what is known as BOM (Byte Order Mark) at the beginning of a Unicode text file. The "unimap" uitility recognizes standard UTF-8 BOM directly. Note that text files marked by BOM, cannot be brought over to MECCA and then concatenated together: BOM will disrupt text data content. Some text editors can write a text file in UTF-8 format without writing any BOM (e.g., PSPad does this). With such text editor, insert this as the very first line in your text mark-up file: \* charset=utf-8 \* which tells "unimap" that the rest of the input file is text data in UTF-8 encoding. But you need to ensure that, with the editor tool you are using, either: a. the text encoding is set to UTF-8 before saving the file (e.g. for PSPad, it is under the Format menu), or b. select UTF-8 encoding in its "Save as" dialog. If you're preparing text data on Windows, and you know there are special characters in use, you can always mark the very first line as: \* charset=utf-8 \* regardless your editor tool will write BOM into the saved UTF-8 encoded file or not. This approach also works when you're using a text editor on Unix or MECCA system, and it lets you enter Unicode characters and can save the data in UTF-8 encoding. Such tools normally do not write any BOM, and so the "first line comment" will be your only way to inform MECCA that the text characters in the file are in UTF-8 encoding. Mapping Unicode character to MECCA special character: Actually, the unimap utility does not have any built-in mapping. It uses an external mapping file for this: /usr/mecca/cfg/unimap which is a plain text file (that file's characters MUST be in ASCII codes), listing Unicode character, and its replacement string, one pair per line. For example: 2013 \081^ will map Unicode character U+2013 (the "en dash"), to special character \081^ in MECCA. Some other commonly used codes, like "left double quote", dagger, etc. can be listed as well. The supplied map is good as a starting point, customers will no doubt add to it over time, as the situation calls for (whenever "unimap" encounters a Unicode character not listed in the map file, it will signal an error). The syntax for "/usr/mecca/cfg/unimap" file: '#' starts a comment; comment and blank lines are ignored; first column lists the Unicode character code: must contain at least 4 hexadecimal digits -- normally a Unicode character is indicated as "U+xxxx", list the "xxxx" part here without the "U+"; at least one space character must follow the digits, it signals the end of the code digits; the next non-space character on the line, starts the "replacement string"; this string runs to include the last non-space character before either a '#', or end-of-line. That is, "replacement string" can contain spaces. Some examples: # this is a comment: do not list normal ASCII characters 2009 @>T # thin space 2022 \231^ # bullet 201c \117^ # left double quote Please note Unicode standard does not assign any valid code for Dingbat characters, but provides certain code ranges "for private use" (E000 - F8FF), dingbat codes should fall in those areas (albeit not for interchange). You can consult Unicode.org character names at: http://www.unicode.org/Public/UNIDATA/NamesList.txt for character code information. For illustrative purpose only -- not related any real-world codes -- suppose you found (by looking at it on screen in MS Word), that a right-arrow is in the document, and it is "U+F1A6" ("unimap" will complain and tell you so if "F1A6" isn't listed in the map file), you can add this pair to your system's map file: F1A6 \sf^\cf zd^\453^\rf^ # seems to work for word which maps the U+F1A6 to Zapfdingbat character \453^, the right-pointing arrow. The SF (save font) and RF (restore font) commands bracket "\cf zd^\453^" so that this mapping works with any current font setting and does not change it. Again, please note that Dingbat characters don't have fixed Unicode assigned to them. What this means, is that you may find the same right-arrow being represented with different codes by different software. You need to anticipate this, and know it is possible to have different codes that map to the same replacement string in your map file. When such mapping is listed, by all means include some comment, at least to tell for what situation the code mapping was made, for your own future reference. While it is possible for different codes to map to the same replacement, there cannot be "same code that maps to different replacement": if this happens in your map file, the last occurance of the code mapping will be used, and "unimap" does not treat it as an error.