GNU Coreutils - Multibyte/unicode support
Random notes and pointers regarding the on-going effort to add multibyte and unicode support in GNU Coreutils.
If you’re considering working on multibyte/unicode/utf8 support in GNU coreutils (or other packages) - reading these should bring you up to speed (and hopefully save some time, too).
NOTE: multibyte, multibyte-sequences, unicode, utf-8 are sometimes used interchangeably throughout the document, but the intent is to support all multibyte locales, not just UTF8 encodings.
Content
- Relevant Discussions on coreutils’ mailing lists - long and technical discussions with lots of relevant information.
- Related Bug Reports - various bugs reports from past years. Linked here if they contain useful points and/or detailed replies.
- Useful websites - good readings
- Online tools - websites providing useful conversions and information.
- Low-level command-line conversion - using printf,od,uconv,perl.
- invalid sequences - invalid sequences.
- Unicode glyph rendering - interplay between libc and xterm.
- cygwin and 16-bit wchar_t - special handling for systems where wchar_t is 16-bit, and UTF-16 surrogates (cygwin/OpenSolaris/AIX).
- glyph width and wcwidth issues - issues relating to incorrect wcwidth(3) results and different glyph rendering.
- expand - expand topics.
- wc - wc topics.
- cut - cut topics.
- head/tail - head/tail topics.
- tr - tr topics.
- fold/fmt - fold and fmt topics.
- od - od topics.
- unorm - unorm topics.
- Unicode Explained: The Book - pointers to relevant pages.
Relevant Discussions on Coreutils’ mailing lists
-
Pádraig Brady maintains a repository containing RedHat’s incomplete implementation at https://github.com/pixelb/coreutils/tree/i18n, with more details at https://www.pixelbeat.org/docs/coreutils_i18n/.
-
2017-Dec-23: multibyte suppport for tr Added support for translation and initial multibyte tests.
-
2017-Dec-10: multibyte support (round 4) - tr. Partial implementation, supporting only delete/squeeze operations.
-
2017-Dec-08: “fold –bytes” non-intuitive with multibyte characters. Current implementation is mostly correct, but manual/help should be improved. Pádraig warns that the suggested implementation differs from RedHat’s i18n’s patch behaviour, and we should match their behaviour.
-
2017-10-26: Incorrect unicode normalization on Mac by Libiconv - reported by Marcin Sulikowski in great details. While this is not directly related to coreutils, a reply by gordon shows that coreutils’ proposed
unorm
program would handle the case correctly. -
2017-Sep-18: “cut -d” initial patch by Sebastian Kisela.
-
2017-Sep-17: Discussion about tr(1) multibyte supprt:
-c
vs-C
, handling invalid sequences and backward compatability. -
2017-Aug-16: Discussion about adding “cut -d” support: Which type of variable to use, and handling of invalid mb sequence for the delimiter.
-
2017-Jun-27: expr(1) multibyte support - commited here and will be included in version 8.28.\ previous discussion: expr multibyte support Patch with
expr(1)
support for multibyte (also bug#26779). -
2017-Apr-4: multibyte support (round 3) - Patch with added
fold(1)
support. -
2016-Sep-19: multibyte support (round 3) - Patch with added partial
cut(1)
support. -
2016-Sep-4: Multibyte support (round 2) - Patch with
unorm
andexpand
working with UTF-16. Expanded description ofunorm
, and issues withexpand
and width of Emoji unicode characters -
2016-Jul-20: multibyte processing - handling invalid sequences (long) - Thread about handling invalid sequences. Also contains a list of coreutils programs and their multibyte-related needs.
-
2016-Jul-21 - discussion and introduction of
mbfix
(a precursor ofunorm
).Following messages in the thread discuss unicode normalization (NFKD,NFKD,NFC,NFD).
-
-
2015-Sep-25: PATCH: Multibyte support for expand and unexpand v2 - patch by Ondrej Oprala.
-
2011-Feb-2: bug#7971: Bug in libiconv - Lots of details about Cygwin and wchar_t/utf-16.
-
2011-Feb-2: bug#7948: 16-bit wchar_t on Windows and Cygwin (same on mailing list - Lots of technical details about cygwin (and similar) systems where wchar_t is 16-bits by Bruno Haible, Eric Blake and others. Also discussed: naming a new typedef
wwchar_t
,xchar_t
etc. -
2010-Sep-13: PATCH: join: support multi-byte character encodings - patch by Pádraig Brady followed by detailed discussions. Pádraig emphasized (off-list): “Note in that patch the avoidance of startup overhead for printf due to avoiding dynamically linking with libunistring.”
-
2009-Feb-21: Re: new modules for Unicode normalization by Bruno Haible - this is gnulib’s
uninorm
module. -
2009-Mar-11: bug in join: case comparisons don’t work in multibyte locales - detailed and technical discussion started by Bruno Haible, lead to the creation of GNU libunistring.
-
2008-May-8: Re: horrible utf-8 performace in wc - coreutils’
wc
supported multibyte characters for a long while. This discussion resulted in processing speed-ups. -
2006-Jul-31: uniq i18n implementation by Pádraig Brady.
Related Bug Reports
tr
-
bug#20114: tr does not support multibyte characters in the first argument (same on mailing list)
-
bug#26362: tr -cd – Problem with UTF-8? (same on mailing list)
-
tr is handling bytes not characters from 5 Feb 2009.
-
Multibyte Awareness - Pádraig mentions
sed
as temporary work-around for missingtr
case-conversion code.
printf
-
Incomplete support of unicode characters in printf \u. thread continues here. Conclusion seem to be that this should be fixed/improved.
-
bug#17196: UTF-8 printf string formating problem (same on mailing list) - long discussion with many details and examples about
printf '%s'
and multiybyte characters. -
/usr/bin/printf: invalid universal character name - Jim Meyering explains why
printf
refuses to print certain escape sequences (a requirement of ‘C99, ISO/IEC 10646’).
wc
-
WC does not count invalid multibyte sequences - this thread started the Coreutils’ Gotcha page.
-
bug#20751: wc -m doesn’t count UTF-8 characters properly (same on mailing list)
Sort
-
bug#17189: Sort bug #2 (same on mailing list) - sort ignores punctuation in non-C locales. Lots of detailed emails form Eric Blake.
-
bugs#24601: UTF-8 locale makes lexicographic sort weird (same on mailing list) This message in the thread provides a simple C program to test
strcoll
behaviour. -
bugs#23677: sort –debug not ignoring punctuation when sort does Karl Berry raises the issue that sorting in anything except
LC_ALL=C
locale is highly problematic - due to ‘secodary’ role of punctuations. -
bug#21844: sort behavior unstable based on neighboring elements ? (same on mailing list)
-
bug#8871: Bug with “sort -i” ? (same on mailing list) - Details from Eric, also mentioning RedHat’s i18n patches vs upstream’s lack of multibyte support.
-
bug#9418: Fwd: bug#9418: case sensitivity buggy in sort (same on mailing list) - locale related confusion. Quote:
> > Yes, that is exactly the case - why on earth would someone want that? > > This results in just some sorting madness! > > Complaints have been made about glibc's absurd and insane preference for > case insensitive collation (at least in en and the euro locales) for > nearly 20 years now. All w/o resolution.
-
Re: sort seems deficient - locale/punctuation related confusion. This message from Jim Meyering contains simple example of the problem.
-
bug#26422: historical feature or grand daddy bug? - Not a bug, but with explanation from Paul Eggert about historical ‘\n’ behaviour in sort.
expand/unexpand
- bug#28038 (and mailing list link): expand(1) lacks MBC support - the current i18n patch does work correctly, no further development is needed.
Other programs
-
bugs#24924: pr has no concept of wide characters (same on mailing list).
Last message on thread from Stephane Chazelas mentions
pr
bug in BIG5-HKSCS locale. -
pr considers bytes not presentation width report by Dan Jacobson with examples in both UTF-8 and big5 encoding.
-
bug#25630: df Unicode is not supported on mounted (same on mailing list)
-
bug#25550: Apparent unicode bug in uniq 8.26 (same on mailing list)
-
od and unicode - detailed message but discussion did not ensue.
-
Unicode characters in tail and head - discussion about
head --chars
. Eric Blake’s reply still reasonates:“Eventually, when someone contributes a maintainable patch that does not bloat the code size and that is still efficient for unibyte locales, then yes, we would like to support multibyte character processing in the various text-based utilities.”
-
bug#22001: Is it possible to tab separate concatenated files? - not directly unicode related, but thread turned into a discussion about what constitutes a text file (according to POSIX and whatsnot). (same on mailing list)
-
terrible Unicode shattering fold(1) command - with examples in Thai language. Quote from James Youngman:
FWIW, that is hard in languages like Thai, where it’s hard to distinguish which bits are the words and where the reasonable breaks are.
See for example http://cpan.uwinnipeg.ca/htdocs/String-Thai-Segmentation/String/Thai/Segmentation.pm.html
-
TODO:
bug#7960: fmt: fix formatting multibyte text (bug #7372)
Useful websites
-
List of unicode letters in different languages: http://www.ltg.ed.ac.uk/~richard/unicode-sample.html
-
UTF-8 Samples: http://www.columbia.edu/~fdc/utf8/ - Paragraphs and sentenses in multiple scripts.
-
The case for UTF8 everywhere: http://utf8everywhere.org/ - Long and interesting read.
-
Examples Of Unicode Usage For Business Applications: http://www.i18nguy.com/unicode/unicode-example-intro.html - files in various formats containing utf-8 data.
-
UTF-8 and Unicode FAQ for Unix/Linux by Markus Kuhn: https://www.cl.cam.ac.uk/~mgk25/unicode.html - a must read.
-
Markus Kuhn’s UTF-8 decoder capability and stress test https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt - The gold standard in UTF-8 testing.
The unorm/mbbuffer tests in this patch are closely modelled after this stress tests (see test-mbbuffer.c - comments such as
/* 4.1.2 */
refer to test 4.1.2 in the UTF-8 stress test). -
Dark corners of unicode: https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/ - Abandon hope all ye who enter here.
-
GNU UniFont aims to create a universal bitmap font which contains ALL unicode glyphs. Perhaps it’s not as pretty as vector fonts, but it’s very useful and important. They provide a unicode tutorial at http://unifoundry.com/unicode-tutorial.html.
-
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text http://kunststube.net/encoding/ - basic but very good.
-
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ - Similar to the above, by Joel Spolsky.
-
Steven R. Loomis’s home page contains many invaulable resources regarding ICU: Ibm’s International Components for Unicode - a comprehensive C++ library for unicode processing.
Online tools
-
Unicode Toys (online conversion): http://qaz.wtf/u/
-
Examine unicode characters:
Low-level command-line conversion
Print unicode with perl:
perl -e 'print "\x{3A3}\n"'
perl -e 'print "\N{U+03A3}\n"'
perl -e 'print "\N{GREEK CAPITAL LETTER SIGMA}\n"'
Using coreutils’ printf (note that other printfs such as FreeBSD do not support \u
):
printf '\u03A3\n'
printf '\U000003A3\n'
print octets with od
(i.e. the binary encoding in the corrent locale, ‘\316\250’
is the octal representation of UTF-8 encoding for ):
$ printf 'Ψ\n' | LC_ALL=C od -c -An
316 250 \n
For maximum portability, always use LC_ALL=C
(because FreeBSD’s od
does support multibyte input and will not display the octet’s octal
value). Print in octal, as every POSIX-compliant printf
can handle
octal values:
$ printf '\316\250\n'
Ψ
Displaying hex unicode codepoints:
$ printf "Σημ" | iconv -t UTF-16LE | od -tx2 -An
03a3 03b7 03bc
The above assumes little-endian (e.g. intel) CPU. Change to UTF-16BE for big-endian machines. It also assumes all code-points in the input fit into 16-bits (which is not a safe assumption). If some characters require more than 16-bit, a safer option is to use 32-bits for every code point:
$ printf "Σημ" | iconv -t UTF-32LE | od -tx4 -An
000003a3 000003b7 000003bc
Checking for invalid multibyte-sequences with GNU sed:
The following example works in UTF-8 locale, and relies on the fact the GNU sed’s regular expression will not match invalid sequences (i.e. anything that was not replaces by the regex is invalid octet). If the output isn’t empty, the input had invalid multibyte sequenes:
$ printf 'a\xCEc\n' | sed 's/.*//g' | od -tx1c -An
ce 0a
c \n
# Same detection for an input file:
$ printf 'ab\nc\n\xCE\xCEde\n\xCE\xA3f\n' > invalid.txt
$ sed -n 's/.//g ; H ; $@{x;s/\n//g;l@}' invalid.txt
\316\316$
# With few more commands, the offending line can be printed as well:
$ sed -n 's/.//g;=;l' invalid.txt | paste - - | awk '$2!="$"'
3 \316\316$
# GNU sed in C locale can edit octets directly:
$ LC_ALL=C sed '3s/\o316\o316//' invalid.txt > fixed.txt
Converting UTF to \u sequences (using ‘uconv’ from ICU package):
$ printf "Σημεῖόν" | uconv -x 'hex-any ; any-hex'
\u03A3\u03B7\u03BC\u03B5\u1FD6\u03CC\u03BD
Converting UTF-8 to named unicode charactes:
$ printf "Σημεῖόν" | uconv -x 'hex-any ; any-name'
\N{GREEK CAPITAL LETTER SIGMA}\N{GREEK SMALL LETTER ETA}
\N{GREEK SMALL LETTER MU}\N{GREEK SMALL LETTER EPSILON}
\N{GREEK SMALL LETTER IOTA WITH PERISPOMENI}
\N{GREEK SMALL LETTER OMICRON WITH TONOS}\N{GREEK SMALL LETTER NU}
List of possible transliterations in ‘uconv’/ICU:
uconv -L | tr ' ' '\n' | grep -i any | sort -f | less
# such as:
uconv -x 'hex-any ; any-hex/perl'
uconv -x 'hex-any ; any-hex/java'
uconv -x 'hex-any ; any-hex/c'
ICU’s uconv
supports several methods to handle invalid data
(called ‘callbacks’ in their man page). This is part of the inspiration
for unorm
(the proposed coreutils program). Examples:
$ printf 'ab\342cdef' | uconv
Conversion to Unicode from codepage failed at input byte position 2. Bytes: e2 Error: Illegal character found
$ printf 'ab\342cdef' | uconv --callback substitute
ab�cdef
$ printf 'ab\342cdef' | uconv --callback escape-c
ab\xE2cdef
Invalid Sequences
Modified-UTF-8 allows encoding NULL as 0xC0 0x80. This allows the byte with the value of zero, which is now not used for any character, to be used as a string terminator. https://en.wikipedia.org/wiki/Null_character#Encoding.
Invalid Sequences: https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points
Unicode glyph rendering
Show different unicode font implementation/support in terminals:
-
Easy case: ‘e’ + combining mark (where a pre-combined ‘e’ exists):
$ printf 'e\u0301\n' é
Works on gnome-terminal, mac-os-x-terminal, xterm. doesn’t work on ‘st’ (simple-terminal from st.suckless.org), prints ‘e’ followed by empty ‘grave’.
-
Advanced support: any letter (regardless of pre-combined letter support:
$ printf 'x\u0301e\n'
On gnome-terminal,mac-os-terminal, prints ‘x’ with grave (nonsensical, but graphically correct) followed by ‘b’.
On xterm, simple-term: prints ‘xe’
Cygwin (and other systems with 16-bit wchar_t)
Cygwin UTF-16 problems: https://cygwin.com/ml/cygwin/2011-02/msg00037.html - long and
interesting discussion. First mention of possibility of wwchar_t
and abstratction layer.
Cygwin Internationalization: https://cygwin.com/cygwin-ug-net/setup-locale.html - keeps recommending UTF8 everywhere.
How cygwin deals internally with windows filenames (which are UTF16): https://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-unusual
But iswalpha takes wint_t
which IS int32_t - perhaps do conversion manually?
see https://cygwin.com/ml/cygwin/2011-02/msg00039.html and
https://cygwin.com/ml/cygwin/2011-02/msg00044.html.
-
printf can generate them, but on wchar_t/64-bit systems, mbrtowc(3) can’t decode them:
$ printf '\ud800\n' | iconv -f utf-8 iconv: illegal input sequence at position 0
-
A file containing:
printf '\ud800\udc000\n' > 1.txt
will be interpreted as 6 invalid octets on 64bit systems, and as either ‘U+100000’ or ‘U+D800 U+DC00’ on cygwin. which is correct ?
-
On Cygwin, this input can be detected (and rejected to maintain consistency) by checking
mbstate_t.__count==4
. What about other systems ?
whcar_t is NOT always UCS4: https://www.gnu.org/software/libunistring/manual/html_node/The-wchar_005ft-mess.html
OpenSolaris
In OpenSolaris, only under unicode locales, “wchar_t” is UTF-32 (good enough?). from https://docs.oracle.com/cd/E36784_01/html/E39536/gmwkm.html:
The ISO/IEC 9899 standard does not specify the form or the
encoding of the contents for the wchar_t data type. Because it is
an implementation-specific data type, it is not portable.
Although many implementations use some Unicode encoding forms for
the contents of the wchar_t data type, do not assume that the
contents ofwchar_t are Unicode. Some platforms use UCS-4 or UCS-2
for their wide-character encoding.
In Oracle Solaris, the internal form of wchar_t is specific to a
locale.
In the Oracle Solaris Unicode locales, wchar_t has the UTF-32
Unicode encoding form, and other locales have different
representations.
AIX
On AIX 5.1 and later, the wchar_t datatype is 32–bit in the 64–bit
environment and 16–bit in the 32–bit environment.
The locale methods have been standardized such that in most
locales, the value stored in the wchar_t for a particular
character will always be its Unicode data value. [...] All
locales use Unicode for their wide character code values (process
code), except the IBM-eucTW codeset.
glyph width and wcwidth issues
Expand
, pr
, fold
, fmt
will have glyph-width related issues.
Some glyphs’ widths can not be determined by libc -
but only by the graphical program that will render them on screen.
In other cases, the glyphs are optionally zero-width combining characters, or stand-alone visible characters. Example: Skin-tone modifiers (not zero width, but optionally is if it follows an face/hand emoji). See http://unicode.org/reports/tr51/#Diversity.
This is EMOJI MODIFIER FITZPATRICK TYPE-1-2’ (U+1F3FB):
# With space (non-emoji) before the modifier, it is rendered as a normal character:
$ printf '\U0001F466 \U0001F3FB\n'
👦 🏻
# with an emoji preceeding it, it is combined:
$ printf '\U0001F466\U0001F3FB\n'
👦🏻
# NOTE for readers: whether the above is rendered as a single
# face depends on your web-browser or text editor.
The mbbuffer-debug
(from this patch)
is used below to examine multibyte input. The W
column shows the result of
wcwidth() of the character.
When checking with wcwidth
, it gives width of 1 on MAC OS X
(‘1’ is not always accurate, depending on later combining rendering):
$ printf '\U0001F3FB' | ./src/mbbuffer-debug -r
ofs line colB colC V wc(dec) wc(hex) Ch W n octets
0 1 1 1 y 127995 0x1f3fb = 1 4 0xf0 0x9f 0x8f 0xbb
Yet on glibc, wcwidth returns -1 for all SMP codepoints (as wcwidth returns -1 for all non-printables):
$ printf '\U0001F3FB' | ./src/mbbuffer-debug -r
ofs line colB colC V wc(dec) wc(hex) Ch W n octets
0 1 1 1 y 127995 0x1f3fb = -1 4 0xf0 0x9f 0x8f 0xbb
It is not at all clear if there’s a “correct” width, as the visualization results differ based on the rendering environment (the following looks different between Mac OSX terminal and gnome-terminal 3.6.2 and safari and chrome):
$ printf 'a\U0001F466\U0001F3FBaa\tb\naaaa\tb\n'
a👦🏻aa b
aaaa b
Should we use ‘wcswidth’, or alternatively, process “EmojiModifiers” propery? (see https://unicode.org/reports/tr51/#Data_Files but then, the list of possible specific properies is endless).
Other “Modifier Symbols” (Category Sk): https://codepoints.net/search?gc=Sk
See also: Multi-Person Grouping: https://unicode.org/reports/tr51/#Multi_Person_Groupings Can be rendered as multiple icons or one combined icon (taking one or more characters).
More examples
joiner such as: U+20E0 COMBINING ENCLOSING CIRCLE BACKSLASH which as an overlaid glyph, to indicate a prohibition or “NO”
$ printf '\U0001F52b\u20E0\n' # no guns
🔫⃠
$ printf '\U0001F399\u20E0\n' # no microphones
🎙⃠
Some characters are ‘combining’, and wcwidth does indicate they have zero width (which is good for expand/pr/fold/fmt), but when rendered they actually consume visual space on the screen, messing up alignment. This is COMBINING ENCLOSING KEYCAP (U+20E3):
$ printf 'a\u20E3aa\tb\naaaa\tb\n'
a⃣aa b
aaaa b
# this is multibyte-aware expand, still incorrect output:
$ printf 'a\u20E3aa\tb\naaaa\tb\n' | ./src/expand
a⃣aa b
aaaa b
# Despite wcwidth giving W=2, the character is rendered
# wider than a single column on the screen:
$ printf 'X\u20E3\n' | ./src/mbbuffer-debug -r
ofs line colB colC V wc(dec) wc(hex) Ch W n octets
0 1 1 1 y 88 0x00058 X 1 1 X
1 1 2 2 y 8419 0x020e3 ⃣ 0 3 0xe2 0x83 0xa3
4 1 5 3 y 10 0x0000a = -1 1 0x0a
# NOTE to readers:
# On Mac OS X terminal, Safari and Firefox,
# the above is rendered as 'a' surrounded by a square. YMMV.
TODO:
- How many characters are NON-PRINTABLE (i.e. wcwidth()==-1), but in expand we do not treat them properly? when adding columns in ‘expand’, ensure wcwidth>0 ?? Do all “word_break” or “line_break” characters have wcwidth()==-1 ?
- check wcwidth() on CJK idograms (does it return 2?)
Not bugs, but worth knowing
Combined letters
In Laten Extended-B: Some glyphs are 2-letters squeezed into a width of 1. wcwidth on glibc and macos seems to handle it correctly (return width==1), and on Xterm,Mac Terminal they are indeed rendered in width of 1 (there seems to be a problem on gnome-terminal/ubuntu-14.04, but that’s a font issue).
Example: Dž:
$ printf '\u01c4a\nbc\n'
DŽa
bc
Full Width glyphs
Some glyphs are designated as full-width, meaning the consume a width of 2 characters, and can be used for easy alignment with CJK characters. See https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms#In_Unicode
This is Full-width capital B U+FF22:
$ printf 'a\uFF22c\nabc\n'
aBc
abc
Both glibc and MacOS-X wcwidth gives width==2 for these (which is good for expand/pr/fmt/fold):
$ printf '\uFF22' | ./src/mbbuffer-debug -r
ofs line colB colC V wc(dec) wc(hex) Ch W n octets
0 1 1 1 y 65314 0x0ff22 B 2 3 0xef 0xbc 0xa2
From Cygwin’s internationalization page:
There's a class of characters in the Unicode character set, called
the "CJK Ambiguous Width" characters. For these characters, the
width returned by the wcwidth/wcswidth functions is
usually 1. This can be a problem with East-Asian languages, which
historically use character sets where these characters have a
width of 2. Therefore, wcwidth/wcswidth return 2 as the width of
these characters when an East-Asian charset such as GBK or SJIS is
selected, or when UTF-8 is selected and the language is specified
as "zh" (Chinese), "ja" (Japanese), or "ko" (Korean). This is not
correct in all circumstances, hence the locale modifier
"@cjknarrow" can be used to force wcwidth/wcswidth to return 1 for
the ambiguous width characters.
expand
See width/wcwidth issues above.
wc
- Add a special option in ‘wc’ to count non-zero width characters?? (But then, what about optional modifiers, e.g. skin-color and family-joiner?)
- What about counting SMP characters (which gives wcwidth()==-1).
cut
option to never cut in a combining-mark? (or technically, only cut in clear graphmeme? e.g. never before ZWJ, BIDI mark, etc.)?
head/tail
how to treat zero-width-joiners, how to treat combined characters ? based on width ?
tr
Multibyte sequence case conversion (the following works on Mac/FreeBSD, not yet in coreutils):
$ printf '\u0103\n'
ă
$ printf '\u0103\n' | /usr/bin/tr '[:lower:]' '[:upper:]'
Ă
Pádraig Brady wrote (privately): > I’ve also noticed interesting chars like the titlecase letter ‘Lj’ (U+01C8) > which is neither upper or lower but does have an upper case (U+01C7), > or the fact the there are only 2 code points Ⱥ (U+023A) and Ⱦ (U+023E) > that increase in length (2 to 3 bytes) when lower-cased.
Deleting multibyte-sequences:
$ printf '\u0103b\u0106d\n'
ăbĆd
$ printf '\u0103b\u0106d\n' | LC_ALL=C od -tc -An
304 203 b 304 206 d \n
# On Mac/FreeBSD, it works as expected:
$ printf '\u0103b\u0106d\n' | tr -d 'ă' | LC_ALL=C od -tc -An
b 304 206 d \n
# Coreutils 'tr' treats input as two independant octets, delete
# both instances of \304 resulting in invalid output:
$ printf '\u0103b\u0106d\n' | tr -d 'ă' | LC_ALL=C od -tc -An
b 206 d \n
POSIX tr(1) says (in “Extended Description”):
“\octal - […] Multi-byte characters require multiple, concatenated escape sequences of this type, including the leading
for each byte."
Based on my understanding of the above, the following ‘should work’:
# It doesn't work at all on Mac/FreeBSD:
$ printf '\u0103b\u0106d\n' | tr -d '\304\203' | LC_ALL=C od -tc -An
304 203 b 304 206 d \n
# It doesn't work on GNU tr (since as above, it treats the octets indepedently):
$ printf '\u0103b\u0106d\n' | tr -d '\304\203' | LC_ALL=C od -tc -An
b 206 d \n
-
Should uppercase mapping of ligatures turns into two letters?
## U+FB01 LATIN SMALL LIGATURE FI (fi)
-
German Capital Sharp S is a similar issue.
$ printf '\uFB01' | tr '[:lower:]' '[:upper:]'
-
POSIX says Chracter Ranges are UNDEFINED in non-posix locale:
> "c-c" > "In locales other than the POSIX locale, this construct has unspecified behavior."
How to handle these when we do implement multibyte support?
-
What does the
-C
(upper-case C) exactly do according to POSIX ?> "The ISO POSIX-2:1993 standard had a -c option that behaved > similarly to the -C option, but did not supply functionality > equivalent to the -c option specified in POSIX.1-2008." The > earlier version also said that octal sequences referred to > collating elements and could be placed adjacent to each > other to specify multi-byte characters. However, it was > noted that this caused ambiguities because tr would not be > able to tell whether adjacent octal sequences were intending > to specify multi-byte characters or multiple single byte > characters. POSIX.1-2008 specifies that octal sequences > always refer to single byte binary values when used to > specify an endpoint of a range of collating elements. "
-
Equivalence classes: The following is supposed to work (i.e. replace also the umlaut-a into X), but does not work on FreeBSD-10.3/OpenBSD-6 which supposed to support it (TODO: check on musl-libc):
$ printf 'abc \303\244\303\202 def\n' | LC_ALL=en_US.UTF-8 tr '[=a=]' X Xbc ä def
There is no portable for an application to determine ‘equivalence class’ without knowledge of libc internals. FreeBSD’s tr is supposed to be able to do it by assuming it knows the internals of its libc: https://github.com/freebsd/freebsd/blob/master/usr.bin/tr/str.c#L212.
A lot depends on the system’s libc. For example, the following works on glibc but not on Mac (in both cases using GNU sed):
# works on glibc with gnu sed: $ printf 'abc \303\244\303\202 def\n' | LC_ALL=en_US.UTF-8 sed 's/[[=a=]]/X/g' Xbc XX def
Implementation issues: The critical strucutres in the tr code.
Other places also assume only 256 different values (e.g.
enum { N_CHARS = UCHAR_MAX + 1 };
).
fold/fmt
character ‘WJ’ (word-joiner) - special treatment in ‘fold / fmt’?
Does any ‘space’ character is space, or ‘iswspace’, or only ASCII 0x20,0x09,0x0d ?
join
FreeBSD’s join bails out on invalid sequences: see function ‘mbssep()’ in https://github.com/freebsd/freebsd/blob/master/usr.bin/join/join.c#L362.
Currently join DOES support some locale-comparison,
as fields are compared with gnulib’s memcoll
(which uses strcoll(3)
internally).
Two things that are not supported (and are partially implemented in redhat’s i18n patch):
-
multibyte field delimiters - but the patch turns the global delimiter variable into a string, making processing slower in all cases.
-
Case-insensitive comparison - the patch allocates new buffers for every key (in every line) and iterates with mbrtowc+towupper.
Also, risk of collating into same order (cf. Karl Berry surprised results from sort in bug#23677.
od
In GNU:
$ printf "\u03a8\n" | od -tx1c -An
ce a8 0a
316 250 \n
In Mac/FreeBSD:
$ printf "\u03A8\n" | od -t x1c -An
ce a8 0a
Ψ ** \n
in Mac/FreeBSD: invalid mb-seqeuences:
$ printf "\xce\xce\n" | od -t x1c
ce ce 0a
316 316 \n
Implementation problem: POSIX says the FIRST character of a valid multibyte sequence should display the character, and the following octets should show ‘’. But the first octet might appear on the LAST character of the line, and the ‘’ should be displayed on the following line.
In FreeBSD, it ‘just works’:
$ printf "aaaaaaaaaaaaaaa\316\250bb\n" | od -An -tc
a a a a a a a a a a a a a a a Ψ
** b b \n
They (FreeBSD) have implemented a ‘peek’ option following a multibyte octet: https://github.com/freebsd/freebsd/blob/master/usr.bin/hexdump/conv.c#L98.
On GNU coreutils’ od, it seems (IIUC) that the implementation reads exactly the (known) amounts of octets needed to display each line, and adding ‘peeking’ feature will be tricky: https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/od.c#n1360
unorm
TODO: organize this mess…
Check normalization according to NormalizationTest.txt
Check compatiblity of:
U+00B5 MICRO SIGN
U+03BC GREEK SMALL LETTER MU
U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
U+212B ANGSTROM SIGN
U+03B2 GREEK CAPITAL LETTER BETA
U+00DF LATIN SMALL LETTER SHARP S
U+03A9 GREEK CAPITAL LETTER OMEGA
U+2126 OHM SIGN
U+03B5 GREEK SMALL LETTER EPSILON
U+2208 ELEMENT OF0xEA
U+005C REVERSE SOLIDUS
U+FF3C FULLWIDTH REVERSE SOLIDUS
Unexpected?? U+00E6 LATIN SMALL LETTER AE (æ) is NOT decomposed:
“Originally a ligature representing a Latin diphthong, it has been promoted to the full status of a letter in the alphabets of some languages, including Danish, Norwegian, Icelandic and Faroese.” - https://codepoints.net/U+00E6
However this is decomposable:
U+FB01 LATIN SMALL LIGATURE FI (fi)
<https://codepoints.net/U+FB01>
Check decomposition of: wcwidth() on LATIN-EXtended-B characters: \u01c4 Latin Capital Letter DZ with caron \u01c5 Latin Capital Letter D with Small Letter Z with caron \u01c6 Latin Small Letter DZ with caron \u01c7 Latin Capital Letter LJ \u01c8 Latin Capital Letter L with Small Letter J \u01c9 Latin Small Letter LJ \u01ca Latin Capital Letter NJ \u01cb Latin Capital Letter N with Small Letter J \u01cc Latin Small Letter NJ
\u01f1 Latin Capital Letter DZ
\u01f2 Latin Capital Letter D with Small Letter
\u01f3 Latin Small Letter DZ
\u1f6 Latin Capital Letter Hwair (seems to work through with wcwidth==1)
\u01fC Latin Capital Letter AE with acute Ǽ (seems to woth with wcwidth==1)
\u0238 Latin Small Letter DB Digraph ȸ (seems ok) - despite being called digraphs
\u0239 Latin Small Letter QP Digraph ȹ (seems ok)
ESPECIALLY the “DZ with Caron” - is it decomposed to D,Z,Caron or D,z-with-caron ?
and these two: does decompistion results in O,dot,macron ?
\u0230 Latin Capital Letter O with dot above and macron Ȱ -
\u0231 Latin Small Letter O with dot above and macron ȱ
Check decomposition/compatiblity of IPA block, e.g.
\u2A3 Latin Small Letter DZ Digraph ʣ - does this translates to DZ, or
to the Latin-Extended-B 'DZ' latter?
up to and including:
\u2AB Latin Small Letter LZ Digraph
Check compatability and decomposition of ‘fullwidth’ characters, see https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms#In_Unicode e.g. does ‘\uFF21’ (full-width A) decomposes to ascii ‘A’ ?
Check compatibility/decomposition of entire block: https://en.wikipedia.org/wiki/Alphabetic_Presentation_Forms
mbbuffer
TODO: organize this mess…
Modified NULL (\xC0\x80)
(Unicode book page 282): Unicode conformance
UNAssigned Code Points (C4)
Test unassigned codes (don't generate, don't change) in all programs.
Test non-characters (U+FFFE, U+FFFF)
Test surrogate codes
Surrogate codepoints treated as invalid on “normal” unixes:
$ printf '\uD800\n' | ./src/mbbuffer-test -r
ofs line colB colC V wc(dec) wc(hex) Ch w n octets
0 1 1 1 n * * * * 1 0xed
1 1 2 2 n * * * * 1 0xa0
2 1 3 3 n * * * * 1 0x80
3 1 4 4 y 10 0x0000a = -1 1 0x0a
but on Cygwin:
Administrator@WIN-9FFSHRJAFVN ~/coreutils-8.25.71-1437c
$ printf '\uD800\n' | ./src/mbbuffer-test -r
ofs line colB colC V wc(dec) wc(hex) Ch W n octets
0 1 1 1 y 55296 0x0d800 = -1 3 0xed 0xa0 0x80
3 1 4 2 y 10 0x0000a = -1 1 0x0a
page 437: Check all special characters and their effects in various programs.
TODO for Book: Show examples of conversion cases in page 501/502.
sed (and grep/gawk) NEVER match regular expressions to invalid
multibyte sequences. To Force matching, use LC_ALL=C.
$ printf '\xe1\xbc\x11' | LC_ALL=C ./sed/sed 's/./X/g' | od -tx1
0000000 58 58 58
0000003
TODO: Special handling for “modified UTF-8” with NULL as
UTF-8 "\xC0\x80" ?
system's native mbrtowc does not handle it,
and will return -1 .
TODO: prepare for all types of invalid sequences:
https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences :
----
Not all sequences of bytes are valid UTF-8. A UTF-8 decoder should be prepared for:
* the red invalid bytes in the above table
* an unexpected continuation byte
* a leading byte not followed by enough continuation bytes (can happen in
simple string truncation, when a string is too long to fit when copying it)
* an overlong encoding as described above
* a sequence that decodes to an invalid code point as described below
https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points :
------
Since RFC 3629 (November 2003), the high and low
surrogate halves used by UTF-16 (U+D800 through U+DFFF) and code
points not encodable by UTF-16 (those after U+10FFFF) are not
legal Unicode values, and their UTF-8 encoding must be treated
as an invalid byte sequence.
Not decoding surrogate halves makes it impossible to store
invalid UTF-16, such as Windows filenames, as UTF-8. Therefore,
detecting these as errors is often not implemented and there are
attempts to define this behavior formally (see WTF-8 and CESU
below).
from https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt:
Section3.3 Sequences with last continuation byte missing:
All bytes of an incomplete sequence should be signalled as a single
malformed sequence, i.e., you should see only a single replacement
character in each of the next 10 tests. (Characters as in section 2).
Mbbuffer currently reports EACH invalid octet instead of just one per incomplete sequence.
TODO: Does incomplete sequence in the middle of the file reported as incomplete (mbrtowc==-2) or invalid (mbrtowc==-1) ?
If we report on the FIRST octet (including line,byte/char offset), the user (needing low-level processing) won’t be able to tell the differences without further processing. By reporting all octets, we provide easier work-arounds (but we also ‘pollute’ stdout with more “invalid char” markers than needed). Perhaps add this as an option?
On Ubuntu 14.04 with xterm 322, terminal prints only one “invalid char”:
$ printf '\ud800\n'
�
On Ubuntu 14.04 with gnome-terminal 3.6.2, nothing is printf.
Mac OS X terminal prints 3 question marks:
$ printf '\ud800\n'
???
Web browsers print 3 characters: visit https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt and view section 5.1.1.
On Linux: Google Chrom 51.0, Firefox 48.0 On Mac: Safari 8.0.7, Chrome 47.0, IceCat 37.0
$ printf '\367\277\277\n' | ./src/mbbuffer-test.exe -r
ofs line colB colC V wc(dec) wc(hex) Ch W n octets
0 1 1 1 n * * * * 1 0xf7
1 1 2 2 n * * * * 1 0xbf
2 1 3 3 n * * * * 1 0xbf
3 1 4 4 y 10 0x0000a = -1 1 0x0a
Also see seciont 5.1 (Single UTF-16 surrogates) where he claims each invalid sequence should result in ONE “invalid character” output: e.g.
5.1.1 U+D800 = ed a0 80 = "���"
In such cases ‘mbrtowc’ return -1 THREE times (or is it because we reset mbstate_t after each failure?)
TODO:
"[...] U+FFFE and U+FFFF must not occur in normalUTF-8 or UCS-4
data. UTF-8 decoders should treat them like malformedor overlong
sequences for safety reasons."
(http://m.blog.csdn.net/article/details?id=50910387)
Unicode Book
HIGHLY RECOMMENDED
Unicode Explained
Internationalize Documents, Programs, and Web Sites
By Jukka K. Korpela
https://shop.oreilly.com/product/9780596101213.do
The following are pointers and notes from the above book which seemd (IMHO) relevant for coreutils’ multibyte implementation (or for testing).
page 11: Finish har har har - change with char-classes, regex, normalization, upper/lower cases
page 12: ligature “fi”: change for normalization, char classes
page 23: German lower-case “strasse” (sharp-S?) becomes “SS” in upper-case (two characters). Also differ from greek “beta” glyph. ‘\u00DF’
page23: 0-with-cross is diameter in mechanical writing, or a letter in Nordic languages?
Length of BIDI markers (zero width?)
page 29 (2nd paragraph from bottom): Greek Sigma in middle vs final form. If there’s no equivalence between them, how about sort order ?
page 29 (top): initial/middle/final/separate contextual forms (e.g hebrew/arabic) Sort order ?
Page 143: Transcoding tools http://www.unicode.org/Public/MAPPINGS TODO: Download these for offline processing.
Page 145: Repertiore requirements Characters in each language: http://www.eki.ee/letter
Page 169: Named Sequences http://www.unicode.org/Public/UNIDATA/NamedSequences.txt
Page 178: Table 4.3: Code-point Classification TODO: Test unassigned, Surrogate, Private-Use input. Ensure no bugs, should be passed as it. What about “wc” and “cut” ?
Page 182: DiGraph e.g “ll” in Spanish, “Ch” in some others - two distinct characters logically treated as one by native language speakers. VS æ (\u00E6) which is one character for “ae”. ij (\u0133) which is small latin ligature “ij”. TODO: Check unicode-normalization-decomposition.
Page 185: unicode standard - chapters Chapter 5: Implementation Guidelines
Page 194: Varient Selectors Unicode markers affecting the precending code-point, ∩︀ (\u2229 - “intersection” symbol) followed by \uFE00 (“variant selector” VS1). Affect font in applicaiton ? IS this Zero width character ? TODO: check with ‘expand’, ‘cut’, ‘wc’.
Page 195: Ligatures. in Danish/Norwegian, æ (\u00E6) is an independent letter, vs just a ligature of two letters “ae” in other languages. TODO: Test sort order with such input in Danish-vs-other locales. TODO: in Danish locale, should unicode normaliztion NOT decompose it?? Unicode “Alphabetic PResentation Form” block (U+FB00..U+FB4F).
TODO: Test decomposition of ligatures in that block (e.g. hebrew ligatures?)
$ printf '\ufb00\n'
ff
$ printf '\ufb03\n'
ffi
$ printf '\ufb4a\n' תּ
ZWJ (\u200d) should instruct the application to join the
characters before/afer into legature.
Doesn't seem to work (on Mac OS):
$ printf 'f\u200Di\n'
f i
Similarly, ZWNJ (\u200C) should prevent joining.
TODO: test ZWJ,ZWNJ (zero width or "invisible control" chars?)
in cut/expand/wc.
Page 196: Vowels vs Marks Hebrew+Arabic: Nikud.
Hindi (Devanagaris script):
\u092A (pa) followed by \u0942 (uu) appears as one glyph (puu).
Page 211, Table 5-1: General Category VAlues.
Page 216: Character Property ‘ea’ = Asian Width Full,Half,narrow. Affects ‘expand’ ?
Page 216: Grapheme Clusters? for ‘fold/fmt/cut’ ?
Page 219,220: Use ‘WB’ (WordBreak)’ or ‘WS’ (Whitespace) ‘SB’ (Sentense-break) properties for counding-words in ‘wc’ ?
gfdafda d dfsa fdsa fdsa
fdafda fdsafdsa
Page 220: Property ‘SFC’ (Simple-Case-Folding): Upper/Lower case are simple.
Page 227: Canonical vs Compatability mapping Canonical: different encoding for SAME symbol. Compatibility: fundamentally similar characters, differ in rendering/ usage (and sometimes in meaning)
Examples in Book
\u2126 = \u03a9
Page 231: Iterative decompisition:
ANGSTRAM (U+212b):
$ printf '\u212b\n'
Å
Is canonical-mapped to 'A-with-Ring U+00C5':
$ printf '\u00c5\n'
Å
Which is canonical mapped to 'A + combining mark ring (U+030a)':
$ printf 'A\u30a\n'
Å
Page 233:
Decomposition of 'VULGAR HALF', 'MICRO SIGN', 'E WITH GRAVE':
$ printf '\u00BD\u00b5\u00e8\n'
½µè
Becomes:
VULGAR HALF => 1 'FRACTION SLASH' 2
MICRO SIGN => greek mu
'E WITH GRAVE' => 'E' 'COMBING MARK GRAVE'
With decomposition (E + combing grave mark):
$ printf '\u00BD\u00b5\u00e8\n'| ./src/unorm -n nfkd | iconv -t ucs-2le | od -tx2 -An
0031 2044 0032 03bc 0065 0300 000a
Without decomposition ('E WITH GRAVE' stays as-is):
$ printf '\u00BD\u00b5\u00e8\n'| ./src/unorm -n nfkc | iconv -t ucs-2le | od -tx2
0031 2044 0032 03bc 00e8 000a
TODO: for 'sort', 'uniq', 'join':
Test the above strings as 'equivalent' (strxfrm/strcoll) ?
Page 249: Collation order no official collation order. Unicode Technical STandard #10 http://www.unicode.org/reports/tr30/tr30-4.html
Page 256: Text Boundaries See the files in /Users/gordon/projects/unicode-mapping/www.unicode.org/Public/9.0.0/ucd/auxiliary like WordBreakProprty.txt includes test files TODO: for wc,fold,fmt,cut ? TODO: instead of ‘iswspace’ is unicode ‘‘Alphabetic’ Property?
For Book: document exceptions for Thai/Lao/Hiragana ?
Page 276: Line-BReaking rules for fold/fmt ?
Page 282: Unicode Conformance requirements TODO: Test unassigned codes (don’t generate, don’t change) in all programs.
Page 285: Conformance: C12a: unorm is conformant.
Page 286: Conformance: C14,15,16 (normalization): unorm is conformant.
page 287: Conformance: When mentioning normalization, use proper terms (for unorm)
Page 299: UTF-8 vs ISO-8859-1 For Book
Page 300: Duplicate Octet Range rable, add octal
Page 392: Duplicate table of control characters, add octal mention sed, printf, od for book/ /website
Page 414: Fixed with charachers (e.g. em/en dashes) TODO: How to treat in ‘expand’ ?
page 426: Line-break chracters in unicode for fold/fmt , what about ‘wc’ ? LS (U+2028) Line Separator PS (U+2029) Paragraph separator For Book/website
Page 426: mathenatical and technical symbols For Book/website: canocnical compatiblity with other chars.
Page 438: ‘other’ non alphabetical markers, should they be counted as words?
$ env printf '\ufff9assaf\ufffagordon\ufffb\n' | wc
1 1 21
in HTML this would be rendered as two words, assaf/gordon.
Page 468: Invisible characters ?
Page 469: MArkup vs plaintext: Table 9-2: should these characters be counted in ‘wc’, skipped in ‘expand’, non-break with ‘cut’ ?
Are these considerd "word break" properties?
should SED's "\b \B \< \>" regex operators support them?
Page 592: Patterns, regex patterns. TODO: ensure tests cases according to page 594.
page 597: “Basic Unicode Support” TODO: Check which coreutils fall under the requirements, and whether they comply.
Last but not least
Support the cause: Adopt a Unicode Character! http://www.unicode.org/consortium/adopted-characters.html
Unorganized (yet)
OpenBSD removes non-utf8 locales: http://marc.info/?l=openbsd-cvs&m=143956261214725&w=2
TODO: learn from grep’s multibyte-white-space test