This patch fixes UTF8 sequence validation logic in teken_utf8_bytes_to_codepoint and fixes fallback behaviour in ttydisc_rubchar when an invalid UTF8 sequence is encountered. The code previously used bitcount to extract sequence length information from the leading byte. However, this assumption breaks for certain code points that have additional bits set in the first half of the leading byte (e.g. Cyrillic characters). This lead to incorrect behaviour when deleting those characters using backspaces. The code now checks the number of consecutive set bits in the leading byte starting from the MSB, as per RFC 3629.
Details
Details
Diff Detail
Diff Detail
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
Comment Actions
The commit message should say RFC 3629 instead of 2629.
sys/teken/teken_wcwidth.h | ||
---|---|---|
131–134 |
Comment Actions
Address @christos 's comments.
- Add more detailed explanation of the use of __builtin_clz
Other fixes:
- Codepoint calculation for two-byte sequences was missing one bit in the mask used for the leading character, fixed now
- ttydisc_rubchar now falls back to non-UTF8 behaviour if teken_wcwidth returns an error