Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.asm.x86    |    Ahh, the lost art of x86 assembly    |    4,675 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 3,120 of 4,675    |
|    Robert Prins to All    |
|    More UTF-8 woes - UTF-8 to "\uN" RTF    |
|    28 Nov 17 20:48:43    |
      From: robert@nospicedham.prino.org              I've got working code to convert UTF-8 to "\uN" RTF escape sequences (as the       format used by LibreOffice is simply way too hard to recreate).              The code was originally written as an ISPF edit macro in REXX on z/OS, and,       bypassing the fact that using both FTP, and the proprietary IBM IND$FILE       function don't really have a clue about an UTF-8 to EBCDIC translation, I just       look for the translated characters 194 to 243, concatenate the required       additional 1, 2, or 3 characters, bit-fiddle them to decimal, and then do a       global change of the 2, 3, or 4 byte strings to the appropriate '\uN ' string.       Works perfectly, and, on z/OS using global change in the the editor, I never       convert the same 2, 3, or 4 characters again! (Which is nice, because using the       ISPF editor in batch munches CPU...)              Now, the same approach would obviously work on Intel              - search a buffer holding the entire file for the required characters,       - delete them, and       - insert the appropriate RTF escape sequence              or more optimal, insert the extra characters required, and overlay the whole       set       with the RTF escape sequence.              Needlessly to say, the amount of data having to be shifted around would be       pretty horrible - input file currently contains 11,869 (167 char + crlf) lines       with "only" 4,876 of them containing UTF-8 encoded characters (for a total of       10,787 UTF-8 characters)              What I'm doing right now, as usual in Pascal, is pretty simple, and, needlessly       to say, not very efficient,              _i:= 194;       repeat        c1:= char(_i);        _p:= pos(c1, _line);               if _p <> 0 then        begin        c2:= _line[_p + 1];               if _i <= 223 then        begin        _u:= ((_i and $0000001f) shl 6) or        (longint(c2) and $0000003f);        _l:= 2;        end        else        if _i <= 239 then        begin        c3:= _line[_p + 2];        _u:= (((_i and $0000000f) shl 6) or        (longint(c2) and $0000003f)) shl 6 or        (longint(c3) and $0000003f);        _l:= 3;        end        else        begin        c3:= _line[_p + 2];        c4:= _line[_p + 3];        _u:= ((((_i and $00000008) shl 6) or        (longint(c2) and $0000003f)) shl 6 or        (longint(c3) and $0000003f)) shl 6 or        (longint(c4) and $0000003f);        _l:= 4;        end;               str(_u, rtf);        rtf:= '\u' + rtf + ' ';        delete(_line, _p, _l);        insert(rtf, _line, _p);        end        else        inc(_i);       until _i = 245;              because I check every (relevant) line for every possible first byte of an UFT-8       encoded character, which is made worse by the fact that the lines to scan have       a       length of 167 characters, but the UTF-8 characters can only appear from       position       69 onwards, and are (currently) limited to two strings of up to 48 characters,       a       number that bears an interesting multiplicative resemblance to 16, which       happens       to be the size of an YMM register.              Using the masking technique seen here many times, I think it would be pretty       straight-forward (Is it? ASCII chars with 7th bit set would also end up as       non-zero) to check if a line contains any '11xx xxxx' bytes, and use that info       to obtain the full 2/3/4 character UTF-8 character, removing the use of a call       to "Pos()", which is now used at least 51 times per line (Ouch!)              However, even with this minor optimization, I'm left with the problem that this       process has to be repeated for every line, and includes the repeated       construction of the '\Un ' escape. There are as many as there are UTF-8       characters (> 1,000,000) so pre-storing them is likely to be more expensive       than       computing them on-the-fly.              And yes, the actual code that currently does the conversion takes all of 0.2       seconds, but given that it takes only 0.14 seconds to convert the other 7       files,       that's a bit too long. ;)              So, any suggestions as to how I can speed up the process?              Robert       --       Robert AH Prins       robert(a)prino(d)org              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca