... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.asm.x86
Ahh, the lost art of x86 assembly
4,675 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 3,120 of 4,675
Robert Prins to All
More UTF-8 woes - UTF-8 to "\uN" RTF
28 Nov 17 20:48:43
   From: robert@nospicedham.prino.org   
      
   I've got working code to convert UTF-8 to "\uN" RTF escape sequences (as the   
   format used by LibreOffice is simply way too hard to recreate).   
      
   The code was originally written as an ISPF edit macro in REXX on z/OS, and,   
   bypassing the fact that using both FTP, and the proprietary IBM IND$FILE   
   function don't really have a clue about an UTF-8 to EBCDIC translation, I just   
   look for the translated characters 194 to 243, concatenate the required   
   additional 1, 2, or 3 characters, bit-fiddle them to decimal, and then do a   
   global change of the 2, 3, or 4 byte strings to the appropriate '\uN ' string.   
   Works perfectly, and, on z/OS using global change in the the editor, I never   
   convert the same 2, 3, or 4 characters again! (Which is nice, because using the   
   ISPF editor in batch munches CPU...)   
      
   Now, the same approach would obviously work on Intel   
      
   - search a buffer holding the entire file for the required characters,   
   - delete them, and   
   - insert the appropriate RTF escape sequence   
      
   or more optimal, insert the extra characters required, and overlay the whole   
   set   
   with the RTF escape sequence.   
      
   Needlessly to say, the amount of data having to be shifted around would be   
   pretty horrible - input file currently contains 11,869 (167 char + crlf) lines   
   with "only" 4,876 of them containing UTF-8 encoded characters (for a total of   
   10,787 UTF-8 characters)   
      
   What I'm doing right now, as usual in Pascal, is pretty simple, and, needlessly   
   to say, not very efficient,   
      
   _i:= 194;   
   repeat   
      c1:= char(_i);   
      _p:= pos(c1, _line);   
      
      if _p <> 0 then   
        begin   
          c2:= _line[_p + 1];   
      
          if _i <= 223 then   
            begin   
              _u:= ((_i and $0000001f) shl 6) or   
                    (longint(c2) and $0000003f);   
              _l:= 2;   
            end   
          else   
          if _i <= 239 then   
            begin   
              c3:= _line[_p + 2];   
              _u:= (((_i and $0000000f) shl 6) or   
                     (longint(c2) and $0000003f)) shl 6 or   
                     (longint(c3) and $0000003f);   
              _l:= 3;   
            end   
          else   
            begin   
              c3:= _line[_p + 2];   
              c4:= _line[_p + 3];   
              _u:= ((((_i and $00000008) shl 6) or   
                      (longint(c2) and $0000003f)) shl 6 or   
                      (longint(c3) and $0000003f)) shl 6 or   
                      (longint(c4) and $0000003f);   
              _l:= 4;   
            end;   
      
          str(_u, rtf);   
          rtf:= '\u' + rtf + ' ';   
          delete(_line, _p, _l);   
          insert(rtf, _line, _p);   
        end   
      else   
        inc(_i);   
   until _i = 245;   
      
   because I check every (relevant) line for every possible first byte of an UFT-8   
   encoded character, which is made worse by the fact that the lines to scan have   
   a   
   length of 167 characters, but the UTF-8 characters can only appear from   
   position   
   69 onwards, and are (currently) limited to two strings of up to 48 characters,   
   a   
   number that bears an interesting multiplicative resemblance to 16, which   
   happens   
   to be the size of an YMM register.   
      
   Using the masking technique seen here many times, I think it would be pretty   
   straight-forward (Is it? ASCII chars with 7th bit set would also end up as   
   non-zero) to check if a line contains any '11xx xxxx' bytes, and use that info   
   to obtain the full 2/3/4 character UTF-8 character, removing the use of a call   
   to "Pos()", which is now used at least 51 times per line (Ouch!)   
      
   However, even with this minor optimization, I'm left with the problem that this   
   process has to be repeated for every line, and includes the repeated   
   construction of the '\Un ' escape. There are as many as there are UTF-8   
   characters (> 1,000,000) so pre-storing them is likely to be more expensive   
   than   
   computing them on-the-fly.   
      
   And yes, the actual code that currently does the conversion takes all of 0.2   
   seconds, but given that it takes only 0.14 seconds to convert the other 7   
   files,   
   that's a bit too long. ;)   
      
   So, any suggestions as to how I can speed up the process?   
      
   Robert   
   --   
   Robert AH Prins   
   robert(a)prino(d)org   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]