... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.asm.x86
Ahh, the lost art of x86 assembly
4,675 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 3,130 of 4,675
Robert Prins to Terje Mathisen
Re: More UTF-8 woes - UTF-8 to "\uN" RTF
29 Nov 17 19:59:56
   From: robert@nospicedham.prino.org   
      
   On 2017-11-29 12:32, Terje Mathisen wrote:   
   > Robert Prins wrote:   
   >> I've got working code to convert UTF-8 to "\uN" RTF escape sequences   
   >> (as the format used by LibreOffice is simply way too hard to   
   >> recreate).   
   > [snip]   
   >> Needlessly to say, the amount of data having to be shifted around   
   >> would be pretty horrible - input file currently contains 11,869 (167   
   >> char + crlf) lines with "only" 4,876 of them containing UTF-8   
   >> encoded characters (for a total of 10,787 UTF-8 characters)   
   > [snip]   
   >> However, even with this minor optimization, I'm left with the   
   >> problem that this process has to be repeated for every line, and   
   >> includes the repeated construction of the '\Un ' escape. There are as   
   >> many as there are UTF-8 characters (> 1,000,000) so pre-storing them   
   >> is likely to be more expensive than computing them on-the-fly.   
   >>   
   >> And yes, the actual code that currently does the conversion takes all   
   >> of 0.2 seconds, but given that it takes only 0.14 seconds to convert   
   >> the other 7 files, that's a bit too long. ;)   
   >>   
   >> So, any suggestions as to how I can speed up the process?   
   >   
   > Robert, you should be able to do this as a stream process, creating a new   
   line   
   > to replace one that needs translation, discarding it if nothing was changed:   
      
   Hammer & nail!   
      
   I realized that while pulling in the actual in-line code this afternoon, and   
   while doing so I also found out that some of my UTF-8 to "\Un" conversions   
   break   
   the 255-byte limit on the ancient Pascal strings, so I will have to go back to   
   the drawing board, not just to figure out how to detect bytes with the two top   
   bits set (I guess a variation on Alan Mycrofts "#has_nullbyte") and where they   
   are set, and how to check YMM registers for zero, but also how to handle   
   strings   
    > 255 bytes, although that should be pretty straight-forward, given that I'm   
   already bypassing Write(Ln) and moving most of my output data directly into   
   extended buffer added to the output file.   
      
   > As you note, it should be trivial to convert utf8-encoded data to \U+decimal   
   > value, you just need to find them:   
   >   
   > A SIMD approach where you load 16 bytes and look for any starting utf8   
   character   
   > should be easy, you just need to look for bytes in a short range.   
   >   
   > If you find at least one, start by copying everything up to this point, then   
   go   
   > into decoding mode and write the results back out as \U escapes.   
      
   For historical reasons, the PL/I version of the same program on z/OS uses FBA   
   datasets, I generate all output with a leading blank (and boxes with the   
   original IBM CP437 characters), so I might just as well do the UTF-8 decoding   
   during this phase!   
      
   > You should be easily able to get the speed up to at least a GB/s, so a   
   couple of   
   > ms for the ~2MB of input data.   
      
   The current 20ms isn't too shabby, but just as I like to sit in fast cars(*), I   
   like my code to be fast. ;)   
      
   Robert   
      
   (*) Highest top speed: 300 km/h in a Mercedes AMG E63 (in Poland!), highest   
   average speed 196.1 km/h (304 km in 1:33) in an Audi RS7 Quattro (obviously in   
   Germany).   
   --   
      
   Robert AH Prins   
   robert(a)prino(d)org   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]