... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"

comp.lang.asm.x86

Ahh, the lost art of x86 assembly

4,675 messages

[ << oldest | < older | list | newer > | newest >> ]

Message 3,128 of 4,675

Terje Mathisen to Robert Prins

Re: More UTF-8 woes - UTF-8 to "\uN" RTF

29 Nov 17 13:32:15

   From: terje.mathisen@nospicedham.tmsw.no   
      
   Robert Prins wrote:   
   > I've got working code to convert UTF-8 to "\uN" RTF escape sequences   
   > (as the format used by LibreOffice is simply way too hard to   
   > recreate).   
   [snip]   
   > Needlessly to say, the amount of data having to be shifted around   
   > would be pretty horrible - input file currently contains 11,869 (167   
   > char + crlf) lines with "only" 4,876 of them containing UTF-8   
   > encoded characters (for a total of 10,787 UTF-8 characters)   
   [snip]   
   > However, even with this minor optimization, I'm left with the   
   > problem that this process has to be repeated for every line, and   
   > includes the repeated construction of the '\Un ' escape. There are as   
   > many as there are UTF-8 characters (> 1,000,000) so pre-storing them   
   > is likely to be more expensive than computing them on-the-fly.   
   >   
   > And yes, the actual code that currently does the conversion takes all   
   > of 0.2 seconds, but given that it takes only 0.14 seconds to convert   
   > the other 7 files, that's a bit too long. ;)   
   >   
   > So, any suggestions as to how I can speed up the process?   
      
   Robert, you should be able to do this as a stream process, creating a   
   new line to replace one that needs translation, discarding it if nothing   
   was changed:   
      
   As you note, it should be trivial to convert utf8-encoded data to   
   \U+decimal value, you just need to find them:   
      
   A SIMD approach where you load 16 bytes and look for any starting utf8   
   character should be easy, you just need to look for bytes in a short range.   
      
   If you find at least one, start by copying everything up to this point,   
   then go into decoding mode and write the results back out as \U escapes.   
      
   You should be easily able to get the speed up to at least a GB/s, so a   
   couple of ms for the ~2MB of input data.   
      
   Terje   
      
      
   --   
   -    
   "almost all programming can be viewed as an exercise in caching"   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)

[ << oldest | < older | list | newer > | newest >> ]