... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.asm.x86
Ahh, the lost art of x86 assembly
4,675 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 3,146 of 4,675
Robert Prins to Robert Prins
Re: More UTF-8 woes - UTF-8 to "\uN" RTF
01 Dec 17 00:18:06
   From: robert@nospicedham.prino.org   
      
   On 2017-11-29 19:59, Robert Prins wrote:   
   > On 2017-11-29 12:32, Terje Mathisen wrote:   
   >> Robert Prins wrote:   
   >> Robert, you should be able to do this as a stream process, creating a new   
   line   
   >> to replace one that needs translation, discarding it if nothing was changed:   
   >   
   > Hammer & nail!   
   >   
   > I realized that while pulling in the actual in-line code this afternoon, and   
   > while doing so I also found out that some of my UTF-8 to "\Un" conversions   
   break   
   > the 255-byte limit on the ancient Pascal strings, so I will have to go back   
   to   
   > the drawing board, not just to figure out how to detect bytes with the two   
   top   
   > bits set (I guess a variation on Alan Mycrofts "#has_nullbyte") and where   
   they   
   > are set, and how to check YMM registers for zero, but also how to handle   
   strings   
   >  > 255 bytes, although that should be pretty straight-forward, given that I'm   
   > already bypassing Write(Ln) and moving most of my output data directly into   
   > extended buffer added to the output file.   
   >   
   >> As you note, it should be trivial to convert utf8-encoded data to \U+decimal   
   >> value, you just need to find them:   
    >>>> A SIMD approach where you load 16 bytes and look for any starting utf8 >>   
   character should be easy, you just need to look for bytes in a short range.   
      
   For now just one-by-one, with a test eax >= 194 (I know my data is valid)   
      
   >> If you find at least one, start by copying everything up to this point, then   
   >> go into decoding mode and write the results back out as \U escapes.   
   >   
   > For historical reasons, the PL/I version of the same program on z/OS uses FBA   
   > datasets, I generate all output with a leading blank (and boxes with the   
   > original IBM CP437 characters), so I might just as well do the UTF-8 decoding   
   > during this phase!   
   >   
   >> You should be easily able to get the speed up to at least a GB/s, so a   
   couple   
   >> of ms for the ~2MB of input data.   
   >   
   > The current 20ms isn't too shabby, but just as I like to sit in fast   
   cars(*), I   
   > like my code to be fast. ;)   
      
   Moved the conversion code in-stream, at the stage I'm shifting the data to the   
   left, and the time for converting the 2Mb file out of the main lift program   
   into   
   .RTF format has gone down from 20ms to between 2 and 3 ms. Going further down   
   to   
   using the YMM registers (don't want to use the XMM's as the transit from YMM to   
   XMM and back to YMM apparently takes a long time) might make things a bit   
   faster, but the 85-90% reduction of running time is, at least for now,   
   acceptable. ;)   
      
   And yes, I'm moving data from the readln() read-in line directly to the buffer   
   of the output file, completely bypassing the writeln RTL procedure.   
      
   Robert   
   --   
   Robert AH Prins   
   robert(a)prino(d)org   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]