Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.asm.x86    |    Ahh, the lost art of x86 assembly    |    4,675 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 3,146 of 4,675    |
|    Robert Prins to Robert Prins    |
|    Re: More UTF-8 woes - UTF-8 to "\uN" RTF    |
|    01 Dec 17 00:18:06    |
      From: robert@nospicedham.prino.org              On 2017-11-29 19:59, Robert Prins wrote:       > On 2017-11-29 12:32, Terje Mathisen wrote:       >> Robert Prins wrote:       >> Robert, you should be able to do this as a stream process, creating a new       line       >> to replace one that needs translation, discarding it if nothing was changed:       >       > Hammer & nail!       >       > I realized that while pulling in the actual in-line code this afternoon, and       > while doing so I also found out that some of my UTF-8 to "\Un" conversions       break       > the 255-byte limit on the ancient Pascal strings, so I will have to go back       to       > the drawing board, not just to figure out how to detect bytes with the two       top       > bits set (I guess a variation on Alan Mycrofts "#has_nullbyte") and where       they       > are set, and how to check YMM registers for zero, but also how to handle       strings       > > 255 bytes, although that should be pretty straight-forward, given that I'm       > already bypassing Write(Ln) and moving most of my output data directly into       > extended buffer added to the output file.       >       >> As you note, it should be trivial to convert utf8-encoded data to \U+decimal       >> value, you just need to find them:        >>>> A SIMD approach where you load 16 bytes and look for any starting utf8 >>       character should be easy, you just need to look for bytes in a short range.              For now just one-by-one, with a test eax >= 194 (I know my data is valid)              >> If you find at least one, start by copying everything up to this point, then       >> go into decoding mode and write the results back out as \U escapes.       >       > For historical reasons, the PL/I version of the same program on z/OS uses FBA       > datasets, I generate all output with a leading blank (and boxes with the       > original IBM CP437 characters), so I might just as well do the UTF-8 decoding       > during this phase!       >       >> You should be easily able to get the speed up to at least a GB/s, so a       couple       >> of ms for the ~2MB of input data.       >       > The current 20ms isn't too shabby, but just as I like to sit in fast       cars(*), I       > like my code to be fast. ;)              Moved the conversion code in-stream, at the stage I'm shifting the data to the       left, and the time for converting the 2Mb file out of the main lift program       into       .RTF format has gone down from 20ms to between 2 and 3 ms. Going further down       to       using the YMM registers (don't want to use the XMM's as the transit from YMM to       XMM and back to YMM apparently takes a long time) might make things a bit       faster, but the 85-90% reduction of running time is, at least for now,       acceptable. ;)              And yes, I'm moving data from the readln() read-in line directly to the buffer       of the output file, completely bypassing the writeln RTL procedure.              Robert       --       Robert AH Prins       robert(a)prino(d)org              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca