Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.asm.x86    |    Ahh, the lost art of x86 assembly    |    4,675 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 3,130 of 4,675    |
|    Robert Prins to Terje Mathisen    |
|    Re: More UTF-8 woes - UTF-8 to "\uN" RTF    |
|    29 Nov 17 19:59:56    |
      From: robert@nospicedham.prino.org              On 2017-11-29 12:32, Terje Mathisen wrote:       > Robert Prins wrote:       >> I've got working code to convert UTF-8 to "\uN" RTF escape sequences       >> (as the format used by LibreOffice is simply way too hard to       >> recreate).       > [snip]       >> Needlessly to say, the amount of data having to be shifted around       >> would be pretty horrible - input file currently contains 11,869 (167       >> char + crlf) lines with "only" 4,876 of them containing UTF-8       >> encoded characters (for a total of 10,787 UTF-8 characters)       > [snip]       >> However, even with this minor optimization, I'm left with the       >> problem that this process has to be repeated for every line, and       >> includes the repeated construction of the '\Un ' escape. There are as       >> many as there are UTF-8 characters (> 1,000,000) so pre-storing them       >> is likely to be more expensive than computing them on-the-fly.       >>       >> And yes, the actual code that currently does the conversion takes all       >> of 0.2 seconds, but given that it takes only 0.14 seconds to convert       >> the other 7 files, that's a bit too long. ;)       >>       >> So, any suggestions as to how I can speed up the process?       >       > Robert, you should be able to do this as a stream process, creating a new       line       > to replace one that needs translation, discarding it if nothing was changed:              Hammer & nail!              I realized that while pulling in the actual in-line code this afternoon, and       while doing so I also found out that some of my UTF-8 to "\Un" conversions       break       the 255-byte limit on the ancient Pascal strings, so I will have to go back to       the drawing board, not just to figure out how to detect bytes with the two top       bits set (I guess a variation on Alan Mycrofts "#has_nullbyte") and where they       are set, and how to check YMM registers for zero, but also how to handle       strings        > 255 bytes, although that should be pretty straight-forward, given that I'm       already bypassing Write(Ln) and moving most of my output data directly into       extended buffer added to the output file.              > As you note, it should be trivial to convert utf8-encoded data to \U+decimal       > value, you just need to find them:       >       > A SIMD approach where you load 16 bytes and look for any starting utf8       character       > should be easy, you just need to look for bytes in a short range.       >       > If you find at least one, start by copying everything up to this point, then       go       > into decoding mode and write the results back out as \U escapes.              For historical reasons, the PL/I version of the same program on z/OS uses FBA       datasets, I generate all output with a leading blank (and boxes with the       original IBM CP437 characters), so I might just as well do the UTF-8 decoding       during this phase!              > You should be easily able to get the speed up to at least a GB/s, so a       couple of       > ms for the ~2MB of input data.              The current 20ms isn't too shabby, but just as I like to sit in fast cars(*), I       like my code to be fast. ;)              Robert              (*) Highest top speed: 300 km/h in a Mercedes AMG E63 (in Poland!), highest       average speed 196.1 km/h (304 km in 1:33) in an Audi RS7 Quattro (obviously in       Germany).       --              Robert AH Prins       robert(a)prino(d)org              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca