Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.asm.x86    |    Ahh, the lost art of x86 assembly    |    4,675 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 3,017 of 4,675    |
|    Robert Prins to Terje Mathisen    |
|    Re: Translating blocks of memory contain    |
|    11 Oct 17 11:25:23    |
      From: robert@nospicedham.prino.org              On 2017-10-11 07:29, Terje Mathisen wrote:       > Robert Wessel wrote:       >> On Mon, 9 Oct 2017 21:38:32 +0000, Robert Prins       >>> However, given that I'm now also working with UTF8 data inside the       >>> box (no pun intended), I really don't want to translate box       >>> characters that are the first character of UTF8 characters. I can       >>> (obviously) naively test for all of the valid       >>> boxchar-followed-by-boxchar variations, but I wonder if there is a       >>> faster way of doing this?       >>>       >>> Yes, the obvious thing to do is to no longer use CP437... However,       >>> that would make detecting the first and final lines of boxes rather       >>> a lot harder, so I'm not really ready for that.       >>       >> I don't think you can completely unambiguously tell the two apart,       >> but I suspect you'd be better off testing to see if something is a       >> valid UTF-8 sequence (instead of box/linedraw characters), which I       >> think would a tighter test.       >>       >       > I agree: First you detect that you are within a text box, you do that by       first       > seeing the starting line and then at the same offsets you should expect       vertical       > bar lines before and after the text, right?       >       > If you cannot have anything past the text box then it becomes much easier,       > otherwise you need to verify the intial guess by decoding the internal text       and       > count the number of utf8 characters.       >       > The latter is actually quite easy to do since all such multi-byte utf8 chars       > will have a prefix from a small range of chars, followed by zero or more       > intermediate chars and a final character which is always from a separate set       > than those previous. This feature of the encoding is so that you can always       know       > where you are if you start at a random byte of a long utf8 text.              Knowing that a vertical box char will always be precided by a space, I've opted       for a quick and dirty solution:               mov ecx, reasd_bytes        mov edx, buffer              @01:        movzx eax, byte ptr [edx]              only lines containing UTF8 characters must start with a vertical bar!               cmp al, "│"        je @04              @02:        mov al, byte ptr [eax + offset xlat]        mov [edx], al        inc edx        dec ecx        jnz @01        jmp @05              @03:        movzx eax, byte ptr [edx]              back to the translate-all loop at EOL               cmp al, $0d        je @02               cmp al, "│"        jne @06               cmp byte ptr [edx - 1], " "        jne @06              @04:        mov byte ptr [edx], "|"              @06:        inc edx        dec ecx        jnz @03              @05:              Maybe not the most optimal solution, but sweet and simple, and for the data in       this particular file it works like a charm.              Robert       --       Robert AH Prins       robert(a)prino(d)org              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca