home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.asm.x86      Ahh, the lost art of x86 assembly      4,675 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 3,017 of 4,675   
   Robert Prins to Terje Mathisen   
   Re: Translating blocks of memory contain   
   11 Oct 17 11:25:23   
   
   From: robert@nospicedham.prino.org   
      
   On 2017-10-11 07:29, Terje Mathisen wrote:   
   > Robert Wessel wrote:   
   >> On Mon, 9 Oct 2017 21:38:32 +0000, Robert Prins   
   >>> However, given that I'm now also working with UTF8 data inside the   
   >>> box (no pun intended), I really don't want to translate box   
   >>> characters that are the first character of UTF8 characters. I can   
   >>> (obviously) naively test for all of the valid   
   >>> boxchar-followed-by-boxchar variations, but I wonder if there is a   
   >>> faster way of doing this?   
   >>>   
   >>> Yes, the obvious thing to do is to no longer use CP437... However,   
   >>> that would make detecting the first and final lines of boxes rather   
   >>> a lot harder, so I'm not really ready for that.   
   >>   
   >> I don't think you can completely unambiguously tell the two apart,   
   >> but I suspect you'd be better off testing to see if something is a   
   >> valid UTF-8 sequence (instead of box/linedraw characters), which I   
   >> think would a tighter test.   
   >>   
   >   
   > I agree: First you detect that you are within a text box, you do that by   
   first   
   > seeing the starting line and then at the same offsets you should expect   
   vertical   
   > bar lines before and after the text, right?   
   >   
   > If you cannot have anything past the text box then it becomes much easier,   
   > otherwise you need to verify the intial guess by decoding the internal text   
   and   
   > count the number of utf8 characters.   
   >   
   > The latter is actually quite easy to do since all such multi-byte utf8 chars   
   > will have a prefix from a small range of chars, followed by zero or more   
   > intermediate chars and a final character which is always from a separate set   
   > than those previous. This feature of the encoding is so that you can always   
   know   
   > where you are if you start at a random byte of a long utf8 text.   
      
   Knowing that a vertical box char will always be precided by a space, I've opted   
   for a quick and dirty solution:   
      
      mov   ecx, reasd_bytes   
      mov   edx, buffer   
      
   @01:   
      movzx eax, byte ptr [edx]   
      
    only lines containing UTF8 characters must start with a vertical bar!   
      
      cmp   al, "│"   
      je    @04   
      
   @02:   
      mov   al, byte ptr [eax + offset xlat]   
      mov   [edx], al   
      inc   edx   
      dec   ecx   
      jnz   @01   
      jmp   @05   
      
   @03:   
      movzx eax, byte ptr [edx]   
      
    back to the translate-all loop at EOL   
      
      cmp   al, $0d   
      je    @02   
      
      cmp   al, "│"   
      jne   @06   
      
      cmp   byte ptr [edx - 1], " "   
      jne   @06   
      
   @04:   
      mov   byte ptr [edx], "|"   
      
   @06:   
      inc   edx   
      dec   ecx   
      jnz   @03   
      
   @05:   
      
   Maybe not the most optimal solution, but sweet and simple, and for the data in   
   this particular file it works like a charm.   
      
   Robert   
   --   
   Robert AH Prins   
   robert(a)prino(d)org   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca