home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.asm.x86      Ahh, the lost art of x86 assembly      4,675 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 3,015 of 4,675   
   Terje Mathisen to Robert Wessel   
   Re: Translating blocks of memory contain   
   11 Oct 17 09:29:35   
   
   From: terje.mathisen@nospicedham.tmsw.no   
      
   Robert Wessel wrote:   
   > On Mon, 9 Oct 2017 21:38:32 +0000, Robert Prins   
   >> However, given that I'm now also working with UTF8 data inside the   
   >> box (no pun intended), I really don't want to translate box   
   >> characters that are the first character of UTF8 characters. I can   
   >> (obviously) naively test for all of the valid   
   >> boxchar-followed-by-boxchar variations, but I wonder if there is a   
   >> faster way of doing this?   
   >>   
   >> Yes, the obvious thing to do is to no longer use CP437... However,   
   >> that would make detecting the first and final lines of boxes rather   
   >> a lot harder, so I'm not really ready for that.   
   >   
   > I don't think you can completely unambiguously tell the two apart,   
   > but I suspect you'd be better off testing to see if something is a   
   > valid UTF-8 sequence (instead of box/linedraw characters), which I   
   > think would a tighter test.   
   >   
      
   I agree: First you detect that you are within a text box, you do that by   
   first seeing the starting line and then at the same offsets you should   
   expect vertical bar lines before and after the text, right?   
      
   If you cannot have anything past the text box then it becomes much   
   easier, otherwise you need to verify the intial guess by decoding the   
   internal text and count the number of utf8 characters.   
      
   The latter is actually quite easy to do since all such multi-byte utf8   
   chars will have a prefix from a small range of chars, followed by zero   
   or more intermediate chars and a final character which is always from a   
   separate set than those previous. This feature of the encoding is so   
   that you can always know where you are if you start at a random byte of   
   a long utf8 text.   
      
   Terje   
      
   --   
   -    
   "almost all programming can be viewed as an exercise in caching"   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca