Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.asm.x86    |    Ahh, the lost art of x86 assembly    |    4,675 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 3,015 of 4,675    |
|    Terje Mathisen to Robert Wessel    |
|    Re: Translating blocks of memory contain    |
|    11 Oct 17 09:29:35    |
      From: terje.mathisen@nospicedham.tmsw.no              Robert Wessel wrote:       > On Mon, 9 Oct 2017 21:38:32 +0000, Robert Prins       >> However, given that I'm now also working with UTF8 data inside the       >> box (no pun intended), I really don't want to translate box       >> characters that are the first character of UTF8 characters. I can       >> (obviously) naively test for all of the valid       >> boxchar-followed-by-boxchar variations, but I wonder if there is a       >> faster way of doing this?       >>       >> Yes, the obvious thing to do is to no longer use CP437... However,       >> that would make detecting the first and final lines of boxes rather       >> a lot harder, so I'm not really ready for that.       >       > I don't think you can completely unambiguously tell the two apart,       > but I suspect you'd be better off testing to see if something is a       > valid UTF-8 sequence (instead of box/linedraw characters), which I       > think would a tighter test.       >              I agree: First you detect that you are within a text box, you do that by       first seeing the starting line and then at the same offsets you should       expect vertical bar lines before and after the text, right?              If you cannot have anything past the text box then it becomes much       easier, otherwise you need to verify the intial guess by decoding the       internal text and count the number of utf8 characters.              The latter is actually quite easy to do since all such multi-byte utf8       chars will have a prefix from a small range of chars, followed by zero       or more intermediate chars and a final character which is always from a       separate set than those previous. This feature of the encoding is so       that you can always know where you are if you start at a random byte of       a long utf8 text.              Terje              --       - |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca