From: cr88192@gmail.com   
      
   On 10/31/2025 9:48 AM, Anton Ertl wrote:   
   > Michael S writes:   
   >> On Thu, 30 Oct 2025 22:19:18 GMT   
   >> anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:   
   >>> My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style   
   >>> instructions, were already present in EV4.   
   > ...   
   >> I didn't consider these instructions as SIMD. May be, I should have.   
   >   
   > They definitely are, but they were not touted as such at the time, and   
   > they use the GPRs, unlike most SIMD extensions to instruction sets.   
   >   
   >> Looks like these instructions are intended to accelerated string   
   >> processing. That's unusual for the first wave of SIMD extensions.   
   >   
   > Yes. This was pre-first-wave. The Alpha architects just wanted to   
   > speed up some common operations that would otherwise have been   
   > relatively slow thanks to Alpha initially not having BWX instructions.   
   > Ironically, when Alpha showed a particularly good result on some   
   > benchmark (maybe Dhrystone), someone claimed that these string   
   > instructions gave Alpha an unfair advantage.   
   >   
      
   Most likely Dhrystone:   
   It shows disproportionate impact from the relative speed of things like   
   "strcmp()" and integer divide.   
      
      
   I had experimented with special instructions for packed search, which   
   could be used to help with either string compare of implementing   
   dictionary objects in my usual way.   
      
      
   Though, had later fallen back to a more generic way of implementing   
   "strcmp()" that could allow more fair comparison between my own ISA and   
   RISC-V. Where, say, one instead makes the determination based on how   
   efficiently the ISA can handle various pieces of C code (rather than the   
   use of niche instructions that typically require hand-written ASM or   
   similar).   
      
      
      
   Generally, makes more sense to use helper instructions that have a   
   general impact on performance, say for example, effecting how quickly a   
   new image can be drawn into VRAM.   
      
   For example, in my GUI experiments:   
   Most of the programs are redrawing the screens as, say, 320x200 RGB555.   
      
   Well, except ROTT, which uses 384x200 8-bit, on top of a bunch of code   
   to mimic planar VGA behavior. In this case, for the port it was easier   
   to write wrapper code to fake the VGA weirdness than to try to rewrite   
   the whole renderer to work with a normal linear framebuffer (like what   
   Doom and similar had used).   
      
      
   In a lot of the cases, I was using an 8-bit indexed color or color-cell   
   mode. For indexed color, one needs to send each image through a palette   
   conversion (to the OS color palette); or run a color-cell encoder.   
   Mostly because the display HW used 128K of VRAM.   
      
   And, even if RAM backed, there are bandwidth problems with going bigger;   
   so higher-resolutions had typically worked to reduce the bits per pixel:   
    320x200: 16 bpp   
    640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)   
    800x600: 2 or 4 bpp color-cell   
    1024x768: 1 bpp monochrome, other experiments (*1)   
    Or, use the 2 bpp mode, for 192K.   
      
   *1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes   
   the color);   
   One possibility also being to use an indexed color pair for every 8x8,   
   allowing for a 1.25 bpp color cell mode.   
      
   Though, thus far the 1024x768 mode is still mostly untested on real   
   hardware.   
      
   Had experimented some with special instructions to speed up the indexed   
   color conversion and color-cell encoding, but had mostly gone back and   
   forth between using helper instructions and normal plain C logic, and   
   which exact route to take.   
      
   Had at one point had a helper instruction for the "convert 4 RGB555   
   colors to 4 indexed colors using a hardware palette", but this broke   
   when I later ended up modifying the system palette for better results   
   (which was a critical weakness of this approach). Also the naive   
   strategy of using a 32K lookup table isn't great, as this doesn't fit   
   into the L1 cache.   
      
      
   So, for 4 bpp color cell:   
   Generally, each block of 4x4 pixels understood as 2 RGB555 endpoints,   
   and 2 selector bits per pixel. Though, in VRAM, 4 of these are packed   
   into a logical 8x8 pixel block; rather than a linear ordering like in   
   DXT1 or similar (specifics differ, but general concept is similar to   
   DXT1/S3TC).   
      
   The 2bpp mode generally has 8x8 pixels encoded as 1bpp in raster order   
   (same order as a character cell, with MSB in top-left corner and LSB in   
   lower-right corner). And, then typically 2x RGB555 over the 8x8 block.   
   IIRC, had also experimented with having each 4x4 sub-block able to use a   
   pair of RGB232 colors, but was harder to get good results.   
      
   But, to help with this process, it was useful to have helper operations   
   for, say:   
    Map RGB555 values to a luma value;   
    Select minimum and maximum RGB555 values for block;   
    Map luma values to 1 or 2 bit selectors;   
    ...   
      
      
   Internally, the GUI mode had worked by drawing everything to an RGB555   
   framebuffer (~ 512K or 1MB) and then using a bitmap to track which   
   blocks had been modified and need to be re-encoded and sent over to VRAM   
   (partly by first flagging during window redraw, then comparing with a   
   previous version of the framebuffer and tracking when pixel-blocks will   
   differ to refine the selection of blocks that need redraw, copying over   
   blocks as needed to keep track of these buffers).   
      
   Process wasn't particularly efficient (and performance is considerably   
   worse than what Win3.x or Win9x seemed to give).   
      
      
      
   As for the packed-search instructions, there were 16-bit versions as   
   well, which could be used either to help with UTF-16 operations; or for   
   dictionary objects.   
      
   Where, a common way I implement dictionary objects is to use arrays of   
   16-bit keys with 64-bit values (often tagged values or similar).   
      
   Though, this does put a limit on the maximum number of unique symbols   
   that can be used as dictionary keys, but not often an issue in practice.   
   Generally these are not QNames or C function names, so reduces the issue   
   of running out of symbol name somewhat.   
      
   One can also differ though on how much it makes to have sense to have   
   ISA level helpers for working with tagrefs and similar (or, getting the   
   ABI involved with these matters, like defining in the ABI the encodings   
   for things like fixnum/flonum/etc).   
      
   ...   
      
      
   > - anton   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|