... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 130,335 of 131,241
BGB to Robert Finch
Re: Tonights Tradeoff (3/3)
22 Nov 25 04:54:00
   [continued from previous message]   
      
        + Significantly faster than Adaptive Huffman   
        + Significantly faster than Range Coding   
        - Slower for large data and worse compression vs Huffman.   
      
   Where, STF+AdRice is mostly:   
      Have a table of symbols;   
      Whenever a symbol is encoded, swap it forwards;   
        Next time, it may potentially be encoded with a smaller index.   
      Encode indices into table using Adaptive Rice Codes.   
   Or, basically, using a lookup table to allow AdRice to pretend to be   
   Huffman. Also reasonably fast and simple.   
      
      
   Block-Haar vs DCT:   
      + Block-Haar is faster and easily reversible (lossless);   
      + Mostly a drop-in replacement for DCT/IDCT in the design.   
      + Also faster than WHT (Walsh-Hadamard Transform)   
      
   RCT vs YCbCr:   
      RCT is both slightly faster, and also reversible;   
      Had experimented with YCoCg, but saw no real advantage over RCT.   
      
      
      
   The existence of BTIC5x was mostly because:   
   BTIC1H and BTIC4B were too computationally demanding to do 320x200 16Hz   
   on a 50MHz BJX2 core;   
      
   MS-CRAM was fast to decode, but needed too much bitrate (SDcard couldn't   
   keep the decoder fed with any semblance of image quality).   
      
      
   So, 5A and 5B were aimed at trying to give tolerable Q/bpp at more   
   CRAM-like decoding speeds.   
      
   Also, while reasonably effective (and fast desktop by PC standards), one   
   other drawback of the 4B design (and to a lesser degree 1H) was the   
   design being overly complicated (and thus the code is large and bulky).   
      
   Part of this was due to having too many block formats.   
      
      
   If my UPIC format were put into my older naming scheme, would likely be   
   called 2G. Design is kinda similar to 2F, but replaces Huffman with   
   STF+AdRice.   
      
      
   As for RP2 and TKuLZ:   
      RP2 is a byte-oriented LZ77 variant, like LZ4,   
        but on-average compresses slightly better than LZ4.   
      TKuLZ: Is sorta like a simplified/tuned Deflate variant.   
        Uses a shorter max symbol length,   
          borrows some design elements from LZ4.   
      
   Can note, some past experiments with LZ decompression (at Desktop PC   
   speeds), with entropy scheme, and len/dist limits:   
      LZMA   : ~   35 MB/sec (Range Coding,   273/   4GB)   
      Zstd   : ~   60 MB/sec (tANS,          16MB/ 128MB)   
      Deflate: ~  175 MB/sec (Huffman,        258/ 32767)   
      TKuLZ  : ~  300 MB/sec (Huffman,      65535/262143)   
      RP2    : ~ 1100 MB/sec (Raw Bytes,      512/131071)   
      LZ4    : ~ 1300 MB/sec (Raw Bytes,    16383/ 65535)   
      
      
   While Zstd is claimed to be fast, my testing tended to show it closer to   
   LZMA speeds than to Deflate, but it does give compression closer to   
   LZMA. The tANS strategy seems to under-perform claims IME (and is   
   notably slower than static Huffman). Also it is the most complicated   
   design among these.   
      
      
   A lot of my older stuff used Deflate, but often Deflate wasn't fast   
   enough, so has mostly gotten displaced by RP2 in my uses.   
      
   TKuLZ is an intermediate, generally faster than Deflate, had an option   
   to get some speed (at the expense of compression) by using fixed length   
   symbols in some cases. This can push it to around 500 MB/sec (at the   
   expense of compression), hard to get much faster (or anywhere near RP2   
   or LZ4).   
      
   Whether RP2 or LZ4 is faster seems to depend on target:   
      BJX2 Core, RasPi, and Piledriver: RP2 is faster.   
        Mostly things with in-order cores.   
        And Piledriver, which behaved almost more like an in-order machine.   
      Zen+, Core 2, and Core i7: LZ4 is faster.   
      
   LZ4 needs typically multiple chained memory accesses for each LZ run,   
   whereas for RP2, match length/distance and raw count are typically all   
   available via a single memory load (then maybe a few bit-tests and   
   conditional branches).   
      
   ...   
      
      
      
   > A while ago I wrote a set of graphics routines in assembler that were   
   > quite fast. One format I have delt with is the .flic file format used to   
   > render animated graphics. I wanted to write my own CIV style game. It   
   > took a little bit of research and some reverse engineering. Apparently,   
   > the authors used a modified version of the format making it difficult to   
   > use the CIV graphics in my own game. I never could get it to render as   
   > fast as the game’s engine. I wrote the code for my game in C or C++, the   
   > original’s game engine code was likely in a different language.   
   >   
      
   This sort of thing is almost inevitable with this stuff.   
      
   Usually I just ended up using C for nearly everything.   
      
      
   > *****   
   >   
   > Been working on vectors for the ISA. I split the vector length register   
   > into eight sections to define up to eight different vector lengths. The   
   > first five are defined for integer, float, fixed, character, and address   
   > data types. I figure one may want to use vectors of different lengths at   
   > the same time, for instance to address data using byte offsets, while   
   > the data itself might be a float. The vector load / store instructions   
   > accept a data type to load / store and always use the address type for   
   > address calculations.   
   >   
   > There is also a vector lane size register split up the same way. I had   
   > thought of giving each vector register its own format for length and   
   > lane size. But thought that is a bit much, with limited use cases.   
   >   
   > I think I can get away with only two load and two store instructions.   
   > One to do a strided load and a second to do an vector indexed load   
   > (gather/scatter). The addressing mode in use is d[Rbase+Rindex*Scale].   
   > Where Rindex is used as the stride when scalar or as a supplier of the   
   > lane offset when Rindex is a vector.   
   >   
   > Writing the RTL code to support the vector memory ops has been   
   > challenging. Using a simple approach ATM. The instruction needs to be   
   > re-issued for each vector lane accessed. Unaligned vector loads and   
   > stores are also allowed, adding some complexity when the operation   
   > crosses a cache-line boundary.   
   >   
   > I have the max vector length and max vector size constants returned by   
   > the GETINFO instruction which returns CPU specific information.   
   >   
      
   I don't get it...   
      
   Usually makes sense to treat vectors as opaque blobs of bits that are   
   then interpreted as one of the available formats for a specific operation.   
      
   In my case, I have a SIMD setup:   
      2 or 4 elements in a GPR or GPR pair;   
      Most other operations are just the normal GPR operations.   
      
   ...   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]