home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.arch      Apparently more than just beeps & boops      131,241 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 130,338 of 131,241   
   Robert Finch to BGB   
   Re: Tonights Tradeoff (3/4)   
   22 Nov 25 12:45:57   
   
   [continued from previous message]   
      
   > The major goal for UPIC was mostly be to address the core use-cases but   
   > also for the decoder to be small and relatively cheap. Still sorta JPEG   
   > competitive despite being primarily cost-optimized to try to make it   
   > more viable for use in programs running on the BJX2 core (where JPEG   
   > decoding is slow and expensive).   
   >   
   > As for Static Huffman vs STF+AdRice:   
   >    Huffman:   
   >      + Slightly faster for larger payloads   
   >      + Optimal for a static distribution   
   >      - Higher memory cost for decoding (storing decoder tables)   
   >      - High initial setup cost (setting up decoder tables)   
   >      - Higher constant overhead (storing symbol lengths)   
   >      - Need to provision for storing Huffman tables   
   >    STF+AdRice:   
   >      + Very cheap initial setup (minimal context)   
   >      + No need to transmit tables   
   >      + Better compression for small data   
   >      + Significantly faster than Adaptive Huffman   
   >      + Significantly faster than Range Coding   
   >      - Slower for large data and worse compression vs Huffman.   
   >   
   > Where, STF+AdRice is mostly:   
   >    Have a table of symbols;   
   >    Whenever a symbol is encoded, swap it forwards;   
   >      Next time, it may potentially be encoded with a smaller index.   
   >    Encode indices into table using Adaptive Rice Codes.   
   > Or, basically, using a lookup table to allow AdRice to pretend to be   
   > Huffman. Also reasonably fast and simple.   
   >   
   >   
   > Block-Haar vs DCT:   
   >    + Block-Haar is faster and easily reversible (lossless);   
   >    + Mostly a drop-in replacement for DCT/IDCT in the design.   
   >    + Also faster than WHT (Walsh-Hadamard Transform)   
   >   
   > RCT vs YCbCr:   
   >    RCT is both slightly faster, and also reversible;   
   >    Had experimented with YCoCg, but saw no real advantage over RCT.   
   >   
   >   
   >   
   > The existence of BTIC5x was mostly because:   
   > BTIC1H and BTIC4B were too computationally demanding to do 320x200 16Hz   
   > on a 50MHz BJX2 core;   
   >   
   > MS-CRAM was fast to decode, but needed too much bitrate (SDcard couldn't   
   > keep the decoder fed with any semblance of image quality).   
   >   
   >   
   > So, 5A and 5B were aimed at trying to give tolerable Q/bpp at more CRAM-   
   > like decoding speeds.   
   >   
   > Also, while reasonably effective (and fast desktop by PC standards), one   
   > other drawback of the 4B design (and to a lesser degree 1H) was the   
   > design being overly complicated (and thus the code is large and bulky).   
   >   
   > Part of this was due to having too many block formats.   
   >   
   >   
   > If my UPIC format were put into my older naming scheme, would likely be   
   > called 2G. Design is kinda similar to 2F, but replaces Huffman with   
   > STF+AdRice.   
   >   
   >   
   > As for RP2 and TKuLZ:   
   >    RP2 is a byte-oriented LZ77 variant, like LZ4,   
   >      but on-average compresses slightly better than LZ4.   
   >    TKuLZ: Is sorta like a simplified/tuned Deflate variant.   
   >      Uses a shorter max symbol length,   
   >        borrows some design elements from LZ4.   
   >   
   > Can note, some past experiments with LZ decompression (at Desktop PC   
   > speeds), with entropy scheme, and len/dist limits:   
   >    LZMA   : ~   35 MB/sec (Range Coding,   273/   4GB)   
   >    Zstd   : ~   60 MB/sec (tANS,          16MB/ 128MB)   
   >    Deflate: ~  175 MB/sec (Huffman,        258/ 32767)   
   >    TKuLZ  : ~  300 MB/sec (Huffman,      65535/262143)   
   >    RP2    : ~ 1100 MB/sec (Raw Bytes,      512/131071)   
   >    LZ4    : ~ 1300 MB/sec (Raw Bytes,    16383/ 65535)   
   >   
   >   
   > While Zstd is claimed to be fast, my testing tended to show it closer to   
   > LZMA speeds than to Deflate, but it does give compression closer to   
   > LZMA. The tANS strategy seems to under-perform claims IME (and is   
   > notably slower than static Huffman). Also it is the most complicated   
   > design among these.   
   >   
   >   
   > A lot of my older stuff used Deflate, but often Deflate wasn't fast   
   > enough, so has mostly gotten displaced by RP2 in my uses.   
   >   
   > TKuLZ is an intermediate, generally faster than Deflate, had an option   
   > to get some speed (at the expense of compression) by using fixed length   
   > symbols in some cases. This can push it to around 500 MB/sec (at the   
   > expense of compression), hard to get much faster (or anywhere near RP2   
   > or LZ4).   
   >   
   > Whether RP2 or LZ4 is faster seems to depend on target:   
   >    BJX2 Core, RasPi, and Piledriver: RP2 is faster.   
   >      Mostly things with in-order cores.   
   >      And Piledriver, which behaved almost more like an in-order machine.   
   >    Zen+, Core 2, and Core i7: LZ4 is faster.   
   >   
   > LZ4 needs typically multiple chained memory accesses for each LZ run,   
   > whereas for RP2, match length/distance and raw count are typically all   
   > available via a single memory load (then maybe a few bit-tests and   
   > conditional branches).   
   >   
   > ...   
   >   
   >   
   >   
   >> A while ago I wrote a set of graphics routines in assembler that were   
   >> quite fast. One format I have delt with is the .flic file format used   
   >> to render animated graphics. I wanted to write my own CIV style game.   
   >> It took a little bit of research and some reverse engineering.   
   >> Apparently, the authors used a modified version of the format making   
   >> it difficult to use the CIV graphics in my own game. I never could get   
   >> it to render as fast as the game’s engine. I wrote the code for my   
   >> game in C or C++, the original’s game engine code was likely in a   
   >> different language.   
   >>   
   >   
   > This sort of thing is almost inevitable with this stuff.   
   >   
   > Usually I just ended up using C for nearly everything.   
   >   
   >   
   >> *****   
   >>   
   >> Been working on vectors for the ISA. I split the vector length   
   >> register into eight sections to define up to eight different vector   
   >> lengths. The first five are defined for integer, float, fixed,   
   >> character, and address data types. I figure one may want to use   
   >> vectors of different lengths at the same time, for instance to address   
   >> data using byte offsets, while the data itself might be a float. The   
   >> vector load / store instructions accept a data type to load / store   
   >> and always use the address type for address calculations.   
   >>   
   >> There is also a vector lane size register split up the same way. I had   
   >> thought of giving each vector register its own format for length and   
   >> lane size. But thought that is a bit much, with limited use cases.   
   >>   
   >> I think I can get away with only two load and two store instructions.   
   >> One to do a strided load and a second to do an vector indexed load   
   >> (gather/scatter). The addressing mode in use is d[Rbase+Rindex*Scale].   
   >> Where Rindex is used as the stride when scalar or as a supplier of the   
   >> lane offset when Rindex is a vector.   
   >>   
   >> Writing the RTL code to support the vector memory ops has been   
   >> challenging. Using a simple approach ATM. The instruction needs to be   
   >> re-issued for each vector lane accessed. Unaligned vector loads and   
   >> stores are also allowed, adding some complexity when the operation   
   >> crosses a cache-line boundary.   
   >>   
   >> I have the max vector length and max vector size constants returned by   
   >> the GETINFO instruction which returns CPU specific information.   
   >>   
   >   
   > I don't get it...   
   >   
   > Usually makes sense to treat vectors as opaque blobs of bits that are   
   > then interpreted as one of the available formats for a specific operation.   
   >   
   > In my case, I have a SIMD setup:   
   >    2 or 4 elements in a GPR or GPR pair;   
   >    Most other operations are just the normal GPR operations.   
   >   
   > ...   
   >   
   >   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca