... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 130,342 of 131,241
BGB to Robert Finch
Re: Tonights Tradeoff (3/5)
22 Nov 25 14:29:23
   [continued from previous message]   
      
   >> (where SERMS-RDCT was a trick to make the DCT/IDCT transform exactly   
   >> reversible, at the cost of speed).   
   >>   
   >>   
   >> In the early 2010s, I was pretty bad about massively over-engineering   
   >> everything.   
   >>   
   >> Later on, some ideas were reused in 2F and UPIC.   
   >> Though, 2F and UPIC were much less over-engineered.   
   >>   
   >> Did specify possible use as video codecs, but thus far both were used   
   >> only as still image formats.   
   >>   
   >> The major goal for UPIC was mostly be to address the core use-cases   
   >> but also for the decoder to be small and relatively cheap. Still sorta   
   >> JPEG competitive despite being primarily cost-optimized to try to make   
   >> it more viable for use in programs running on the BJX2 core (where   
   >> JPEG decoding is slow and expensive).   
   >>   
   >> As for Static Huffman vs STF+AdRice:   
   >>    Huffman:   
   >>      + Slightly faster for larger payloads   
   >>      + Optimal for a static distribution   
   >>      - Higher memory cost for decoding (storing decoder tables)   
   >>      - High initial setup cost (setting up decoder tables)   
   >>      - Higher constant overhead (storing symbol lengths)   
   >>      - Need to provision for storing Huffman tables   
   >>    STF+AdRice:   
   >>      + Very cheap initial setup (minimal context)   
   >>      + No need to transmit tables   
   >>      + Better compression for small data   
   >>      + Significantly faster than Adaptive Huffman   
   >>      + Significantly faster than Range Coding   
   >>      - Slower for large data and worse compression vs Huffman.   
   >>   
   >> Where, STF+AdRice is mostly:   
   >>    Have a table of symbols;   
   >>    Whenever a symbol is encoded, swap it forwards;   
   >>      Next time, it may potentially be encoded with a smaller index.   
   >>    Encode indices into table using Adaptive Rice Codes.   
   >> Or, basically, using a lookup table to allow AdRice to pretend to be   
   >> Huffman. Also reasonably fast and simple.   
   >>   
   >>   
   >> Block-Haar vs DCT:   
   >>    + Block-Haar is faster and easily reversible (lossless);   
   >>    + Mostly a drop-in replacement for DCT/IDCT in the design.   
   >>    + Also faster than WHT (Walsh-Hadamard Transform)   
   >>   
   >> RCT vs YCbCr:   
   >>    RCT is both slightly faster, and also reversible;   
   >>    Had experimented with YCoCg, but saw no real advantage over RCT.   
   >>   
   >>   
   >>   
   >> The existence of BTIC5x was mostly because:   
   >> BTIC1H and BTIC4B were too computationally demanding to do 320x200   
   >> 16Hz on a 50MHz BJX2 core;   
   >>   
   >> MS-CRAM was fast to decode, but needed too much bitrate (SDcard   
   >> couldn't keep the decoder fed with any semblance of image quality).   
   >>   
   >>   
   >> So, 5A and 5B were aimed at trying to give tolerable Q/bpp at more   
   >> CRAM- like decoding speeds.   
   >>   
   >> Also, while reasonably effective (and fast desktop by PC standards),   
   >> one other drawback of the 4B design (and to a lesser degree 1H) was   
   >> the design being overly complicated (and thus the code is large and   
   >> bulky).   
   >>   
   >> Part of this was due to having too many block formats.   
   >>   
   >>   
   >> If my UPIC format were put into my older naming scheme, would likely   
   >> be called 2G. Design is kinda similar to 2F, but replaces Huffman with   
   >> STF+AdRice.   
   >>   
   >>   
   >> As for RP2 and TKuLZ:   
   >>    RP2 is a byte-oriented LZ77 variant, like LZ4,   
   >>      but on-average compresses slightly better than LZ4.   
   >>    TKuLZ: Is sorta like a simplified/tuned Deflate variant.   
   >>      Uses a shorter max symbol length,   
   >>        borrows some design elements from LZ4.   
   >>   
   >> Can note, some past experiments with LZ decompression (at Desktop PC   
   >> speeds), with entropy scheme, and len/dist limits:   
   >>    LZMA   : ~   35 MB/sec (Range Coding,   273/   4GB)   
   >>    Zstd   : ~   60 MB/sec (tANS,          16MB/ 128MB)   
   >>    Deflate: ~  175 MB/sec (Huffman,        258/ 32767)   
   >>    TKuLZ  : ~  300 MB/sec (Huffman,      65535/262143)   
   >>    RP2    : ~ 1100 MB/sec (Raw Bytes,      512/131071)   
   >>    LZ4    : ~ 1300 MB/sec (Raw Bytes,    16383/ 65535)   
   >>   
   >>   
   >> While Zstd is claimed to be fast, my testing tended to show it closer   
   >> to LZMA speeds than to Deflate, but it does give compression closer to   
   >> LZMA. The tANS strategy seems to under-perform claims IME (and is   
   >> notably slower than static Huffman). Also it is the most complicated   
   >> design among these.   
   >>   
   >>   
   >> A lot of my older stuff used Deflate, but often Deflate wasn't fast   
   >> enough, so has mostly gotten displaced by RP2 in my uses.   
   >>   
   >> TKuLZ is an intermediate, generally faster than Deflate, had an option   
   >> to get some speed (at the expense of compression) by using fixed   
   >> length symbols in some cases. This can push it to around 500 MB/sec   
   >> (at the expense of compression), hard to get much faster (or anywhere   
   >> near RP2 or LZ4).   
   >>   
   >> Whether RP2 or LZ4 is faster seems to depend on target:   
   >>    BJX2 Core, RasPi, and Piledriver: RP2 is faster.   
   >>      Mostly things with in-order cores.   
   >>      And Piledriver, which behaved almost more like an in-order machine.   
   >>    Zen+, Core 2, and Core i7: LZ4 is faster.   
   >>   
   >> LZ4 needs typically multiple chained memory accesses for each LZ run,   
   >> whereas for RP2, match length/distance and raw count are typically all   
   >> available via a single memory load (then maybe a few bit-tests and   
   >> conditional branches).   
   >>   
   >> ...   
   >>   
   >>   
   >>   
   >>> A while ago I wrote a set of graphics routines in assembler that were   
   >>> quite fast. One format I have delt with is the .flic file format used   
   >>> to render animated graphics. I wanted to write my own CIV style game.   
   >>> It took a little bit of research and some reverse engineering.   
   >>> Apparently, the authors used a modified version of the format making   
   >>> it difficult to use the CIV graphics in my own game. I never could   
   >>> get it to render as fast as the game’s engine. I wrote the code for   
   >>> my game in C or C++, the original’s game engine code was likely in a   
   >>> different language.   
   >>>   
   >>   
   >> This sort of thing is almost inevitable with this stuff.   
   >>   
   >> Usually I just ended up using C for nearly everything.   
   >>   
   >>   
   >>> *****   
   >>>   
   >>> Been working on vectors for the ISA. I split the vector length   
   >>> register into eight sections to define up to eight different vector   
   >>> lengths. The first five are defined for integer, float, fixed,   
   >>> character, and address data types. I figure one may want to use   
   >>> vectors of different lengths at the same time, for instance to   
   >>> address data using byte offsets, while the data itself might be a   
   >>> float. The vector load / store instructions accept a data type to   
   >>> load / store and always use the address type for address calculations.   
   >>>   
   >>> There is also a vector lane size register split up the same way. I   
   >>> had thought of giving each vector register its own format for length   
   >>> and lane size. But thought that is a bit much, with limited use cases.   
   >>>   
   >>> I think I can get away with only two load and two store instructions.   
   >>> One to do a strided load and a second to do an vector indexed load   
   >>> (gather/scatter). The addressing mode in use is   
   >>> d[Rbase+Rindex*Scale]. Where Rindex is used as the stride when scalar   
   >>> or as a supplier of the lane offset when Rindex is a vector.   
   >>>   
   >>> Writing the RTL code to support the vector memory ops has been   
   >>> challenging. Using a simple approach ATM. The instruction needs to be   
   >>> re-issued for each vector lane accessed. Unaligned vector loads and   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]