Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.arch    |    Apparently more than just beeps & boops    |    131,241 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 130,342 of 131,241    |
|    BGB to Robert Finch    |
|    Re: Tonights Tradeoff (3/5)    |
|    22 Nov 25 14:29:23    |
      [continued from previous message]              >> (where SERMS-RDCT was a trick to make the DCT/IDCT transform exactly       >> reversible, at the cost of speed).       >>       >>       >> In the early 2010s, I was pretty bad about massively over-engineering       >> everything.       >>       >> Later on, some ideas were reused in 2F and UPIC.       >> Though, 2F and UPIC were much less over-engineered.       >>       >> Did specify possible use as video codecs, but thus far both were used       >> only as still image formats.       >>       >> The major goal for UPIC was mostly be to address the core use-cases       >> but also for the decoder to be small and relatively cheap. Still sorta       >> JPEG competitive despite being primarily cost-optimized to try to make       >> it more viable for use in programs running on the BJX2 core (where       >> JPEG decoding is slow and expensive).       >>       >> As for Static Huffman vs STF+AdRice:       >> Huffman:       >> + Slightly faster for larger payloads       >> + Optimal for a static distribution       >> - Higher memory cost for decoding (storing decoder tables)       >> - High initial setup cost (setting up decoder tables)       >> - Higher constant overhead (storing symbol lengths)       >> - Need to provision for storing Huffman tables       >> STF+AdRice:       >> + Very cheap initial setup (minimal context)       >> + No need to transmit tables       >> + Better compression for small data       >> + Significantly faster than Adaptive Huffman       >> + Significantly faster than Range Coding       >> - Slower for large data and worse compression vs Huffman.       >>       >> Where, STF+AdRice is mostly:       >> Have a table of symbols;       >> Whenever a symbol is encoded, swap it forwards;       >> Next time, it may potentially be encoded with a smaller index.       >> Encode indices into table using Adaptive Rice Codes.       >> Or, basically, using a lookup table to allow AdRice to pretend to be       >> Huffman. Also reasonably fast and simple.       >>       >>       >> Block-Haar vs DCT:       >> + Block-Haar is faster and easily reversible (lossless);       >> + Mostly a drop-in replacement for DCT/IDCT in the design.       >> + Also faster than WHT (Walsh-Hadamard Transform)       >>       >> RCT vs YCbCr:       >> RCT is both slightly faster, and also reversible;       >> Had experimented with YCoCg, but saw no real advantage over RCT.       >>       >>       >>       >> The existence of BTIC5x was mostly because:       >> BTIC1H and BTIC4B were too computationally demanding to do 320x200       >> 16Hz on a 50MHz BJX2 core;       >>       >> MS-CRAM was fast to decode, but needed too much bitrate (SDcard       >> couldn't keep the decoder fed with any semblance of image quality).       >>       >>       >> So, 5A and 5B were aimed at trying to give tolerable Q/bpp at more       >> CRAM- like decoding speeds.       >>       >> Also, while reasonably effective (and fast desktop by PC standards),       >> one other drawback of the 4B design (and to a lesser degree 1H) was       >> the design being overly complicated (and thus the code is large and       >> bulky).       >>       >> Part of this was due to having too many block formats.       >>       >>       >> If my UPIC format were put into my older naming scheme, would likely       >> be called 2G. Design is kinda similar to 2F, but replaces Huffman with       >> STF+AdRice.       >>       >>       >> As for RP2 and TKuLZ:       >> RP2 is a byte-oriented LZ77 variant, like LZ4,       >> but on-average compresses slightly better than LZ4.       >> TKuLZ: Is sorta like a simplified/tuned Deflate variant.       >> Uses a shorter max symbol length,       >> borrows some design elements from LZ4.       >>       >> Can note, some past experiments with LZ decompression (at Desktop PC       >> speeds), with entropy scheme, and len/dist limits:       >> LZMA : ~ 35 MB/sec (Range Coding, 273/ 4GB)       >> Zstd : ~ 60 MB/sec (tANS, 16MB/ 128MB)       >> Deflate: ~ 175 MB/sec (Huffman, 258/ 32767)       >> TKuLZ : ~ 300 MB/sec (Huffman, 65535/262143)       >> RP2 : ~ 1100 MB/sec (Raw Bytes, 512/131071)       >> LZ4 : ~ 1300 MB/sec (Raw Bytes, 16383/ 65535)       >>       >>       >> While Zstd is claimed to be fast, my testing tended to show it closer       >> to LZMA speeds than to Deflate, but it does give compression closer to       >> LZMA. The tANS strategy seems to under-perform claims IME (and is       >> notably slower than static Huffman). Also it is the most complicated       >> design among these.       >>       >>       >> A lot of my older stuff used Deflate, but often Deflate wasn't fast       >> enough, so has mostly gotten displaced by RP2 in my uses.       >>       >> TKuLZ is an intermediate, generally faster than Deflate, had an option       >> to get some speed (at the expense of compression) by using fixed       >> length symbols in some cases. This can push it to around 500 MB/sec       >> (at the expense of compression), hard to get much faster (or anywhere       >> near RP2 or LZ4).       >>       >> Whether RP2 or LZ4 is faster seems to depend on target:       >> BJX2 Core, RasPi, and Piledriver: RP2 is faster.       >> Mostly things with in-order cores.       >> And Piledriver, which behaved almost more like an in-order machine.       >> Zen+, Core 2, and Core i7: LZ4 is faster.       >>       >> LZ4 needs typically multiple chained memory accesses for each LZ run,       >> whereas for RP2, match length/distance and raw count are typically all       >> available via a single memory load (then maybe a few bit-tests and       >> conditional branches).       >>       >> ...       >>       >>       >>       >>> A while ago I wrote a set of graphics routines in assembler that were       >>> quite fast. One format I have delt with is the .flic file format used       >>> to render animated graphics. I wanted to write my own CIV style game.       >>> It took a little bit of research and some reverse engineering.       >>> Apparently, the authors used a modified version of the format making       >>> it difficult to use the CIV graphics in my own game. I never could       >>> get it to render as fast as the game’s engine. I wrote the code for       >>> my game in C or C++, the original’s game engine code was likely in a       >>> different language.       >>>       >>       >> This sort of thing is almost inevitable with this stuff.       >>       >> Usually I just ended up using C for nearly everything.       >>       >>       >>> *****       >>>       >>> Been working on vectors for the ISA. I split the vector length       >>> register into eight sections to define up to eight different vector       >>> lengths. The first five are defined for integer, float, fixed,       >>> character, and address data types. I figure one may want to use       >>> vectors of different lengths at the same time, for instance to       >>> address data using byte offsets, while the data itself might be a       >>> float. The vector load / store instructions accept a data type to       >>> load / store and always use the address type for address calculations.       >>>       >>> There is also a vector lane size register split up the same way. I       >>> had thought of giving each vector register its own format for length       >>> and lane size. But thought that is a bit much, with limited use cases.       >>>       >>> I think I can get away with only two load and two store instructions.       >>> One to do a strided load and a second to do an vector indexed load       >>> (gather/scatter). The addressing mode in use is       >>> d[Rbase+Rindex*Scale]. Where Rindex is used as the stride when scalar       >>> or as a supplier of the lane offset when Rindex is a vector.       >>>       >>> Writing the RTL code to support the vector memory ops has been       >>> challenging. Using a simple approach ATM. The instruction needs to be       >>> re-issued for each vector lane accessed. Unaligned vector loads and              [continued in next message]              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca