Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.arch    |    Apparently more than just beeps & boops    |    131,241 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 130,338 of 131,241    |
|    Robert Finch to BGB    |
|    Re: Tonights Tradeoff (3/4)    |
|    22 Nov 25 12:45:57    |
      [continued from previous message]              > The major goal for UPIC was mostly be to address the core use-cases but       > also for the decoder to be small and relatively cheap. Still sorta JPEG       > competitive despite being primarily cost-optimized to try to make it       > more viable for use in programs running on the BJX2 core (where JPEG       > decoding is slow and expensive).       >       > As for Static Huffman vs STF+AdRice:       > Huffman:       > + Slightly faster for larger payloads       > + Optimal for a static distribution       > - Higher memory cost for decoding (storing decoder tables)       > - High initial setup cost (setting up decoder tables)       > - Higher constant overhead (storing symbol lengths)       > - Need to provision for storing Huffman tables       > STF+AdRice:       > + Very cheap initial setup (minimal context)       > + No need to transmit tables       > + Better compression for small data       > + Significantly faster than Adaptive Huffman       > + Significantly faster than Range Coding       > - Slower for large data and worse compression vs Huffman.       >       > Where, STF+AdRice is mostly:       > Have a table of symbols;       > Whenever a symbol is encoded, swap it forwards;       > Next time, it may potentially be encoded with a smaller index.       > Encode indices into table using Adaptive Rice Codes.       > Or, basically, using a lookup table to allow AdRice to pretend to be       > Huffman. Also reasonably fast and simple.       >       >       > Block-Haar vs DCT:       > + Block-Haar is faster and easily reversible (lossless);       > + Mostly a drop-in replacement for DCT/IDCT in the design.       > + Also faster than WHT (Walsh-Hadamard Transform)       >       > RCT vs YCbCr:       > RCT is both slightly faster, and also reversible;       > Had experimented with YCoCg, but saw no real advantage over RCT.       >       >       >       > The existence of BTIC5x was mostly because:       > BTIC1H and BTIC4B were too computationally demanding to do 320x200 16Hz       > on a 50MHz BJX2 core;       >       > MS-CRAM was fast to decode, but needed too much bitrate (SDcard couldn't       > keep the decoder fed with any semblance of image quality).       >       >       > So, 5A and 5B were aimed at trying to give tolerable Q/bpp at more CRAM-       > like decoding speeds.       >       > Also, while reasonably effective (and fast desktop by PC standards), one       > other drawback of the 4B design (and to a lesser degree 1H) was the       > design being overly complicated (and thus the code is large and bulky).       >       > Part of this was due to having too many block formats.       >       >       > If my UPIC format were put into my older naming scheme, would likely be       > called 2G. Design is kinda similar to 2F, but replaces Huffman with       > STF+AdRice.       >       >       > As for RP2 and TKuLZ:       > RP2 is a byte-oriented LZ77 variant, like LZ4,       > but on-average compresses slightly better than LZ4.       > TKuLZ: Is sorta like a simplified/tuned Deflate variant.       > Uses a shorter max symbol length,       > borrows some design elements from LZ4.       >       > Can note, some past experiments with LZ decompression (at Desktop PC       > speeds), with entropy scheme, and len/dist limits:       > LZMA : ~ 35 MB/sec (Range Coding, 273/ 4GB)       > Zstd : ~ 60 MB/sec (tANS, 16MB/ 128MB)       > Deflate: ~ 175 MB/sec (Huffman, 258/ 32767)       > TKuLZ : ~ 300 MB/sec (Huffman, 65535/262143)       > RP2 : ~ 1100 MB/sec (Raw Bytes, 512/131071)       > LZ4 : ~ 1300 MB/sec (Raw Bytes, 16383/ 65535)       >       >       > While Zstd is claimed to be fast, my testing tended to show it closer to       > LZMA speeds than to Deflate, but it does give compression closer to       > LZMA. The tANS strategy seems to under-perform claims IME (and is       > notably slower than static Huffman). Also it is the most complicated       > design among these.       >       >       > A lot of my older stuff used Deflate, but often Deflate wasn't fast       > enough, so has mostly gotten displaced by RP2 in my uses.       >       > TKuLZ is an intermediate, generally faster than Deflate, had an option       > to get some speed (at the expense of compression) by using fixed length       > symbols in some cases. This can push it to around 500 MB/sec (at the       > expense of compression), hard to get much faster (or anywhere near RP2       > or LZ4).       >       > Whether RP2 or LZ4 is faster seems to depend on target:       > BJX2 Core, RasPi, and Piledriver: RP2 is faster.       > Mostly things with in-order cores.       > And Piledriver, which behaved almost more like an in-order machine.       > Zen+, Core 2, and Core i7: LZ4 is faster.       >       > LZ4 needs typically multiple chained memory accesses for each LZ run,       > whereas for RP2, match length/distance and raw count are typically all       > available via a single memory load (then maybe a few bit-tests and       > conditional branches).       >       > ...       >       >       >       >> A while ago I wrote a set of graphics routines in assembler that were       >> quite fast. One format I have delt with is the .flic file format used       >> to render animated graphics. I wanted to write my own CIV style game.       >> It took a little bit of research and some reverse engineering.       >> Apparently, the authors used a modified version of the format making       >> it difficult to use the CIV graphics in my own game. I never could get       >> it to render as fast as the game’s engine. I wrote the code for my       >> game in C or C++, the original’s game engine code was likely in a       >> different language.       >>       >       > This sort of thing is almost inevitable with this stuff.       >       > Usually I just ended up using C for nearly everything.       >       >       >> *****       >>       >> Been working on vectors for the ISA. I split the vector length       >> register into eight sections to define up to eight different vector       >> lengths. The first five are defined for integer, float, fixed,       >> character, and address data types. I figure one may want to use       >> vectors of different lengths at the same time, for instance to address       >> data using byte offsets, while the data itself might be a float. The       >> vector load / store instructions accept a data type to load / store       >> and always use the address type for address calculations.       >>       >> There is also a vector lane size register split up the same way. I had       >> thought of giving each vector register its own format for length and       >> lane size. But thought that is a bit much, with limited use cases.       >>       >> I think I can get away with only two load and two store instructions.       >> One to do a strided load and a second to do an vector indexed load       >> (gather/scatter). The addressing mode in use is d[Rbase+Rindex*Scale].       >> Where Rindex is used as the stride when scalar or as a supplier of the       >> lane offset when Rindex is a vector.       >>       >> Writing the RTL code to support the vector memory ops has been       >> challenging. Using a simple approach ATM. The instruction needs to be       >> re-issued for each vector lane accessed. Unaligned vector loads and       >> stores are also allowed, adding some complexity when the operation       >> crosses a cache-line boundary.       >>       >> I have the max vector length and max vector size constants returned by       >> the GETINFO instruction which returns CPU specific information.       >>       >       > I don't get it...       >       > Usually makes sense to treat vectors as opaque blobs of bits that are       > then interpreted as one of the available formats for a specific operation.       >       > In my case, I have a SIMD setup:       > 2 or 4 elements in a GPR or GPR pair;       > Most other operations are just the normal GPR operations.       >       > ...       >       >              [continued in next message]              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca