... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.c
Meh, in C you gotta define EVERYTHING
243,242 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 242,604 of 243,242
BGB to All
Re: srand(0)
25 Dec 25 15:29:59
   From: cr88192@gmail.com   
      
   On 12/25/2025 1:31 PM, Lawrence D’Oliveiro wrote:   
   > On Thu, 25 Dec 2025 03:07:03 -0600, BGB wrote:   
   >   
   >> One entropy-mining process is to use "clock()" or similar and then   
   >> spin in a loop for a certain amount of time effectively building a   
   >> hash of the values returned by clock. The exact timing when the   
   >> values change will tend to carry a certain amount of entropy.   
   >   
   > The turbulence of the air/gas inside disk drives is apparently a good   
   > source of randomness.   
      
   Yeah, but one doesn't easily have access to this information.   
   Likewise to access from the low order bits of CPU thermometers or   
   similar, etc.   
      
   For some of my targets, there is also no HDD (typically, everything runs   
   off of SD cards).   
      
      
   FWIW, in my own CPU design, there is actually a hardware RNG where   
   internal signals are basically gathered up and fed around the bus in a   
   special noise channel and used to continuously feed into a hardware RNG   
   for which a value can be read with a special CPU instruction.   
      
   But, alas, mainline CPUs lack such a feature.   
      
   On x86, it is also possible to get some level of entropy from mining   
   RDTSC, but this is non-portable.   
      
      
      
      
   But, yeah, tested out a few more RNG designs, and ATM:   
   	seed1 ^= ~(seed2>>47);	seed2 ^= ~(seed1>>43);  // 4 cycles   
   	seed1 ^=  (seed1<<13);	seed2 ^=  (seed2>>11);  // 4 cycles   
   	seed1 ^=  (seed1>>19);	seed2 ^=  (seed2<<17);  // 4 cycles   
   	val = ((seed1 ^ seed2) >> 32) & 0x7FFF;         // 6 cycles   
   Seems to be working pretty OK (decent randomness), and is moderately fast.   
      
   Add cost of +4 cycles for LD (2c penalty), +2 ST   
   Est cost: Around 24 clock cycles.   
      
      
   Though, breaking up the shifts and xors using temporaries could be used   
   to micro-optimize it a little more (vs trying to rely on compile-time   
   instruction shuffling).   
      
   Downside as that this particular approach (XOR'ing values with   
   themselves and modifying the original variable before the next step),   
   creates a lot of dependencies which limits the potential ILP (can't get   
   ILP over 2 in this case).   
      
   Where, the interleaved "seed1 = (seed1<>SH2);" pattern   
   allows for slightly higher ILP, but seemingly gets less randomness per   
   shift (so the total dependency length ends up similar).   
      
   In the CPU in question, would effectively need 4 seed values to get   
   better ILP, say:   
   	seed1 ^= ~(seed2>>47);	seed2 ^= ~(seed3>>43);   
   	seed3 ^= ~(seed4>>53);	seed4 ^= ~(seed1>>41);   
              // ~ 4 cycles   
   	seed1 ^=  (seed1<<13);	seed2 ^=  (seed2>>11);   
   	seed3 ^=  (seed3<< 7);	seed4 ^=  (seed4>> 5);   
   	seed1 ^=  (seed1>>19);	seed2 ^=  (seed2<<17);   
   	seed3 ^=  (seed3>>23);	seed4 ^=  (seed4<<29);   
              // ~ 6 cycles   
   	val = ((seed1 ^ seed2) >> 32) & 0x7FFF;  // 6 cycles   
            //+ 8 cycles (Global LD/ST)   
   Est Cost: Around 24 clock cycles.   
      
   (24 cycle estimate assuming compiler shuffles instructions into a   
   roughly optimal order).   
      
      
   Though, with more operations still likely to be slower, despite the   
   higher ILP potential (the higher ILP would merely reduce the slowdown   
   from having more operations).   
      
   In this case, ideally one wants to be able to have groups of 6   
   non-dependent ALU operations on a 3-wide CPU with 2-cycle ALU latency.   
   Though, in this case, achieving peak ILP would actually require ~ 6 or 8   
   seed registers. Though, in this case, would require using XG3, for   
   RISC-V it would also suffer due to RV64 only having 32 GPRs rather than   
   64, so trying to use 8 seeds in parallel would run out of GPRs on RISC-V   
   (don't want to spill, this will hurt a lot worse than the lost ILP).   
      
   Where, if ILP is maximized, one can get to around 0.3 clock cycles per   
   operation.   
      
      
      
   But, on the target in question, 2 or 4 seeds is still likely to be   
   considerably faster than, say:   
   	seed = seed * 0x0000FC4BA2F7ACABULL + 1;   
      
   Which suffers from the "great evil" known as "slow 64-bit multiply"   
   (needs 68 cycles to do a 64-bit multiply, similar to performing a divide).   
      
   Where, the cost of the 64-bit integer multiply will stomp all over the   
   cycle cost of the former (well, and not like "rand()" tends to be used   
   all that heavily in any case).   
      
      
   Well, contrast to the naive lookup-table approach:   
      index=(index+1)&255; //latency = 8 cycles (3 LD, 2 ADD, 2 AND, 1 ST)   
      val=table[index];    //latency = 5 cycles (2 ADD, 3 LD).   
   So: ~ 13 cycles of latency.   
      
   So, looks simple, but not great in this case (both weak, and latency   
   isn't great for its weakness).   
      
      
   Then again:   
      seed = seed * 0x0000FC4BA2F7ACABULL + 1;   
        //75 cycles: 3 LD, 1 const, 68 MUL, 2 ADD, 1 ST   
      ret = (myseed >> 48) & 0x7FFF;  //4 cycles: 2 SHR, 2 AND   
   Cost: Around 80 cycles.   
      
      
   ...   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]