... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 131,176 of 131,241
Robert Finch to Michael S
Re: A useless machine
21 Feb 26 01:56:03
   From: robfi680@gmail.com   
      
   On 2026-02-19 5:10 a.m., Michael S wrote:   
   > On Wed, 18 Feb 2026 22:56:19 -0500   
   > Robert Finch  wrote:   
   >   
   >> On 2026-02-18 11:22 a.m., Michael S wrote:   
   >>> On Sun, 15 Feb 2026 10:56:51 -0500   
   >>> Robert Finch  wrote:   
   >>>   
   >>>> I have coded a version that tests up to 8192 number simultaneously.   
   >>>> It is supposed to be able to run at over 200 MHz. Some of the   
   >>>> number would take a large number of iterations though.   
   >>>>   
   >>>> I found something strange, the code to count for more numbers does   
   >>>> not increase the size of the core very much. (But it takes longer   
   >>>> to build.) I am not sure if this occurs because there is a mitsake   
   >>>> somewhere, or if the tools are able to figure out how to share all   
   >>>> the logic. I am thinking the logic for adders is shared somehow.   
   >>>> 128-bit arithmetic is being used.   
   >>>>   
   >>>   
   >>> There is a mistake in your code. No other explanation is possible.   
   >>>   
   >>> You should look for 3 sorts of mistakes   
   >>> 1. Core itself.   
   >>> The most likely mistake is failing to look for one of the ending   
   >>> conditions. You have to look for 3 conditions:   
   >>> a) - value produced at current step is smaller than initial value.   
   >>>    That is successful ending. Here you signal to the scheduler that   
   >>> you are ready to accept the next value.   
   >>> b) - overflow. Next value exceeds the width of register.   
   >>> c) - timeout. You were running for more than N steps. N=8192 looks   
   >>> like the reasonable value.   
   >>> a) and b) are suspected failures.   
   >>> The handling for b) and c) is similar - you raise alert signal and   
   >>> stop. Your test unit has to provide supervisor circuity with the   
   >>> mean to fetch the initial value that caused suspected failure and   
   >>> to clear alert condition.   
   >>> After alert is cleared your unit returns to the same state as after   
   >>> success - waiting for next initial value.   
   >>   
   >> I found the mistake. Core was working, but groups of cores were not.   
   >> Done signals were missing.   
   >>   
   >>>   
   >>> 2. Scheduler.   
   >>> In proof-of-concept design the scheduler does not have to be very   
   >>> realistic. For example, it does not have to be efficiently   
   >>> pipelined. But it should be able to monitor ready signals of all   
   >>> test units and to supply different initial data to different units.   
   >>> Supplying the same data to all units simultaneously is a sort of   
   >>> mistake that explains your observed behavior.   
   >>>   
   >> There is really simple scheduling ATM. There is just a bunch of done   
   >> signals that are gated together. Meaning the slowest calculation of   
   >> the group will slow everything else down. Once all the dones are true   
   >> the next set of numbers are loaded.   
   >>   
   >>> 3. Supervisor   
   >>> Again, in proof-of-concept design the supervisor does not have to be   
   >>> very realistic. But it should be able to selectively exercise all   
   >>> core feature related to alert, including fetching of initial data   
   >>> and clearing of alert.   
   >>>   
   >>> After you coded everything correctly, do not even try to compile   
   >>> test with 8000 units. That is totally unrealistic. Even if you find   
   >>> huge FPGA in which such number of testers could fit (and I am not   
   >>> sure that such FPGA exists) the compilation will take many hours,   
   >>> possibly days.   
   >>>   
   >>>   
   >>>   
   >> Yeah 8,000 was definitely out of the picture. I had since coded 24x24   
   >> matrix (576) and although the logic would fit that turned out not to   
   >> be able to route. So, it is down to 20x20. 20x20 with no real   
   >> scheduling fits.   
   >>   
   >> I tried displaying progress as an on-screen bitmap, but that just   
   >> appeared noisy. I have been working on an SCP so may be able to use   
   >> that to manage the scheduling with some custom logic.   
   >>   
   >>   
   >>   
   >   
   > Does 20x20 means 400 cores with something like 80-bit initial values   
   > and 128-bit processing?   
   Yes. 128-bit processing.   
      
   > If you didn't make further mistakes then it should be rather big, order   
   > of 300K LEs. It would not fit in the biggest low-end Altera device   
   > (Cyclone 10GX 10CX220).   
      
   IIRC it is about 150k LUTs. It uses the simplest approach, which is not   
   very fast iteration wise, but it allows a lot of cores and a fast clock   
   cycles time. One core is very small and has a fast iteration (1 clock   
   cycle). But it takes a lot more iterations than can be done with a more   
   brainiac approach.   
      
   Using a more brainiac approach would likely cut the performance in half   
   and use a lot more LUTs.   
      
   > On the Xilinx side, it would not fit in any Spartan and in most Artix   
   > devices. May be, it would fit in the biggest Artix UltraScale+ (AU25P).   
   > But more realistically for 400 test units one would want Arria 10 on   
   > Altera side or Kintex on Xilinx side. Both are not prohibitively   
   > expensive, but not cheap.   
   >   
   > What device are you using?   
   I am not using an too inexpensive device. I have a Kintex-325 with about   
   200k LUTs.   
      
   > If it is smaller than Kintex KU5P then I'd strongly suspect that you   
   > didn't clean out all mistakes.   
   >   
   >   
   >   
   There could be more mistakes for sure. But I am sure the lowest level   
   core works. It seemed to in simulation. I made a slight mod to it since   
   I tested it last, checking n < startn.   
   Counts are not used so the logic would be stripped out, but its good for   
   verification.   
      
   1 Core is:   
      
   module Collatz_conjecture(rst, clk, startn, count, done);   
   input rst;   
   input clk;   
   input [127:0] startn;   
   output reg [127:0] count;   
   output reg done;   
   reg [127:0] n;   
      
   always_ff @(posedge clk)   
   if (rst) begin   
   	n <= startn;   
   	count <= 0;   
   	done <= FALSE;   
   end   
   else begin   
   	if (!done) begin   
   		if (~n[0])   
   			n <= n >> 2'd1;   
   		else   
   			n <= n+n+n+1;   
   		if (n < startn)   
   			done <= TRUE;   
   		count <= count + 1;   
   	end   
   end   
      
   endmodule   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]