... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
sci.logic
Logic -- math, philosophy & computationa
262,912 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 261,459 of 262,912
Ross Finlayson to All
Re: Meta: a usenet server just for sci.m
29 Nov 25 03:54:06
   [continued from previous message]   
      
   Thanks Claude. I'd suggest that the usual idea is that memory is copied   
   to the nearest neighbor, as the original DMA initiation would be pointed   
   to a receive buffer neighborly to the thread that runs the re-routine   
   body. Since the re-routine runs through and will access the already   
   received memos, it's figured that mostly they are possibly evaluated by   
   the linear in the routine (if, "non-blocking", thoroughly), with regards   
   to the threads that execute the body of the re-routine. The NUMA access   
   at least is accessible if slower to access, then with the idea that   
   mostly each core has its own re-routine executor that processes on the   
   code each get their own handler, or for devices that can't share the   
   bus, that would be according to limits of the device protocol. Since   
   mostly the memo is pointers to memory, it's unbounded, and is whatever   
   is the payload in the example, as those would generally be of a freelist   
   of the FSM's memo states, a list of pointers. The idea of the re-routine   
   at first seems unintuitive: to run the routines repeatedly and keep   
   exiting out, yet later when the asynchronous routine is write-able as   
   the synchronous routine, including flow-of-control and exception   
   handling, with some limitations or considerations about how to handle   
   vector routines and the processing of ordered and un-ordered data, here   
   it's figured that these type transactions fit within the model of the   
   re-routine, free of "callback hell" yet as well nicely non-blocking,   
   though, basically facing branch mis-prediction usually for a   
   "non-predictable" hint in machine instruction, that their traces get   
   involved associating the trace segments instead of a usual call-stack,   
   then that mostly it's nice that the definition of the actual logic   
   otherwise is as of a synchronous implementation, for example to compose   
   with backing of the blocking and implement reference routines with or   
   without automatic asynchronous concurrency.   
      
      
   Thanks RF. This crystallizes several critical implementation details   
   that make re-routines practical.   
   NUMA-Aware Memory Placement:   
   The DMA target selection is topology-aware from the start:   
   // At re-routine submission:   
   ReRoutine rr = {   
   .micc = vend_micc(),   
   .home_core = current_core_id,   
   .memo = alloc_memo_on_numa_node(current_core_id),   
   // DMA buffers pointed to by memo are also NUMA-local   
   };   
      
   // DMA completion handler routes to home core's slique:   
   void dma_done(MICC micc, PhysAddr payload) {   
   core_id = micc.home_hint; // Encoded in M-ICC   
   write_to_slique(core_slique[core_id], {micc, payload});   
   send_ipi_to_core(core_id); // Interrupt home core   
   }   
   The "nearest neighbor" copying happens only on re-route to different   
   core (rare, for load balancing). Normal case: memo and payload stay   
   NUMA-local, accessed at L2/L3 cache speed.   
   Per-Core Re-Routine Executors:   
   Each core runs its own independent executor:   
      
   // Core 0:   
   while (true) {   
   rr = pop_local_queue(0);   
   if (rr) execute_reroutine(rr);   
   }   
      
   // Core 1:   
   while (true) {   
   rr = pop_local_queue(1);   
   if (rr) execute_reroutine(rr);   
   }   
      
   No shared queue contention - only the M-ICC vending is serialized (but   
   amortized via pre-allocated slabs per core).   
   For exclusive-access devices (e.g., single UART, non-SR-IOV NIC), the   
   re-routine affinity is pinned:   
      
   // Device with exclusive access:   
   uart_reroutines → always execute on CORE_0 (device owner)   
      
   // SR-IOV NIC with multiple queues:   
   nic_queue[0] → CORE_0   
   nic_queue[1] → CORE_1   
   // No contention, perfect parallelism   
      
   Memo as Pointer List (Unbounded):   
   Now I understand - the memo isn't a fixed-size struct, it's a managed   
   heap object:   
      
   struct Memo {   
   MICC owner_micc;   
   uint16_t step_count; // How many memos filled so far   
   void** pointers; // Array of pointers to payloads   
   };   
      
   // Allocated from per-FSM-type freelist:   
   Memo* alloc_memo(FSMType type) {   
   return freelist_pop(memo_pools[type]);   
   }   
      
   // Each pointer in the memo points to actual data:   
   memo->pointers[STEP_TCP_SOCK] → Socket* (8 bytes)   
   memo->pointers[STEP_RX_BUF] → Buffer* (points to 4KB payload)   
   memo->pointers[STEP_HEADER] → HTTPHeader* (variable size)   
      
   The unbounded aspect is acceptable because:   
      
   Each FSM type has known max depth (TCP connect = 5 steps, HTTP GET = 8   
   steps)   
   Freelist pre-allocates worst-case (e.g., 100 concurrent TCP connections   
   × 5 pointers = 4KB total)   
   If freelist exhausts → admission control (reject new re-routines)   
      
   The pointers themselves are NUMA-local to the home core, so access is fast.   
   Why Re-Routines Avoid Callback Hell:   
   Traditional async code:   
      
   // Callback hell (unreadable):   
   void http_get(URL url, Callback cb) {   
   tcp_connect(url.host, 80, [url, cb](Socket s) {   
   send_request(s, "GET /", [s, cb]() {   
   recv_response(s, [cb](Response r) {   
   cb(r);   
   });   
   });   
   });   
   }   
      
   Re-routine equivalent:   
      
   // Linear, readable (synchronous-style):   
   Response http_get_reroutine(URL url, Memo* m) {   
   Socket s = tcp_connect(url.host, 80); // May throw MEMO_INCOMPLETE   
   send_request(s, "GET /"); // May throw MEMO_INCOMPLETE   
   Response r = recv_response(s); // May throw MEMO_INCOMPLETE   
   return r;   
   }   
      
   The compiler transformation automatically inserts memo checkpoints at   
   each blocking call site. Developer writes synchronous code, runtime   
   provides async execution.   
   Branch Misprediction & Non-Predictable Hints:   
   This is the performance cost you accept. Every memo check is a branch:   
      
   if (!memo->has(STEP_N)) {   
   // Cold path - first time through   
   throw MEMO_INCOMPLETE;   
   }   
   // Hot path - memo filled   
   value = memo->get(STEP_N);   
      
   On first execution: branch mispredicted (took exception path).   
   On retry after memo fill: branch correctly predicted (took fast path).   
   For truly unpredictable branches (e.g., packet reordering), you hint the   
   CPU:   
      
    x86-64:   
   cmp [memo + offset], 0   
   jnz .filled   
    Hint: this branch is unpredictable   
   .byte 0x2E ; CS segment override (acts as branch hint on some CPUs)   
   ```   
      
   Modern CPUs (Intel Alder Lake+, ARM Neoverse) support explicit branch   
   prediction hints in microarchitecture. The TTASL compiler emits these.   
      
   **Trace Segments Replace Call Stacks:**   
      
   Traditional debugging:   
   ```   
   Call stack:   
   http_get+0x45   
   tcp_connect+0x12   
   bind+0x08   
   [blocked in kernel]   
   ```   
      
   Re-routine debugging:   
   ```   
   Trace segments keyed by M-ICC 0x123456:   
   [T0] SUBMIT http_get_reroutine STEP=0   
   [T1] EXECUTE http_get_reroutine STEP=0 → MEMO_INCOMPLETE (STEP_TCP)   
   [T2] EXECUTE tcp_connect_reroutine STEP=0 → MEMO_INCOMPLETE (STEP_SOCK)   
   [T3] RESPONDER socket_ready M-ICC=0x123456 filled STEP_SOCK   
   [T4] EXECUTE tcp_connect_reroutine STEP=1 → COMPLETE   
   [T5] EXECUTE http_get_reroutine STEP=1 → MEMO_INCOMPLETE (STEP_SEND)   
   ...   
      
   The trace segments form a logical call graph that survives across   
   re-executions. You can reconstruct the full execution history by joining   
   on M-ICC, even if the re-routine migrated cores or was preempted 100 times.   
   Reference Implementation: Sync + Async Versions:   
   The beauty is you can compose both:   
      
   // Synchronous blocking version (for testing):   
   Response http_get_sync(URL url) {   
   Socket s = tcp_connect(url.host, 80); // Blocks until complete   
   send_request(s, "GET /"); // Blocks   
   return recv_response(s); // Blocks   
   }   
      
   // Asynchronous non-blocking version (production):   
   Response http_get_async(URL url, Memo* m) {   
   Socket s = tcp_connect_reroutine(url.host, 80, m->submemo(0));   
   send_request_reroutine(s, "GET /", m->submemo(1));   
   return recv_response_reroutine(s, m->submemo(2));   
   }   
      
   The same logic, just different execution model. The TTASL compiler can   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]