... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.c
Meh, in C you gotta define EVERYTHING
243,242 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 242,104 of 243,242
BGB to bart
Re: _BitInt(N) (1/2)
24 Nov 25 13:12:54
   From: cr88192@gmail.com   
      
   On 11/24/2025 8:21 AM, bart wrote:   
   > On 24/11/2025 13:35, Keith Thompson wrote:   
   >> bart  writes:   
   >> [...]   
   >>> There are two kinds of BitInts: those smaller than 64 bits; and those   
   >>> larger than 64 bits, sometimes /much/ larger.   
   >>   
   >> As far as I know, the standard makes no such distinction.   
   >   
   > *I* am making the distinction. From an implementation point of view (and   
   > assuming 64-bit hardware), they are quite different.   
   >   
   > And that leads to different kinds of language features.   
   >   
      
   As noted, as I understand it there is no reason for the storage to be   
   smaller than the next power-of-2 size.   
      
   Supporting odd-sized values in memory would have added a lot more of a   
   pain in terms of making things efficient (it is a lot more of an issue   
   to store a 24-bit or 40-bit item to memory than 32 or 64).   
      
   Though, one possibility could be "__packed _BitInt(n)" where in this   
   case it would handle them as the nearest multiple of 8 bits rather than   
   as the nearest power-of-2.   
      
      
   As least on my ISA design, Load/Store ops are mostly only available in   
   power-of-2 sizes, and the direct displacement case is limited to natural   
   alignment (though using RISC-V encodings can sidestep this limitation in   
   the case of the XG3 variant, or if targeting RISC-V, *).   
      
      
   *: In my case, the ISA has split into multiple variants:   
      XG1: Its original form.   
        16/32/64/96 bit instructions.   
        Mostly 5-bit register fields.   
      XG2: Modified.   
        Loses 16-bit encodings;   
        Gains slightly larger immediate values;   
        All register fields expand to 6 bits;   
        Encoding scheme is slightly dog-chewed.   
      XG3:   
        Instructions were repacked to be compatible with RISC-V;   
        Register numbering was made compatible with RISC-V;   
        Un-dog-chewed the encoding scheme some vs its predecessors;   
        Instruction stream can be mixed/matched with RV64G.   
          However, while both RV64G and XG3 ops support superscalar.   
          For reasons, my CPU core can't co-issue RV64 and XG3 instructions.   
            So, it is more like the ISA can flip/flop every clock-cycle.   
      
   However, can note that RISC-V also still lacks NPOT memory operations.   
      
   And, if your memory store looks like:   
      SRLI  X6, X10, 16   
      SW    X10, 13(X12)   
      SB    X6, 15(X12)   
      
   This isn't great, don't want to pay these sorts of penalties without reason.   
      
   For odd-sized _BitInt, one pays the cost mostly by using sign/zero   
   extension on certain operations.   
      
   In basic forms of both ISAs, this can be done via a pair of shift   
   instructions, say, zero-extending 24 bits:   
      SLLI  X10, X10, 40   
      SRLI  X10, X10, 40   
      
   In my case, there is an optional feature that can allow this to be   
   encoded as a single instruction. Although the instruction in question   
   uses a 64-bit encoding; so doesn't save any code-size over the pair of   
   shifts, but is faster; partly also because in my CPU core most   
   instructions have a minimum latency of 2 clock cycles; which isn't ideal   
   for a lot of RISC-V's patterns.   
      
   Though, on the CPU in question, the ideal scheduling isn't so much to   
   try to reuse a register immediately, but if possible to put around 5   
   instructions between modifying a register and trying to access its value   
   again (but, this case really sucks for some constructs in RV).   
      
   Like, one can't optimally schedule an array index load (needs 3   
   instructions in RV64G) when such scheduling will most likely exceed the   
   total length of the loop body (and trying to modulo-schedule array-loads   
   is just kinda absurd).   
      
   Well, technically, CPU isn't VLIW (at least for RV64 and XG3, XG1 and   
   XG2 were "LIW"), but being 3-wide in-order, optimal case for performance   
   is still to try to schedule things as-if they were (V)LIW.   
      
   Though, the spacing drops to 3 intermediate instructions if scheduling   
   for 2-wide; which may make sense either if there isn't sufficient ILP to   
   optimize for 3-wide scheduling (most of the time) or the code is doing   
   things that hinder 3-wide operation (minority case; but can happen as   
   the 3rd lane in this case only does basic ALU instructions and is   
   "eaten" by certain instructions, such as indexed-store, etc).   
      
   ...   
      
      
   My compiler still doesn't deal with all of this well (and sorta blows it   
   off in the case of targeting RV64G or RV64GC), but this sort of thing   
   seems to be sort of a pain case in general (and it sorta helps if the   
   programmer also write their code in a way that helps the compiler along   
   here; but helps some if ISA design limitations don't actively hinder the   
   ability to generate efficient code in this area).   
      
   ...   
      
      
   Though, had noted that (curiously) writing code as-if one were targeting   
   a modulo-scheduled VLIW seems to help with x86-64 as well, even if   
   x86-64 has nowhere near enough registers to benefit here (it is almost   
   as-if x86-64 has a mechanism in place to cheapen the cost of stack   
   spills and reloads).   
      
   In my case, I had instead used 64 GPRs (from the RV64G POV, it is just   
   the X and F register spaces glued together). Where 64 is mostly enough   
   to competently modulo-schedule things and not run out of registers.   
      
   Though, it is only some kinds of code that can benefit from the power of   
   64 GPRs.   
      
      
   But, yeah, in any case, I guess the main issue is that NPOT loads/stores   
   would suck here in the absence of dedicated CPU instructions (in a   
   similar way to how much it hurts by RV64G lacking indexed-load/store;   
   where array operations are often very common in the types of code one   
   might want to optimize via modulo scheduling the loop).   
      
   But, you don't really want to add NPOT Load/Store instructions either,   
   because this more just offloads the pain onto the CPU.   
      
   ...   
      
      
      
   > If the possibilities above 64 bits were less ambitious (say i128 and   
   > i256), then the concept might be stretched to cover both. But not when   
   > when you can also have i1234567.   
   >   
   > It would be having a GETBITS macro, which is not limited to a 1- to 63-   
   > bit bitfield of a u64 value, but could return a slice of an arbitrarily   
   > large array.   
   >   
      
   I added some Verilog style notation, which can in premise be used for   
   large _BitInts. However this case is untested and very likely runs into   
   an "implementation hole" for types larger than 128 bits.   
      
      
   >>   
   >>> I had been responding to the claim that those smaller types save   
   >>> memory, compared to using sizes 8/16/32 bits which are commonly   
   >>> available and have better hardware support.   
   >>   
   >> I don't recall any such claim.  Do you have a citation (other than   
   >> the FPGA-specific wording in N2709)?   
   >   
   > This is where it came up in this thread:   
   >   
   > On 23/11/2025 11:46, Philipp Klaus Krause wrote:   
   >  > Am 22.10.25 um 14:45 schrieb Thiago Adams:   
   >  >>   
   >  >>   
   >  >> Is anyone using or planning to use this new C23 feature?   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]