From: anton@mips.complang.tuwien.ac.at   
      
   scott@slp53.sl.home (Scott Lurndal) writes:   
   >Thomas Koenig writes:   
   >>Anton Ertl schrieb:   
   >>> Thomas Koenig writes:   
   >>>>I recently heard that CS graduates from ETH Zürich had heard about   
   >>>>pipelines, but thought it was fetch-decode-execute.   
   >>>   
   >>> Why would a CS graduate need to know about pipelines?   
   >   
   >So they can properly simluate a pipelined processor?   
      
   Sure, if a CS graduate works in an application area, they need to   
   learn about that application area, whatever it is.   
      
   But why would knowledge about processor pipelines be part of their CS   
   curriculum?   
      
   >When I got my MSCS, computer engineering courses were   
   >required, including basic logic elements and overviews   
   >of processor design.   
      
   For me, too. I even learned something about processor pipelines, in a   
   specialized elective course.   
      
   >>Why would anybody know the basics of what they are doing?   
      
   Processor pipelines are not the basics of what a CS graduate is doing.   
   They are an implementation detail in computer engineering.   
      
   >Indeed, a programmer that doesn't understand the underlying   
   >hardware is crippled.   
      
   I certainly have a lot of sympathy for that point of view. However,   
   there are a lot of abstractions whose cost a programmer should   
   understand if they intend to write efficient code, e.g., the memory   
   hierarchy or system calls.   
      
   But CPU pipelines have the nice property that they are mostly   
   transparent. What you need to understand for performance is the   
   latency of various instructions, and the costs of branch   
   misprediction. I teach a course "Efficient programs", and I do not   
   discuss hardware pipelining, but I do explain these performance   
   characteristics.   
      
   If anything, understanding OoO execution and it's effect on   
   performance is more relevant. But looking at the dearth of textbooks,   
   and the fact that Henry Wong did his thesis on his own initiative,   
   even among computer engineering professors that is a topic that is of   
   little interest.   
      
   Back to programmers: There is also the other POV that programmers   
   should never concern themselves with low-level details and should   
   always leave that to compilers, which supposedly can do all those   
   things better than programmers (I call that the compiler supremacy   
   position). Compiler supremacy is wishful thinking, but wishful   
   thinking has a strong influence in the world.   
      
   A few more examples where compilers are not as good as even I expected:   
      
   Just today, I compiled   
      
   u4 = u1/10;   
   u3 = u1%10;   
      
   (plus some surrounding code) with gcc-14 in three contexts. Here's   
   the code for two of them (the third one is similar to the second one):   
      
   movabs $0xcccccccccccccccd,%rax movabs $0xcccccccccccccccd,%rsi   
   sub $0x8,%r13 mov %r8,%rax   
   mul %r8 mov %r8,%rcx   
   mov %rdx,%rax mul %rsi   
   shr $0x3,%rax shr $0x3,%rdx   
   lea (%rax,%rax,4),%rdx lea (%rdx,%rdx,4),%rax   
   add %rdx,%rdx add %rax,%rax   
   sub %rdx,%r8 sub %rax,%r8   
   mov %r8,0x8(%r13) mov %rcx,%rax   
   mov %rax,%r8 mul %rsi   
    shr $0x3,%rdx   
    mov %rdx,%r9   
      
   The major difference is that in the left context, u3 is stored into   
   memory (at 0x8(%r13)), while in the right context, it stays in a   
   register. In the left context, gcc managed to base its computation of   
   u1%10 on the result of u1/10; in the right context, gcc first computes   
   u1%10 (computing u1/10 as part of that), and then computes u1/10   
   again.   
      
   Then I looked if there is some unsigned equivalent of ldiv(), but   
   there is not, supposedly because the compilers manage to combine the /   
   and % operations by themselves.   
      
   I also found that the resulting code was slower on a Rocket Lake than   
   a variant of the code that passes the divisor in a variable, but   
   that's ok: On Skylake and earlier CPUs division is so slow that the   
   replacement code is probably faster.   
      
   - anton   
   --   
   'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'   
    Mitch Alsup,    
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|