muta...@gmail.com wrote:   
   > On Monday, January 31, 2022 at 4:02:01 AM UTC+11, anti...@math.uni.wroc.pl   
   wrote:   
   >   
   > > Well, compiler algorithms were hard to invent. But now   
   > > they are described in books and it is not hard to   
   > > reproduce algorithm from a book.   
   >   
   > It is when you don't understand the algorithm in the book.   
   >   
   > My brain wasn't designed for complex algorithms.   
   >   
   > I have other skills though.   
      
   Hmm, algorithms in compiler are really not more complicated   
   that algorithms in OS, database, BBS or even text editor.   
      
   Actually main thing is bookkeeping: you need to make sure   
   that right information is at places that need it.   
      
   There is problem of scale if you go for full compiler.   
   But this is not too bad. Self compiling toy compiler   
   for subset of C should be doable in 2-3 kloc. At about   
   20-25 kloc you should be able to get C90 compilance.   
   In our time those are not really big programs.   
      
   > > Concerning SubC, it looks like rather bad start for   
   > > a full C compiler.   
   >   
   > "full" meaning what level of the standard? I'm only   
   > interested in C90.   
      
   Already C90 have many things which are problematic to add   
   to SubC. For me non-negotiable is full support for structs.   
   That does not need much code, but AFAICS it includes change   
   of frequently used compiler data structures, that is lot   
   of code to change.   
      
   Floating point is another thing: it is reasonable to skip   
   floating point in initial toy and add it only at later   
   stage. But compiler structure should support easy addition,   
   which means support different modes of variables. Which   
   essentially meanst that if you want to add floating point   
   later you should support different size of short and long   
   releatively early.   
      
   Another thing is bitfields: personally I make little use   
   of them, but sometimes they are quite useful. And there   
   are a lot of programs using them.   
      
   Let me add that if I were to pick my subset of C it   
   probably would inculde limited VLA-s. Namely, AFAIK code   
   like:   
      
   void mat_add(int n, int a[n][n], int b[n][n]) {   
    int i,j;   
    for(i = 0; i < n; i++) {   
    for(j = 0; j < n; j++) {   
    a[i][j] += b[i][j];   
    }   
    }   
   }   
      
   is valid C99. And one could write essentially identical   
   code in Fortran66 and it is not very hard to support   
   (full VLA support need much more effort).   
      
   > > There are design decisions that   
   > > make it more complex than necessary simultanously   
   > > limiting what it can do. In particular using two symbol   
   > > tables and acumulator model for code generation.   
   >   
   > Could you explain these problems in more detail please?   
   > I'd like to understand what is wrong with it.   
      
   Scope in programming languages is recursive. You may claim   
   that in C there are "really" only two scopes: file scope and   
   function scope, but C90 standard says differently. So,   
   for correct implementation of C you need to handle multiple   
   scopes. Now, there are well established methods to handle   
   multiple scopes using single symbol table. Simplest one   
   just puts new definitions at and of symbol table. It also   
   remembers position in symbol table corresponding to start of   
   scope and when inner scope ends simply resets end to remembered   
   position (which effectively removes all entries from inner   
   scope and restores entries from outer scope). There is   
   variation of this using hash tables. So two symbol   
   tables in SubC means that you get _more_ complicated   
   compiler which is less capable than simpler one.   
      
   Actually, the two symbol tables are tip of iceberg.   
   Natural handling of types is recursive: pointer type   
   contains type of thing pointed to, structure has list   
   of fields and each of fields has its own type. In   
   compiler it is convenient to have function types,   
   such type has type of return value and argument types   
   (C does not have function types but have pointers to   
   functions and function type in compiler simplifies   
   handling of pointers). SubC apparently encodes   
   several combinations of types as integers and   
   uses switches to handle them. Each switch branch   
   needs its own code and due to combinations you   
   get more code. With recursive handling you have   
   less cases, so simpler code and more capable   
   compiler.   
      
   After second thought I think there is reason for   
   unnecessary complexity in SubC. Namely, IIUC   
   oryginally SubC had no support of structures.   
   Nils wanted self-compilation which implied   
   no structures in compiler. Recursive approach   
   heavily depends on structures so Nils probably   
   thought that without structures "flat" approach   
   is simpler. But avoidance of structures (I saw   
   only one structure type in SubC source) means   
   that code is more complicated and harder to   
   understand.   
      
   Concerning accumulator machine: there are several   
   machine models usable for code generation. Basically,   
   main part of compiler generates instructions for   
   simple abstract machine and code generator is   
   responsible for translating instructions from   
   abstract version to actual machine code. In   
   simplest version each abstract instruction is   
   separately translated to one or more machine   
   instructions.   
      
   Popular abstract machines are stack machines   
   and 3 address representation. What is wrong   
   with accumulator machine as abstract target?   
   Well, there is one accumulator so there are   
   a lot of moves to/form accumulator. Which   
   means more work when generating instructions   
   for abstract machine and more work in code   
   generator. And you either get poor object   
   code or spent significant effort combining   
   abstract instructions into machine instructions.   
   Compared to that 3 address representation   
   means that when generating abstract instructions   
   you need to allocate space for temporaries   
   (which is easy) and except of temporaries   
   other parts are easier. When generating   
   machine instructions from 3 address   
   representation code generator is simpler   
   and has more opportunity to better use machine   
   instructions.   
      
   Let me add that I did really simple code   
   generator using 3 address representation   
   (as part of a toy compiler). This was mainly   
   to see how bad the generated code will be.   
   Initially it was really bad, maybe a little   
   worse than code from SubC. But I noticed   
   a few little tricks. The main being that   
   each expression has some destination and   
   machine code generator directly produces   
   result in this destination. In abstract   
   level a little care ensured that final   
   destination is propagated down, so many   
   moves generated by earlier version are no   
   longer needed.   
      
   Maybe as another comparison, I also did machine   
   code generator for "production" compiler (this was   
   addition to existing compiler, other part were already   
   done, but did not support machines that I wanted). This   
   was fancy language having more features than   
   C. Code generator for x86_64 has 2291 lines.   
   Code generator for 32-bit arm has 1579.   
   Here representation is again 3 address representation   
   (but there is explict stack and address may be   
   on the stack). I dare to say that both generate   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|