... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.programming
Programming issues that transcend langua
57,431 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 56,946 of 57,431
Stefan Ram to All
Scanning
19 Jan 23 12:10:36
   From: ram@zedat.fu-berlin.de   
      
     Some idle thoughts about scanning (lexical analysis, or   
     rather what comes before it) ...   
      
     Let's take a very simple task: This scanner for text files   
     has nothing more to do than to return every character,   
     except to strip the spaces at the end of a line.   
      
     It is a function "get_next_token" that on each call will   
     return the next character from a file to its client (caller),   
     except that spaces at the end of a line will skipped.   
      
     So we read the line and strip the spaces. (One line in   
     Python.)   
      
     But how do I know in advance if the line will fit into   
     memory?   
      
     Perhaps because of such fears, traditional scanners¹ do not   
     read lines or, Heaven forbid, files, but only characters!   
      
     They do not use random access with respect to the text to be   
     scanned, but sequential access, although things would be   
     easier with random access.   
      
     So how would you do it with this style of programming (never   
     reading the whole line into memory)?   
      
     "I read a character. If it's a space, I peek at the next   
     character, if that's a space, I start adding spaces to my   
     look-ahead buffer. If an EOL is encountered, the look-ahead   
     buffer is discarded. Otherwise, I have to start feeding my   
     client from the lookahead buffer until the lookahead buffer   
     is empty."   
      
     If I am concerned that a line will not fit in memory, how do   
     I know that the sequence of spaces at the end of a line will   
     fit in memory (the look-ahead buffer)? The look-ahead buffer   
     could be replaced by a counter. If you are paranoid, you   
     would use a 64-bit counter and check it for overflow!   
      
     Is it worth the effort with a look-ahead buffer and   
     sequential access? Should you just read a line, assuming   
     that a line will always fit into memory, and strip the   
     blanks the easy way, i.e., using random access? TIA for any   
     comments!   
      
     1   
      
     an example of a traditional scanner:   
      
     It only ever calls "GetCh", never "GetLine". The code could   
     be easier to write by reading a whole line and then just   
     using functions that can look at that line using random   
     access to get the next symbol (maybe using regular   
     expressions). But a traditional scanner carefully only ever   
     reads a single character and manages a state.   
      
   PROCEDURE GetSym;   
      
   VAR     i          : CARDINAL;   
      
   BEGIN   
     WHILE  ch <= ' '  DO  GetCh  END;   
     IF  ch = '/'  THEN   
       SkipLine;   
       WHILE  ch <= ' '  DO  GetCh  END   
     END;   
     IF  (CAP (ch) <= 'Z') AND (CAP (ch) >= 'A')  THEN   
       i := 0;   
       sym := literal;   
       REPEAT   
         IF  i < IdLength  THEN   
           id [i] := ch;   
           INC (i)   
         END;   
         IF  ch > 'Z' THEN  sym := ident  END;   
         GetCh   
         ...   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]