Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.programming    |    Programming issues that transcend langua    |    57,431 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 56,947 of 57,431    |
|    Richard Heathfield to Stefan Ram    |
|    Re: Scanning    |
|    19 Jan 23 12:43:45    |
   
   From: rjh@cpax.org.uk   
      
   On 19/01/2023 12:10 pm, Stefan Ram wrote:   
   > Some idle thoughts about scanning (lexical analysis, or   
   > rather what comes before it) ...   
   >   
   > Let's take a very simple task: This scanner for text files   
   > has nothing more to do than to return every character,   
   > except to strip the spaces at the end of a line.   
   >   
   > It is a function "get_next_token" that on each call will   
   > return the next character from a file to its client (caller),   
   > except that spaces at the end of a line will skipped.   
   >   
   > So we read the line and strip the spaces. (One line in   
   > Python.)   
   >   
   > But how do I know in advance if the line will fit into   
   > memory?   
   >   
   > Perhaps because of such fears, traditional scanners¹ do not   
   > read lines or, Heaven forbid, files, but only characters!   
   >   
   > They do not use random access with respect to the text to be   
   > scanned, but sequential access, although things would be   
   > easier with random access.   
   >   
   > So how would you do it with this style of programming (never   
   > reading the whole line into memory)?   
   >   
   > "I read a character. If it's a space, I peek at the next   
   > character, if that's a space, I start adding spaces to my   
   > look-ahead buffer. If an EOL is encountered, the look-ahead   
   > buffer is discarded. Otherwise, I have to start feeding my   
   > client from the lookahead buffer until the lookahead buffer   
   > is empty."   
   >   
   > If I am concerned that a line will not fit in memory, how do   
   > I know that the sequence of spaces at the end of a line will   
   > fit in memory (the look-ahead buffer)? The look-ahead buffer   
   > could be replaced by a counter. If you are paranoid, you   
   > would use a 64-bit counter and check it for overflow!   
   >   
   > Is it worth the effort with a look-ahead buffer and   
   > sequential access? Should you just read a line, assuming   
   > that a line will always fit into memory, and strip the   
   > blanks the easy way, i.e., using random access? TIA for any   
   > comments!   
   >   
   > 1   
   >   
   > an example of a traditional scanner:   
   >   
   > It only ever calls "GetCh", never "GetLine". The code could   
   > be easier to write by reading a whole line and then just   
   > using functions that can look at that line using random   
   > access to get the next symbol (maybe using regular   
   > expressions). But a traditional scanner carefully only ever   
   > reads a single character and manages a state.   
   >   
   > PROCEDURE GetSym;   
   >   
   > VAR i : CARDINAL;   
   >   
   > BEGIN   
   > WHILE ch <= ' ' DO GetCh END;   
   > IF ch = '/' THEN   
   > SkipLine;   
   > WHILE ch <= ' ' DO GetCh END   
   > END;   
   > IF (CAP (ch) <= 'Z') AND (CAP (ch) >= 'A') THEN   
   > i := 0;   
   > sym := literal;   
   > REPEAT   
   > IF i < IdLength THEN   
   > id [i] := ch;   
   > INC (i)   
   > END;   
   > IF ch > 'Z' THEN sym := ident END;   
   > GetCh   
   > ...   
      
   man 3 realloc   
      
   This was a perennial comp.lang.c topic back in the day.   
      
   My interface looked (and still looks) like this:   
      
   #define FGDATA_BUFSIZ BUFSIZ /* adjust to taste */   
   #define FGDATA_WRDSIZ sizeof("floccinaucinihilipilification")   
   #define FGDATA_REDUCE 1   
      
   int fgetline(char **line, size_t *size, size_t maxrecsize, FILE   
   *fp, unsigned int flags, size_t *plen);   
      
   It's easier to use than it might look:   
      
    char *data = NULL; /* where will the data go? NULL is fine */   
    size_t size = 0; /* how much space do we have right now? */   
    size_t len = 0; /* after call, holds line length */   
      
    while(fgetline(&data, &size, (size_t)-1, stdin, 0, &len) == 0)   
    {   
    if(len > 0)   
      
   If you want fgetline.c and don't have 20 years of clc archives,   
   just yell.   
      
   --   
   Richard Heathfield   
   Email: rjh at cpax dot org dot uk   
   "Usenet is a strange place" - dmr 29 July 1999   
   Sig line 4 vacant - apply within   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca