home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.forth      Forth programmers eat a lot of Bratwurst      117,927 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 117,734 of 117,927   
   Hans Bezemer to Anton Ertl   
   Re: Back & Forth - CSV is dead, long liv   
   18 Nov 25 16:18:56   
   
   From: the.beez.speaks@gmail.com   
      
   On 18-11-2025 16:03, Anton Ertl wrote:   
   > Hans Bezemer  writes:   
   >> On 15-11-2025 23:28, Paul Rubin wrote:   
   >>> TSV has the same problem I'd expect.   
   >>   
   >> Not really. Watch the video.   
   >   
   > Absolutely not!  If this was a topic where a video made things really   
   > more obvious than a written explanation (say, the performance of a   
   > gymnastic artist), I would find a pointer to a video acceptable,   
   > although it would still be unlikely that I would watch the video.   
   >   
   > But CSV/TSV is not one of these topics.  If you want to convince me   
   > that TSVs, whatever they are, have an advantage over CSVs, provide the   
   > argument in writing.  I can read.   
   >   
   > - anton   
      
   "Maybe you can write." You think? https://thebeezspeaks.blogspot.com/   
   A 700 page manual. And this:   
      
   "Yeah, I thought: let's do something completely different. So here we   
   are. BTW, if you assumed I hate CSV, you're very wrong. I love CSV. Just   
   load a line from the file into the Terminal Input Buffer - and let's   
   start parsing it. If you got a fairly straight forward CSV file, you'll   
   dissect those records with very little effort. Try doing that with JSON   
   or XML without including a massive library that takes the dirty work out   
   of your hands.   
      
   So no, I'm fine with CSV. It's great. I love it. It's kind to my hands.   
   It tastes like real coffee. Well, there you have it.   
      
   It's also one of the most popular formats for data exchange. Virtually   
   every single application supports it. Except the example spreadsheet   
   that came with Borlands Turbo C, version 2. And that's a shame, because   
   it is much older than you think. The IBM Fortran (level H extended)   
   compiler under OS/360 already supported CSV. And we're talking 1972   
   here. And the first version of Turbo C was released in 1987. I'd say   
   that's plenty of time add some CSV support, don't you think?   
      
   It's also referenced in the 1983 SuperCalc manual:, page 7-35: “Use the   
   program Super Data Interchange to convert a SuperCalc data file to a   
   Comma Separated Values file”. Bummer, but I couldn't get this so-called   
   “SDI” program to produce anything remotely useful. But I digress..   
      
   It took a full two decades before it was officially formalized by RFC   
   4180. In the meanwhile a lot of variations had emerged, especially where   
   the delimiters were concerned. Comma's were replaced by pipes,   
   semi-colons and even TABs. According to RFC 4180 the use of these   
   alternative delimiters is not allowed.   
      
   Still, all the European Excels I've encountered use semi-colons rather   
   than commas, probably because in Europe the comma is used as a decimal   
   separator. Which is annoying if you intend to delimit your file with   
   commas. Now, fun fact, RFC 4180 offers a solution to that - simply   
   enclose that data element with double quotes. Which in itself creates   
   the next problem - what if your data contains double quotes as well?   
   Next solution - double the sucker up.   
      
   The consequence of all this ad hoc patching is, that if you wanna parse   
   these fields, you can neither parse for commas - nor for quotes. And if   
   you do, you have to implement all kinds of corrective counter measures   
   to get it exactly right. But wait - there's more. Another smart move was   
   to allow raw line terminators in fields. So now, you can't even continue   
   the parse - no, you gotta save whatever you got, refill the entire   
   ff-ing buffer again and restart from the beginning. So tell me guys,   
   when did you design this thing? After working hours? While knocking back   
   your tenth beer?   
      
   Another thing you probably didn't realize, but all CSV files are   
   actually Windows files. Every line has to be terminated by a “carriage   
   return - line feed” pair.   
      
   But enough talk, let's see how it behaves in real life. I've created a   
   spreadsheet with some information about computer screens and exported it   
   to CSV. Seems harmless enough. Now let's try to read and list it. Note,   
   we're working with 4-t-H today.   
      
   • First we define a word to open the file. The only thing you have to   
   declare is its name and whether the file is read (that is: input) or   
   written (that is: output);   
      
   • What it leaves on the stack is a 4-t-H file handle. If the file   
   couldn't be opened, it contains an invalid file handle. The word ERROR?   
   adds a flag to the stack;   
      
   • When TRUE, we can call ABORT" to issue an error message and abort;   
      
   • If FALSE, ABORT" will be ignored and the file handle will be   
   duplicated - so it can be USEd. Now all input is sourced from this file.   
   And yes, it will leave an item on the stack - and later we will see why.   
   And that's it, we're done;   
      
   • Now we got to read this file field by field. So we PARSE the line   
   using the comma - just like we've seen before. And if it contained data,   
   we write that string to the console - and we rinse and repeat until   
   there are no more fields left;   
      
   • Now we got to read this file line by line. REFILL reads an entire line   
   to the input buffer - and returns TRUE when it succeeded. If it   
   succeeded we ask our previously defined READ-FIELDS to parse that line.   
   Note we separate each record with an extra carriage-return;   
      
   • We're ready to glue all the components together. First we have to   
   check whether a filename was given on the command line. The program name   
   is always the first argument - so yes, like in C, you got at least one   
   argument. The second argument would be the data file. Conclusion: we   
   expect two arguments. The total number of arguments is put on the stack   
   by the ARGN word;   
      
   • So, we expect two arguments. Less is considered an error. In that   
   case, we call ABORT" to get us out of this mess;   
      
   • Our command line arguments are zero-based numbered. So our .CSV file   
   is at index 1. Now we know for sure there are at least two arguments, we   
   can retrieve it with the ARGS word. We can feed it to our OPEN-FILE word   
   and subsequently call READ-RECORDS;   
      
   • When we're all done - remember that value we left on the stack? It was   
   the file handle. So all we got to do is call the CLOSE word - and it   
   will use that value to close the file. Easy as pie!   
      
   And here you got the consequences of .CSV when the going gets tough.   
   It's butt ugly. Unusable. Now. Can we fix this? Yes, we can. Two   
   additional lines and one modified one - that's all it takes. Barely an   
   inconvenience.   
      
   So - let's address the differences. Yeah, the two additional lines   
   include the libraries. That's easy. Now, what do they do?   
      
   • Well, PARSE-CSV takes quoted fields into consideration. So you get   
   exactly what's between the quotes;   
      
   • But if it also contains double quotes, we're not done yet. That's   
   where CSV> comes in. That one “undoubles” the quotes in a field. Now   
   we're done!   
      
   • Usually I define a word called FIELD> where I combine these words. And   
   if needed I can add the filters I deem necessary to perform the task at   
   hand.   
      
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca