... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.misc
General topics about computers not cover
21,759 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 20,452 of 21,759
Ben Collver to All
How Unix Spell Ran in 64kB RAM
23 Jan 25 15:16:38
   From: bencollver@tilde.pink   
      
   How Unix Spell Ran in 64kB RAM   
   ==============================   
      
   by Abhinav Upadhyay   
      
   How do you fit a 250kB dictionary in 64kB of RAM and still perform   
   fast lookups? For reference, even with modern compression techniques   
   like gzip -9, you can't compress this file below 85kB.   
      
   In the 1970s, Douglas McIlroy faced this exact challenge while   
   implementing the spell checker for Unix at AT&T. The constraints of   
   the PDP-11 computer meant the entire dictionary needed to fit in just   
   64kB of RAM. A seemingly impossible task.   
      
   Instead of relying on generic compression techniques, he took   
   advantage of the properties of the data and developed a compression   
   algorithm that came within 0.03 bits of the theoretical limit of   
   possible compression. To this day, it remains unbeaten.   
      
   The story of Unix spell is more than just historical curiosity. It's   
   a masterclass in engineering under constraints: how to analyze a   
   problem from first principles, leverage mathematical insights, and   
   design elegant solutions that work within strict resource limits.   
      
   If you're short on time, here's the key engineering story:   
      
   * The Unix spell started in the 1970s as an afternoon prototype by   
     Steve Johnson at AT&T, before Douglas McIlroy rewrote it to improve   
     its performance and accuracy.   
      
   * McIlroy's first innovation was a clever linguistics-based stemming   
     algorithm that reduced the dictionary to just 25,000 words while   
     improving accuracy.   
      
   * For fast lookups, he initially used a Bloom filter--perhaps one of   
     its first production uses. Interestingly, Dennis Ritchie provided   
     the implementation. They tuned it to have such a low false positive   
     rate that they could skip actual dictionary lookups.   
      
   * When the dictionary grew to 30,000 words, the Bloom filter approach   
     became impractical, leading to innovative hash compression   
     techniques.   
      
   * They computed that 27-bit hash codes would keep collision   
     probability acceptably low, but needed compression.   
      
   * McIlroy's solution was to store differences between sorted hash   
     codes, after discovering these differences followed a geometric   
     distribution.   
      
   * Using Golomb's code, a compression scheme designed for geometric   
     distributions, he achieved 13.60 bits per word--remarkably close to   
     the theoretical minimum of 13.57 bits.   
      
   * Finally, he partitioned the compressed data to speed up lookups,   
     trading a small memory increase (final size ~14 bits per word) for   
     significantly faster performance.   
      
   The rest of the article expands each of these points and gives a   
   detailed explanation with all the math and logic behind them.   
      
   From:    
      
   --- SoupGate-DOS v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]