... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
alt.comp.os.windows-10
Steaming pile of horseshit Windows 10
197,671 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 197,641 of 197,671
Maria Sophia to All
PSA: An offline Windows workflow for Bal
04 Mar 26 11:38:41
   XPost: comp.text.pdf   
   From: mariasophia@comprehension.com   
      
   PSA:   
   An offline Windows workflow for audiobook TTS from PDF   
   (Based 1/2 on logic but the other 1/2 is based on trial-&-error results.)   
      
   Software:   
      I. Windows 10 (almost everything will be FOSS, offline)   
     II. Balabolka 2.15.0.811   
    III. Calibre 8.10.0   
     IV. PDF ShaperFree 8.9   
      V. Adobe Acrobat 6 (writer)   
      
   Main issues are we don't want to create an audiobook that speaks the...   
    a. headers/footers   
    b. images   
    c. end of line (versus end of sentence, which causes unnatural pauses)   
    d. index, forward, tables, bibliography, etc.   
      
   Main software logic employed was to use the best tool/format for the job...   
    A. Balabolka needs the cleanest text input you can give it; where   
       Balabolka (with the Tesseract plugin) can OCR but we're not doing OCR.   
    B. Acrobat tries to preserve the visual layout of the PDF; while   
       Calibre does not try to preserve layout (which is better for TTS).   
    C. Calibre tries to extract logical-reading order (which TTS needs).   
    D. RTF is the better input for Calibre since an RTF from Acrobat has   
         i. Real text (not embedded PDF objects)   
        ii. Fewer hard line breaks (than TXT)   
       iii. Predictable header/footer patterns (that Calibre can strip)   
        iv. No images (so nothing confuses the reading order)   
         v. Consistent encoding (to rebuild paragraphs cleanly)   
    E. But Acrobat 6 embedded JPEGs as hex blobs inside the RTF.   
    F. So PDF Shaper Free was used to remove images.   
    G. And PDF Shaper Free PDF->RTF conversion is cleaner than Acrobat's   
       (as Acrobat embeds images as hex blobs inside RTF).   
   But there is still stray artifacts (e.g., the first letter of the first   
   word of every chapter is a big font and then a space and then the rest of   
   the word, and some lines are still chopped, and of course the header &   
   footer are still in the text output).   
      
   Problem Statement:   
    College-aged grandchild wants me to "read" a 212-page 10MB book she has   
    in PDF so she and I can discuss it over the phone; but I want to "listen"   
    to that book PDF because reading scanned/text is miserable for my eyes.   
      
   Problem Document:   
    The book "looks" scanned (i.e., sloppy fonts) but Acrobat can select text   
    and Acrobat search can find a given word, so it's weird text + scanned???   
    At least that means Tesseract OCR plugins to Balabolka are not needed.   
      
   Test Flow:   
    Adobe Acrobat 6 (Writer) was used to convert the original PDF twice.   
    1. First the 10MB (original) PDF was "File > Reduced file size" to 27MB   
       (page 70 & 135 took almost forever, but presumably cleaned artifacts)   
    2. Then the 27MB "reduced-size" PDF was "File > Save as" a 177MB RTF;   
       but the RTF had page-end line breaks, so the RTF was discarded.   
    3. The 27MB reduced-size PDF was fed into PDF Shaper Free   
       "File > Remove Images", which saved a 17MB image-free PDF   
       (which, coincidentally, had *much cleaner* text fonts!).   
    4. Calibre read in the RTF and converted it to TXT but each line still   
       had a linebreak at the right side of the visible page. Drat.   
    5. I tried Sigil to make a cleaner EPUB from the Calibre EPUB; but even   
       though the EPUB has no artificial line breaks, the TXT still had them.   
    6. Merge lines that are not real paragraph breaks   
        Balabolka:Edit > Replace [x]Use Regular Expressions   
        Find: ([^\n])\n([^\n])   
        Replace: \1 \2   
        Save as text   
    7. Fix broken words   
       Find: ([a-z])¡([a-z])   
       Replace: \1\2   
    8. Remove tab characters   
       Find: \t+   
       Replace: (empty)   
   9. Remove page headers (such as "10  WORLD HISTORY")   
       Find: ^\s*\d+\s+THE DEVIL’S HORSEMEN\s*$   
       Replace: (empty)   
   10. Normalize paragraph spacing (collapse into a single line)   
        Find: \n{3,}   
        Replace: \n\n   
   11. Remove soft-hyphen & ligature debris   
        Find: ¡   
        Replace: (empty)   
        Find: ([a-zA-Z])�([a-zA-Z])   
        Replace:   
   12. Tell Balabolka to pause slightly before & after italicized words   
        Find: \t([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*)   
        Replace:  \1\2   
      
   Once you have the clean text, the Balabolka step is trivial.   
   Save as MP3.   
     Bitrate: 48 kbps or 64 kbps   
     Mode: Mono   
     Sample rate: 22 kHz or 32KHz   
   Copy to your mobile device and play whenever you want to listen   
   to the audio book.   
   --   
   This is a work in progress but I figured I'd document the steps.   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]