... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
alt.comp.os.windows-11
Steaming pile of horseshit Windows 11
4,969 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 3,639 of 4,969
Marian to Herbert Kleebauer
Re: Tutorial: Notepad++ shortcuts.xml ma
31 Dec 25 11:21:22
   XPost: alt.comp.os.windows-10, alt.comp.microsoft.windows   
   From: marianjones@helpfulpeople.com   
      
   Herbert Kleebauer wrote:   
   > On 12/31/2025 9:33 AM, Marian wrote:   
   >   
   >> This line has a sneaky Unicode dash � right here.   
   >> This line has curly quotes �like these�.   
   >> This line has a non-breaking space between words.   
   >   
   > In Thunderbird this didn't arrive as valid uTF-8 code.   
   >   
   > "dash � right" in hex:   
   >   
   > 64 61 73 68 │ 20 FB 20 72 │ 69 67 68 74   
   >   
   > FB is the starting byte of a 4 byte utf-8 code, but the   
   > 3 remaining bytes are missing.   
      
   Hi Herbert,   
      
   Happy New Year!   
      
   Thank you for that information. I only find out after I've posted.   
   I don't "see" most of the tofu, but as you can tell, it was there.   
      
   It happens when I don't always remember to convert Unicode to ASCII.   
   Here is the original test file that contains the Unicode characters.   
      
   This line is fine.   
   This line has a sneaky Unicode dash – right here.   
   This line has curly quotes “like these”.   
   This line has a non-breaking space between words.   
      
   Bear in mind there is much more than just Unicode characters in   
   pasted web-page text as Unicode is only the container; the real trouble   
   comes from the variety of characters inside it such as zero-width   
   spaces & joiners, directional control characters, soft hyphens, etc.   
      
   I sent that exactly as it was copied & pasted from gVim.   
   My Usenet "reader" is a bunch of telnet scripts tied to gVim.   
      
   Whatever is in the header is random from a dictionary lookup.   
   So whatever character encoding is in the header is static.   
      
   This is why I try to run all the web page comments (which contain funky   
   characters) through a conversion to ASCII prior to posting.   
      
   Here's that same file after being run through this sequence.   
   c:\> type unicode2ascii.bat   
   @echo off   
   :: unicode2ascii.bat   
   :: This batch file runs a PowerShell script that removes all non-ASCII   
   :: characters from unicode.txt and writes the cleaned output to ascii.txt.   
   powershell -NoProfile -ExecutionPolicy Bypass -File unicode2ascii.ps1   
      
   c:\> type unicode2ascii.ps1   
   # unicode2ascii.ps1   
   # This script reads unicode.txt, removes all characters outside the   
   # 7-bit ASCII range (0x00 to 0x7F), and writes the result to ascii.txt.   
      
   Get-Content unicode.txt | ForEach-Object {   
       ($_ -replace '[^\x00-\x7F]', '')   
   } | Set-Content ascii.txt   
      
   c:\> type ascii.txt   
   This line is fine.   
   This line has a sneaky Unicode dash  right here.   
   This line has curly quotes like these.   
   This line has a non-breaking space between words.   
   >   
   >> c:\> type unicode2ascii.bat   
   >> @echo off   
   >>:: unicode2ascii.bat   
   >>:: This batch file runs a PowerShell script that removes all non-ASCII   
   >>:: characters from unicode.txt and writes the cleaned output to ascii.txt.   
   >> powershell -NoProfile -ExecutionPolicy Bypass -File unicode2ascii.ps1   
   >   
   > Wouldn't it be simpler to open the file in Notepad and save it with   
   > ANSI encoding instead of UTF-8?   
      
   My use model is to research the bejeezus out of my Usenet posts, so they   
   very often contain funky characters of all sorts due to copy/paste/edit.   
      
   I simply need a quick converter of the pasted text to keyboard ASCII.   
   This notepad conversion started years ago with a few funky characters.   
   Shortcuts.xml grew over time into the behemoth that it currently is.   
      
   Nonetheless, "simpler" and "faster" need to go together.   
   Currently the process is:   
      
   a. Paste results from multiple web sources into a gvim file   
   b. Convert to ascii in Notepad++ using ctrl+A & ctrl+B   
   c. Edit converted results to the final Usenet post content   
      
   It's just a couple of quick keyboard combinations.   
    ctrl+c (to copy the referenced web page text)   
    Runbox > n (to bring up Notepad)   
    ctrl+v (to paste the copied funky text to Notepad++)   
    ctrl+b (to convert the funky text to ascii characters)   
    ctrl+a (to copy the converted ascii)   
    ctrl+v (to paste into gVim for the final edits)   
    ctrl+s (to send off the Usenet post).   
      
   I do it a hundred times a day, all day, every day, so the keystroke   
   sequence is efficient but I'm always open to a better simpler method.   
      
   The first thing I had tried, years ago, was to do it inside of gVim.   
    :%s/[^[:ascii:]]//g   
   Mapped to the F5 key, that turned into this, which partially works.   
    nnoremap  :%s/[^[:ascii:]]//g   
      
   But that will simply "remove" the unwanted funky characters.   
   I need to map them to individual ascii replacements.   
      
   Of course, gVim can map a specific character to another character:   
    nnoremap  :silent %s/[xy]/"/ge | %s/[nm]/'/ge | %s/[ab]/-/ge   
    where for this post   
    x = open curly doublequote   
    y = close curly doubleqhote   
    n = open quote   
    m = closed curly quote   
    a = em dash   
    b = en dash   
   But that turns into yet another "shortcuts.xml" complex conversion set   
   as I run into all sorts of funky unicode characters (such as nbspace).   
      
   Taking your suggestion to heart, I could, I guess, call my powershell   
   conversion script from within gVim, which I hadn't been doing ('cuz I   
   wasn't running the powershell script - it was just a test vehicle).   
    :!powershell -File unicode2ascii.ps1   
   :e ascii.txt   
    :e ascii.txt   
      
   I could also run some sort of filter out of gVim to Powershell too.   
    :%!powershell -command "(Get-Content -Raw) -replace '[^\x00-\x7F]', ''"   
   Mapped, perhaps, to the F7 key for single-stroke keyboard efficiency.   
    noremap  :% !powershell -command "(Get-Content -Raw) -replace   
   '[^\x00-\x7F]', ''"   
      
   If I were on Linux, I'd use the existing converter tools such as   
    iconv -f UTF-8 -t ASCII//TRANSLIT input.txt > output.txt   
    uconv -f utf-8 -t ascii --transliterate input.txt > output.txt   
      
   Looking it up, I can also   
   $txt = Get-Content unicode.txt -Raw $norm = $txt.Normalize([Text   
   NormalizationForm]::FormKD) $ascii = -join ($norm.ToCharArray() | Where-Object   
   { [int]$_ -lt 128 }) $ascii | Set-Content ascii.txt   
      
   And, looking it up s'more, I see python has a conversion tool also.   
    pip install unidecode   
    from unidecode import unidecode   
    print(unidecode(open("unicode.txt").read()))   
      
   It's such a common need there are lots of solutions.   
   Which is the simplest?   
   --   
   Just paying it forward with clear steps & free tools made simple.   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]