... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 129,601 of 131,241
BGB to All
Random/OT: Low sample rate audio weirdne
06 Sep 25 05:28:16
   From: cr88192@gmail.com   
      
   Just randomly thinking again about some things I noticed with audio at   
   low sample rates.   
      
   For baseline, can note, basic sample rates:   
      44100: Standard, sounds good, but bulky   
      32000: Sounds good   
      22050: Moderate   
      16000: OK, Modest size, acceptable quality.   
        Seems like best tradeoff if not going for high quality.   
      11025: Poor, muffled.   
       8000: Very poor, speech almost unintelligible (normally).   
         But, it is seeming like a "weird hack" may exist here.   
      
   For sample formats:   
      16-bit PCM: Good   
      Binary16: Also good   
      A-Law: Decent (space efficient)   
      8-bit PCM: Sounds poor crap at all sample rates.   
        Tends to introduce an obvious hiss.   
      
   So, at higher sample rates, 16-bit PCM or A-Law are the clearly better   
   options. And, 16000 16-bit sounds better than 44100 8-bit, despite the   
   latter having the higher data-rate, because 8-bit PCM adds a very   
   obvious hiss.   
      
      
   For upsampling, a usual filtering strategy that works well seems to be   
   to use a cubic spline.   
      
   For downsampling, there are a few options:   
      Nearest Neighbor:   
        Simplest, poor quality   
        Introduces very weird distortions when going to low sample rates.   
      Box average:   
        Take N samples and average them;   
        Only works well for power-of-2 resampling.   
      Pseudo tricubic:   
        Take a block of N samples;   
        Downsample by half until its both above and below target rate;   
        Weighted sum of lower (average) and cubic interpolation (higher).   
      Sinc:   
        Theoretically exists, but never made it work well.   
      
   As a general strategy, pseudo tricubic had worked well.   
      
      
   So, seemingly, at least when working at 16kHz or above:   
      16-bit PCM, A-Law, or Binary16 is a win;   
      Pseudo tricubic seems to give the best perceptual audio quality.   
      
      
   If going to low sample rates (11 or 8kHz), a problem emerges:   
      Speech becomes muffled and unintelligible.   
   So, 16-bit PCM and A-Law don't work, audio is still muffled.   
      
      
   But, there is something weird I had noticed at low rates (eg, 8kHz):   
      ADPCM encoding seems to increase the intelligibility.   
        Speech seemingly more intelligible after ADPCM than before.   
      More so, not using the "obvious choice" of minimizing error.   
        It works better if the encoder is tuned to slightly overshoot.   
      The effect is more obvious with 2-bit PCM than with 4-bit.   
        Like, some sort of weird "less is more" with the quality.   
      
   My sense of hearing and RMSE heuristic somewhat disagree with which is   
   "better" quality. Where, RMSE seems to prefer if the ADPCM encoder tends   
   to undershoot, and RMSE also preferring more muffled versions (that are   
   closer to the down-sampled input audio).   
      
      
   Similarly (partly inspired by the ADPCM effect), it also seems new   
   contender arrive on the scene as a resampling algorithms:   
   Treat the previously generated sample points as a reference, and try to   
   pick a next point that best fits a line or B-Spline to the intermediate   
   points (in effect, treating each sample point as if it were a control   
   point for a B-Spline fitted to the input samples).   
      
      
   The line-fitting is simpler, but the B-Spline seems to give a similar   
   effect with better quality (even if RMSE does not agree, it sees the   
   error from this as worse than with the other methods if the audio is   
   upsampled using the normal spline method).   
      
   Though, RMSE is lower if the upsampler also treats the audio samples   
   more like the control points in a B-Spline.   
      
   Where, in my usual cubic-spline upsampler, the interpolation passes   
   through each control point (if the interpolated position directly aligns   
   with a control point, it returns this point). This differs from a   
   B-Splines, were generally the curve undershoots the control points.   
      
      
   Then, it seems like for storage, the low-rate audio (control points) can   
   be stored in ADPCM (though this time, error-minimization during encoding   
   giving the best results).   
      
   And, oddly, it seems like the audio (in this low-sample rate,   
   control-points form) actually has higher perceptual audio quality (and   
   things like speech seem more intelligible; despite the low 8kHz sample   
   rate).   
      
      
   But, I am at a loss here as to why and of this would be true in a   
   theoretical sense.   
      
      
      
   Stuff online mentioning the use of B-Splines for audio seems to work on   
   the assumption of generating control points and then using another   
   B-Spline to generate audio at the target rate (rather than directly   
   listening to the control points as audio).   
      
   Stuff online also mentions needing to low-pass filter the audio before   
   generating the spline, but if any sort of low-pass filtering is applied   
   (before spline generation) than (again) it becomes muffled and   
   unintelligible.   
      
   Presumably, the idea would be to filter out things above the Nyquist   
   frequency of the target sample rate (so, say, 4kHz for 8kHz audio), but,   
   as noted, a 4kHz low-pass filter (in general) wrecks intelligibility.   
      
   Then again, maybe the mention of low-pass filtering assumes operating at   
   somewhat higher target sample rates?...   
      
      
   Where, seemingly for speech and frequency ranges:   
      under 1 kHz: mystery range...   
        Filtering out has little effect.   
      1-2 kHz; "fullness"   
        Filtering out this range causes a "tinny" sound   
        Filtering this out seems to strongly displease cats.   
      2-4 kHz: Has vowel sounds   
        Filtering out this range makes voices sound robotic.   
        Many of the distinguishing parts of the voice go away.   
      4-8 kHz: Consonants / etc seem to live here   
        Filtering this out removes the "what is being said" part.   
      8-16 kHz: Mostly optional   
        Improves quality, but not effect on intelligibility.   
      16 kHz: Upper end of hearing   
        Like CRT TV whistling is up here.   
      
   Where, I had noted that general intelligibility of speech and other   
   audio remains intact with a 4kHz to 8kHz band-pass filter, though with a   
   "robotic" sound, and it is harder to tell peoples' voices apart (like,   
   everyone is speaking with a similar-sounding robotic voice).   
      
   But, with 8kHz audio having a 4kHz Nyquist frequency, it makes a   
   problem. Can sort of hear vowel sounds, but sounds are often largely   
   undifferentiated. Like, can hear that someone is talking, or whose voice   
   it is, but not really what they are saying.   
      
   Though, does leave a mystery then of why telephony would have used 8kHz,   
   when presumably intelligible speech is the whole point of a telephone?...   
      
   Then again, my actual phone experience has mostly been muffled with a   
   rather obnoxious hiss (like, if the general phone experience wasn't bad   
   enough, they have to punish people for using the phone by having some   
   truly awful audio quality...).   
      
      
      
   But, then, had noted that with the ADPCM hack, or the B-Spline fitting   
   hack, it is again possible to hear what is being said at an 8kHz   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]