Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.arch    |    Apparently more than just beeps & boops    |    131,241 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 129,601 of 131,241    |
|    BGB to All    |
|    Random/OT: Low sample rate audio weirdne    |
|    06 Sep 25 05:28:16    |
      From: cr88192@gmail.com              Just randomly thinking again about some things I noticed with audio at       low sample rates.              For baseline, can note, basic sample rates:        44100: Standard, sounds good, but bulky        32000: Sounds good        22050: Moderate        16000: OK, Modest size, acceptable quality.        Seems like best tradeoff if not going for high quality.        11025: Poor, muffled.        8000: Very poor, speech almost unintelligible (normally).        But, it is seeming like a "weird hack" may exist here.              For sample formats:        16-bit PCM: Good        Binary16: Also good        A-Law: Decent (space efficient)        8-bit PCM: Sounds poor crap at all sample rates.        Tends to introduce an obvious hiss.              So, at higher sample rates, 16-bit PCM or A-Law are the clearly better       options. And, 16000 16-bit sounds better than 44100 8-bit, despite the       latter having the higher data-rate, because 8-bit PCM adds a very       obvious hiss.                     For upsampling, a usual filtering strategy that works well seems to be       to use a cubic spline.              For downsampling, there are a few options:        Nearest Neighbor:        Simplest, poor quality        Introduces very weird distortions when going to low sample rates.        Box average:        Take N samples and average them;        Only works well for power-of-2 resampling.        Pseudo tricubic:        Take a block of N samples;        Downsample by half until its both above and below target rate;        Weighted sum of lower (average) and cubic interpolation (higher).        Sinc:        Theoretically exists, but never made it work well.              As a general strategy, pseudo tricubic had worked well.                     So, seemingly, at least when working at 16kHz or above:        16-bit PCM, A-Law, or Binary16 is a win;        Pseudo tricubic seems to give the best perceptual audio quality.                     If going to low sample rates (11 or 8kHz), a problem emerges:        Speech becomes muffled and unintelligible.       So, 16-bit PCM and A-Law don't work, audio is still muffled.                     But, there is something weird I had noticed at low rates (eg, 8kHz):        ADPCM encoding seems to increase the intelligibility.        Speech seemingly more intelligible after ADPCM than before.        More so, not using the "obvious choice" of minimizing error.        It works better if the encoder is tuned to slightly overshoot.        The effect is more obvious with 2-bit PCM than with 4-bit.        Like, some sort of weird "less is more" with the quality.              My sense of hearing and RMSE heuristic somewhat disagree with which is       "better" quality. Where, RMSE seems to prefer if the ADPCM encoder tends       to undershoot, and RMSE also preferring more muffled versions (that are       closer to the down-sampled input audio).                     Similarly (partly inspired by the ADPCM effect), it also seems new       contender arrive on the scene as a resampling algorithms:       Treat the previously generated sample points as a reference, and try to       pick a next point that best fits a line or B-Spline to the intermediate       points (in effect, treating each sample point as if it were a control       point for a B-Spline fitted to the input samples).                     The line-fitting is simpler, but the B-Spline seems to give a similar       effect with better quality (even if RMSE does not agree, it sees the       error from this as worse than with the other methods if the audio is       upsampled using the normal spline method).              Though, RMSE is lower if the upsampler also treats the audio samples       more like the control points in a B-Spline.              Where, in my usual cubic-spline upsampler, the interpolation passes       through each control point (if the interpolated position directly aligns       with a control point, it returns this point). This differs from a       B-Splines, were generally the curve undershoots the control points.                     Then, it seems like for storage, the low-rate audio (control points) can       be stored in ADPCM (though this time, error-minimization during encoding       giving the best results).              And, oddly, it seems like the audio (in this low-sample rate,       control-points form) actually has higher perceptual audio quality (and       things like speech seem more intelligible; despite the low 8kHz sample       rate).                     But, I am at a loss here as to why and of this would be true in a       theoretical sense.                            Stuff online mentioning the use of B-Splines for audio seems to work on       the assumption of generating control points and then using another       B-Spline to generate audio at the target rate (rather than directly       listening to the control points as audio).              Stuff online also mentions needing to low-pass filter the audio before       generating the spline, but if any sort of low-pass filtering is applied       (before spline generation) than (again) it becomes muffled and       unintelligible.              Presumably, the idea would be to filter out things above the Nyquist       frequency of the target sample rate (so, say, 4kHz for 8kHz audio), but,       as noted, a 4kHz low-pass filter (in general) wrecks intelligibility.              Then again, maybe the mention of low-pass filtering assumes operating at       somewhat higher target sample rates?...                     Where, seemingly for speech and frequency ranges:        under 1 kHz: mystery range...        Filtering out has little effect.        1-2 kHz; "fullness"        Filtering out this range causes a "tinny" sound        Filtering this out seems to strongly displease cats.        2-4 kHz: Has vowel sounds        Filtering out this range makes voices sound robotic.        Many of the distinguishing parts of the voice go away.        4-8 kHz: Consonants / etc seem to live here        Filtering this out removes the "what is being said" part.        8-16 kHz: Mostly optional        Improves quality, but not effect on intelligibility.        16 kHz: Upper end of hearing        Like CRT TV whistling is up here.              Where, I had noted that general intelligibility of speech and other       audio remains intact with a 4kHz to 8kHz band-pass filter, though with a       "robotic" sound, and it is harder to tell peoples' voices apart (like,       everyone is speaking with a similar-sounding robotic voice).              But, with 8kHz audio having a 4kHz Nyquist frequency, it makes a       problem. Can sort of hear vowel sounds, but sounds are often largely       undifferentiated. Like, can hear that someone is talking, or whose voice       it is, but not really what they are saying.              Though, does leave a mystery then of why telephony would have used 8kHz,       when presumably intelligible speech is the whole point of a telephone?...              Then again, my actual phone experience has mostly been muffled with a       rather obnoxious hiss (like, if the general phone experience wasn't bad       enough, they have to punish people for using the phone by having some       truly awful audio quality...).                            But, then, had noted that with the ADPCM hack, or the B-Spline fitting       hack, it is again possible to hear what is being said at an 8kHz              [continued in next message]              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca