Speech Synthesis and Perception with Envelope Cue
Overview
This project was completed for the Signals and Systems lab course. It implemented a Tone Vocoder — a system that decomposes speech into frequency sub-bands, extracts the amplitude envelope of each band, re-modulates the envelopes onto sinusoidal carriers, and resynthesizes the signal. This mimics the processing strategy used in cochlear implants, which must transmit speech with a very limited number of independent channels.
Results
- Increasing the number of frequency bands N consistently improved perceptual quality of the resynthesized speech.
- Increasing the low-pass filter cutoff frequency improved envelope fidelity and naturalness.
- Bionic cochlear segmentation (logarithmically spaced bands) outperformed equal-interval segmentation at low N (e.g., N=4), because the low-frequency range carries disproportionately more speech energy.
- At large N, equal-interval segmentation achieved higher upper-bound quality, but cochlear segmentation became unstable at N≈20 (narrow passbands caused filter instability).
- Added Speech-Shaped Noise (SSN) at varying SNRs and confirmed that envelope-based synthesis degrades gracefully but becomes unintelligible at low SNR.
- Developed a full MATLAB App Designer GUI for real-time parameter exploration.
Technical Details
Tone Vocoder Pipeline:
- Band-pass filtering: Split the 200–7000 Hz speech spectrum into N sub-bands using Butterworth BPFs.
- Mode 0: Equal-frequency spacing.
- Mode 1: Cochlear-length mapping (
f = 165.4 × (10^(0.06d) − 1)), producing logarithmically spaced bands that match basilar membrane resonance distribution.
- Envelope extraction: Full-wave rectification (
abs) followed by a low-pass Butterworth filter (cutoffCfHz) to extract the amplitude envelope of each sub-band. - Carrier modulation: Each envelope multiplied by a sinusoidal carrier at the sub-band midpoint frequency.
- Synthesis & normalization: Sum all modulated sub-bands; normalize energy to match the input signal level.
Advanced Extensions:
- Carrier frequency variants: Tested geometric mean, harmonic mean, arithmetic mean, and square mean as alternatives to the midpoint frequency, examining effects on reconstruction fidelity.
- SSN generation: Synthesized speech-shaped noise matching the input’s power spectral density using
pwelch+fir2, added at a controlled SNR. - MATLAB App Designer console: Interactive GUI with sliders for band count (0–150) and LPF cutoff (0–200 Hz), BPF mode toggle, SSN on/off switch, and real-time waveform + spectrum display.
Challenges
- Filter instability at high N: Narrow passbands caused Butterworth BPF coefficients to become numerically unstable; identified N≈20 as the practical upper bound for cochlear-mode segmentation.
- Energy normalization: Without explicit normalization, synthesized speech energy varied significantly with N and Cf, making perceptual comparisons across conditions unreliable.
- Code modularity: Refactored the pipeline into reusable functions (
Envelope,getSSN,alter) shared across standalone scripts and the App Designer class, which required careful handling of MATLAB’s function scoping rules.
Reflection and Insights
This project made abstract signal-processing concepts tangible: the effect of filter bank design on speech quality can be heard directly, not just measured. The cochlear-inspired logarithmic spacing illustrates a broader principle — domain-specific knowledge (here, auditory neuroscience) often provides better engineering priors than uniform mathematical choices. The project also demonstrated that building an interactive parameter-exploration tool, even a simple slider-based GUI, dramatically accelerates the insight cycle compared to running scripts with hardcoded values.
Team and Role
- Team: Two-person team.
- My Role: Implemented the core Tone Vocoder pipeline; designed and built the MATLAB App Designer console; led the cochlear segmentation analysis and carrier frequency experiments.
Speech Synthesis and Perception with Envelope Cue