You Can't Round-Trip Audio Through Log-Mel

Why \(x \to \text{mel} \to \hat{x}\) is lossy, why the literature built vocoders to cope, and when that gap is worth closing.

Using a real LibriSpeech-derived clip8, every TTS and voice pipeline eventually asks the same question: can I encode audio to log-mel, manipulate it, and decode back without loss? No. The forward map is cheap and deterministic; the reverse map is not an inverse. Phase is discarded, frequency bins are merged, and many waveforms share the same mel. That is why the field invented vocoders: learned decoders that guess the information mel never stored.

\[ \Phi:\ x(t)\ \longmapsto\ S \in \mathbb{R}^{M \times T}, \qquad \nexists\ \Phi^{-1}\ \text{s.t.}\ \Phi^{-1}(\Phi(x)) = x \]

1. What mel throws away

Write the forward map as composition of three steps. Magnitude STFT drops phase; mel projection compresses linear frequency bins; log compression is invertible alone but cannot undo the first two.

\[ |X(f,t)| = \bigl|\mathrm{STFT}(x)\bigr|, \qquad M = E\,|X|^{\odot 2}, \qquad S = 10\log_{10}\!\left(\frac{M}{\mathrm{ref}}\right) \]

Take a real LibriSpeech clip and resynthesize the same magnitude with random phase: waveforms diverge, mel matches. Formally, \(\Phi(x_1) = \Phi(x_2)\) while \(x_1 \neq x_2\).

Mel projection is a rank-deficient linear map: \(E^{+}E \neq I\), so detail in the orthogonal complement of \(E\)’s range space is discarded and cannot be read back from \(S\) alone.

Composed, audio\(\to\)mel\(\to\)audio cannot be lossless even when \(S\) is copied exactly.
Stage	Map	Why \(\Phi \neq \mathrm{id}\)
Magnitude	\(x \mapsto \|X\|\)	Phase discarded; \(\infty\)-many \(x\) share one \(\|X\|\)1
Mel bank	\(\|X\| \mapsto E\|X\|^{\odot 2}\), \(E \in \mathbb{R}^{M \times F}\)	\(F \gg M\); \(E^{+}E \neq I\)1

2. The naive round-trip still fails

Before neural vocoders, the classical decode was explicit: invert log, pseudo-invert the mel bank, estimate phase with Griffin–Lim2, then iSTFT. Even at zero mel error (\(\|S - \Phi(x)\| = 0\)), the output is metallic: mel was never a container for the full signal.

\[ \hat{x} = \mathrm{iSTFT}\!\Bigl( \mathrm{GL}\bigl(E^{+}\,10^{S/10}\bigr) \Bigr), \qquad \hat{x} \neq x \]

3. Why vocoders exist in the literature

Tacotron 23 split the problem: an acoustic model predicts mel; a WaveNet vocoder maps mel to waveform. The vocoder exists because mel\(\to\)wave is one-to-many4: one tile \(S\) labels a family of signals differing in phase and null-space detail. MelGAN5 and HiFi-GAN6 replaced Griffin–Lim with fast learned decoders. PhaseAug4 rotates phase during training so GAN vocoders stop memorizing a single ground-truth waveform per mel. Diffusion vocoders sample from the same equivalence class rather than chasing one canonical inverse.

Mel reconstruction loss is a poor proxy for perceptual match: equal mel RMSE can still yield different resynthesized timbres7. Judge decoders by ear, not pixel error alone.

\[ \Phi(x_1) = \Phi(x_2) = S,\ \ x_1 \neq x_2 \qquad\Longrightarrow\qquad \hat{x} = G_\theta(S) \approx x_1\ \text{(one choice from many preimages)} \]

4. Is perfect inversion worth solving?

Treat this as a product question. Perfect lossless round-trip (\(\hat{x} = x\)) fights the representation: mel discards detail on purpose.

5. Why we don’t need the phase

The flip side of all this: discarding phase is not a reluctant compromise, it is the right call. The STFT shift property shows why. Delay the signal by \(\tau\) and every bin picks up a pure phase factor:

\[ x(t-\tau)\ \xrightarrow{\ \mathrm{STFT}\ }\ X(f,t)\,e^{-j2\pi f\tau}, \qquad \bigl|X\,e^{-j2\pi f\tau}\bigr| = |X|, \qquad \angle X \mapsto \angle X - 2\pi f\tau \]

The magnitude (hence the mel) is left exactly unchanged, while the phase of every bin rotates by \(-2\pi f\tau\). So a sub-sample shift no listener can hear rewrites the entire phase spectrum. Three consequences make phase a poor thing to store:

Storing phase means committing to one arbitrary alignment from a perceptually equivalent continuum. Keeping \(|X|\) and letting a vocoder re-pick a self-consistent phase at decode time is the better trade, which is precisely the job vocoders do.

Practical checklist

Golden nugget. You cannot round-trip audio through log-mel without loss: that is why vocoders exist. For recognition, \(\Phi\) is the right compression. For synthesis, \(G_\theta\) is the literature’s admission that \(S\) never held the full signal. Invest in decoder quality when audio is the product; ignore the round-trip problem when it is not.

References

Natsiou & O’Leary, “A sinusoidal signal reconstruction method for the inversion of the mel-spectrogram,” 2022.
Griffin & Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984.
Shen et al., “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions (Tacotron 2),” 2018.
Lee et al., “PhaseAug: A Differentiable Augmentation for Speech Synthesis to Simulate One-to-Many Mapping,” ICASSP 2023.
Kumar et al., “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis,” NeurIPS 2019.
Kong et al., “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis,” NeurIPS 2020.
Natsiou et al., “An investigation of the reconstruction capacity of stacked convolutional autoencoders for log-mel-spectrograms,” 2023.
Panayotov et al., “LibriSpeech: an ASR corpus based on public domain audio books,” ICASSP 2015.

You Can’t Round-Trip Audio Through Log-Mel

1. What mel throws away

2. The naive round-trip still fails

3. Why vocoders exist in the literature

4. Is perfect inversion worth solving?

5. Why we don’t need the phase

Practical checklist

References