← Home

You Can’t Round-Trip Audio Through Log-Mel

Why \(x \to \text{mel} \to \hat{x}\) is lossy, why the literature built vocoders to cope, and when that gap is worth closing.

Erik Quintanilla · June 2026

Using a real LibriSpeech-derived clip8, every TTS and voice pipeline eventually asks the same question: can I encode audio to log-mel, manipulate it, and decode back without loss? No. The forward map is cheap and deterministic; the reverse map is not an inverse. Phase is discarded, frequency bins are merged, and many waveforms share the same mel. That is why the field invented vocoders: learned decoders that guess the information mel never stored.

\[ \Phi:\ x(t)\ \longmapsto\ S \in \mathbb{R}^{M \times T}, \qquad \nexists\ \Phi^{-1}\ \text{s.t.}\ \Phi^{-1}(\Phi(x)) = x \]
Forward encode path and lossy decode return through log-mel.
Top (blue): encode \(x \to |X| \to ES \to \log\). Dashed return (violet): any decode \(\hat{x} \neq x\). Branching marks information discarded before mel.

1. What mel throws away

Write the forward map as composition of three steps. Magnitude STFT drops phase; mel projection compresses linear frequency bins; log compression is invertible alone but cannot undo the first two.

\[ |X(f,t)| = \bigl|\mathrm{STFT}(x)\bigr|, \qquad M = E\,|X|^{\odot 2}, \qquad S = 10\log_{10}\!\left(\frac{M}{\mathrm{ref}}\right) \]
StageMapWhy \(\Phi \neq \mathrm{id}\)
Magnitude \(x \mapsto |X|\) Phase discarded; \(\infty\)-many \(x\) share one \(|X|\)1
Mel bank \(|X| \mapsto E|X|^{\odot 2}\),  \(E \in \mathbb{R}^{M \times F}\) \(F \gg M\); \(E^{+}E \neq I\)1
Composed, audio\(\to\)mel\(\to\)audio cannot be lossless even when \(S\) is copied exactly.

Take a real LibriSpeech clip and resynthesize the same magnitude with random phase: waveforms diverge, mel matches. Formally, \(\Phi(x_1) = \Phi(x_2)\) while \(x_1 \neq x_2\).

Two waveforms and one shared log-mel spectrogram.
Top: original (blue) vs phase-scrambled (violet) waveforms. Middle: two identical log-mel panels. Bottom: zero-difference strip.
Mel filterbank triangles and compressed spectral energy.
Top: mel filterbank \(E\) over the linear frequency axis. Bottom: linear spectrum and compressed mel energy shown in separate coordinate panels.

Mel projection is a rank-deficient linear map: \(E^{+}E \neq I\), so detail in the orthogonal complement of \(E\)’s range space is discarded and cannot be read back from \(S\) alone.

2. The naive round-trip still fails

Before neural vocoders, the classical decode was explicit: invert log, pseudo-invert the mel bank, estimate phase with Griffin–Lim2, then iSTFT. Even at zero mel error (\(\|S - \Phi(x)\| = 0\)), the output is metallic: mel was never a container for the full signal.

\[ \hat{x} = \mathrm{iSTFT}\!\Bigl( \mathrm{GL}\bigl(E^{+}\,10^{S/10}\bigr) \Bigr), \qquad \hat{x} \neq x \]
Classical and neural mel decode paths.
Top row: classical chain \(S \to E^{+} \to |\widehat{X}| \to \mathrm{GL} \to \mathrm{iSTFT}\) (violet terminus = noisy \(\hat{x}\)). Bottom row: neural vocoder \(\hat{x} = G_\theta(S)\) learns a prior over missing detail.

3. Why vocoders exist in the literature

Tacotron 23 split the problem: an acoustic model predicts mel; a WaveNet vocoder maps mel to waveform. The vocoder exists because mel\(\to\)wave is one-to-many4: one tile \(S\) labels a family of signals differing in phase and null-space detail. MelGAN5 and HiFi-GAN6 replaced Griffin–Lim with fast learned decoders. PhaseAug4 rotates phase during training so GAN vocoders stop memorizing a single ground-truth waveform per mel. Diffusion vocoders sample from the same equivalence class rather than chasing one canonical inverse.

Mel reconstruction loss is a poor proxy for perceptual match: equal mel RMSE can still yield different resynthesized timbres7. Judge decoders by ear, not pixel error alone.

\[ \Phi(x_1) = \Phi(x_2) = S,\ \ x_1 \neq x_2 \qquad\Longrightarrow\qquad \hat{x} = G_\theta(S) \approx x_1\ \text{(one choice from many preimages)} \]
Many waveforms map to one mel; vocoders pick one decode.
Three distinct preimages \(x_1, x_2, x_3\) (blue) collapse to one mel \(S\) (violet center). Dashed outbound paths: different vocoders \(G_{\theta_i}(S)\), a design choice, not a unique inverse.

4. Is perfect inversion worth solving?

Treat this as a product question. Perfect lossless round-trip (\(\hat{x} = x\)) fights the representation: mel discards detail on purpose.

5. Why we don’t need the phase

The flip side of all this: discarding phase is not a reluctant compromise, it is the right call. The STFT shift property shows why. Delay the signal by \(\tau\) and every bin picks up a pure phase factor:

\[ x(t-\tau)\ \xrightarrow{\ \mathrm{STFT}\ }\ X(f,t)\,e^{-j2\pi f\tau}, \qquad \bigl|X\,e^{-j2\pi f\tau}\bigr| = |X|, \qquad \angle X \mapsto \angle X - 2\pi f\tau \]

The magnitude (hence the mel) is left exactly unchanged, while the phase of every bin rotates by \(-2\pi f\tau\). So a sub-sample shift no listener can hear rewrites the entire phase spectrum. Three consequences make phase a poor thing to store:

Storing phase means committing to one arbitrary alignment from a perceptually equivalent continuum. Keeping \(|X|\) and letting a vocoder re-pick a self-consistent phase at decode time is the better trade, which is precisely the job vocoders do.

Practical checklist

Golden nugget. You cannot round-trip audio through log-mel without loss: that is why vocoders exist. For recognition, \(\Phi\) is the right compression. For synthesis, \(G_\theta\) is the literature’s admission that \(S\) never held the full signal. Invest in decoder quality when audio is the product; ignore the round-trip problem when it is not.

References

  1. Natsiou & O’Leary, “A sinusoidal signal reconstruction method for the inversion of the mel-spectrogram,” 2022.
  2. Griffin & Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984.
  3. Shen et al., “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions (Tacotron 2),” 2018.
  4. Lee et al., “PhaseAug: A Differentiable Augmentation for Speech Synthesis to Simulate One-to-Many Mapping,” ICASSP 2023.
  5. Kumar et al., “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis,” NeurIPS 2019.
  6. Kong et al., “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis,” NeurIPS 2020.
  7. Natsiou et al., “An investigation of the reconstruction capacity of stacked convolutional autoencoders for log-mel-spectrograms,” 2023.
  8. Panayotov et al., “LibriSpeech: an ASR corpus based on public domain audio books,” ICASSP 2015.