Sound Transformation: Applying Image Neural Style Transfer Networks to Audio Spectograms
Feb 05, 2019
Cite:
@inproceedings{liu2019sound, title={Sound Transformation: Applying Image Neural Style Transfer Networks to Audio Spectograms}, author={Liu, Xuehao and Delany, Sarah Jane and McKeever, Susan}, booktitle={International Conference on Computer Analysis of Images and Patterns}, pages={330--341}, year={2019}, organization={Springer} }
Paper Link: here
The work for image style transfer using deep learning has accomplished great success. It has many versions. Each of them can do a different kind of task in respect to a specific style transfer problem. There are four networks are tested: Slow, Faster, cGAN, and cycleGAN. We trained an audio classification CNN using the structure of VGG-19. Slow and Faster are based on it. The basic idea is treat spectrograms as a gray-scale image.

For these two kinds of results, the basic backbone of the network is not changed. What I changed is some little adjustment so that the network can take spectrograms as inputs. Also I added the energy and frequency constrains to prevent the spectrogram from the distortion of networks, according to this paper.
(a) is the original flute.
(b) is the original keyboard.
(c) is the result of Slow.
(d) is the result of Net.
(related blog: the first image style transfer)

The spectrograms and sounds above are the result of cGAN. It is do the transfer from a flute to a keyboard. The most left one is the result. The rest of them are the original flute and keyboard. Actually they are not a transfer. They are the translation. But we will never get the exact target in the train set. (related blog)

This is the results from cycleGAN.
(a) is the original flute.
(b) is the original keyboard.
(c) is the result transferred from keyboard to flute.
(d) is the result transferred from flute to keyboard.
This is so interesting because in image translation, cycleGAN will only change the part that it believes those parts are matters the most. And it will keep the rest of the image intact. In audio, we do not know what is the key part for the network. (related blog)
More results please see here.

Here is the t-SNE mapping of each new generated audio signals comparing to other natural sounds.
The aim of this mapping is to show that these generated sounds are something new. They can be separated from their source sounds.