Audio demos of our paper about symbolic music generation. The symbolic representation contains 5 stems (vocal, piano, guitar, bass, drums). Vocal stem is rendered using bell sound.

This model is trained on a music audio dataset with mostly Chinese pop songs. We transcribe the audio into symbolic domain using MIR models (beat tracking, chord detection, structure analysis, 5-stem transcription and music tagging). The outputs of MIR models are then transformed into a token sequence to train a transformer-based language model.

Some characteristics we think are interesting:

  • The model is able to keep the theme over the entire song, showing strong ability to model long-term dependencies.
  • Some generated pieces have key shift halfway, with quite smooth transitions (e.g. two samples on the last row).

Our model is also able to take control inputs, such as chord progression and song structure. Below is a piece generated with uncommon chord progression (looping major keys A-B-C-D-E-F-G-A).

Although the input chord progression is quite uncommon, the model is still able to find a balance between following the input and producing a reasonable melody, which surprises us.