Here are some audio demos of our paper about symbolic music generation. This model is trained on a music audio dataset with mostly Chinese pop songs. We transcribe the audio into symbolic domain using MIR models (beat tracking, chord detection, structure analysis, 5-stem transcription and music tagging). The outputs of MIR models are then transformed into a token sequence to train a transformer-based language model.

The model-generated token sequences contain 5-stem (vocal, piano, guitar, bass, drum) music notes and is transformed into MIDI format. We then render them into audios using Ableton Live. Vocal track is rendered using bell sound.

Some characteristics we think are interesting:

  • The model is able to keep the theme over the entire song, showing strong ability to model long-term dependencies.
  • Some generated pieces have key shift by the end (the first and second demo below)

Our model is also able to take control inputs, such as chord progression and song structure. Below is a piece generated with uncommon chord progression (looping major keys A-B-C-D-E-F-G-A).

Although the input chord progression is quite uncommon, the model is still able to find a balance between following the input and producing a reasonable melody, which surprises us.