TikTok’s $300 billion-valued parent company, ByteDance, is one of the world’s busiest AI developers. It plans to spend billions of dollars on AI chips this year, while its tech gives Sam Altman’s OpenAI a run for its money.
ByteDance’s Duobao AI chatbot is currently the most popular AI assistant in China, with 78.6 million monthly active users as of January.
This makes it the world’s second most-used AI app behind OpenAI’s ChatGPT (with 349.4 million MAUs). The recently released Doubao-1.5-pro is claimed to match the performance of OpenAI’s GPT-4o at a fraction of the cost.
As Counterpoint Research notes in this breakdown of Duobao’s positioning and functionality, “much like its international rival ChatGPT, the cornerstone of Doubao’s appeal is its multimodality, offering advanced text, image, and audio processing capabilities”.
It can also generate music.
In September, ByteDance added an AI music generation function to the Duobao app, which apparently “supports more than ten types of music styles and allows you to write lyrics and compose music with one click”.
This, though, isn’t the end of ByteDance’s fascination with building music AI technologies.
On September 18, ByteDance’s Duobao Team announced the big launch of a suite of AI music models dubbed Seed-Music.
Seed-Music, they claimed, would “empower people to explore more possibilities in music creation”.
Established in 2023, the ByteDance Doubao (Seed) Team is “dedicated to building industry-leading AI foundation models”.
According to the official launch announcement for Seed-Music in September, the AI music product “supports score-to-song conversion, controllable generation, music and lyrics editing, and low-threshold voice cloning”.
It also claims that “it cleverly combines the strengths of language models and diffusion models and integrates them into the music composition workflow, making it suitable for different music creation scenarios for both beginners and professionals”.
The official Seed-Music website contains a number of audio clips that demonstrate what it can do.
You can hear some of that, below:
More important, though, is how Seed-Music was built.
Luckily, the Duobao Team has published a tech report that explains the inner workings of their Seed-Music project.
MBW has read it cover to cover.
In the introduction to ByteDance’s research paper, which you can read in full here, the company’s researchers state that, “music is deeply embedded in human culture” and that “throughout human history, vocal music has accompanied key moments in life and society: from love calls to seasonal harvests”.
“Our goal is to leverage modern generative modeling technologies, not to replace human creativity, but to lower the barriers to music creation.”
ByeDance research paper for Seed-Music
The intro continues: “Today, vocal music remains central to global culture. However, creating vocal music is a complex, multi-stage process involving pre-production, writing, recording, editing, mixing, and mastering, making it challenging for most people.”
“Our goal is to leverage modern generative modeling technologies, not to replace human creativity, but to lower the barriers to music creation. By offering interactive creation and editing tools, we aim to empower both novices and professionals to engage at different stages of the music production process.”
How Seed-Music works
ByteDance’s researchers explain that the “unified framework” behind Seed-Music “is built upon three fundamental representations: audio tokens, symbolic tokens, and vocoder latents”, which each correspond to “a generation pipeline.”
The audio token-based pipeline, as illustrated in the chart below, works like this: “(1) Input embedders convert multi-modal controlling inputs, such as music style description, lyrics, reference audio, or music scores, into a prefix embedding sequence. (2) The auto-regressive LM generates a sequence of audio tokens. (3) The diffusion transformer model generates continuous vocoder latents. (4) The acoustic vocoder produces high-quality 44.1kHz stereo audio.”
In contrast to the audio token-based pipeline, the symbolic token-based Generator, which you can see in the chart below, is “designed to predict symbolic tokens for better interpretability”, which the researchers state is “crucial for addressing musicians’ workflows in Seed-Music”.
According to the research paper, “Symbolic representations, such as MIDI, ABC notation and MusicXML, are discrete and can be easily tokenized into a format compatible with LMs”.
ByteDance’s researchers add in the paper: “Unlike audio tokens, symbolic representations are interpretable, allowing creators to read and modify them directly. However, their lack of acoustic details means the system has to rely heavily on the Renderer’s ability to generate nuanced acoustic characteristics for musical performance. Training such a Renderer requires large-scale datasets of paired audio and symbolic transcriptions, which are especially scarce for vocal music.”
The obvious question…
By now, you’re probably asking where The Beatles and Michael Jackson’s music come into all of this.
We’re nearly there. First, we need to talk about MIRs.
According to the Seed-Music research paper, “to extract the symbolic features from audio for training the above system,” the team behind the tech used various “in-house Music Information Retrieval (MIR) models”.
According to this very clear explanation over at Dataloop, MIR “is a subcategory of AI models that focuses on extracting meaningful information from music data, such as audio signals, lyrics, and metadata”.
Aka: It’s a metadata scraper. Stick a song into the jaws of a MIR model, and it will analyze, predict and present data that might include pitch, beats-per-minute (BPM), lyrics, chords, and more.
Music Information Retrieval research first gained popularity over its ability to help with the digital classification of genres, moods, tempos, etc. – key building blocks for recommendation systems used by music streaming services.
Now, though, leading generative AI music platforms are reportedly using MIR research to improve their product output.
Can you see where this is going? Yes, of course.
ByteDance’s research team has successfully built its own in-house MIR models, which have been used by the ByteDance team to “extract the symbolic features from audio” to build parts of its Seed-Music system. Those MIR models include:
AI, are you okay? Are you okay, AI?
Taking a deeper dive into the research published by ByteDance for its Structural analysis-focused MIR model, we find a research paper titled:
‘To catch a chorus, verse, intro, or anything else: Analyzing a song with structural functions’.
It was published in 2022. You can read it here.
According to the paper: “Conventional music structure analysis algorithms aim to divide a song into segments and to group them with abstract labels (e.g., ‘A’, ‘B’, and ‘C’).
“However, explicitly identifying the function of each segment (e.g., ‘verse’ or ‘chorus’) is rarely attempted, but has many applications”.
In this research paper, they “introduce a multi-task deep learning framework to model these structural semantic labels directly from audio by estimating ‘verseness,’ ‘chorusness,’ and so forth, as a function of time”.
To conduct this research, the ByteDance team used four “public datasets”, including one called the ‘Isophonics’ dataset, which, it notes, “contains 277 songs from The Beatles, Carole King, Michael Jackson, and Queen.”
The source of the Isophonics dataset used by ByteDance’s researchers appears to be Isophonics.net, described as the home for software and data resources from the Centre for Digital Music (C4DM) at Queen Mary, University of London.
The Isophonics website notes that its “chord, onset, and segmentation annotations have been used by many researchers in the MIR community.”
The website explains that “the annotations published here fall into four categories: chords, keys, structural segmentations, and beats/bars”.
In 2022, ByteDance’s researchers published a video presentation of their, To catch a chorus, verse, intro, or anything else: Analyzing a song with structural functions paper for the International Conference on Acoustics, Speech, and Signal Processing (ICASSP).
You can see this presentation below.
The video’s caption describes a “novel system/method that segments a song into sections such as chorus, verse, intro, outro, bridge, etc”.
It demonstrates its findings related to songs by the Beatles, Michael Jackson, Avril Lavigne and other artists:
We must be careful here over any suggestion that ByteDance’s AI music-generating technology may have been “trained” using songs by popular artists like the Beatles or Michael Jackson.
Yet, as you can see, a dataset containing annotations of such songs has clearly been used as a part of a ByteDance research project in this field.
Any analysis or reference to popular songs and their annotations in research conducted or funded by a multi-billion-dollar technology company will surely raise a number of questions for the music industry – especially those employed to protect its copyrights.
“We firmly believe that AI technologies should support, not disrupt, the livelihoods of musicians and artists. AI should serve as a tool for artistic expression, as true art always stems from human intention.”
ByteDance’s Seed-Music researchers
There is a section dedicated to Ethics and Safety at the bottom of ByteDance’s Seed-Music research paper.
According to ByteDance’s researchers, they “firmly believe that AI technologies should support, not disrupt, the livelihoods of musicians and artists“.
They add: “AI should serve as a tool for artistic expression, as true art always stems from human intention. Our goal is to present this technology as an opportunity to advance the music industry by lowering barriers to entry, offering smarter, faster editing tools, generating new and exciting sounds, and opening up new possibilities for artistic exploration.”
The ByteDance researchers also outline ethical issues specifically: “We recognize that AI tools are inherently prone to bias, and our goal is to provide a tool that stays neutral and benefits everyone. To achieve this, we aim to offer a wide range of control elements that help minimize preexisting biases.
“By returning artistic choices to users, we believe we can promote equality, preserve creativity, and enhance the value of their work. With these priorities in mind, we hope our breakthroughs in lead sheet tokens highlight our commitment to empowering musicians and fostering human creativity through AI.”
In terms of Safety / ‘deepfake’ concerns, the researchers explain that, “in the case of vocal music, we recognize how the singing voice evokes one of the strongest expressions of individual identity”.
They add: “To safeguard against the misuse of this technology in impersonating others, we adopt a process similar to the safety measures laid out in Seed-TTS. This involves a multistep verification method for spoken content and voice to ensure the enrollment of audio tokens contains only the voice of authorized users.
“We also implement a multi-level water-marking scheme and duplication checks across the generative process. Modern systems for music generation may fundamentally reshape culture and the relationship between artistic creation and consumption.
“We are confident that, with strong consensus between stakeholders, these technologies will and revolutionize music creation workflow and benefit music novices, professionals, and listeners alike.”Music Business Worldwide