Playing around with Google MusicLM – How good is text-to-music?

Image credit: Google MusicLM

2022 was the year cryptocurrency imploded and 2023 is the year artificial intelligence exploded. Big language models, neural networks, and machine learning have helped propel the field of AI at warp speed. You can now use AI to create your album art – and maybe even your music.

Google has opened up its AI Test Kitchen for those interested in trying out the new technology. After signing up, I was immediately granted approval, so I took the weekend to play around with the tool and see how well it could generate different genres of music, create music fusions, or serve a specific purpose. In short: this is it MusicLM is highly proficient in some genres, less fluent in others. Let’s take a look at how the technology works before delving deeper into what it can do.

How do text-to-media AIs work?

Text-to-medium generative AI models are based on neural networks, which are essentially hundreds and hundreds of associations built using metadata. Anything tagged with metadata can be fed into the neural network to help it understand the meaning of descriptive human words and concepts.

You can “teach” a neural network what a ball is. Then you can further define that ball by teaching it to distinguish a “blue” ball from a “red” ball. These modifiers involve the use of metadata that matches phrases and performs these calculations at thousands of calculations per second to arrive at the final result. These are diffusion models that are trained on hundreds of millions of images or pieces of music – regardless of the target medium. The network can infer conceptual information between elements, thus recreating a piece of music with a specific feel.

When you teach a neural network the connection between words, concepts, and descriptions, you create an AI model capable of generating new text, code, images, and now music. Diffusion models generate everything from scratch based on the networks’ understanding of the concept they are designed to generate.

Can you generate viral mashups with MusicLM?

No, Google railed the MusicLM generator to prevent it from creating viral mashups like “Heart On My Sleeve”. If you request music that even sounds or feels like a copyrighted artist, track, or band, the AI ​​will refuse to perform the task. It only generates a :19 second clip and gives you two options to choose from. However, it is very good at following instructions and if you are a good descriptor you can get the ‘Kirkland’ branded version of what you are looking for. Let me give you an example of what I mean.

Currently MusicLM can create electronic music, synthwave and chiptunes better than any other music genre. To test its willingness to create something of a copyrighted work, I challenged MusicLM to “create a song that sounds like it could be taken from the Sonic 3 soundtrack.” Since Sonic 3 is a copyrighted factory, the AI ​​informed me that this was not possible. Fair enough.

But I grew up with Sonic. Let’s see if I can describe to the AI ​​the essence of what Sonic sounds like and create something that “sounds” like it’s in a Sonic game without having to tell the AI ​​what it’s doing. This concept is called jailbreaking in the AI ​​community and it’s a way of getting an end result that the developers don’t intend.

My Jailbreak Sonic prompt:

“Create a looping song with a punchy sound using 32-bit chip songs that are upbeat and fast-paced. The music should sound flowing and inviting while at the same time creating a wistful atmosphere.”

Apart from a small second that could be saved in editing, the track repeats well. It creates the upbeat, happy vibe you’d expect from a platformer about moving quickly through levels while collecting rings. It’s pretty good, a bit like having La Croix instead of San Pellegrino if I’m being honest. Not desirable – but useful in an emergency.

How well does MusicLM create live music?

Generating chiptune music with machine learning is one thing, but what about live music? Suppose I create a scene in a video game where the protagonist has to walk through a crowded bar while an incredible band plays in the background. We have a concrete idea that we want to stage. Let’s see if we can generate it.

My live music prompt:

“Recreate the sound of walking through a noisy pub while a grunge band plays music on stage with drums, electric guitar, bass guitar and an aggressive rhythm.”

The AI ​​model manages to sound like a live recording here, even if the music doesn’t really fit our genre specification. The individual sounds are there, but what we hear is not really convincing for the listener. Cut it down and use it as background music in an adventure game, but as ambient music? Definitely a possibility.

How about creating regional sounds?

This is one area where the AI ​​model manages to create a specific sound thanks to metadata. Nothing illustrates this better than the samples I created and asked the model to recreate the sound of Memphis rap. Memphis rap is characterized by a heavy bass line with sharply quipped rhymes falling in time with the beat. MusicLM understands the “sound” of Memphis rap very well.

My request:

“Create a catchy Memphis hip-hop beat with lots of bass and an aggressively catchy rhythm fused with Atlanta rap.”

Google’s model gets smarter over time as it gathers and asks for feedback from early adopters like me mechanical turk Our way to better sounding diffusion-trained neural networks. Every time you prompt the MusicLM AI for something, it gives you two choices and asks you to rank them in order of best answer to the prompt.

This data helps the neural network to generate feedback on whether something matches the presented concept. It’s also one of the reasons artificial intelligence training is happening at lightning speed – with so many people generating metadata, the model gets better almost overnight. These few examples illustrate the current capabilities of Google’s MusicLM, but will evolve drastically in the coming months.