Skip to main content

Mike Sugarbaker

Why AI audio is a different ballgame

5 min read

(Sorry I'm late.)

In case you missed it, the latest thing we're calling AI (or Machine Learning or whatever) is ethically very problematic! A legal case has been brought by a few professional illustrators who have had their hard-won, marketable styles straight-up ganked by text-to-image processors. While the second most discoursed-about ML-generation medium, that of text, is not criticized as often on an intellectual-property basis as on its tendency to present statistically-likely nonsense as authoritative-sounding truth, I still worry about its contribution to AI's brand in general as ripping off artists and creators who most often weren't even given opportunity to opt out.

Why would I be worried personally? Because I'm enchanted by the sounds of Dance Diffusion, an application of the Stable Diffusion method to the generation of audio. Like its image-creating sibling, Dance Diffusion starts either from noise - fully random data - or from some starting data with noise transparently overlaid. It then "denoises" the data toward what its model says is likely. As with image generators, giving the model some starting data can result in what's known as "style transfer" - rendering the preexisting content in a style closer to the model, while keeping the fundamentals of the starting point intact. This is, for the most part, how I've been using DD - to do things like ask a model trained on nothing but drum solos to transform a clip from a rap vocal, and other such. (Models that do text-to-sound generation, or creation of audio from written prompts, are beginning to emerge as of this writing but are still fairly limited.)

I have mostly done my style transfers with pre-made models trained on relatively small sets of recordings. My favorite results have come from a model called "unlocked-250k," which gets its first name from its training data, the Unlocked Recordings collection on the Internet Archive. This collection, despite what one might assume from its presence on IA for free download, is mostly under copyright, and not under any sort of unusually permissive license. (None of it is "in print" or otherwise commercially available.) So why are these recordings here, in this model that's in the default model selection for this tool? How does it keep not occurring to nerds that this is a problem?

But here's the thing: when it comes to music, we already have the tools to deal with this situation. They're called ASCAP and BMI. In fact, these tools were created in response to a nearly identical problem: technological changes to the way musicians' IP is distributed, which made said distribution much more indirect and, uh, diffuse.

I doubt these institutions are perfect - I'm certainly not at a point in my music career where I'm in a position to need to know lots about them. (Also I'm completely eliding the issue of voice actors, for whom style transfer is already becoming a threat - but they do have a union!) But I bet they could be talked into handling artists' contributions to the statistical probability of a piece of music's direction and form as it evolves out of noise. Those contributions are individually small, but we generally know exactly what group of artists ought to be getting them, for which recordings. And in the case of very composed starting data, like the hummed basslines and melody fragments I've been making so I can transform them into weird blurts of orchestra, the songwriting is not actually in the picture in the final product (there's some mechanism that gets royalties to arrangers and producers, but I don't know that it's one of the same ones). So I'd expect what users of audio AIs end up paying, as royalties or fees, wouldn't be as large as if you sample something outright. Everybody wins!

I can think of a number of things to quibble back and forth about in such an arrangement (what about consent? You don't get to opt out of having your song played on the radio; is this similar?), but the point is this sort of problem can be understood and handled, and appropriate institutions can be created to attempt to deal with it. Is it conceivable that visual artists could respond to AI by forming a similar layer of institutions? I doubt they have the muscle, plus maybe there is too much work-for-hire happening in illustration, compared to recorded music, to make that approach make sense. But it all puts me in mind of Elinor Ostrom's work debunking (before it was published??!?!) the so-called tragedy of the commons. Everything stays complicated about human beings working together, but invested people with a commitment to each other can work out creative solutions. That gives us an alternative to the simple, absolute cancellation of an entire, fascinating line of research and activity. I hope we take it, in whatever form.