Nvidia unveils 'Swiss Army Knife' of AI Audio Tools: Fugatto

High-powered computer chip maker Nvidia on Monday unveiled a new AI model developed by its researchers that can generate or transform any mix of music, voices and sounds described with instructions using any combination of text and audio files .

The new AI model called Fugatto – for Foundational Generative Audio Transformer Opus – can create a piece of music based on a text prompt, remove or add instruments from an existing song, change the accent or emotion in a voice, and even produce sounds that have never been heard before. .

According to Nvidia, by supporting numerous audio generation and transformation tasks, Fugatto is the first foundational generative AI model to exhibit emergent properties—capabilities resulting from the interaction of its various trained capabilities—and the ability to combine free-form instructions.

“We wanted to create a model that understands and generates sound the way humans do,” Rafael Valle, a manager of applied sound research at Nvidia, said in a statement.

“Fugatto is our first step toward a future where unsupervised multitasking learning in audio synthesis and transformation emerges from data and model scale,” he added.

Nvidia noted that the model is capable of handling tasks it was not pre-trained on, as well as generating sounds that change over time, such as the Doppler effect of thunder as a rainstorm moves through an area .

The company added that unlike most models, which can only recreate the training data they’ve been exposed to, Fugatto allows users to create never-before-seen soundscapes, such as a thunderstorm breaking into dawn with the sound of birds singing.

Breakthrough AI model for audio transformation

“Nvidia’s launch of Fugatto is a significant advance in AI-driven audio technology,” said Kaveh Vahdat, founder and president of RiseOppa national CMO services company based in San Francisco.

“Unlike existing models that specialize in specific tasks—such as music composition, voice synthesis, or sound effect generation—Fugatto provides a unified framework capable of handling a diverse array of audio-related functions,” he told TechNewsWorld. “This versatility positions it as a comprehensive tool for sound synthesis and transformation.”

Vahdat explained that Fugatto distinguishes itself by its ability to generate and transform sound based on both text instructions and optional audio input. “This two-input approach allows users to create complex audio outputs that seamlessly blend multiple elements, such as combining a saxophone’s melody with the timbre of a meowing cat,” he said.

Additionally, he continued, Fugatto’s ability to interpolate between instructions allows for nuanced control over attributes such as accent and emotion in voice synthesis, providing a level of customization not commonly found in current AI audio tools.

“Fugatto is an extraordinary step towards AI that can handle multiple modalities simultaneously,” added Benjamin Leea professor of engineering at the University of Pennsylvania.

“Using both text and audio input together can produce much more efficient or effective models than using text alone,” he told TechNewsWorld. “The technology is interesting because if you look beyond text alone, it broadens the volumes of training data and the capabilities of generative AI models.”

Nvidia at its best

Mark N. Vena, president and chief analyst at SmartTech Research in Las Vegas, claims that Fugatto represents Nvidia at its best.

“The technology introduces advanced capabilities in AI audio processing by enabling the transformation of existing audio into entirely new forms,” he told TechNewsWorld. “These include converting a piano melody into a human vocal line or changing the accent and emotional tone of spoken words, offering unprecedented flexibility in sound manipulation.”

“Unlike existing AI audio tools, Fugatto can generate new sounds from text descriptions, such as making a trumpet sound like a barking dog,” he said. “These features provide creators in music, film and games with innovative tools for sound design and sound editing.”

Fugatto handles audio holistically — covering sound effects, music, voice, virtually any type of sound, including sounds never heard before — and precisely, said Ross Rubin, the principal analyst with Reticle Researcha consumer technology consulting firm in New York City.

He mentioned the example of Sunoa service that uses AI to generate songs. “They just released a new version that has improvements in how generated human voices sound and other things, but it doesn’t allow the kind of precise, creative changes that Fugatto allows, like adding new instruments to a mix, moods to change from happy to sad, or move a song from a minor to a major key,” he told TechNewsWorld.

“Its understanding of the world of sound and the flexibility it offers goes beyond the mask-specific engines we’ve seen for things like generating a human voice or generating a song,” he said.

Open door for creatives

Vahdat pointed out that Fugatto could be useful in both advertising and language learning. Agencies can create custom audio content that matches brand identities, including voiceovers with specific accents or emotional tones, he noted.

At the same time, in language learning, educational platforms will be able to develop personalized audio material, such as dialogues in various accents or emotional contexts, to help with language acquisition.

“Fugatto technology opens doors to a wide variety of applications in creative industries,” Vena maintained. “Filmmakers and game developers can use it to create unique soundscapes, such as turning everyday sounds into fantastical or immersive effects,” he said. “It also holds potential for personalized audio experiences in virtual reality, assistive technologies and education, tailoring sounds to specific emotional tones or user preferences.”

“In music production,” he added, “it can transform instruments or vocal styles to explore innovative compositions.”

However, further development may be necessary to get better musical results. “All these results are trivial, and some have been around longer – and better,” observed Dennis Bathory-Kitsza musician and composer in Northfield Falls, Vt.

“The voice isolation was clunky and unmusical,” he told TechNewsWorld. “The additional instruments were also trivial, and most of the transformations were colorless. The only advantage is that it requires no specific learning, so the development of musicality for the AI user will be minimal.

“It could usher in some new uses – real musicians are already wonderfully inventive – but unless the developers have better musical chops to begin with, the results will be dismal,” he said. “They will be musically sloppy to join the visual and verbal sloppy of AI.”

AGI stands for

With artificial general intelligence (AGI) still very much in the future, Fugatto could be a model for simulating AGI, which ultimately aims to replicate or surpass human cognitive abilities across a wide range of tasks.

“Fugatto is part of a solution that uses generative AI in a collaborative bundle with other AI tools to create an AGI-like solution,” explains Rob Enderle, president and principal analyst at the Enderle Groupa consulting services firm in Bend, Ore.

“Until we get AGI working,” he told TechNewsWorld, “this approach will be the dominant way to create more complete AI projects with much higher quality and interest.”

+++++++++++++++++++
Ready to Sell
AISKILLSOURCE.COM