AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing

Adobe Research1, Carnegie Mellon University2

TL;DR

One model for editing, understanding, and generating audio stories

✏️
Open-ended instruction following
"Make the door opening sound more dramatic, and make it faster."
Source
Edited
🎧
Multi-speaker scene analysis
Fine-grained captioning of complex audio scenes with multiple speakers and sound effects
🎬
Generate from text prompts
"Create an audio scene for approaching an old castle door."
Generated

Abstract

Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio processing tasks, audio stories introduce new layers of semantic, temporal, and physical complexity. To address this challenge, we propose AudioChat, a framework for developing audio foundation models that can generate, edit, and understand audio stories. AudioChat introduces a new paradigm in which LLM-based toolcalling agents simulate interactions between users and the system, and these simulated dialogues are used as training data. We also introduce a novel Audio Transfusion Forcing objective to train the AudioChat model, allowing it to simultaneously decompose high-level instructions via structured chain-of-thought reasoning and perform interactive multi-turn audio understanding/generation. To evaluate generation and editing performance, we develop three new metrics that directly measure task performance instead of relying upon distribution-based scoring.

First research result visualization

Left: Interface with AudioChat's structured reasoning. Right: Architectural overview of AudioChat.

Open-ended Audio Editing

We recommend using headphones for an optimal listening experience

Changing the texture of a sound

Instruction Source Generated
Make the door opening sound more dramatic, and make it faster.
Make the generator sound more unstable and older
Let's make the raven caw sound more menacing, and also add the sound of a sword being unsheathed.
Change the axe sound to a chainsaw.
make the spaceship door sound more dramatic
make the waves more dramatic and violent
Can you make the creaking wood sound older and more distressed?
I think the footsteps are a little too heavy. Can we make them lighter and softer?
make the wind less pronounced, and the music box more mysterious
make the tree sound more ancient, less creaky.
Make the gavel sound more chaotic, and more frequent.
Let's make the raven sound more ominous. More drawn out, lower pitched.
Make the creaking floorboards more dramatic and add a slight echo.
Make the lute a harp instead.
make the footsteps more determined and less gravelly

World Knowledge

The model's world knowledge allows it to infer proper timing and what indirect instructions refer to

Instruction Source Generated
Add sounds of applause
Add a subtle musical swell as she enters
Add sounds of someone fidgeting.
make the henchman's voice quieter
Add a strained grunt from the husband

Adding a sound

Instruction Source Generated
Add footsteps on stone
Add a dying scream.

Removing a sound

Instruction Source Generated
Remove the carriage sound.
remove the harp arpeggios.
remove the female speaker

Changing the volume of a sound

Instruction Source Generated
Lower the loudness of the interrogation room hum
make the piano music slightly louder
lower the volume of the second voice

Moving a sound

We recommend using headphones for an optimal listening experience

Instruction Source Generated
Move the church bell slightly left.
Move the first speaker more left.
move the first speakers voice more left

Multi-Operation Instructions

Instruction Source Generated
I love the overall feel. Can we make the chair creak a little more prominently, and maybe add a second creak a bit later on?
Make the clock ticking quieter and move it to the left.

Distance Aware Editing

We recommend using headphones for an optimal listening experience

Instruction Source Generated
Add the sound of horses approaching
Make the crying child sound more faint and distant.

Audio Storytelling

Generated audio stories from scratch based on high-level text instructions

Instruction Generated
Create an audio scene for approaching an old castle door.
Create a soundscape for a dimly lit study, focusing on a sense of focused thought.
Let's create a soundscape for a bustling medieval marketplace.
I need to build a soundscape for a scene where a storm is building
I need sound effects for a mysterious magical entrance.
Create the sound of a gathering storm and a forming whirlwind
I want to create an immersive soundscape of a hidden, overgrown temple in a tropical rainforest.
I want to create a soundscape of a lonely lighthouse on a rocky coast during a storm.
A small, cozy room. Someone is speaking to another person in hushed tones.
Create a scene in a medieval castle, a knight preparing for battle.
A scene where a man is secretly pleading with someone in a dark alleyway
A dramatic scene of a ship sinking in a storm
Create a soundscape for an old garden, tranquil and welcoming.
She gathered up her reins
She gathered up her reins
I want a soundscape of a small raft on a dark lake, being circled by large reptiles
I want to create a soundscape for a cozy evening scene. Think warm, inviting, and a little bit melancholic.
Create the soundscape for an old wizard's study.
Create a soundscape of a hawk catching a final raindrop.
Create a soundscape of a forest at dusk.
Create a soundscape of a hawk catching a final raindrop.
Create the ambience of a Victorian tea room, quiet and refined.
A calm lake at dusk
I need sounds for a futuristic city street at night. Lots of neon and flying cars.
A gritty urban alleyway at night
Create a soundscape reflecting this personality
A scene on a ship

Audio Story Understanding

Multi-Speaker Scene

Single Speaker Scene

BibTeX

@article{YourPaper2024,
  title={Your Paper Title},
  author={Author Name},
  year={2024}
}