AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing
TL;DR
One model for editing, understanding, and generating audio stories
Abstract
Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio processing tasks, audio stories introduce new layers of semantic, temporal, and physical complexity. To address this challenge, we propose AudioChat, a framework for developing audio foundation models that can generate, edit, and understand audio stories. AudioChat introduces a new paradigm in which LLM-based toolcalling agents simulate interactions between users and the system, and these simulated dialogues are used as training data. We also introduce a novel Audio Transfusion Forcing objective to train the AudioChat model, allowing it to simultaneously decompose high-level instructions via structured chain-of-thought reasoning and perform interactive multi-turn audio understanding/generation. To evaluate generation and editing performance, we develop three new metrics that directly measure task performance instead of relying upon distribution-based scoring.
Open-ended Audio Editing
We recommend using headphones for an optimal listening experience
Changing the texture of a sound
| Instruction | Source | Generated |
|---|---|---|
| Make the door opening sound more dramatic, and make it faster. | ||
| Make the generator sound more unstable and older | ||
| Let's make the raven caw sound more menacing, and also add the sound of a sword being unsheathed. | ||
| Change the axe sound to a chainsaw. | ||
| make the spaceship door sound more dramatic | ||
| make the waves more dramatic and violent | ||
| Can you make the creaking wood sound older and more distressed? | ||
| I think the footsteps are a little too heavy. Can we make them lighter and softer? | ||
| make the wind less pronounced, and the music box more mysterious | ||
| make the tree sound more ancient, less creaky. | ||
| Make the gavel sound more chaotic, and more frequent. | ||
| Let's make the raven sound more ominous. More drawn out, lower pitched. | ||
| Make the creaking floorboards more dramatic and add a slight echo. | ||
| Make the lute a harp instead. | ||
| make the footsteps more determined and less gravelly |
World Knowledge
The model's world knowledge allows it to infer proper timing and what indirect instructions refer to
| Instruction | Source | Generated |
|---|---|---|
| Add sounds of applause | ||
| Add a subtle musical swell as she enters | ||
| Add sounds of someone fidgeting. | ||
| make the henchman's voice quieter | ||
| Add a strained grunt from the husband |
Adding a sound
| Instruction | Source | Generated |
|---|---|---|
| Add footsteps on stone | ||
| Add a dying scream. |
Removing a sound
| Instruction | Source | Generated |
|---|---|---|
| Remove the carriage sound. | ||
| remove the harp arpeggios. | ||
| remove the female speaker |
Changing the volume of a sound
| Instruction | Source | Generated |
|---|---|---|
| Lower the loudness of the interrogation room hum | ||
| make the piano music slightly louder | ||
| lower the volume of the second voice |
Moving a sound
We recommend using headphones for an optimal listening experience
| Instruction | Source | Generated |
|---|---|---|
| Move the church bell slightly left. | ||
| Move the first speaker more left. | ||
| move the first speakers voice more left |
Multi-Operation Instructions
| Instruction | Source | Generated |
|---|---|---|
| I love the overall feel. Can we make the chair creak a little more prominently, and maybe add a second creak a bit later on? | ||
| Make the clock ticking quieter and move it to the left. |
Distance Aware Editing
We recommend using headphones for an optimal listening experience
| Instruction | Source | Generated |
|---|---|---|
| Add the sound of horses approaching | ||
| Make the crying child sound more faint and distant. |
Audio Storytelling
Generated audio stories from scratch based on high-level text instructions
| Instruction | Generated |
|---|---|
| Create an audio scene for approaching an old castle door. | |
| Create a soundscape for a dimly lit study, focusing on a sense of focused thought. | |
| Let's create a soundscape for a bustling medieval marketplace. | |
| I need to build a soundscape for a scene where a storm is building | |
| I need sound effects for a mysterious magical entrance. | |
| Create the sound of a gathering storm and a forming whirlwind | |
| I want to create an immersive soundscape of a hidden, overgrown temple in a tropical rainforest. | |
| I want to create a soundscape of a lonely lighthouse on a rocky coast during a storm. | |
| A small, cozy room. Someone is speaking to another person in hushed tones. | |
| Create a scene in a medieval castle, a knight preparing for battle. | |
| A scene where a man is secretly pleading with someone in a dark alleyway | |
| A dramatic scene of a ship sinking in a storm | |
| Create a soundscape for an old garden, tranquil and welcoming. | |
| She gathered up her reins | |
| She gathered up her reins | |
| I want a soundscape of a small raft on a dark lake, being circled by large reptiles | |
| I want to create a soundscape for a cozy evening scene. Think warm, inviting, and a little bit melancholic. | |
| Create the soundscape for an old wizard's study. | |
| Create a soundscape of a hawk catching a final raindrop. | |
| Create a soundscape of a forest at dusk. | |
| Create a soundscape of a hawk catching a final raindrop. | |
| Create the ambience of a Victorian tea room, quiet and refined. | |
| A calm lake at dusk | |
| I need sounds for a futuristic city street at night. Lots of neon and flying cars. | |
| A gritty urban alleyway at night | |
| Create a soundscape reflecting this personality | |
| A scene on a ship |
Audio Story Understanding
Multi-Speaker Scene
Single Speaker Scene
BibTeX
@article{YourPaper2024,
title={Your Paper Title},
author={Author Name},
year={2024}
}