SAO-Instruct: Free-form Audio Editing using
Natural Language Instructions

NeurIPS 2025

Michael Ungersböck1
Florian Grötschla1
Luca A. Lanzendörfer1
June Young Yi2
Changho Choi3
Roger Wattenhofer1
ETH Zurich1
Seoul National University2
Korea University3
Paper Code 🤗 Model

Generative models have made significant progress in synthesizing high-fidelity audio from short textual descriptions. However, editing existing audio using natural language has remained largely underexplored. Current approaches either require the complete description of the edited audio or are constrained to predefined edit instructions that lack flexibility. In this work, we introduce SAO-Instruct, a model based on Stable Audio Open capable of editing audio clips using any free-form natural language instruction. To train our model, we create a dataset of audio editing triplets (input audio, edit instruction, output audio) using Prompt-to-Prompt, DDPM inversion, and a manual editing pipeline. Although partially trained on synthetic data, our model generalizes well to real in-the-wild audio clips and unseen edit instructions. We demonstrate that SAO-Instruct achieves competitive performance on objective metrics and outperforms other audio editing approaches in a subjective listening study. To encourage future research, we release our code and model weights.

Audio Editing Diagram

Examples

SAO-Instruct takes an audio clip along with a free-form edit instruction and outputs the edited audio clip.

A woman gives a speech
Caption
A woman gives a speech
SAO-Instruct
A woman gives a speech in a large concert hall
Edit Instruction
it should be in a large concert hall
Chirping of birds with wind blowing
Caption
Chirping of birds with wind blowing
SAO-Instruct
Chirping of birds
Edit Instruction
remove the background noise
A car is passing by with leaves rustling
Caption
A car is passing by with leaves rustling
SAO-Instruct
A car is passing by on a gravel road with leaves rustling
Edit Instruction
make the car drive on gravel
Frying food is sizzling
Caption
Frying food is sizzling
SAO-Instruct
Frying food is sizzling with someone doing the dishes
Edit Instruction
add someone doing the dishes
Muffled sounds followed by metal being hit
Caption
Muffled sounds followed by metal being hit
SAO-Instruct
Muffled sounds followed by glass being hit
Edit Instruction
make it glass instead
Ocean waves crashing
Caption
Ocean waves crashing
SAO-Instruct
Ocean waves crashing on a windy day
Edit Instruction
it should be a windy day
Birds chirp, wind blows and frogs croak
Caption
Birds chirp, wind blows and frogs croak
Birds chirp, wind blows and frogs croak with a rainy atmosphere
Edit Instruction
give it a rainy atmosphere
Birds chirp, wind blows and frogs croak with a footsteps approaching
Edit Instruction
there are footsteps approaching
Birds chirp, wind blows and frogs croak with a small river going by
Edit Instruction
add a small river going by
A helicopter flying in the distance
Caption
A helicopter flying in the distance
A helicopter flying in the distance with thunder
Edit Instruction
add distant thunder
A helicopter flying in the distance with fireworks
Edit Instruction
there should be fireworks
A plane flying in the distance
Edit Instruction
change it to a plane

Long-form Audio Editing

SAO-Instruct can edit up to 47 seconds of audio.

A door is opening and closing and footsteps are occurring
Caption
A door is opening and closing and footsteps are occurring
SAO-Instruct
A door is opening and closing and footsteps are occurring on snow
Edit Instruction
he should walk on snow
People are on the beach
Caption
People are on the beach
SAO-Instruct
A dog is barking in the foyer
Edit Instruction
add thunder
People are clapping in the foyer
Caption
People are clapping in the foyer
SAO-Instruct
A dog is barking in the foyer
Edit Instruction
change it to a dog barking

Comparison with Baselines

Prompt
Captions and Instructions
Input Audio
from AudioCaps
ZETA/50
conditioned on full captions
ZETA/75
conditioned on full captions
AudioEditor
conditioned on full captions
SAO-Instruct
conditioned on instruction
  • Input: "Birds chirp as an object strikes a surface"
  • Instruction: "make it a metallic object"
  • Output: "Birds chirp as a metallic object strikes a surface"
Mel spectrogram of 'Birds chirp as a metallic object strikes a surface'
Mel spectrogram of 'Birds chirp as a metallic object strikes a surface'
Mel spectrogram of 'Birds chirp as a metallic object strikes a surface'
Mel spectrogram of 'Birds chirp as a metallic object strikes a surface'
Mel spectrogram of 'Birds chirp as a metallic object strikes a surface'
  • Input: "An emergency siren wailing followed by a large truck engine running idle"
  • Instruction: "replace the truck engine with a motorcycle engine"
  • Output: "An emergency siren wailing followed by a motorcycle engine running idle"
Mel spectrogram of 'An emergency siren wailing followed by a motorcycle engine running idle'
Mel spectrogram of 'An emergency siren wailing followed by a motorcycle engine running idle'
Mel spectrogram of 'An emergency siren wailing followed by a motorcycle engine running idle'
Mel spectrogram of 'An emergency siren wailing followed by a motorcycle engine running idle'
Mel spectrogram of 'An emergency siren wailing followed by a motorcycle engine running idle'
  • Input: "Wind blows and a small bird chirps"
  • Instruction: "make the bird chirping louder"
  • Output: "Wind blows and a small bird chirps loudly"
Mel spectrogram of 'Wind blows and a small bird chirps loudly'
Mel spectrogram of 'Wind blows and a small bird chirps loudly'
Mel spectrogram of 'Wind blows and a small bird chirps loudly'
Mel spectrogram of 'Wind blows and a small bird chirps loudly'
Mel spectrogram of 'Wind blows and a small bird chirps loudly'
  • Input: "A woman speaking with a child speaking"
  • Instruction: "remove the child"
  • Output: "A woman speaking"
Mel spectrogram of 'A woman speaking'
Mel spectrogram of 'A woman speaking'
Mel spectrogram of 'A woman speaking'
Mel spectrogram of 'A woman speaking'
Mel spectrogram of 'A woman speaking'
  • Input: "A bus engine slowing down then accelerating"
  • Instruction: "add brakes squealing"
  • Output: "A bus engine slowing down with brakes squealing then accelerating"
Mel spectrogram of 'A bus engine slowing down with brakes squealing then accelerating'
Mel spectrogram of 'A bus engine slowing down with brakes squealing then accelerating'
Mel spectrogram of 'A bus engine slowing down with brakes squealing then accelerating'
Mel spectrogram of 'A bus engine slowing down with brakes squealing then accelerating'
Mel spectrogram of 'A bus engine slowing down with brakes squealing then accelerating'
  • Input: "An emergency vehicle has the siren on"
  • Instruction: "add traffic noise"
  • Output: "An emergency vehicle has the siren on with traffic noise"
Mel spectrogram of 'An emergency vehicle has the siren on with traffic noise'
Mel spectrogram of 'An emergency vehicle has the siren on with traffic noise'
Mel spectrogram of 'An emergency vehicle has the siren on with traffic noise'
Mel spectrogram of 'An emergency vehicle has the siren on with traffic noise'
Mel spectrogram of 'An emergency vehicle has the siren on with traffic noise'
  • Input: "Humming and sputtering from an idling engine"
  • Instruction: "make it a motorcycle engine"
  • Output: "Humming and sputtering from an idling motorcycle engine"
Mel spectrogram of 'Humming and sputtering from an idling motorcycle engine'
Mel spectrogram of 'Humming and sputtering from an idling motorcycle engine'
Mel spectrogram of 'Humming and sputtering from an idling motorcycle engine'
Mel spectrogram of 'Humming and sputtering from an idling motorcycle engine'
Mel spectrogram of 'Humming and sputtering from an idling motorcycle engine'
  • Input: "People are laughing"
  • Instruction: "Add clapping"
  • Output: "People are laughing with clapping in the background"
Mel spectrogram of 'People are laughing with clapping in the background'
Mel spectrogram of 'People are laughing with clapping in the background'
Mel spectrogram of 'People are laughing with clapping in the background'
Mel spectrogram of 'People are laughing with clapping in the background'
Mel spectrogram of 'People are laughing with clapping in the background'
  • Input: "A vehicle engine starting up then running idle"
  • Instruction: "add the echo of a tunnel"
  • Output: "A vehicle engine starting up then running idle with the echo of a tunnel"
Mel spectrogram of 'A vehicle engine starting up then running idle with the echo of a tunnel'
Mel spectrogram of 'A vehicle engine starting up then running idle with the echo of a tunnel'
Mel spectrogram of 'A vehicle engine starting up then running idle with the echo of a tunnel'
Mel spectrogram of 'A vehicle engine starting up then running idle with the echo of a tunnel'
Mel spectrogram of 'A vehicle engine starting up then running idle with the echo of a tunnel'
  • Input: "A cat meowing as wind blows into a microphone"
  • Instruction: "replace the cat with a dog"
  • Output: "A dog barking as wind blows into a microphone"
Mel spectrogram of 'A dog barking as wind blows into a microphone'
Mel spectrogram of 'A dog barking as wind blows into a microphone'
Mel spectrogram of 'A dog barking as wind blows into a microphone'
Mel spectrogram of 'A dog barking as wind blows into a microphone'
Mel spectrogram of 'A dog barking as wind blows into a microphone'

Failure Cases

While the performance of SAO-Instruct can be further improved by per-sample adjustments, such as tuning the CFG scale or the amount of noise applied to the initial encoded audio, some limitations remain. We observe that the phrasing of edit instruction can influence the edit quality and accuracy of the model. The model also occasionally struggles to reconstruct coherent speech and may produce edits with significant artifacts.

An alarm beeps while a woman speaks
Caption
An alarm beeps while a woman speaks
SAO-Instruct
An alarm beeps while a woman speaks
Edit Instruction
remove the alarm
An alarm beeps while a woman speaks
Caption
An alarm beeps while a woman speaks
SAO-Instruct
An alarm beeps while a woman speaks
Edit Instruction
the alarm should be silent!

Newly added sounds sometimes fail to naturally blend in with the background and instead appear overlaid on existing sound elements. Additionally, if a clip contains many distinct elements, the model is unable to alter sounds or confuses them, which leads to unintended edits.

A cat meowing
Caption
A cat meowing
SAO-Instruct
A cat meowing and a dog howling
Edit Instruction
add a dog howling
Drums, footsteps, frogs, and crickets are heard
Caption
Drums, footsteps, frogs, and crickets are heard
SAO-Instruct
Claps, footsteps, frogs, and crickets are heard
Edit Instruction
replace the drums with claps

These limitations primarily stem from insufficient data diversity and could be mitigated by training on larger and more diverse datasets.