Examples
SAO-Instruct takes an audio clip along with a free-form edit instruction and outputs the edited audio clip.
A woman gives a speech
it should be in a large concert hall
Chirping of birds with wind blowing
remove the background noise
A car is passing by with leaves rustling
make the car drive on gravel
Frying food is sizzling
add someone doing the dishes
Muffled sounds followed by metal being hit
make it glass instead
Ocean waves crashing
it should be a windy day
Birds chirp, wind blows and frogs croak
give it a rainy atmosphere
there are footsteps approaching
add a small river going by
A helicopter flying in the distance
add distant thunder
there should be fireworks
change it to a plane
Long-form Audio Editing
SAO-Instruct can edit up to 47 seconds of audio.
A door is opening and closing and footsteps are occurring
he should walk on snow
People are on the beach
add thunder
People are clapping in the foyer
change it to a dog barking
Comparison with Baselines
| Prompt Captions and Instructions |
Input Audio from AudioCaps |
ZETA/50 conditioned on full captions |
ZETA/75 conditioned on full captions |
AudioEditor conditioned on full captions |
SAO-Instruct conditioned on instruction |
|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Failure Cases
While the performance of SAO-Instruct can be further improved by per-sample adjustments, such as tuning the CFG scale or the amount of noise applied to the initial encoded audio, some limitations remain. We observe that the phrasing of edit instruction can influence the edit quality and accuracy of the model. The model also occasionally struggles to reconstruct coherent speech and may produce edits with significant artifacts.
An alarm beeps while a woman speaks
remove the alarm
An alarm beeps while a woman speaks
the alarm should be silent!
Newly added sounds sometimes fail to naturally blend in with the background and instead appear overlaid on existing sound elements. Additionally, if a clip contains many distinct elements, the model is unable to alter sounds or confuses them, which leads to unintended edits.
A cat meowing
add a dog howling
Drums, footsteps, frogs, and crickets are heard
replace the drums with claps
These limitations primarily stem from insufficient data diversity and could be mitigated by training on larger and more diverse datasets.