Examples

SAO-Instruct takes an audio clip along with a free-form edit instruction and outputs the edited audio clip.

Caption
A woman gives a speech

SAO-Instruct

A woman gives a speech in a large concert hall

Edit Instruction
it should be in a large concert hall

Caption
Chirping of birds with wind blowing

SAO-Instruct

Edit Instruction
remove the background noise

Caption
A car is passing by with leaves rustling

SAO-Instruct

A car is passing by on a gravel road with leaves rustling

Edit Instruction
make the car drive on gravel

Caption
Frying food is sizzling

SAO-Instruct

Frying food is sizzling with someone doing the dishes

Edit Instruction
add someone doing the dishes

Caption
Muffled sounds followed by metal being hit

SAO-Instruct

Muffled sounds followed by glass being hit

Edit Instruction
make it glass instead

Caption
Ocean waves crashing

SAO-Instruct

Edit Instruction
it should be a windy day

Caption
Birds chirp, wind blows and frogs croak

Birds chirp, wind blows and frogs croak with a rainy atmosphere

Edit Instruction
give it a rainy atmosphere

Birds chirp, wind blows and frogs croak with a footsteps approaching

Edit Instruction
there are footsteps approaching

Birds chirp, wind blows and frogs croak with a small river going by

Edit Instruction
add a small river going by

Caption
A helicopter flying in the distance

A helicopter flying in the distance with thunder

Edit Instruction
add distant thunder

A helicopter flying in the distance with fireworks

Edit Instruction
there should be fireworks

Edit Instruction
change it to a plane

Long-form Audio Editing

SAO-Instruct can edit up to 47 seconds of audio.

Caption
A door is opening and closing and footsteps are occurring

SAO-Instruct

A door is opening and closing and footsteps are occurring on snow

Edit Instruction
he should walk on snow

Caption
People are on the beach

SAO-Instruct

Edit Instruction
add thunder

Caption
People are clapping in the foyer

SAO-Instruct

Edit Instruction
change it to a dog barking

Comparison with Baselines

Prompt Captions and Instructions	Input Audio from AudioCaps	ZETA/50 conditioned on full captions	ZETA/75 conditioned on full captions	AudioEditor conditioned on full captions	SAO-Instruct conditioned on instruction
Input: "Birds chirp as an object strikes a surface" Instruction: "make it a metallic object" Output: "Birds chirp as a metallic object strikes a surface"
Input: "An emergency siren wailing followed by a large truck engine running idle" Instruction: "replace the truck engine with a motorcycle engine" Output: "An emergency siren wailing followed by a motorcycle engine running idle"
Input: "Wind blows and a small bird chirps" Instruction: "make the bird chirping louder" Output: "Wind blows and a small bird chirps loudly"
Input: "A woman speaking with a child speaking" Instruction: "remove the child" Output: "A woman speaking"
Input: "A bus engine slowing down then accelerating" Instruction: "add brakes squealing" Output: "A bus engine slowing down with brakes squealing then accelerating"
Input: "An emergency vehicle has the siren on" Instruction: "add traffic noise" Output: "An emergency vehicle has the siren on with traffic noise"
Input: "Humming and sputtering from an idling engine" Instruction: "make it a motorcycle engine" Output: "Humming and sputtering from an idling motorcycle engine"
Input: "People are laughing" Instruction: "Add clapping" Output: "People are laughing with clapping in the background"
Input: "A vehicle engine starting up then running idle" Instruction: "add the echo of a tunnel" Output: "A vehicle engine starting up then running idle with the echo of a tunnel"
Input: "A cat meowing as wind blows into a microphone" Instruction: "replace the cat with a dog" Output: "A dog barking as wind blows into a microphone"

Failure Cases

While the performance of SAO-Instruct can be further improved by per-sample adjustments, such as tuning the CFG scale or the amount of noise applied to the initial encoded audio, some limitations remain. We observe that the phrasing of edit instruction can influence the edit quality and accuracy of the model. The model also occasionally struggles to reconstruct coherent speech and may produce edits with significant artifacts.

Caption
An alarm beeps while a woman speaks

SAO-Instruct

Edit Instruction
remove the alarm

Caption
An alarm beeps while a woman speaks

SAO-Instruct

Edit Instruction
the alarm should be silent!

Newly added sounds sometimes fail to naturally blend in with the background and instead appear overlaid on existing sound elements. Additionally, if a clip contains many distinct elements, the model is unable to alter sounds or confuses them, which leads to unintended edits.

Caption
A cat meowing

SAO-Instruct

Edit Instruction
add a dog howling

Caption
Drums, footsteps, frogs, and crickets are heard

SAO-Instruct

Claps, footsteps, frogs, and crickets are heard

Edit Instruction
replace the drums with claps

These limitations primarily stem from insufficient data diversity and could be mitigated by training on larger and more diverse datasets.