Examples
SAO-Instruct takes an audio clip along with a free-form edit instruction and outputs the edited audio clip.
 
            A woman gives a speech
 
            it should be in a large concert hall
 
            Chirping of birds with wind blowing
 
            remove the background noise
 
            A car is passing by with leaves rustling
 
            make the car drive on gravel
 
            Frying food is sizzling
 
            add someone doing the dishes
 
            Muffled sounds followed by metal being hit
 
            make it glass instead
 
            Ocean waves crashing
 
            it should be a windy day
 
            Birds chirp, wind blows and frogs croak
 
              give it a rainy atmosphere
 
              there are footsteps approaching
 
              add a small river going by
 
            A helicopter flying in the distance
 
              add distant thunder
 
              there should be fireworks
 
              change it to a plane
Long-form Audio Editing
SAO-Instruct can edit up to 47 seconds of audio.
 
            A door is opening and closing and footsteps are occurring
 
            he should walk on snow
 
            People are on the beach
 
            add thunder
 
            People are clapping in the foyer
 
            change it to a dog barking
Comparison with Baselines
| Prompt Captions and Instructions | Input Audio from AudioCaps | ZETA/50 conditioned on full captions | ZETA/75 conditioned on full captions | AudioEditor conditioned on full captions | SAO-Instruct conditioned on instruction | 
|---|---|---|---|---|---|
| 
 |   |   |   |   |   | 
| 
 |   |   |   |   |   | 
| 
 |   |   |   |   |   | 
| 
 |   |   |   |   |   | 
| 
 |   |   |   |   |   | 
| 
 |   |   |   |   |   | 
| 
 |   |   |   |   |   | 
| 
 |   |   |   |   |   | 
| 
 |   |   |   |   |   | 
| 
 |   |   |   |   |   | 
Failure Cases
While the performance of SAO-Instruct can be further improved by per-sample adjustments, such as tuning the CFG scale or the amount of noise applied to the initial encoded audio, some limitations remain. We observe that the phrasing of edit instruction can influence the edit quality and accuracy of the model. The model also occasionally struggles to reconstruct coherent speech and may produce edits with significant artifacts.
 
            An alarm beeps while a woman speaks
 
            remove the alarm
 
            An alarm beeps while a woman speaks
 
            the alarm should be silent!
Newly added sounds sometimes fail to naturally blend in with the background and instead appear overlaid on existing sound elements. Additionally, if a clip contains many distinct elements, the model is unable to alter sounds or confuses them, which leads to unintended edits.
 
            A cat meowing
 
            add a dog howling
 
            Drums, footsteps, frogs, and crickets are heard
 
            replace the drums with claps
These limitations primarily stem from insufficient data diversity and could be mitigated by training on larger and more diverse datasets.
