AudioAgent: Enhancing Task Performance through Modality-Driven Prompt Optimization

Anonymous Authors

Abstract. While Large Language Models (LLMs) demonstrate significant potential as controllers in agent fields, effectively interpreting user instructions and selecting appropriate tools for audio tasks, they rely solely on textual input for selection. This reliance, however, overlooks valuable information inherent in the audio modality that could disambiguate user instructions and improve tool selection. To this end, we introduce AudioAgent, a versatile and adaptable agent framework for audio fields. It is the first system that emphasizes audio comprehension and utilizes this information to refine user-provided instructions by one finetuned LLM autonomously. Through clearer instructions, AudioAgent empowers the controller to make more precise selections from our comprehensive audio tool library, ultimately enhancing overall task performance. Our framework also allows users to freely register tools and utilize any LLM as the core controller. Both subjective and objective metrics validate the effectiveness of our work.

Overview



Table of Contents

Prompt Optimization

In this section, we provide some examples of our Prompt Optimization part. We list Raw Instruction, Ground Truth(GT), AudioAgent's result. Raw Instruction means inadequate instructions which serve as input in our dataset, GT means the the target output in the dataset. Ours means the result from modality comprehension and prompt optimizaion

Audio Sample Raw Instruction GT Ours

Task Enhancement - Speech Transcription

In this section, we provide some examples to showcase the enhancement by AudioAgent's optimal selection of the suitable tool. Here is Speech Transcription. The Prompt is the raw input for these 2 Agent. Moreover, the tool in AudioGPT only support English when receving the raw input, so it behaves poorly in other language.

Audio Sample Prompt Optimized_Prompt Target Output AudioGPT Qwen2-Audio AudioAgent

Task Enhancement - Speech Translation

In this section, we provide some examples to showcase the enhancement by AudioAgent's optimal selection of the suitable tool. Here is Speech Transcription. Moreover, the tool in AudioGPT can't finish the task without language type, so we only provide examples on Qwen2-Audio.

Audio Sample Content Prompt Optimized_Prompt Target Output Qwen2-Audio AudioAgent

Task Enhancement - Audio Enhancement

In this section, we provide some examples to showcase the enhancement by AudioAgent's optimal selection of the suitable tool. Here is Audio Enhancement. Qwen2-Audio lacks the ability to finish such task.

Audio Sample Prompt Optimized_Prompt AudioGPT AudioAgent

Sequential Selection

In this section, we provide some Multiple tool's selection example in AudioAgent.

Audio Sample Prompt Optimized Prompt Reply

Multi-turn Interaction

In this section, we provide the result in multi-turn dialogue.

Audio Sample Enhancement Definition Transcription Translate Text to Speech