Instructions to use zhifeixie/AudioInteraction with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zhifeixie/AudioInteraction with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("zhifeixie/AudioInteraction") model = AutoModelForMultimodalLM.from_pretrained("zhifeixie/AudioInteraction") - Notebooks
- Google Colab
- Kaggle
Audio-Interaction: Streaming Audio-In, Text-Out Conversational Model
Project Page | Code | Model | Dataset | Paper
Audio-Interaction is a unified streaming model that listens to audio in real time and decides, at each audio chunk, whether to keep listening or to start replying with text. It formalizes the "perceive-decide-respond" loop, allowing the model to handle conventional offline tasks (ASR, S2TT) while adding online capabilities like proactive intervention and real-time voice chatting.
The model alternates between a LISTENING state, where it consumes one encoder-output chunk per step and emits either KEEP_SILENCE or TEXT_BEGIN, and a SPEAKING state, where it autoregressively generates a text turn until TEXT_END and then returns to listening for the next chunk.
Model Details
- Model name: Audio-Interaction
- Task: Streaming audio-conditioned text generation (audio in, text out)
- Audio encoder: Qwen2.5-Omni audio tower (chunk-wise)
- Audio framing: 16 kHz, padded to 0.4-second (6400-sample) boundaries; 10 encoder-output frames per chunk
- Decoding states: LISTENING (emits
KEEP_SILENCE/TEXT_BEGIN) and SPEAKING (emits text untilTEXT_END) - Default sampling: temperature 0.3, top-k 3
- Default max new tokens: 4096 per session
- License: Apache-2.0
Repository Contents
Audio-Interaction/
βββ model-00001-of-00004.safetensors # LM weights, sharded (β4 GB each)
βββ model-00002-of-00004.safetensors
βββ model-00003-of-00004.safetensors
βββ model-00004-of-00004.safetensors
βββ model.safetensors.index.json # Shard index consumed by safetensors loader
βββ config.json # Top-level model config
βββ generation_config.json # Generation defaults
βββ model_config.yaml # GPT config consumed by Config.from_file
βββ hyperparameters.yaml # Training-time hyperparameters (reference)
βββ tokenizer.json # Tokenizer
βββ tokenizer_config.json
βββ MiniOmni3_ChunkwisedEncoder.pth # Audio encoder weights (Qwen2.5-Omni audio tower)
βββ qwen25OmniConfig/ # Audio-encoder config (nested: thinker_config.audio_config)
Intended Use
Audio-Interaction is intended for streaming conversational agents that need to react to audio as it arrives β for example, voice assistants that may interject mid-utterance, alarms that respond to ambient sound, or low-latency dialogue systems where waiting for a full utterance before replying is too slow.
Quick Start
Installation
git clone https://github.com/xzf-thu/Audio-Interaction.git
cd Audio-Interaction
conda create -n Audio-Interaction python=3.10 -y
conda activate Audio-Interaction
pip install -r requirements.txt
Download the checkpoint
From the Audio-Interaction project root, pull the weights into checkpoints/:
from huggingface_hub import snapshot_download
snapshot_download(repo_id="zhifeixie/Audio-Interaction", local_dir="checkpoints")
snapshot_download is the recommended path β it pulls every file and resumes on interruption.
Python Usage
from src.miniomni3.generate.run import run_inference
run_inference(
checkpoint_dir="checkpoints",
audio_paths=["/path/to/audio.wav"], # offline mode: one round per path
device="cuda:0", # or "mps" / "cpu"
)
Streaming Protocol
A single session looks like:
[system prompt tokens]
ββββ LISTENING ββββ
β AUDIO_BEGIN PAD*10 ASSISTANT β KEEP_SILENCE (keep listening)
β AUDIO_BEGIN PAD*10 ASSISTANT β TEXT_BEGIN EMOTION (start replying)
βββββββββββββββββββ
ββββ SPEAKING βββββ
β β¦ text tokens β¦ TEXT_END (reply finished)
βββββββββββββββββββ
ββββ LISTENING ββββ (next audio chunk)
β¦
The model is trained to emit at most one TEXT_BEGIN per audio chunk. Each assistant turn begins with TEXT_BEGIN, followed by an emotion token, the reply tokens, and TEXT_END. Turns starting with KEEP_SILENCE indicate the model chose not to respond to that chunk.
Limitations
- The model produces text, not speech. Pair it with a TTS system for end-to-end voice interaction.
- Audio must be 16 kHz mono; non-conforming inputs are resampled and padded to 0.4-second boundaries.
- Decisions are made at 0.4-second granularity (one encoder chunk), which sets a floor on response-onset latency.
- Trailing partial audio chunks shorter than 10 encoder frames are dropped before generation.
Citation
@misc{xie2026audiointeractionmodel,
title={Audio Interaction Model},
author={Zhifei Xie and Zihang Liu and Ze An and Xiaobin Hu and Yue Liao and Ziyang Ma and Dongchao Yang and Mingbao Lin and Deheng Ye and Shuicheng Yan and Chunyan Miao},
year={2026},
eprint={2606.05121},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2606.05121},
}
Acknowledgements
Audio-Interaction builds on the Qwen2.5-Omni audio encoder. We thank the Qwen team and the maintainers of OpenAI Whisper for the audio-loading utilities used in this project.
- Downloads last month
- 274