AI/TLDR

StepFun · 2026-04-28 · notable

Step-Audio-R1.5 — RLHF-Trained Audio Reasoner Closes the 'Verifiable Reward Trap' Gap

StepFun's audio-reasoning LLM swaps verifiable-reward RL for RLHF guided by a rubric reward model. Hits 77.97 average, substantially beats R1 on AudioMultiChallenge, and stays competitive with Gemini 3 Pro on multi-turn spoken dialogue.

Step-Audio-R1 GitHub repository social card from StepFun

An audio reasoner that sounds human in long conversations — RLHF rescues prosody and emotion that RLVR was quietly killing.

What is it?

Step-Audio-R1.5 is the next iteration of StepFun's audio-language model that does chain-of-thought reasoning over speech. The technical report argues that the standard recipe for audio reasoners — Reinforcement Learning with Verified Rewards (RLVR) — produces models that are 'technically accurate but experientially hollow,' losing prosodic naturalness and emotional continuity in multi-turn dialogue.

How does it work?

The team combines RLVR with Reinforcement Learning from Human Feedback (RLHF) using a 'rubric-guided preference reward model' that scores correctness, fluency, and emotional resonance simultaneously. The architecture decouples explicit reasoning traces from final responses, so the RLHF signal can shape conversational quality without flattening the chain-of-thought. Evaluation uses AudioMultiChallenge plus three new in-house benchmarks (step_caption, step_spqa, step_dialogue_understanding) released alongside the paper.

Why does it matter?

Audio reasoning models tend to ace short benchmark turns and then sound robotic in real conversations. The paper formalises that gap as the 'verifiable reward trap' and shows a concrete recipe — rubric-guided RLHF — that maintains analytical reasoning while moving the dial on long-form interactive feel. Code, paper, and benchmarks are all open under Apache 2.0.

Who is it for?

Audio-LLM researchers, voice-agent and TTS teams chasing natural-feeling multi-turn dialogue

Try it

github.com/stepfun-ai/Step-Audio-R1

Key numbers

  • Average score: 77.97
  • License: Apache 2.0
  • GitHub stars: 647

Links

Tags

  • audio-llm
  • rlhf
  • speech
  • reasoning
  • stepfun
  • open-source
  • apache-2-0
  • audio-reasoning
  • multi-turn-dialogue

← All releases · Learn AI