Speech Synthesis for Voice AI

Higgs TTS

A conversational speech system built for live voice chat. It speaks in the moment instead of reading static text, bringing expressive conversational voice capabilities to developers.

Get Started

What is Higgs TTS?

Higgs TTS is a high-performance text-to-speech system developed by the Boson AI Team. Created specifically for voice chat applications, it is built to speak model responses in real time. It goes beyond reading static texts by turning model outputs into expressive conversational speech. It supports zero-shot voice cloning and gives developers inline control over emotion, style, prosody, pauses, and sound effects directly from the text stream.

Traditional systems convert text into audio using rigid, pre-recorded patterns, which often results in robotic or monotonous output. Higgs TTS changes this approach. By treating speech generation as a dynamic, controllable text stream, it allows developers to send instructions directly within the text. This control enables the voice to adapt immediately, changing tone, introducing pauses, or modifying expression mid-sentence to reflect the natural flow of human conversation.

The Conversational Speech Model

Voice AI requires a different approach than standard text-to-speech. In a live conversation, speech is not just the final step after text generation. It is how the agent answers, reacts, pauses, emphasizes, and carries the turn.

Higgs TTS is designed for this dynamic environment. It keeps the reliability of a production-level speech system, but it is built to speak model responses in the moment. The output retains the exact timing and expression needed to make an agent feel conversational, addressing a common limitation of traditional speech engines.

Traditional TTS (Reading)

Optimized for reading audiobooks or static articles. Generates a uniform voice pattern with pre-calculated prosody that does not adapt to conversational context.

Higgs TTS (Speaking)

Optimized for interactive dialogue. Responds dynamically to inline tags, adjusts to conversation context, manages turn-taking, and introduces natural conversational cues.

Interactive Space Demo

Test Higgs TTS using the official interactive playground. You can check the voice quality, enter custom text prompts, and experiment with audio responses.

Key Features of Higgs TTS

Direct Text Stream Control

Control the model directly from the text input stream. Insert inline tags to change emotion, style, speed, pitch, pauses, and sound effects mid-utterance. This allows developers to shape how a response is spoken without leaving the generation flow.

Zero-Shot Voice Cloning

Clone a voice using only a short audio reference file. The system extracts the speaker profile instantly, enabling customized voice outputs without the need for model training or custom fine-tuning.

Multilingual Coverage

Supports speech synthesis across 100+ languages. It covers both major global languages and lower-resource dialects, maintaining low error rates across all testing profiles.

Real-Time Processing

Built to support streaming generation. With single-digit word error rates and fast generation speeds, it fits into voice applications that require immediate speech feedback.

Natural Conversational Cues

Integrates paralinguistic cues and syntactic variations into generated audio. It models natural conversational speech structures, question inflections, and emotional transitions.

Flexible Inference Modes

Choose between using the hosted API endpoint or local hosting. Model weights are available for local deployment using SGLang-Omni, giving developers full control over their infrastructure.

Performance & Benchmarks

Higgs TTS has been evaluated on public multilingual speech suites as well as internal datasets covering 111 languages. The system exhibits competitive Word Error Rates (WER) and Character Error Rates (CER) when compared to other open-source and proprietary platforms.

Multilingual Benchmarks

The table below reports macro-averaged WER and CER (x100). Lower values indicate better performance.

Benchmark	Languages	Higgs Audio v2	Higgs Audio v3	Non-Higgs Best Model
SeedTTS	2	2.10	1.11	1.21 (OmniVoice)
CV3	13	21.19	4.41	4.60 (Fish Audio S2 Pro)
MiniMax-Multilingual	32	49.86	2.74	2.98 (OmniVoice)
Higgs-Multilingual	111	52.24	3.61	3.63 (OmniVoice)

Conversational Behavior Benchmarks (Emergent TTS)

This benchmark measures conversational behaviors that transcript accuracy alone cannot capture. The values represent win-rates (higher is better) compared to a baseline, sharing the same reference audio prompt.

Category	Higgs Audio v3	Fish Audio S2 Pro	Qwen3-TTS-1.7B	OmniVoice
Overall Win-Rate	53.65%	43.80%	38.84%	40.82%
Emotions	53.75%	53.04%	45.54%	61.07%
Foreign Words	48.75%	33.93%	24.64%	28.75%
Paralinguistics	68.57%	53.75%	44.29%	52.68%
Complex Pronunciation	25.10%	18.16%	30.00%	13.67%
Questions	61.43%	55.00%	53.39%	45.00%
Syntactic Complexity	60.71%	45.71%	34.11%	40.36%

Note: Benchmark results have been verified and reproduced by the SGLang-Omni team.

Installation and Setup

To run Higgs TTS locally, you can serve the model weights using SGLang-Omni. Follow the instructions below to install the necessary packages and deploy the inference engine on your hardware.

Step 1: Install SGLang-Omni

First, install the SGLang package with support for voice models. Run this command in your terminal:

pip install sglang-omni --upgrade

Step 2: Start the Local Speech Server

Run the following command to download the Higgs TTS weights from Hugging Face and start the local inference server:

python -m sglang.launch_server --model-path boson-ai/higgs-audio-v3-tts --port 30000

Step 3: Run Inference via Python Code

Use the script below to connect to your local endpoint and generate speech using inline tags.

import requests

url = "http://localhost:30000/v1/audio/speech"
headers = {"Content-Type": "application/json"}
data = {
    "model": "higgs-audio-v3-tts",
    "input": "This is normal speech. <pause duration='0.4'/> <emotion name='excited'>And now I am talking with excitement!</emotion>",
    "voice": "default_voice",
    "response_format": "wav"
}

response = requests.post(url, json=data, headers=headers)
with open("output.wav", "wb") as f:
    f.write(response.content)

print("Speech audio saved as output.wav")

Developer API Guide

When building production voice applications, you can use the hosted Boson API endpoint. The system supports both blocking and streaming audio generation formats, allowing you to feed audio packets back to your application as they are generated.

Inline Control Syntax

You can embed parameters directly inside your text input to format the speech output. Here is a list of the supported tags:

Tag Format	Description	Example
<emotion name="[value]">	Adjusts tone to excited, sad, angry, or happy.	<emotion name="happy">I am glad to help.</emotion>
<pause duration="[seconds]"/>	Inserts a silent pause for the specified duration.	Let me think. <pause duration="1.2"/> Okay, ready.
<style pitch="[value]" speed="[value]">	Changes pitch (+/- semi-tones) and speech rate.	<style pitch="+1" speed="1.2">Fast text.</style>
<sfx name="[value]"/>	Triggers conversational sound effects (cough, laugh, sigh).	Yes, indeed. <sfx name="cough"/> Excuse me.

Zero-Shot Voice Cloning Payload

To use a custom voice without fine-tuning, pass a speaker reference block containing base64 encoded audio in the request headers or body payload:

{
  "model": "higgs-audio-v3-tts",
  "input": "This speech uses the audio characteristics of the speaker reference.",
  "speaker_reference": {
    "audio_data": "SGVsbG8gd29ybGQgYmFzZTY0IGRhdGE...",
    "format": "wav"
  }
}

Capabilities and Considerations

Capabilities

Low Word Error Rate (WER) and Character Error Rate (CER) on global multilingual benchmarks.
Direct stream control using simple tags without separate API operations.
Zero-shot voice cloning enables instant generation using custom profiles.
Supports streaming outputs for low-latency voice chat integrations.
Local deployment options keep data on private hardware.

Considerations

Requires compatible hardware (e.g. GPUs) for local real-time streaming operations.
Voice cloning quality depends on the clarity and duration of the provided reference audio.
The online browser sandbox is under stabilization.
Inline controls must be formatted correctly to avoid tags being spoken as normal text.