Speech-to-text & Text-to-speech

Save up to 90% for
Voice AI inference

Whether you are serving inference for Speech-to-Text, Text-to-Speech, Chatbots, or batch translating 1000s of hrs of audio, Salad’s consumer GPUs can reduce your cloud cost by up to 90% compared to managed services and APIs.

Deploy on SaladCloud Get a Demo

Audio AI - automatic speech recognition with Whisper large

Have questions about enterprise pricing for SaladCloud?

Book a 15 min call with our team.

Talk to Sales

Run popular models or bring your own models

Whisper Bark XTTS-v2 Riffusion AudioLM

Read our blog

Benchmarks, tutorials, product updates and more.

Go to SaladCloud Blog

Speech-to-Text: 47,638 minutes per $1 - Parakeet TDT 1.1B

speech to text benchmark - parakeet tdt 1.1b

Speech-to-Text: 29,994 minutes/$ - Distil-Whisper Large V2

Speech to text benchmark - distil whisper large v2

Text-to-Speech: 6 Million words per $1 - OpenVoice

text to speech benchmark gpu - openvoice

Your own ChatGPT with Ollama, ChatUI & Salad

Speech-to-Text: 11,736 mins/$ - Whisper Large V3

Speech to text gpu benchmark - whisper large v3

Text-to-Speech: 39,000 words per $1 - Bark

Speech-to-Text

For use cases like automatic speech recognition (ASR), translation, captioning, subtitling, etc., SaladCloud is 90% or more affordable than APIs and hyperscalers.

$0.005

cost per hour

Save 99% On Audio Transcription with self-managed Whisper on SaladCloud.

47,638

minutes per dollar

Get a 1000-fold cost reduction with Parakeet TDT 1.1B compared to popular APIs.

Speech to text - Automatic speech recognition with Whisper large on Salad GPU cloud

Text-to-Speech

Save up to 90% on Text-to-Speech (TTS) inference with SaladCloud’s consumer GPUs. The RTX & GTX series GPUs deliver the best cost performance for TTS inference.

6,000,000

words per dollar

The RTX 2070 & GTX 1650 deliver almost 6,000,000 words per dollar for TTS use cases with OpenVoice.

230

words per second

The RTX 3080 Ti delivers 230.4 words per second, offering the best speed-to-cost ratio at just $0.20/hour with OpenVoice.

39,000

words per dollar

The RTX 3060 and GTX 1060 delivers 39,000 words per dollar with Bark Text-to-Speech model.

The Lowest Cost For Voice AI Inference

Voice AI models are perfect for consumer GPUs. They offer incredible cost performance and save thousands of dollars compared to running on public clouds.
Scale quickly to thousands of GPU instances worldwide without the need to manage VMs or individual instances, all with a simple usage-based price structure.

Transcription

Save up to 98% on transcription costs compared to the public cloud with about 60X real-time speed on RTX 3090s.

Translation

Get better machine translation economics on SaladCloud's network of GPUs at the lowest market prices.

Captioning / Subtitles

Cut AI captioning/subtitle generation costs by at least 50% on SaladCloud.

Read Documentation