Your resource for web content, online publishing
and the distribution of digital products.
S M T W T F S
 
 
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
10
 
11
 
12
 
13
 
14
 
15
 
16
 
17
 
18
 
19
 
20
 
21
 
22
 
23
 
24
 
25
 
26
 
27
 
28
 
29
 
30
 
31
 
 

Hosting Your Own AI with Two-Way Voice Chat Is Easier Than You Think!

DATE POSTED:January 8, 2025

The integration of LLMs with voice capabilities has created new opportunities in personalized customer interactions.

\ This guide will walk you through setting up a local LLM server that supports two-way voice interactions using Python, Transformers, Qwen2-Audio-7B-Instruct, and Bark.

Prerequisites

Before we begin, you'll have the following installed:

  • Python: Version 3.9 or higher.
  • PyTorch: For running the models.
  • Transformers: Provides access to the Qwen model.
  • Accelerate: Required in some environments.
  • FFmpeg & pydub: For audio processing.
  • FastAPI: To create the web server.
  • Uvicorn: ASGI server to run FastAPI.
  • Bark: For text-to-speech synthesis.
  • Multipart & Scipy: To manipulate audio.

\ FFmpeg can be installed via apt install ffmpeg on Linux or brew install ffmpeg on MacOS.

\ You can install the Python dependencies using pip: pip install torch transformers accelerate pydub fastapi uvicorn bark python-multipart scipy

Step 1: Setting Up the Environment

First, let’s set up our Python environment and choose our PyTorch device:

\

import torch device = 'cuda' if torch.cuda.is_available() else 'cpu'

\ This code checks if a CUDA-compatible (Nvidia) GPU is available and sets the device accordingly.

\ If no such GPU is available, PyTorch will instead run on CPU which is much slower.

\

For newer Apple Silicon devices, the device can also be set to mps to run PyTorch on Metal, but the PyTorch Metal implementation is not comprehensive.

Step 2: Loading the Model

Most open-source LLMs only support text input and text output. However, since we want to create a voice-in-voice-out system, this would require us to use two more models to (1) convert the speech into text before it's fed into our LLM and (2) convert the LLM output back into speech.

\ By using a multimodal LLM like Qwen Audio, we can get away with one model to process speech input into a text response, and then only have to use a second model convert the LLM output back into speech.

\ This multimodal approach is not only more efficient in terms of processing time and (V)RAM consumption, but also usually yields better results since the input audio is sent straight to the LLM without any friction.

\

If you're running on a cloud GPU host like Runpod or Vast, you'll want to set the HuggingFace home & Bark directories to your volume storage by running export HF_HOME=/workspace/hf & export XDG_CACHE_HOME=/workspace/bark before downloading the models.

\

from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration model_name = "Qwen/Qwen2-Audio-7B-Instruct" processor = AutoProcessor.from_pretrained(model_name) model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device)

\ We chose to use the small 7B variant of the Qwen Audio model series here in order to reduce our computational requirements. However, Qwen may have released stronger and bigger audio models by the time you are reading this article. You can view all the Qwen models on HuggingFace to double check you're using their latest model.

\

For a production environment, you may want to use a fast inference engine like vLLM for much higher throughput.

Step 3: Loading the Bark model

Bark is a state-of-the-art open-source text-to-speech AI model that supports multiple languages as well as sound effects.

\

from bark import SAMPLE_RATE, generate_audio, preload_models preload_models()

\

Besides Bark, you can also use other open-source or proprietary text-to-speech models. Keep in mind that while the proprietary ones might be more performant, they come at a much higher cost. The TTS arena keeps an up-to-date comparison.

\ With both Qwen Audio 7B & Bark loaded into memory, the approximate (V)RAM usage is 24GB, so make sure your hardware supports this. Otherwise, you may use a quantized version of the Qwen model to save on memory.

Step 4: Setting Up the FastAPI Server

We’ll create a FastAPI server with two routes to handle incoming audio or text inputs and return audio responses.

\

from fastapi import FastAPI, UploadFile, Form from fastapi.responses import StreamingResponse import uvicorn app = FastAPI() @app.post("/voice") async def voice_interaction(file: UploadFile): # TODO return @app.post("/text") async def text_interaction(text: str = Form(...)): # TODO return if __name__ == "__main__":     uvicorn.run(app, host="0.0.0.0", port=8000)

\ This server accepts audio files via POST requests at the /voice & /text endpoint.

Step 5: Processing Audio Input

We’ll use ffmpeg to process the incoming audio and prepare it for the Qwen model.

\

from pydub import AudioSegment from io import BytesIO import numpy as np def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray: audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1) samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16) samples = samples.astype(np.float32) / 32768.0 return samples def load_audio_as_array(audio_bytes: bytes) -> np.ndarray: audio_segment = AudioSegment.from_file(BytesIO(audio_bytes)) float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000) return float_array Step 6: Generating Textual Response with Qwen

With the processed audio, we can generate a textual response using the Qwen model. This will need to handle both text & audio inputs.

\ The preprocessor will convert our input to the model's chat template (ChatML in Qwen's case).

\

def generate_response(conversation): text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios = [] for message in conversation: if isinstance(message["content"], list): for ele in message["content"]: if ele["type"] == "audio": audio_array = load_audio_as_array(ele["audio_url"]) audios.append(audio_array) if audios: inputs = processor( text=text, audios=audios, return_tensors="pt", padding=True ).to(device) else: inputs = processor( text=text, return_tensors="pt", padding=True ).to(device) generate_ids = model.generate(**inputs, max_length=256) generate_ids = generate_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] return response

\

Feel free to play around with the generation parameters like the temperature on the model.generate function.

Step 7: Converting Text to Speech with Bark

Finally, we’ll convert the generated text response back to speech.

\

from scipy.io.wavfile import write as write_wav def text_to_speech(text): audio_array = generate_audio(text) output_buffer = BytesIO() write_wav(output_buffer, SAMPLE_RATE, audio_array) output_buffer.seek(0) return output_buffer Step 8: Integrating Everything in the APIs

Update the endpoints to process the audio or text input, generate a response, and return the synthesized speech as a WAV file.

\

@app.post("/voice") async def voice_interaction(file: UploadFile): audio_bytes = await file.read() conversation = [ { "role": "user", "content": [ { "type": "audio", "audio_url": audio_bytes } ] } ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") @app.post("/text") async def text_interaction(text: str = Form(...)): conversation = [ {"role": "user", "content": [{"type": "text", "text": text}]} ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav")

You may choose to also add a system message to the conversations to gain more control over the assistant responses.

Step 9: Testing things out

We can use curl to ping our server as follows:

\

# Audio input curl -X POST http://localhost:8000/voice --output output.wav -F "[email protected]" # Text input curl -X POST http://localhost:8000/text --output output.wav -H "Content-Type: application/x-www-form-urlencoded" -d "text=Hey" Conclusion

By following these steps, you’ve set up a simple local server capable of two-way voice interactions using state-of-the-art models. This setup can serve as a foundation for building more complex voice-enabled applications.

Applications

If you’re exploring ways to monetize AI-powered language models, consider these potential applications:

Full code import torch from fastapi import FastAPI, UploadFile, Form from fastapi.responses import StreamingResponse import uvicorn from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration from bark import SAMPLE_RATE, generate_audio, preload_models from scipy.io.wavfile import write as write_wav from pydub import AudioSegment from io import BytesIO import numpy as np device = 'cuda' if torch.cuda.is_available() else 'cpu' model_name = "Qwen/Qwen2-Audio-7B-Instruct" processor = AutoProcessor.from_pretrained(model_name) model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device) preload_models() app = FastAPI() def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray: audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1) samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16) samples = samples.astype(np.float32) / 32768.0 return samples def load_audio_as_array(audio_bytes: bytes) -> np.ndarray: audio_segment = AudioSegment.from_file(BytesIO(audio_bytes)) float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000) return float_array def generate_response(conversation): text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios = [] for message in conversation: if isinstance(message["content"], list): for ele in message["content"]: if ele["type"] == "audio": audio_array = load_audio_as_array(ele["audio_url"]) audios.append(audio_array) if audios: inputs = processor( text=text, audios=audios, return_tensors="pt", padding=True ).to(device) else: inputs = processor( text=text, return_tensors="pt", padding=True ).to(device) generate_ids = model.generate(**inputs, max_length=256) generate_ids = generate_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] return response def text_to_speech(text): audio_array = generate_audio(text) output_buffer = BytesIO() write_wav(output_buffer, SAMPLE_RATE, audio_array) output_buffer.seek(0) return output_buffer @app.post("/voice") async def voice_interaction(file: UploadFile): audio_bytes = await file.read() conversation = [ { "role": "user", "content": [ { "type": "audio", "audio_url": audio_bytes } ] } ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") @app.post("/text") async def text_interaction(text: str = Form(...)): conversation = [ {"role": "user", "content": [{"type": "text", "text": text}]} ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8000)

\