Vivox: How to: Access client-side audio buffers – Unity Support Help Center

The article explains when and how to access audio buffers in the Vivox SDK for custom audio processing, such as adding effects, integrating third-party engines, or performing audio analysis. It details callback functions available in the Vivox Core SDK for both capture and render audio, highlights special considerations for Unity and Unreal, and provides guidance on safely implementing these callbacks without impacting performance or user experience.

The Vivox SDK automatically handles the capture and render of voice data without requiring audio buffer access. Unless your implementation requires the situations detailed below, audio buffer access is generally not needed. Developers might require direct access to the capture or render audio buffers to perform the following:

Add an audio effect that modifies the voice (i.e., robot voice, radio distortion, echo, static).
Use a third-party audio engine (i.e., Wwise, FMOD) for render instead of an automatic render provided by Vivox.
Perform phoneme or other audio analysis.

Engine

Vivox Core (Custom Engine)

The Vivox Core SDK provides four hooks for optional callback functions that allow access to the audio buffers:

Capture side
- After capture, but before Vivox audio processing
- After Vivox audio processing, but before being sent to the Vivox server
Render side
- After being received from the Vivox server, but before audio processing and mixdown to a single audio stream
- After Vivox audio processing and mixdown, but before render

Caution: Accessing or modifying the audio buffers incorrectly could have a substantial negative impact on the user experience.

The vx_sdk_config_t structure has six methods that are related to audio buffers and the audio subsystem. This structure is passed in when calling vx_initialize(). Only the needed callbacks should be set and any unneeded ones can be ignored.

After the callbacks are set, they are called from the Vivox audio processing thread. The callbacks cannot be changed to be called from another thread. Any blocking operations (e.g., writing data to storage) should pass the given data to another thread to perform that operation. The Vivox audio processing thread must not be blocked.

// Called after audio is read from the capture device.
// This is as close to the capture device as possible, but can still
// have audio adjustments due to hardware echo cancellation and other
// factors.
// No blocking operations can occur on this callback.
void(*pf_on_audio_unit_after_capture_audio_read)(
  void *callback_handle,
  const char *session_group_handle,
  const char *initial_target_uri,
  short *pcm_frames,      // frame buffer
  int pcm_frame_count,    // number of frames in buffer
  int audio_frame_rate,   // sample rate
  int channels_per_frame  // channels per frame
);

// Called when an audio processing unit is about to
// send captured audio to the network from the audio processing
// thread.
// No blocking operation can occur on this callback.
void(*pf_on_audio_unit_before_capture_audio_sent)(
  void *callback_handle,
  const char *session_group_handle,
  const char *initial_target_uri,
  short *pcm_frames,      // frame buffer
  int pcm_frame_count,    // number of frames in buffer
  int audio_frame_rate,   // sample rate
  int channels_per_frame  // channels per frame
);

// Called before an audio processing unit mixes the per-participant
// audio data to a single stream from the audio processing thread.
// No blocking operations can occur on this callback.
void (*pf_on_audio_unit_before_recv_audio_mixed_t)(
  void *callback_handle,
  const char *session_group_handle,
  const char *initial_target_uri,
  vx_before_recv_audio_mixed_participant_data_t *participants_data,
  size_t num_participants
);

// Called when an audio processing unit is about to write received
// audio to the render device from the audio processing thread.
// No blocking operations can occur on this callback.
void(*pf_on_audio_unit_before_recv_audio_rendered)(
  void *callback_handle,
  const char *session_group_handle,
  const char *initial_target_uri,
  short *pcm_frames,      // frame buffer
  int pcm_frame_count,    // number of frames in buffer
  int audio_frame_rate,   // sample rate
  int channels_per_frame, // channels per frame
  int is_silence          // equals 0 if there is renderable audio data
);

// Called when an audio processing unit is started
// from the audio processing thread.
// No blocking operations can occur on this callback.
void(*pf_on_audio_unit_started)(
  void *callback_handle,
  const char *session_group_handle,
  const char *initial_target_uri);

// Called when an audio processing unit is stopped
// from the audio processing thread.
// No blocking operations can occur on this callback.
void(*pf_on_audio_unit_stopped)(
  void *callback_handle,
  const char *session_group_handle,
  const char *initial_target_uri);

Vivox Unity

The Vivox Unity SDK uses the Unity Editor Audio Tap components to access the capture/render audio. Please see the Unity Editor Audio Tap components documentation rather than this article for how to proceed for Unity implementations.

Vivox Unreal

The audio buffers are not exposed in the Vivox Unreal SDK, but can be easily set by adding to the existing vx_sdk_config_t creation step in the Vivox C++ source code. As Unreal uses C++, the callbacks are similar to the Vivox Core ones.

Callback usage

pf_on_audio_unit_after_capture_audio_read

This callback is the most appropriate to inject audio to replace the captured audio.

This callback returns the data as close to the audio capture device as possible based on the native device. For example, if the device performs hardware echo cancellation, this data is obtained after that step. If a user wants to inject audio to replace the captured audio, have this function overwrite the PCM frames with the data to inject. The data is then run through Vivox audio processing, such as Voice Activity Detection (VAD), Acoustic Echo Cancellation (AEC), and Automatic Gain Control (AGC).

The data must be in the following format to be written to the buffer that is pointed to by pcm_frames.

16-bit signed integer
pcm_frame_count number of frames
channels_per_frame number of channels per frame
audio_frame_rate sample rate

Note that all resampling or other audio conditioning of injected data is the developer's responsibility. The buffer must always be filled. If there is no audio data, represent this by 0s in that portion of the buffer.

pf_on_audio_unity_before_capture_audio_sent

This callback is the most appropriate for recording applications that are designed to capture a player’s speech.

This callback is called after Vivox audio processing (such as Voice Activity Detection [VAD], Acoustic Echo Cancellation [AEC], and Automatic Gain Control [AGC]) occurs, and before being transmitted to the Vivox server. It is not recommended that developers modify the media payload at this point because the metadata (for example, is_speaking) would no longer match the originally analyzed data.

pf_on_audio_unit_before_recv_audio_mixed

This callback is the most appropriate for adding Digital Signal Processing (DSP) effects to individual participants on the render side.

This callback is called after receiving the audio from the Vivox server, and prior to mixing the audio down to a single stream. Use this callback to gain access to the per-participant audio data. You can call this callback with num_participants of 0, which indicates that there is currently silence and no audio data to mix.

If the audio frames are zeroed out in this callback, no events for non-local participants in the session are generated because a zeroed-out frame plays silence for that participant.

pf_on_audio_unit_before_recv_audio_rendered

This callback is the most appropriate for recording applications that are designed to capture what a player hears.

This callback is called before rendering the audio to the render device. This action occurs after the per-participant mixdown and the application of any 3D audio effects.

Taking audio render or capture responsibility / Third-party audio render and capture

If the application's audio engine will be used to render Vivox voice audio retrieved from these callbacks, or if the application's audio engine will be used to provide capture audio to Vivox through these callbacks, then the Vivox render and/or capture device(s) should be set to "No Device". This will prevent Vivox from opening any audio devices for reading/writing.

However, on mobile platforms Vivox relies on hardware acoustic echo cancellation to prevent echo when in speakerphone-like audio configurations. Android and iOS require that actual audio endpoints be opened for voice intentions with the OS for hardware echo cancellation to operate. So on mobile, it is better to NOT set Vivox's audio devices to "No Device". Rather, leave the Vivox audio device settings as they are and then zero or overwrite the audio data in these Vivox audio callbacks to prevent Vivox from rendering the voice audio or to provide substitute capture audio (discarding what was read from the microphone).

Additional notes

Callbacks only occur when in a channel session.
Sample rates can switch mid-stream.
Samples are 16-bit signed integers.
pcm_frame_count is the total number of frames for the period, where a frame consists of one sample for each channel. For 32 kHz, the number of frames in a 20ms period would be 640, regardless of whether the channel is stereo or mono.
Silence is represented by 0s.
In cases where Vivox cannot open the capture device in single channel mode, the results from the microphone are mixed down to a single channel, which is then presented on a capture callback.
On the render side, the audio is stereo interleaved.

Related to