Scenario
Game audio should be softened when players are speaking to allow them to be clearly heard, but game audio should return to normal if no one is talking.
Resolution
Ducking is an audio effect where the level of one audio signal is reduced by the presence of another signal. This can typically be achieved by lowering (ducking) the volume of a secondary audio track when the primary track starts, and then lifting the volume again when the primary track is finished.
This article describes how to trigger an audio ducking feature. Note that this article does not go into how to adjust the volume levels for the ducking, but only how to judge when to trigger your ducking implementation.
Core
- Set your participant update frequency to 10Hz. You can do this by modifying the participant_property_frequency field in one of the Vivox login requests that your implementation uses (vx_req_account_anonymous_login_t, vx_req_account_login_t, vx_req_account_authtoken_login_t, or vx_req_account_set_login_properties_t). Although the 10Hz event rate is higher than typically recommended, for a faster ducking reaction time, this is needed.
vx_req_account_anonymous_login_t *req;
vx_req_account_anonymous_login_create(&req);
...
req->participant_property_frequency = 5; // Five 20 ms audio periods = 100 ms event periods (10 Hz)
...
- In your implementation's Vivox message handler, add a conditional block to the event message handling for the event type evt_participant_updated. Implement audio ducking control logic there based upon information gathered from the vx_evt_participant_updated_t event structs.
- Do not use the is_speaking field of vx_evt_participant_updated_t for controlling ducking due to how the timing on that works on the Vivox server side, which is not configurable. Only monitor for energy updates of non-self participants. An event is for a non-self participant when the field is_current_user is false.
- Inspect the energy field of vx_evt_participant_updated_t. Consider an energy of above ~0.4 as "speaking," with ~0.6 being loud and 0.75 being nearly the loudest possible. energy is a digital loudness value, so it only correlates with actual acoustic loudness that is produced by loudspeakers. A base level within 0.4-0.435 makes for a decent trigger for significant sound. Try experimenting to find a good floor value for your application.
- If "speaking" is detected based on the energy value in the participant updated event being above your threshold, you can slowly begin your ducking process. If the speaking continues, continue ducking until you reach your ducked level.
- When no non-self energy above the threshold is received for some period (for example, 100ms based on the 10Hz frequency), begin to slowly unduck. If non-self energy is received above the critical level, begin ducking again. Otherwise, continue unducking until you reach the normal level.
Unity
- Set your participant update frequency to 10Hz. You can do this by setting the ILoginSession.ParticipantPropertyUpdateFrequency prior to login. Although this is higher than typically recommended, for a faster ducking reaction time, this is needed.
- Register for the IChannelSession.Participants.AfterValueUpdated callback.
- Do not use the SpeechDetected value due to how the timing on that works on the Vivox server side, which is not configurable. Only monitor for AudioEnergy events for non-self participants.
- Consider an AudioEnergy of above ~0.4 as "speaking," with ~0.6 being loud and 0.75 being nearly the loudest possible. AudioEnergy is a digital loudness value, so it only correlates with actual acoustic loudness that is produced by loudspeakers. A base level within 0.4-0.435 makes for a decent trigger for significant sound. Try experimenting to find a good floor value for your application.
- If "speaking" is detected based on the AudioEnergy update, you can slowly begin your ducking process. If the speaking continues, then continue ducking until you reach your ducked level.
- When no non-local AudioEnergy is received for some period (for example, 100ms based on the 10Hz frequency), begin to slowly unduck. If non-local AudioEnergy is received above the critical level, begin ducking again. Otherwise, continue unducking until you reach the normal level.
Unreal
- Set your participant update frequency to 10Hz. You can do this by adding the line setting the participant_property_frequency to the VivoxNativeSdk::Login function in the VivoxNativeSdk.cpp file, as detailed in the following code example. Although this is higher than typically recommended, for a faster ducking reaction time, this is needed.
vx_req_account_guest_login *req;
vx_req_account_guest_login_create(&req);
req->connector_handle = vx_strdup(TCHAR_TO_UTF8(*connectorHandle));
req->access_token = vx_strdup(TCHAR_TO_UTF8(*accessToken));
req->account_handle = vx_strdup(TCHAR_TO_UTF8(*account.ToString()));
req->participant_property_frequency = 5;
- Bind an event handler to the IChannelSession.EventAfterParticipantUpdated.
- Do not use the SpeechDetected value due to how the timing on that works on the Vivox server side, which is not configurable. Only monitor for AudioEnergy events for non-self participants.
- Consider an AudioEnergy of above ~0.4 as "speaking," with ~0.6 being loud and 0.75 being nearly the loudest possible. AudioEnergy is a digital loudness value, so it only correlates with actual acoustic loudness that is produced by loudspeakers. A base level within 0.4-0.435 makes for a decent trigger for significant sound. Try experimenting to find a good floor value for your application.
- If "speaking" is detected based the AudioEnergy update, you can slowly begin your ducking process. If the speaking continues, continue ducking until you reach your ducked level.
- When no non-local AudioEnergy is received for some period (for example, 100ms based on the 10Hz frequency), begin to slowly unduck. If non-local AudioEnergy is received above the critical level, begin ducking again. Otherwise, continue unducking until you reach the normal level.