[FS-7541] regression in native audio recording Created: 10/May/15  Updated: 18/May/15  Resolved: 18/May/15

Status: Closed
Project: FreeSWITCH
Component/s: freeswitch-core
Affects Version/s: 1.4.18
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Stanislav Sinyagin Assignee: Anthony Minessale II
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Debian Wheezy, on different hosts.

Issue Links:
Related
relates to FS-7538 both -in and -out raw recordings are ... Closed
CPU Architecture:
x86-64
Kernel:
Linux
Userland:
GNU/Linux
Distribution:
Debian
Distribution Version:
Debian 7 wheezy
Compiler:
gcc
FreeSWITCH GIT Revision: 359ff2ac70788594a321eac856c85011baed017d
GIT Master Revision hash:: 359ff2ac70788594a321eac856c85011baed017d

 Description   
At first I noticed that a VM at digitalocean (KVM) had significantly worse audio recorded in native format than in WAV: there were gaps in the audio every few seconds. Later I made a test on the same and several different hosts. KVM hosts showed always the problem. One of Xen servers did not show it, while another Xen server had this problem. A baremetal ARM server also did not show the problem.

The problem is stably reproducible with current master branch and with 1.4.18. Also I rolled back the master branch to dc8c8cc7b8 (Fri Dec 19 02:38:07 2014 -0600, Merge pull request #142 in FS/freeswitch from ~BORDMI/freeswitch-fs-7098-7099:master to master), and the problem disappeared on the DigitalOcean host.

I can provide you access to the server if needed.

Piece of public dialplan:
    <extension name="record">
      <condition field="destination_number" expression="^record_(.+)$">
        <action application="set" data="RECORD_READ_ONLY=true"/>
        <action application="set" data="send_silence_when_idle=400"/>
        <action application="set" data="record_waste_resources=true"/>
        <action application="jitterbuffer" data="60:200:20"/>
        <action application="answer"/>
        <action application="record_session" data="/var/tmp/record_$1"/>
        <action application="playback" data="silence_stream://-1"/>
      </condition>
    </extension>

Command executed on the same server, sending a call to its own IP address:
fs_cli -x 'originate sofia/external/record_01@104.236.177.224:5080 &playback(/var/tmp/ITU-T_P_50_BRITISH_ENGLISH.wav)'

The audio file is a 16-bit, 8KhZ WAV file available at http://murmur.voxserv.ch/media/ITU-T_P_50_BRITISH_ENGLISH.wav

The resulting audio has audible gaps every few seconds.



 Comments   
Comment by Anthony Minessale II [ 10/May/15 ]
You need to do a git bisect.
Comment by Stanislav Sinyagin [ 11/May/15 ]
intermediate bisection results:

bad 359ff2ac7078
bad 5e43c6dd2
bad 74e8db736
bad 6082e1dff

good 302a339fdf
good 5afdffb661
good dc8c8cc7b8
Comment by Stanislav Sinyagin [ 11/May/15 ]
root@do02:/usr/src/freeswitch/src# git bisect bad
b320c525c28f2ec6a9fa345314a7c9ddd33dfb8a is the first bad commit
commit b320c525c28f2ec6a9fa345314a7c9ddd33dfb8a
Author: Anthony Minessale <anthm@freeswitch.org>
Date: Tue Feb 17 20:55:10 2015 -0600

    FS-7066 FS-7253 FS-7231 #this part should be right

:040000 040000 dfc7cbc717210b63aed307cc84147035a0a8a123 d2d6699fa12cfcc29dcc8c770ab13c772a98c97a M src


I'm re-testing the next good (68db163aad), but looking at b320c525, it feels like something related to timer issues.
Comment by Stanislav Sinyagin [ 11/May/15 ]
In 68db163aad , only 2-3 lost frames appear during a 2-minute recording, and in b320c525 the lost frames appear every few seconds.
I started rebuilding acce943b5d and will test it again tomorrow.

Also I preserved wheezy debs, in order to simplify further testing:

freeswitch-all_1.5.15b+git~20150511T214013Z~68db163aad-1~wheezy+1_amd64.deb
freeswitch-all_1.5.15b+git~20150511T204004Z~b320c525c2-1~wheezy+1_amd64.deb
freeswitch-all_1.5.15b+git~20150511T180206Z~2c8e9c80ce-1~wheezy+1_amd64.deb

I can also provide you access to the test machine, if needed.
Comment by Stanislav Sinyagin [ 12/May/15 ]
acce943b5d has also significant audio loss. Looks like the bisection went wrong when I marked 68db163aad as good.
Comment by Stanislav Sinyagin [ 12/May/15 ]
4780a7e2af is definitely the last good version.
Comment by Stanislav Sinyagin [ 12/May/15 ]
the above tests were made on a DigitalOcean VM, Debian Wheezy, kernel 3.2.0-4-amd64

Interesting enough, that on a different machine at DigitalOcean, Ubuntu 14.04.2 LTS, kernel 3.13.0-43-generic, 4780a7e2af still produced choppy audio in native recording. Also dc8c8cc7b8 produced choppy audio.

Both machines are two-core VM's, both show the same processor type: Intel(R) Xeon(R) CPU E5-2630L v2 @ 2.40GHz
Comment by Anthony Minessale II [ 12/May/15 ]
Sounds inconsistent still?

We can have a look at the commit .
Comment by Stanislav Sinyagin [ 13/May/15 ]
it's quite consistent on Debian. I will do more tests with Ubuntu on different hardware. But the "bad" commits are dealing with the system timers, so it's very likely that they influence the recording:

b320c52 FS-7066 FS-7253 FS-7231 #this part should be right
68db163 FS-7066 FS-7253 FS-7231 #comment not so fast, revert
acce943 FS-7066 FS-7253 FS-7231 #comment remove TFD_TIMER_ABSTIME as it seems to cause some load issues on some machines

Comment by Stanislav Sinyagin [ 13/May/15 ]
also tested on baremetal 8-core Intel(R) Xeon(R) CPU E3-1270 v3 @ 3.50GHz, Ubuntu 14.04.2 LTS, kernel 3.13.0-52-generic, in current master (9eb887af4) and in 359ff2ac, and no choppiness detected.
Comment by Stanislav Sinyagin [ 13/May/15 ]
also I'm running a long-term test on a baremetal Debian Wheezy host, Intel(R) Xeon(R) CPU E31270 @ 3.40GHz, kernel 3.2.0-4-amd64, FreeSWITCH version 1.4.18.

The test calls are made every 5 minutes, and there's up to 5 simultaneous calls in G711. Out of 66 calls that the server made to itself, 3 calls indicate degraded voice. Tomorrow I will switch it to 4780a7e2af.
Comment by Anthony Minessale II [ 13/May/15 ]
The changes you mention have had a subsequent revert, reapplication some more changes all after the commit you mentioned.
0dfd08bc86e3065b2e118d421207485b93f3d5e4 is the final result of that round of changes. So you are just hitting a commit that was already an issue and fixed in the future of the same commit stream. Just look at the commits to switch_time.c

If you are having trouble on VM one big reason is the timerfd is probably not connected to actual timing hardware.
in switch.conf.xml you may want to try playing with this setting: enable-softtimer-timerfd
Try setting it to "broadcast" to have the core share one timerfd for the whole process.
Try setting it to "false" to disable it and revert to the non-timerfd broadcast based timing.




Comment by Stanislav Sinyagin [ 14/May/15 ]
I got new results, which are quite confusing: on the baremetal server, which is a 8-core machine running nothing but the test FreeSWITCH instance, I get even more audio distortions with 4780a7e2af than with release 1.4.18. With 1.4.18, approximately 7% of the calls had dropped frames, and with 4780a7e2af , around 15%.

On the VM, the distortions appeared much more frequent, and almost in every call.

Should I also experiment with softimer parameters on the baremetal server while I have the machine?

Comment by Stanislav Sinyagin [ 14/May/15 ]
here I described the test procedure in detail: https://txlab.wordpress.com/2015/05/14/quality-assurance-for-voip-calls/
Comment by Anthony Minessale II [ 14/May/15 ]
I would mess with those parameters on latest master on whatever systems you are having problems with.
The changes to the time thing are clearly resolved as I had 7 bugs at once during that period all of which were resolved with the commit vowing to never touch that code again.
Comment by Stanislav Sinyagin [ 14/May/15 ]
alright, I have now actually several different servers, baremetal and VM, and the problem appears on all of them, with different frequency, but still always present. I'll start with the baremetal Wheezy machine, as the clock behavior is most predictable on it.
Comment by Stanislav Sinyagin [ 14/May/15 ]
Anthony, could it theoretically be related to the NTP daemon correcting the system time?
Comment by Anthony Minessale II [ 14/May/15 ]
Its possible though I cannot be sure.

I do see you are calling the jitterbuffer app before you answer the call. The app is designed for modifying the jb after a call already has a jb.
You may want to try setting the variable jitterbuffer_msec=60:200" instead of calling the app or move it to after the answer. Also you can get away with setting it to 20:200 these days as its adaptive.
Comment by Anthony Minessale II [ 14/May/15 ]
BTW are we going to get you guys back to ClueCon this year? It's been a while.
Comment by Stanislav Sinyagin [ 14/May/15 ]
only if ClueCon moves to Europe :)
I'm stuck at home with two little children, so no travel allowed :-)
Comment by Stanislav Sinyagin [ 14/May/15 ]
in theory, it should not use the jitterbuffer at all, as the RTP packets are sent to the server's own Gigabit interface, and there's no other traffic so far. But I'll try it too.
Comment by Stanislav Sinyagin [ 14/May/15 ]
I'm now launching a new call every 5 seconds in order to speed up the testing. So, if it were an NTP-related problem, the lost frames would occur in multiple calls at the same time. But so far, this hasn't happened, and the frames were lost in two non-overlapping calls. Now running with jitterbuffer_msec variable.
Comment by Stanislav Sinyagin [ 14/May/15 ]
jitterbuffer_msec variable did not help. Now running without NTP daemon. With a new call every 5 seconds, the problem is detected within a half an hour.
Comment by Stanislav Sinyagin [ 15/May/15 ]
disabling NTP daemon did not change anything
Setting "enable-softtimer-timerfd" to "true" did not change anything
Comment by Stanislav Sinyagin [ 15/May/15 ]
setting ""enable-monotonic-timing" to "true" did not help. Is there something more I can play with?
Comment by Stanislav Sinyagin [ 15/May/15 ]
I can provide you all the scripts or give you access to the server. The problem is quite easy to reproduce, it takes usually 15 to 30 minutes for first bad calls to appear.
Comment by Stanislav Sinyagin [ 15/May/15 ]
The problem is not affected by enabling or disabling the jitterbuffer.

I looked at the source code, and I see that the native file recording is always done within the RTP thread (a separate thread is only used for non-native recording). Also a silence frame is inserted if the timestamp difference between the frames is 3ms longer than the packet interval. So, theoretically if file writing takes longer than 23ms, we are too late for the next frame. Are there any other factors that may influence the interval (or interval measurement)?

Comment by Stanislav Sinyagin [ 15/May/15 ]
writing the recordings into tmpfs did not help.
Comment by Stanislav Sinyagin [ 16/May/15 ]
I disabled the piece of code that inserts the silence frames, and now it's recording perfect audio: out of 1520 test calls, not a single degraded recording:

root@fsqasjc1:/usr/src/freeswitch/src# git show
commit 7b293cdbe73fb9feae45f74642545cdfec7b459a
Author: root <root@debian>
Date: Sat May 16 12:31:03 2015 -0500

    xx

diff --git a/src/switch_ivr_async.c b/src/switch_ivr_async.c
index 1fdd0c3..28c23e8 100644
--- a/src/switch_ivr_async.c
+++ b/src/switch_ivr_async.c
@@ -1234,7 +1234,7 @@ static switch_bool_t record_callback(switch_media_bug_t *bug, void *user_data, s
                        }
                                
                                
- if (rh->last_read_time && rh->last_read_time < now) {
+ if (NULL && rh->last_read_time && rh->last_read_time < now) {
                                diff = ((now - rh->last_read_time) + 3000 ) / rh->read_impl.microseconds_per_packet;
                                
                                if (diff > 1) {

this is running on baremetal server, Debian 7.8 amd64, kernel 3.2.0-4-amd64, FreeSWITCH installed from debs, built by debian/util.sh.

So, it looks like switch_micro_time_now() fails to deliver a precise timestamp, and the recording loop assumes that more than 23ms is lost between the frames.
Comment by Anthony Minessale II [ 16/May/15 ]
That was actually going to be my next suggestion.
We do need that code tho so what if you use swiitch_time_now() instead?

Comment by Stanislav Sinyagin [ 16/May/15 ]
I'll try it now
Comment by Stanislav Sinyagin [ 16/May/15 ]
I think it's still good to replace missing frames with silence -- in this case, the audio would not be skewing. But probably that 3ms constant should be a variable.
Comment by Stanislav Sinyagin [ 16/May/15 ]
so far, 125 calls without failures. I will let it run overnight
Comment by Stanislav Sinyagin [ 17/May/15 ]
running smoothly, over 5000 calls without errors:

diff --git a/src/switch_ivr_async.c b/src/switch_ivr_async.c
index 1fdd0c3..3f90907 100644
--- a/src/switch_ivr_async.c
+++ b/src/switch_ivr_async.c
@@ -1217,7 +1217,7 @@ static switch_bool_t record_callback(switch_media_bug_t *bug, void *user_data, s
                break;
        case SWITCH_ABC_TYPE_TAP_NATIVE_READ:
                {
- switch_time_t now = switch_micro_time_now();
+ switch_time_t now = switch_time_now();
                        switch_time_t diff;

Comment by Stanislav Sinyagin [ 17/May/15 ]
does jitterbuffer actually work in front of session recording?
Comment by Tamas Jalsovszky [ 17/May/15 ]
If there was something wrong with switch_micro_time_now(), would that mean there could be performance issues under virtualization? I don't want to hijack the ticket, just following it as we have performance issues under KVM and haven't found anything usable yet.
Comment by Stanislav Sinyagin [ 17/May/15 ]
This patch has also improved the situation on a KVM machine in Debian Wheezy, although some distortions still occur from time to time.
Comment by Stanislav Sinyagin [ 17/May/15 ]
test results for KVM at digitalocean: Debian: 58 degraded calls out of 2553. Ubuntu: 5 degraded calls out of 350.
Comment by Anthony Minessale II [ 17/May/15 ]
Micro time now is intentionally not very accurate for less important things. Maybe we just need a bigger threshold for how long constitutes missing packets to not get erronous fill ins.
Comment by Stanislav Sinyagin [ 17/May/15 ]
but then these fill-ins would not make any sense, as the timer is not reliable. Then probably no fill-ins would be better.

in the mean time, i also made a new series of tests, writing directly to WAV files. The WAV recording is not inserting fill-ins and not checking the timer on arrival. But on one of the KVM hosts (Ubuntu), I still got audio errors in calls within one server: out of 1477 calls, 8 calls had audio distortions. It looks like the RTP sending timer is not so precise either?
Comment by Stanislav Sinyagin [ 18/May/15 ]
one more test result: on the baremetal platform, the calls were recorded into WAV files. Out of 7030 calls of the server to itself, one was distorted: in the middle of the audio file, there was a 80ms delay, and the rest of the recording was shifted at 80ms.

the daemon is running with standard arguments as launched by the deb package:
/usr/bin/freeswitch -u freeswitch -ncwait -nonat

"top" shows Priority -2 and Nice -10 for the freeswitch process.


Comment by Anthony Minessale II [ 18/May/15 ]
If you have no fill-ins there will be no perceived silence in the files so all silence will be removed from the file. so while it was on hold etc would be out of sync and recorded conversations would be distorted.

Regarding your JB question you 100% must be using a jitterbuffer for reliable recording.
All cases where you terminate voip to something clocked differently like a recording, TDM, a Conference. If you choose not to use a jitter buffer you will have this problem all the time.
The problem comes from the fact that the recording has to write the packets as it gets them and there is contention between treating late packets as late or missing. The whole problem that a jitter buffer solves falls to the core of the recording or muxer where it can not be sure. With a jitter buffer in place, you can then trust it to make sure the packets are in order and can safely assume if a packet has not been received in the expected time that it is missing and could use filling in. Even then if the CPU is busy or the network interface hiccups or power saving or any other reason that the computer might have a side effect. You still run a possible risk of timing issues. The hardware you need for the best timing is more or less a real hardware motherboard with a accurate timing source and the kernel configured to use the best one HPET etc. Some VM like xen made up their own pretend timer because they try to get more out of concurrent VM by sharing or averaging out the timing.

cat /sys/devices/system/clocksource/clocksource0/available_clocksource

see if you see hpet

If you do?

echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource


This is clearly going outside the scope of FS and why we try to limit how much we address load related questions.






 
Comment by Anthony Minessale II [ 18/May/15 ]
master updated with switch_time_now and longer time to determine inserting silence.
Comment by Stanislav Sinyagin [ 18/May/15 ]
thanks Anthony, I will do more tests and document this experience in Confluence
Comment by Stanislav Sinyagin [ 18/May/15 ]
Anthony, after switching the default timer from tsc to hpet on the baremetal platform, I started to get immediately much worse results -- recording into WAV files gave 10 degraded calls out of 436. Shall I move the discussion to the mailing list?
Comment by Stanislav Sinyagin [ 18/May/15 ]
oops - the errors were exactly at the time when the freeswitch build process was creating the debs, so this was most probably due to I/O wait.
Generated at Sun May 26 12:41:55 CDT 2019 using Jira 8.1.0#801000-sha1:2e1cd1bb771978cda2c5e8f3f10539ab180613f6.