[ARGUS] A concrete packet loss example :-)

Sat Oct 16 01:29:30 EDT 2004

	Because we are hoping to instrument a large high speed grid ftp file
transfer (which on a good day can saturate the gig link) we ran a netperf test
this afternoon to assess how argus is doing in its current state. It 
unfortunatly provided a concrete example of possibilities for packet loss
and what you need to do to find it (and hopefully fix it :-)) as we have
discussed before. The argus trace below is of 5 minutes of netperf going at
~ 950 megabits per second. Rounding down to 800 megs that is 100 megabytes
per second or around 30 gigabytes for the 5 minute interval. As we see from
the argus trace, it only thinks it saw about 18 gigabytes.

15 Oct 04 15:36:32  *        tcp   aaa.bb.cc.ddd.22     ?>    aaa.bb.cc.eee.59940 55       89        8558         8146        CON
15 Oct 04 15:36:57  *        tcp   aaa.bb.cc.ddd.53616  ->    aaa.bb.cc.eee.63214 812577   387778    2997080894   25675796    CON
15 Oct 04 15:36:57  d        tcp   aaa.bb.cc.ddd.53615  ->    aaa.bb.cc.eee.12865 6        4         916          788         CON
15 Oct 04 15:37:57  s        tcp   aaa.bb.cc.ddd.53616  ->    aaa.bb.cc.eee.63214 830294   390797    3156073012   25812974    CON
15 Oct 04 15:38:57  s        tcp   aaa.bb.cc.ddd.53616  ->    aaa.bb.cc.eee.63214 790515   383177    2799105842   25450438    CON
15 Oct 04 15:39:57  s        tcp   aaa.bb.cc.ddd.53616  ->    aaa.bb.cc.eee.63214 832011   393695    3171499418   25989794    CON
15 Oct 04 15:40:57  s        tcp   aaa.bb.cc.ddd.53616  ->    aaa.bb.cc.eee.63214 832456   392123    3175486124   25881510    CON
15 Oct 04 15:41:57           tcp   aaa.bb.cc.ddd.53615  ->    aaa.bb.cc.eee.12865 3        3         198          454         FIN
15 Oct 04 15:41:57           tcp   aaa.bb.cc.ddd.53616  ->    aaa.bb.cc.eee.63214 186      416       1660256      27456       FIN

	Yet a dump of the man records says that pcap only lost about 175 
buffers to overwrite before read by the application indicating (at this time
at least, more on that later) the problem doesn't look to be pcap losses.

15 Oct 04 15:27:41           man  229.97.122.203  v2.0             1858603 494   5796     0         477018       627         CON
15 Oct 04 15:32:41           man  229.97.122.203  v2.0             1859469 507   5984198  175       1912201834   587         CON
15 Oct 04 15:37:41           man  229.97.122.203  v2.0             1860265 459   5418640  0         1724936348   593         CON

	So we then need to look at a netstat -i to get the interface stats.
Here we find our first source of packet loss. Input errors on both the gig
cards that are capturing on the link. This likely means that we are running 
out of kernel network buffers (and or memory or bus bandwith which will be 
harder to deal with if true) and thus the first step is going to be build a 
new kernel with much larger network buffer sizes to try and cut this down and 
see what happens.

%netstat -i
Name    Mtu Network       Address              Ipkts Ierrs    Opkts Oerrs  Coll
sk0    9000 <Link#1>    00:00:5a:9a:26:cc 138756567 5172740        0     0     0
sk1    9000 <Link#2>    00:00:5a:9a:25:1c 499147209 3508673        0     0     0
xl0    1500 <Link#3>    00:e0:81:20:c3:4c 36871843     0  2090017     1     0

	But now we also need to go back and think about the pcap loss number.
Since it wasn't 0, we are likely near the edge of overrunning the buffer and
experiencing high loss. The pcap buffer size is already syscntled to the 
500+k max currently allowed, which suggests that at the same time the net 
buffers are boosted it would be wise to look at increasing the max size of the
pcap buffer since stopping the loss on the interfaces will likely just move
it to the pcap buffer (the half meg pcap buffer is a lot of why I'm on FreeBSD,
the Linux one looked to be about 64K and not easily changable except with a 
recompile). 
	Figuring a way to clear the error counters wouldn't be a bad idea 
either and of course buying gargoyle is likely the right solution to this 
whole problem if this is decided to be useful, expecially since this link was 
funded to upgrade to 10gigE this morning :-). I'll keep you posted on 
progress (or lack of as the case may be :-)).

Peter Van Epp / Operations and Technical Support 
Simon Fraser University, Burnaby, B.C. Canada