[ARGUS] A concrete packet loss example :-)
Peter Van Epp
vanepp at sfu.ca
Sat Oct 16 01:29:30 EDT 2004
Because we are hoping to instrument a large high speed grid ftp file
transfer (which on a good day can saturate the gig link) we ran a netperf test
this afternoon to assess how argus is doing in its current state. It
unfortunatly provided a concrete example of possibilities for packet loss
and what you need to do to find it (and hopefully fix it :-)) as we have
discussed before. The argus trace below is of 5 minutes of netperf going at
~ 950 megabits per second. Rounding down to 800 megs that is 100 megabytes
per second or around 30 gigabytes for the 5 minute interval. As we see from
the argus trace, it only thinks it saw about 18 gigabytes.
15 Oct 04 15:36:32 * tcp aaa.bb.cc.ddd.22 ?> aaa.bb.cc.eee.59940 55 89 8558 8146 CON
15 Oct 04 15:36:57 * tcp aaa.bb.cc.ddd.53616 -> aaa.bb.cc.eee.63214 812577 387778 2997080894 25675796 CON
15 Oct 04 15:36:57 d tcp aaa.bb.cc.ddd.53615 -> aaa.bb.cc.eee.12865 6 4 916 788 CON
15 Oct 04 15:37:57 s tcp aaa.bb.cc.ddd.53616 -> aaa.bb.cc.eee.63214 830294 390797 3156073012 25812974 CON
15 Oct 04 15:38:57 s tcp aaa.bb.cc.ddd.53616 -> aaa.bb.cc.eee.63214 790515 383177 2799105842 25450438 CON
15 Oct 04 15:39:57 s tcp aaa.bb.cc.ddd.53616 -> aaa.bb.cc.eee.63214 832011 393695 3171499418 25989794 CON
15 Oct 04 15:40:57 s tcp aaa.bb.cc.ddd.53616 -> aaa.bb.cc.eee.63214 832456 392123 3175486124 25881510 CON
15 Oct 04 15:41:57 tcp aaa.bb.cc.ddd.53615 -> aaa.bb.cc.eee.12865 3 3 198 454 FIN
15 Oct 04 15:41:57 tcp aaa.bb.cc.ddd.53616 -> aaa.bb.cc.eee.63214 186 416 1660256 27456 FIN
Yet a dump of the man records says that pcap only lost about 175
buffers to overwrite before read by the application indicating (at this time
at least, more on that later) the problem doesn't look to be pcap losses.
15 Oct 04 15:27:41 man 229.97.122.203 v2.0 1858603 494 5796 0 477018 627 CON
15 Oct 04 15:32:41 man 229.97.122.203 v2.0 1859469 507 5984198 175 1912201834 587 CON
15 Oct 04 15:37:41 man 229.97.122.203 v2.0 1860265 459 5418640 0 1724936348 593 CON
So we then need to look at a netstat -i to get the interface stats.
Here we find our first source of packet loss. Input errors on both the gig
cards that are capturing on the link. This likely means that we are running
out of kernel network buffers (and or memory or bus bandwith which will be
harder to deal with if true) and thus the first step is going to be build a
new kernel with much larger network buffer sizes to try and cut this down and
see what happens.
%netstat -i
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
sk0 9000 <Link#1> 00:00:5a:9a:26:cc 138756567 5172740 0 0 0
sk1 9000 <Link#2> 00:00:5a:9a:25:1c 499147209 3508673 0 0 0
xl0 1500 <Link#3> 00:e0:81:20:c3:4c 36871843 0 2090017 1 0
But now we also need to go back and think about the pcap loss number.
Since it wasn't 0, we are likely near the edge of overrunning the buffer and
experiencing high loss. The pcap buffer size is already syscntled to the
500+k max currently allowed, which suggests that at the same time the net
buffers are boosted it would be wise to look at increasing the max size of the
pcap buffer since stopping the loss on the interfaces will likely just move
it to the pcap buffer (the half meg pcap buffer is a lot of why I'm on FreeBSD,
the Linux one looked to be about 64K and not easily changable except with a
recompile).
Figuring a way to clear the error counters wouldn't be a bad idea
either and of course buying gargoyle is likely the right solution to this
whole problem if this is decided to be useful, expecially since this link was
funded to upgrade to 10gigE this morning :-). I'll keep you posted on
progress (or lack of as the case may be :-)).
Peter Van Epp / Operations and Technical Support
Simon Fraser University, Burnaby, B.C. Canada
More information about the argus
mailing list