Argus server exits with "maximum errors exceeded 200000"

Peter Van Epp vanepp at sfu.ca
Thu Nov 19 19:16:01 EST 2009


On Thu, Nov 19, 2009 at 02:49:18PM -0500, Guy Dickinson wrote:
> Greetings, Argus Developers and Subscribers:
> 
> For some time, I have been attempting to troubleshoot an argus server
> instance sitting atop a ~1Gbps link which has presented some stability
> issues. To date, I have had two issues, one which I think I have solved,
> and one which remains open.
> 
> The first has been described before in a handful of mailing list
> postings, not dissimilar to this one:
> 
> http://thread.gmane.org/gmane.network.argus/5010/focus=5011
> 
> The argus server would run fine, but after a few hours of connection
> from a ra client, it would disconnect without warning with the
> "ArgusWriteOutSocket [...] max queue exceeded 100001" error. I was able
> to suppress this error by changing the size of ArgusMaxListLength in
> ArgusUtil.c:
> 
> int ArgusMaxListLength = 1000000;
> 

	This has likely just increased the time before the error occurs :-).
What the original message is saying is that the argus output task isn't able
to move the data out the output socket fast enough and the queue is building. 
If thats just a traffic burst then it isn't serious as it will go back down
again, but if the sustained traffic is more than the output can handle you 
will eventually hit another limit (running out of memory for instance). Are
you by chance writing records to disk on the argus server? That may cause this
(although on a ninja perhaps not too, depending if it has a capture disk 
subsystem installed). It would probably be a good bet to run a netperf/iperf
test between the ninja and the ra box to make sure the interbox link has 
decent throughput (although the argus output from a gig link should only be 
around 100 megs or less, if there is something wrong on the intermachine link
that is slowing it down that will do this). 

> Now, however, I am beginning to see a different problem with the argus
> server. After a day or so of a connected ra client, the argus server
> exits with the debug message
> 
> argus[7386]: 19 Nov 09 14:19:28.712777 ArgusWriteOutSocket(0xad21b008)
> maximum errors exceeded 200000
> 

	This one looks to be to many EINTR returns from the output socket 
(which is pointing to a problem of some kind, perhaps not enough tcp buffers
available) on the output socket. It looks to come from argus/ArgusUtil.c
around line 1608 (but my argus is a bit old too so YMMV :-)). A quick google
search tells me that this is an error return for a system call interrupted 
by a signal in Linux and the solution is to retry the operation. It looks
like asock->errornum is set to 0 every time there is a successful write to the
socket so something odd seems to be going on in your case as 200K signals 
without a successful write seems a lot (unless there is a bug and a code path
where the counter isn't being reset correctly of course)!
	It may be profitable to change the 6 to a 1 at line 1443 in
argus/ArgusUtil.c i.e.:

ArgusDebug (6,"ArgusWriteSocket: write returned %d, errno %d\n",retn, errno);

to

ArgusDebug (1,"ArgusWriteSocket: write returned %d, errno %d\n",retn, errno);

so it shows at your debug level (but be aware there may be a lot of these
which may cause problems too :-)). Alternately waiting to see if Carter can
think of any reason why this may be happening may be the best bet :-). 
	Do you know if the tcp buffers in your kernel have been boosted way
up and window scaling enabled on both machines? At full gig (at least on high 
latency wide area links) you will get window starvation and throughput problems
(~35 megabits per second on a 22 msec latency lightpath that would support 
995 megabits per second with proper kernel tuning, although I wouldn't expect
your latency is anywhere near that). I don't have the numbers our HPC guys used
to hand but can get them if needed. If you have an ndt server nearby and
a web browser on the box (which may be more of a problem) its diagnostic 
screen will tell you what and how much performance is being limited and point
at what you need to boost. 
 
> Could someone shed some light on these errors and what may be causing
> them? While running the server with debug set to 1, I see these messages
> a few times an hour:
> 
> argus[7386]: 19 Nov 09 11:48:12.456533 ArgusNewFlow() flow key is not
> correct len equals zero
> 

	I think this one points at an argus bug. It is desirable (but possibly
not easy on a link as fast as yours) to get a copy of the pcap buffer that
is causing this to Carter. 
	If you can stand a crash of argus the easiest way to debug this one is 
to rebuild argus with .debug and .devel flags (which will slow it down a bit)
and then intentionally cause a segfault by attempting to assign a null pointer
and use gdb on the core file to look at the data structures. In this 
case that would be argus/ArgusModeler.c line 1739 although this may be too 
late to see the input packet because I think this is processing off the queue.
	These should be logged to syslog as well, so if they happen reasonably
regularly and your ninja can do (and is allowed to which is a whole different
can of worms :-)) do full speed capture you may be able to get a pcap file 
that will cause the error when fed to argus. Then filtering out things until
you can isolate the packet causing the error can get it fixed (probably a 
protocol argus doesn't recognize or a bug in one of the decoders). 

> 
> Client and Server Version: 3.0.2
> Network Capture Hardware: Endace DAG 4.5G2
> Client and Server OS: RHEL5.4
> Capture Bandwidth: 700Mbit/sec - 1Gbps
> 
> Both the argus server and ra client are running on some fairly serious
> hardware. The former is running on an Endace NinjaBox and the latter on
> an 8-core box with an awful lot of memory.
> 
> Any help would be greatly appreciated.
> 
> Regards,
> Guy Dickinson
> 
> -- 
> ------------------
> Guy Dickinson, Network Security Analyst
> NYU ITS Technology Security Services
> guy.dickinson at nyu.edu
> (212) 998-3052

Peter Van Epp



More information about the argus mailing list