Argus server exits with "maximum errors exceeded 200000"

Guy Dickinson guy.dickinson at nyu.edu
Fri Nov 20 12:59:43 EST 2009


Thanks very much for the detailed response. I'll take each bit
one-at-a-time:

> Are
> you by chance writing records to disk on the argus server?
Nope, the server is strictly listening on a socket, and the client is
actually just, in essence, printing records to STDOUT so another process
can pick them up and parse them into another format. During testing, I
generally have ra write things out to /dev/null. The process connected
to ra's STDOUT isn't writing flow data to disk, either, it should be noted.

> It would probably be a good bet to run a netperf/iperf
> test between the ninja and the ra box

iperf rates the connection at somewhere between 280 and 300Mbits/sec
which sounds about right. I'd be quite surprised if it was really a
network issue between the argus server and ra box; our internal network
is quite robust.

> It may be profitable to change the 6 to a 1 at line 1443 in
> argus/ArgusUtil.c i.e. [...]

I've made that debug change and am currently running the server as such.
Because this error takes some time to surface I'll post the results as
soon as I can duplicate the error.

While I was making the debugging change, however, I did notice that I
was getting this notice about 100 instances at a time every couple of
seconds:

argus[14706]: 20 Nov 09 12:19:30.312082 setArgusInterfaceStatus(1)

I assume it to be harmless but thought I'd mention it for the sake of
diligence.

Also, just for completeness, I changed the debug level on line 1718 of
ArgusUtil.c:

ArgusDebug (1, "ArgusWriteOutSocket (0x%x) %d records waiting. returning
%d\n", asock, list->count, retn);

list->count is almost always 0, with short bursts to no more than 2000,
maybe every 10 minutes for less than a second.


> Do you know if the tcp buffers in your kernel have been boosted way
> up and window scaling enabled on both machines?

Both systems are running the standard RHEL5 kernel, running just as it
came out of the box. I can investigate tuning it a bit further but I'm a
little hesitant to do so until I can get some testbed hardware online
which may take a little while.

>> argus[7386]: 19 Nov 09 11:48:12.456533 ArgusNewFlow() flow key is not
>> correct len equals zero
>> 
> 	I think this one points at an argus bug. It is desirable (but possibly
> not easy on a link as fast as yours) to get a copy of the pcap buffer that
> is causing this to Carter. 

I'll work on extracting a pcap which generates this error. My gdb-fu is
a little weak but I'll see if I can get something to you guys fairly
soon. Our network, like many academic networks, is rife with bizarre
protocols and misconfigured devices so it could easily be one of those.
I'll see what I can track down.

Many thanks again for your assistance!
-Guy


Peter Van Epp wrote:
> On Thu, Nov 19, 2009 at 02:49:18PM -0500, Guy Dickinson wrote:
>> Greetings, Argus Developers and Subscribers:
>>
>> For some time, I have been attempting to troubleshoot an argus server
>> instance sitting atop a ~1Gbps link which has presented some stability
>> issues. To date, I have had two issues, one which I think I have solved,
>> and one which remains open.
>>
>> The first has been described before in a handful of mailing list
>> postings, not dissimilar to this one:
>>
>> http://thread.gmane.org/gmane.network.argus/5010/focus=5011
>>
>> The argus server would run fine, but after a few hours of connection
>> from a ra client, it would disconnect without warning with the
>> "ArgusWriteOutSocket [...] max queue exceeded 100001" error. I was able
>> to suppress this error by changing the size of ArgusMaxListLength in
>> ArgusUtil.c:
>>
>> int ArgusMaxListLength = 1000000;
>>
> 
> 	This has likely just increased the time before the error occurs :-).
> What the original message is saying is that the argus output task isn't able
> to move the data out the output socket fast enough and the queue is building. 
> If thats just a traffic burst then it isn't serious as it will go back down
> again, but if the sustained traffic is more than the output can handle you 
> will eventually hit another limit (running out of memory for instance). Are
> you by chance writing records to disk on the argus server? That may cause this
> (although on a ninja perhaps not too, depending if it has a capture disk 
> subsystem installed). It would probably be a good bet to run a netperf/iperf
> test between the ninja and the ra box to make sure the interbox link has 
> decent throughput (although the argus output from a gig link should only be 
> around 100 megs or less, if there is something wrong on the intermachine link
> that is slowing it down that will do this). 
> 
>> Now, however, I am beginning to see a different problem with the argus
>> server. After a day or so of a connected ra client, the argus server
>> exits with the debug message
>>
>> argus[7386]: 19 Nov 09 14:19:28.712777 ArgusWriteOutSocket(0xad21b008)
>> maximum errors exceeded 200000
>>
> 
> 	This one looks to be to many EINTR returns from the output socket 
> (which is pointing to a problem of some kind, perhaps not enough tcp buffers
> available) on the output socket. It looks to come from argus/ArgusUtil.c
> around line 1608 (but my argus is a bit old too so YMMV :-)). A quick google
> search tells me that this is an error return for a system call interrupted 
> by a signal in Linux and the solution is to retry the operation. It looks
> like asock->errornum is set to 0 every time there is a successful write to the
> socket so something odd seems to be going on in your case as 200K signals 
> without a successful write seems a lot (unless there is a bug and a code path
> where the counter isn't being reset correctly of course)!
> 	It may be profitable to change the 6 to a 1 at line 1443 in
> argus/ArgusUtil.c i.e.:
> 
> ArgusDebug (6,"ArgusWriteSocket: write returned %d, errno %d\n",retn, errno);
> 
> to
> 
> ArgusDebug (1,"ArgusWriteSocket: write returned %d, errno %d\n",retn, errno);
> 
> so it shows at your debug level (but be aware there may be a lot of these
> which may cause problems too :-)). Alternately waiting to see if Carter can
> think of any reason why this may be happening may be the best bet :-). 
> 	Do you know if the tcp buffers in your kernel have been boosted way
> up and window scaling enabled on both machines? At full gig (at least on high 
> latency wide area links) you will get window starvation and throughput problems
> (~35 megabits per second on a 22 msec latency lightpath that would support 
> 995 megabits per second with proper kernel tuning, although I wouldn't expect
> your latency is anywhere near that). I don't have the numbers our HPC guys used
> to hand but can get them if needed. If you have an ndt server nearby and
> a web browser on the box (which may be more of a problem) its diagnostic 
> screen will tell you what and how much performance is being limited and point
> at what you need to boost. 
>  
>> Could someone shed some light on these errors and what may be causing
>> them? While running the server with debug set to 1, I see these messages
>> a few times an hour:
>>
>> argus[7386]: 19 Nov 09 11:48:12.456533 ArgusNewFlow() flow key is not
>> correct len equals zero
>>
> 
> 	I think this one points at an argus bug. It is desirable (but possibly
> not easy on a link as fast as yours) to get a copy of the pcap buffer that
> is causing this to Carter. 
> 	If you can stand a crash of argus the easiest way to debug this one is 
> to rebuild argus with .debug and .devel flags (which will slow it down a bit)
> and then intentionally cause a segfault by attempting to assign a null pointer
> and use gdb on the core file to look at the data structures. In this 
> case that would be argus/ArgusModeler.c line 1739 although this may be too 
> late to see the input packet because I think this is processing off the queue.
> 	These should be logged to syslog as well, so if they happen reasonably
> regularly and your ninja can do (and is allowed to which is a whole different
> can of worms :-)) do full speed capture you may be able to get a pcap file 
> that will cause the error when fed to argus. Then filtering out things until
> you can isolate the packet causing the error can get it fixed (probably a 
> protocol argus doesn't recognize or a bug in one of the decoders). 
> 
>> Client and Server Version: 3.0.2
>> Network Capture Hardware: Endace DAG 4.5G2
>> Client and Server OS: RHEL5.4
>> Capture Bandwidth: 700Mbit/sec - 1Gbps
>>
>> Both the argus server and ra client are running on some fairly serious
>> hardware. The former is running on an Endace NinjaBox and the latter on
>> an 8-core box with an awful lot of memory.
>>
>> Any help would be greatly appreciated.
>>
>> Regards,
>> Guy Dickinson
>>
>> -- 

-- 
------------------
Guy Dickinson, Network Security Analyst
NYU ITS Technology Security Services
guy.dickinson at nyu.edu




More information about the argus mailing list