Argus server exits with "maximum errors exceeded 200000"
guy.dickinson at nyu.edu
Mon Dec 21 13:12:07 EST 2009
I wanted to provide the list with an update on this issue, which we seem
to have solved, or at least have made significant progress on. Here's
what we've found after several weeks of testing.
First, as suggested here, we reverted to a plain, 'off the shelf' build
of Argus with ArgusMaxListLength back to its original setting.
We then used argus to monitor itself on a separate instance and saw
approximately 1% loss/retransmission between the argus server and a ra()
client. Carter suggested this performance issue was the root cause of
all our issues and may be the result of a physical network problem.
We then spent many many hours running iperf between various hosts at
different points on our network to try and isolate any problematic
network hardware. We noticed that we could never actually establish a
reliable iperf connection between any two hosts on the subnet that our
ra() client is on and the subnet the argus server is on.
We rebooted our ra() client system into a known-good, clean live CD
environment and were immediately able to get sustained network rates of
600mbit/sec or so with no TCP retransmissions, which is consistent with
performance across our network core, which effectively ruled out a
physical network issue.
With extensive consultation from our network engineers, we were able to
track the issue down to a bug in the Cisco Firewall Services Modules
(FWSMs) in the access switches that each subnet is routed through,
related to the way that the FWSMs handle TCP SACK options when sequence
randomization is enabled. The packets were getting dropped by the local
iptables instance at each end of the connection because they didn't
match the "ANY, ESTABLISHED" rule that allows existing
statefully-tracked connections to pass.
The bug has some scant documentation in various places, but other folks
have observed this issue in various other scenarios. At this time there
does not appear to be a Cisco fix in place:
There are three possible workarounds to this issue:
1) Disable SACK options on both ends of the connection. This has a
negligible performance impact. In Linux you can do this in
/etc/sysctl.conf by adding:
net.ipv4.tcp_sack = 0
2) Similarly, you can instruct iptables/netfilter/conntrack to "Be
Liberal" which allows for some anomalies in the ESTABLISHED/RELATED
rule, although what precisely this does is poorly documented and
probably not recommended on security grounds. You can enable this,
however, by using
# sysctl net.ipv4.netfilter.ip_conntrack_tcp_be_liberal = 1
There's a caveat here for RHEL users, though: there's a bug in sysctl
which doesn't make this setting permanent across reboots, even if you
put this directive in /etc/sysctl.conf. You'll have to write your own
line in /etc/rc.local or similar. We've reported this to RedHat and you
can find the ticket here:
3) Finally, you can add static ACCEPT rules to your local firewall
before the ESTABLISHED/RELATED rule so that state-tracking doesn't come
into play. This is hard to scale and another security issue so I
wouldn't recommend this either.
Having decided on solution 1, we've managed to keep a ra() session going
for about 96 hours now, which is probably old-hat to most folks on this
list but a first for us.
Thanks to Carter for pointing us in the right direction.
Carter Bullard wrote:
> Hey Guy,
> Sorry for the delayed response.
> DAG cards don't really support a socket interface, so the
> is there to fool code farther down the line into thinking that the DAG
> card is up.
> This may chew up cycles, but it should not be a source of errors.
> If we get back to your original thread, you have a producer/consumer
> problem, in
> that argus() is generating more data than your clients can consume. The MAX
> QUEUE EXCEEDED messages are the clue. By adding more depth to those
> queues, you're just delaying the inevitable. We need to solve this
> turn your queue lengths down and all should be well.
> There are many issues that can cause a client to not perform well. If
> its on the
> same box as argus(), it could be busy writing to disk or argus() is so
> busy that
> the client never gets a time slice, and just can't consume the load.
> If the client is on another box, the problem could be packet loss
> between argus()
> and the client program. Because argus uses TCP, it must retransmit data
> that is
> dropped. Loss can occur due to bad cables, out of scope network cards, and
> limited available network capacity. Argus() is a great tool to use to
> its own transport connections. Have you looked at the argus data for the
> argus data transport TCP to see if its losing packets?
> Also i the client is on another box, the problem could be flow control.
> TCP allows
> the client to "shut the transmitter up", so to speak. This can happen
> if the disks
> that its writing to (or the screen) slows it down such that it can't
> read the socket.
> Argus is also the tool of choice here. Look at the argus records for
> the transport
> stream looking for "S" or "D" indicators in the flags field. This
> indicates source
> or destination flow control (you should see 'S's). If this is the case,
> you need to
> beef up your consumer, or use filters.
> When a reader can't keep up, argus has only one recourse, to close the
> connection and keep going. So your argus would do well, its just the
> would attach and leave and then attach and leave again. But, by increasing
> the queue length to 1M, you generated a situation where argus can encounter
> a critical error, like out of memory etc... and then it thinks it has to
> Lets lower the queue length back to 200K, and the try to figure out why your
> clients are consuming fast enough.
> On Nov 19, 2009, at 2:49 PM, Guy Dickinson wrote:
>> Greetings, Argus Developers and Subscribers:
>> For some time, I have been attempting to troubleshoot an argus server
>> instance sitting atop a ~1Gbps link which has presented some stability
>> issues. To date, I have had two issues, one which I think I have solved,
>> and one which remains open.
>> The first has been described before in a handful of mailing list
>> postings, not dissimilar to this one:
>> The argus server would run fine, but after a few hours of connection
>> from a ra client, it would disconnect without warning with the
>> "ArgusWriteOutSocket [...] max queue exceeded 100001" error. I was able
>> to suppress this error by changing the size of ArgusMaxListLength in
>> int ArgusMaxListLength = 1000000;
>> Now, however, I am beginning to see a different problem with the argus
>> server. After a day or so of a connected ra client, the argus server
>> exits with the debug message
>> argus: 19 Nov 09 14:19:28.712777 ArgusWriteOutSocket(0xad21b008)
>> maximum errors exceeded 200000
>> Could someone shed some light on these errors and what may be causing
>> them? While running the server with debug set to 1, I see these messages
>> a few times an hour:
>> argus: 19 Nov 09 11:48:12.456533 ArgusNewFlow() flow key is not
>> correct len equals zero
>> Client and Server Version: 3.0.2
>> Network Capture Hardware: Endace DAG 4.5G2
>> Client and Server OS: RHEL5.4
>> Capture Bandwidth: 700Mbit/sec - 1Gbps
>> Both the argus server and ra client are running on some fairly serious
>> hardware. The former is running on an Endace NinjaBox and the latter on
>> an 8-core box with an awful lot of memory.
>> Any help would be greatly appreciated.
>> Guy Dickinson
>> Guy Dickinson, Network Security Analyst
>> NYU ITS Technology Security Services
>> guy.dickinson at nyu.edu
>> (212) 998-3052
Guy Dickinson, Network Security Analyst
NYU ITS Technology Security Services
guy.dickinson at nyu.edu
More information about the argus