Argus server exits with "maximum errors exceeded 200000"

Tue Dec 29 09:40:11 EST 2009

Hey Guy,
That is so cool that you found this problem.  This type of performance problem is
what I use argus() for all the time, and its good to see that it worked for you as well!!!

Another common problem that I see when using argus() to debug network
performance issues, is bugs in NAT devices.  A flow gets  setup, and some arbitrary
time later, the NAT device just deletes its cache for the active flow, and everything stops
from the perspective of the end system.  Sometimes, the NAT device starts spitting
out unmodified (un-NATed) packets into the Internet ether.  I'm sure that Internet
Telescopes get a lot of this type of junk.

Anyway, very cool that all is working.

Carter

On Dec 21, 2009, at 1:12 PM, Guy Dickinson wrote:

> Hello,
> I wanted to provide the list with an update on this issue, which we seem
> to have solved, or at least have made significant progress on. Here's
> what we've found after several weeks of testing.
> 
> First, as suggested here, we reverted to a plain, 'off the shelf' build
> of Argus with ArgusMaxListLength back to its original setting.
> 
> We then used argus to monitor itself on a separate instance and saw
> approximately 1% loss/retransmission between the argus server and a ra()
> client. Carter suggested this performance issue was the root cause of
> all our issues and may be the result of a physical network problem.
> 
> We then spent many many hours running iperf between various hosts at
> different points on our network to try and isolate any problematic
> network hardware. We noticed that we could never actually establish a
> reliable iperf connection between any two hosts on the subnet that our
> ra() client is on and the subnet the argus server is on.
> 
> We rebooted our ra() client system into a known-good, clean live CD
> environment and were immediately able to get sustained network rates of
> 600mbit/sec or so with no TCP retransmissions, which is consistent with
> performance across our network core, which effectively ruled out a
> physical network issue.
> 
> With extensive consultation from our network engineers, we were able to
> track the issue down to a bug in the Cisco Firewall Services Modules
> (FWSMs) in the access switches that each subnet is routed through,
> related to the way that the FWSMs handle TCP SACK options when sequence
> randomization is enabled. The packets were getting dropped by the local
> iptables instance at each end of the connection because they didn't
> match the "ANY, ESTABLISHED" rule that allows existing
> statefully-tracked connections to pass.
> 
> The bug has some scant documentation in various places, but other folks
> have observed this issue in various other scenarios. At this time there
> does not appear to be a Cisco fix in place:
> 
> http://osdir.com/ml/security.firewalls.ipfilter/2004-06/msg00059.html
> http://lkml.indiana.edu/hypermail/linux/kernel/0707.3/2402.html
> 
> There are three possible workarounds to this issue:
> 1) Disable SACK options on both ends of the connection. This has a
> negligible performance impact. In Linux you can do this in
> /etc/sysctl.conf by adding:
> 
> net.ipv4.tcp_sack = 0
> 
> 2) Similarly, you can instruct iptables/netfilter/conntrack to "Be
> Liberal" which allows for some anomalies in the ESTABLISHED/RELATED
> rule, although what precisely this does is poorly documented and
> probably not recommended on security grounds. You can enable this,
> however, by using
> 
> # sysctl net.ipv4.netfilter.ip_conntrack_tcp_be_liberal = 1
> 
> There's a caveat here for RHEL users, though: there's a bug in sysctl
> which doesn't make this setting permanent across reboots, even if you
> put this directive in /etc/sysctl.conf. You'll have to write your own
> line in /etc/rc.local or similar. We've reported this to RedHat and you
> can find the ticket here:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=493226
> 
> 3) Finally, you can add static ACCEPT rules to your local firewall
> before the ESTABLISHED/RELATED rule so that state-tracking doesn't come
> into play. This is hard to scale and another security issue so I
> wouldn't recommend this either.
> 
> Having decided on solution 1, we've managed to keep a ra() session going
> for about 96 hours now, which is probably old-hat to most folks on this
> list but a first for us.
> 
> Thanks to Carter for pointing us in the right direction.
> 
> Regards,
> Guy
> 
> Carter Bullard wrote:
>> Hey Guy,
>> Sorry for the delayed response.
>> 
>> DAG cards don't really support a socket interface, so the
>> setArgusInterfaceStatus()
>> is there to fool code farther down the line into thinking that the DAG
>> card is up.
>> This may chew up cycles, but it should not be a source of errors.
>> 
>> If we get back to your original thread, you have a producer/consumer
>> problem, in
>> that argus() is generating more data than your clients can consume.  The MAX
>> QUEUE EXCEEDED messages are the clue.  By adding more depth to those
>> queues, you're just delaying the inevitable.  We need to solve this
>> problem, 
>> turn your queue lengths down and all should be well.
>> 
>> There are many issues that can cause a client to not perform well.  If
>> its on the
>> same box as argus(), it could be busy writing to disk or argus() is so
>> busy that
>> the client never gets a time slice, and just can't consume the load.
>> 
>> If the client is on another box, the problem could be packet loss
>> between argus()
>> and the client program.  Because argus uses TCP, it must retransmit data
>> that is
>> dropped.  Loss can occur due to bad cables, out of scope network cards, and
>> limited available network capacity.  Argus() is a great tool to use to
>> monitor
>> its own transport connections.  Have you looked at the argus data for the 
>> argus data transport TCP to see if its losing packets?
>> 
>> Also i the client is on another box, the problem could be flow control.
>> TCP allows
>> the client to "shut the transmitter up", so to speak.  This can happen
>> if the disks
>> that its writing to (or the screen) slows it down such that it can't
>> read the socket.
>> Argus is also the tool of choice here.  Look at the argus records for
>> the transport
>> stream looking for "S" or "D" indicators in the flags field.  This
>> indicates source
>> or destination flow control (you should see 'S's).  If this is the case,
>> you need to
>> beef up your consumer, or use filters.
>> 
>> When a reader can't keep up, argus has only one recourse, to close the
>> output
>> connection and keep going.  So your argus would do well, its just the
>> clients
>> would attach and leave and then attach and leave again.  But,  by increasing
>> the queue length to 1M, you generated a situation where argus can encounter
>> a critical error, like out of memory etc... and then it thinks it has to
>> terminate.
>> 
>> Lets lower the queue length back to 200K, and the try to figure out why your
>> clients are consuming fast enough.
>> 
>> Carter
>> 
>> On Nov 19, 2009, at 2:49 PM, Guy Dickinson wrote:
>> 
>>> Greetings, Argus Developers and Subscribers:
>>> 
>>> For some time, I have been attempting to troubleshoot an argus server
>>> instance sitting atop a ~1Gbps link which has presented some stability
>>> issues. To date, I have had two issues, one which I think I have solved,
>>> and one which remains open.
>>> 
>>> The first has been described before in a handful of mailing list
>>> postings, not dissimilar to this one:
>>> 
>>> http://thread.gmane.org/gmane.network.argus/5010/focus=5011
>>> 
>>> The argus server would run fine, but after a few hours of connection
>>> from a ra client, it would disconnect without warning with the
>>> "ArgusWriteOutSocket [...] max queue exceeded 100001" error. I was able
>>> to suppress this error by changing the size of ArgusMaxListLength in
>>> ArgusUtil.c:
>>> 
>>> int ArgusMaxListLength = 1000000;
>>> 
>>> Now, however, I am beginning to see a different problem with the argus
>>> server. After a day or so of a connected ra client, the argus server
>>> exits with the debug message
>>> 
>>> argus[7386]: 19 Nov 09 14:19:28.712777 ArgusWriteOutSocket(0xad21b008)
>>> maximum errors exceeded 200000
>>> 
>>> Could someone shed some light on these errors and what may be causing
>>> them? While running the server with debug set to 1, I see these messages
>>> a few times an hour:
>>> 
>>> argus[7386]: 19 Nov 09 11:48:12.456533 ArgusNewFlow() flow key is not
>>> correct len equals zero
>>> 
>>> 
>>> Client and Server Version: 3.0.2
>>> Network Capture Hardware: Endace DAG 4.5G2
>>> Client and Server OS: RHEL5.4
>>> Capture Bandwidth: 700Mbit/sec - 1Gbps
>>> 
>>> Both the argus server and ra client are running on some fairly serious
>>> hardware. The former is running on an Endace NinjaBox and the latter on
>>> an 8-core box with an awful lot of memory.
>>> 
>>> Any help would be greatly appreciated.
>>> 
>>> Regards,
>>> Guy Dickinson
>>> 
>>> -- 
>>> ------------------
>>> Guy Dickinson, Network Security Analyst
>>> NYU ITS Technology Security Services
>>> guy.dickinson at nyu.edu
>>> (212) 998-3052
>>> 
>> 
> 
> 
> -- 
> ------------------
> Guy Dickinson, Network Security Analyst
> NYU ITS Technology Security Services
> guy.dickinson at nyu.edu
> (212) 998-3052
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3815 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20091229/a48dfba3/attachment.bin>