[Ntop-misc] Direction and IP/TCP timeout settings

Craig Merchant cmerchant at responsys.com
Fri Jul 26 17:10:08 EDT 2013


I'm running the argus clients that you released on the 22nd...  3.0.7.12

I'll send you all of the configs and scripts we use offline.  I'll also send you syslog files with all of the argus-related events in them.

The weird thing I just noticed is that radium and rastream start normally.  At the end of each five minute interval, rastream kicks off a script that runs racluster and feeds our data into Splunk.  It will run normally for a few to many iterations and then I start seeing messages from racluster that it can't find the file that rastream tells it to process.  That behavior is new since the latest version and nothing about those scripts has changed...

Thanks for all your help, Carter.  Let me know if there is anything I can do.

Craig

From: Carter Bullard [mailto:carter at qosient.com]
Sent: Friday, July 26, 2013 8:52 AM
To: Craig Merchant
Cc: Argus (argus-info at lists.andrew.cmu.edu)
Subject: Re: [Ntop-misc] [ARGUS] Direction and IP/TCP timeout settings

So, the max queue exceeded message indicates that an argus data reader was attached
to the radium, but it did not process records fast enough, and the output queue hit its
buffer limit, 0.5M records.   Not a show stopper, but does indicate that you were
generating more records than your client could consume.

The racluster problems, didn't we fix that with absolute most recent version of the clients ???


Carter

On Jul 25, 2013, at 9:48 PM, Craig Merchant <cmerchant at responsys.com<mailto:cmerchant at responsys.com>> wrote:


Hey, Carter...

I was checking my logs and I found a couple of errors from radium and racluster:

Jul 25 06:12:00 10.230.174.40 Jul 25 13:12:00 ids-manager01-dc1 radium[5831]:13:12:00.035300 ArgusWriteOutSocket(0xfc09ac30) max queue exceeded 500001

Jul 25 16:17:08 10.230.174.40 Jul 25 23:17:08 ids-manager01-dc1 kernel: racluster[15194]general protection ip:467780 sp:7fffbef3d9c0 error:0 in racluster[400000+aa000]

Jul 25 14:03:33 10.230.174.40 Jul 25 21:03:33 ids-manager01-dc1 kernel: racluster[15475]:segfault at ac ip 000000000045f994 sp 00007fff63b4e640 error 4 in racluster[400000+aa000]

Jul 25 01:45:25 10.230.174.40 Jul 25 08:45:25 ids-manager01-dc1 kernel: racluster[8324]:segfault at 0 ip (null) sp 00007fff220d4860 error 14 in racluster[400000+aa000]

Not quite sure what these mean, but just wanted to send them your way...

Thx.

Craig

From: Carter Bullard [mailto:carter at qosient.com<http://qosient.com>]
Sent: Thursday, July 25, 2013 10:11 AM
To: Craig Merchant
Cc: Argus (argus-info at lists.andrew.cmu.edu<mailto:argus-info at lists.andrew.cmu.edu>)
Subject: Re: [Ntop-misc] [ARGUS] Direction and IP/TCP timeout settings

Hey Craig,
Did you get a chance to change the timeout value for nanosleep() to see if
that helped your CPU utilization ??
All is well and happy ????

Carter

On Jul 23, 2013, at 3:24 PM, Craig Merchant <cmerchant at responsys.com<mailto:cmerchant at responsys.com>> wrote:



I've successfully compiled the 3.0.7.4 version of argus on both my sensors.  I added the ARGUS_FAR_STATUS_INTERVAL=5 to /etc/argus.conf.  I checked the /root/argus-3.0.7.4/support/Config/argus.conf file for the ARGUS_FAR_STATUS_INTERVAL (and any other new config options), but it wasn't in the file.  Argus started up just fine.

The percentage of flows that Argus can't determine the direction of is about 20%, which is dramatically better than the 40-60% it was doing with previous versions.  The CPU utilization is still really high (90-100% most of the time).  Are there any changes to the ARGUS_FAR_STATUS_INTERVAL that you think would improve it further?

I downloaded the 3.0.7.12 version of the clients and ran configure:  ./configure --with-GeoIP=yes

When I ran make, I got the following error:

In file included from ./raclient.c:48:
./rasqlinsert.h:87:31: error: readline/readline.h: No such file or directory
./raclient.c: In function âRaProcessEventRecordâ:
./raclient.c:1717: error: âBytefâ undeclared (first use in this function)
./raclient.c:1717: error: (Each undeclared identifier is reported only once
./raclient.c:1717: error: for each function it appears in.)
./raclient.c:1717: error: expected expression before â)â token
make[2]: *** [raclient.o] Error 1
make[2]: Leaving directory `/root/argus-clients-3.0.7.12/examples/ramysql'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/root/argus-clients-3.0.7.12/examples'
make: *** [all] Error 2

Thanks.

Craig


From: Carter Bullard [mailto:carter at qosient.com<http://qosient.com/>]
Sent: Monday, July 22, 2013 6:50 PM
To: Craig Merchant
Cc: Argus (argus-info at lists.andrew.cmu.edu<mailto:argus-info at lists.andrew.cmu.edu>)
Subject: Re: [Ntop-misc] [ARGUS] Direction and IP/TCP timeout settings

Well you should have used the ./support/Config/argus.conf file as a starter
configuration, and it has that variable.  The default is 5 seconds.

You should definitely grab argus-3.0.7.4 and try that.
Grab the current argus-latest.tar.gz.

Carter

On Jul 22, 2013, at 9:43 PM, Craig Merchant <cmerchant at responsys.com<mailto:cmerchant at responsys.com>> wrote:




I do...  And it looks like the majority of them have direction problems...

My argus.conf doesn't have that setting in it - and neither does /root/argus-3.0.7.3/support/Config/argus.conf.  Is that a configuration option new to the release you just posted today?  I haven't had a chance to download and install it yet.

What about the ARGUS_ENV="PCAP_MEMORY=300000" setting?  I see it's disabled in the default argus.conf file.  If I want to use pf_ring, is there any way that setting could be impacting things?

Thx.

C

From: Carter Bullard [mailto:carter at qosient.com<http://qosient.com/>]
Sent: Monday, July 22, 2013 6:12 PM
To: Craig Merchant
Cc: Argus (argus-info at lists.andrew.cmu.edu<mailto:argus-info at lists.andrew.cmu.edu>)
Subject: Re: [Ntop-misc] [ARGUS] Direction and IP/TCP timeout settings

The newest version of argus on the dev server fixes the bug you reported where argus seg faults on your packet file.  The bug was introduced when we added larger timeout values trying to fix your problem.

Do any of your records have a " dur gt 5 "  assuming your ARGUS_FAR_STATUS_INTERVAL is 5 seconds ?

Carter

On Jul 22, 2013, at 8:14 PM, Craig Merchant <cmerchant at responsys.com<mailto:cmerchant at responsys.com>> wrote:
Hey, Carter...

I ran a search on the last 200,000 records that had a "?" in the direction field and only about 7% of them had a "g" in the flags.  If gaps in the packets were the problem - whether from an overloaded port, driver, or asymmetric flows (we are using a pair of Cisco VSS switches, but the NetOps team swears that the SPAN port sees all traffic from both switches) - wouldn't we expect that number to be a lot higher?

In your example "ra -S argus.source -M xml - man", can ra read from radium or can it only read from a file?  I presume both are supported since you used the -S switch instead of -r, but when I run it against my radium instance, the command never exits or displays any results.  Do I need to specify an interval for ra to connect?

While I'm doing this testing, I'm running one host with pf_ring and one with the normal Intel ixgbe driver and the directional issues are pretty much even across both hosts.  I've tried connecting my raclients to the argus instances directly (and thus not using radium), but the results are pretty much the same.

When you refer to modifying "sleep timeouts", what configuration option are you referring to?  Is that the IP/TCP timeouts in argus.conf?  I looked through argus.conf, radium.conf, and rarc.conf for "sleep" and didn't find anything...

As for hard-coding destination ports...  Any kind of CSV file or iana-formatted file that you use for ralabel would be easy for me to work with.

Did you have a chance to look at the tcpdump I sent you and see how well Argus picks out the direction from the flows?

Thx.

Craig

From: Carter Bullard [mailto:carter at qosient.com]
Sent: Monday, July 22, 2013 1:40 PM
To: Craig Merchant
Subject: Re: [Ntop-misc] [ARGUS] Direction and IP/TCP timeout settings

Hey Craig,
So, the whole point to this exercise has been to determine if
you are not getting all the packets from the wire, because
you think you are seeing too many " ? " in your TCP direction
field.

When the sensor doesn't see all the packets that it can,
the most important indicator is a " g " in the flgs field.
This indicates that there are packet gaps that the flow
modeler has detected, which are sequence numbers never seen.
You should be seeing " g "s if random packet loss from
the wire to argus is occurring.

If this was/is the case, then changing the sleep timeouts should
help a great deal in reducing the occurence of " g "s and the
mystery of the apparent lack of SYN and SYN_ACKs would be solved.

If not, but argus is still not reporting all the direction
that you think it should, then selective loss of the SYN
and SYN_ACK packets is a possibility.

pf_ring would be a most natural place to point the finger, in this case.

The argus "man" record reports libpcap packet drop stats,
which count the number of packets that were received and
ready for processing, but were not read.  You can print that
number like this:

   ra -S argus.source -M xml - man

And you will get something like this:

<?xml version ="1.0" encoding="UTF-8"?>
<!--Generated by ra(3.0.7.12) QoSient, LLC-->
<ArgusDataStream
  xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
  xsi:noNamespaceSchemaLocation = "http://qosient.com/argus/Xml/ArgusRecord.3.0.xsd"
  BeginDate = "2013-07-15T13:56:43.109557" CurrentDate = "2013-07-22T16:33:37.086812"
  MajorVersion = "3" MinorVersion = "0" InterfaceType = "DLT_NULL" InterfaceStatus = "Up"
  ArgusSourceId = "192.168.0.68"  NetAddr = "0.0.0.0"  NetMask = "0.0.0.0">

 <ArgusManagementRecord  StartTime = "2013-07-22T16:33:36.982927" Duration = "614213.875000" Flags = "         " Proto = "man" PktsRcvd = "0" Records = "0" BytesRcvd = "0" PktsDropped = "0" State = "STA"></ArgusManagementRecord>
 <ArgusManagementRecord  StartTime = "2013-07-22T16:33:43.194437" Duration = "60.101017" Flags = "         " Proto = "man" PktsRcvd = "52114" Records = "57" BytesRcvd = "47541540" PktsDropped = "0" State = "CON"></ArgusManagementRecord>

The PktsDropped value is something to look for.

If there is still a mystery, flows with the " ? " will exist naturally.
Flows that are long lived, with idle periods longer that the TCP timeout
period, with present with the " ? ".  Also when there is asymmetry, such
as load balancing, you may miss the SYN and SYN_ACK completely.
You get what you get, in that case.

We provide some means to control the direction, when its unknown.
If you want to propose other client based mechanisms, holler away.

Carter


On Jul 21, 2013, at 1:15 AM, Craig Merchant <cmerchant at responsys.com<mailto:cmerchant at responsys.com>> wrote:






Just an FYI...  Apparently the DNA/libzero drivers from NTOP support pcap_stats().  But I have absolutely no idea how to access those stats...

From: ntop-misc-bounces at listgateway.unipi.it<mailto:ntop-misc-bounces at listgateway.unipi.it> [mailto:ntop-misc-bounces at listgateway.unipi.it<mailto:misc-bounces at listgateway.unipi.it>] On Behalf Of Alfredo Cardigliano
Sent: Saturday, July 20, 2013 4:03 AM
To: ntop-misc at listgateway.unipi.it<mailto:ntop-misc at listgateway.unipi.it>
Subject: Re: [Ntop-misc] [ARGUS] Direction and IP/TCP timeout settings

Hi Craig
yes, libpcap over dna cluster queue provides pcap_stats() support.

Alfredo

On Jul 18, 2013, at 9:01 PM, Craig Merchant <cmerchant at responsys.com<mailto:cmerchant at responsys.com>> wrote:







Alfredo,

I ran both pfcount -i dnacluster:10 at 28 (the queue argus monitors) and pfcount -i dna0 (when pfdnacluster_masterr wasn't running).  Both of them showed a 0.1% packet loss.

What about this question that Carter had:

Does the pfdnacluster_master queue provide standard pcap_stats() ?
We should be able to look at the MARs, which will tell us  how
many packets the interface dropped.

I'm not familiar with what pcap_stats() are...

Thanks.

Craig

From: ntop-misc-bounces at listgateway.unipi.it<mailto:ntop-misc-bounces at listgateway.unipi.it> [mailto:ntop-misc-bounces at listgateway.unipi.it<mailto:misc-bounces at listgateway.unipi.it>] On Behalf Of Alfredo Cardigliano
Sent: Thursday, July 18, 2013 12:44 AM
To: ntop-misc at listgateway.unipi.it<mailto:ntop-misc at listgateway.unipi.it>
Subject: Re: [Ntop-misc] FW: [ARGUS] Direction and IP/TCP timeout settings

Hi Craig
what do you mean with "Pfcount says that the queue that argus is running  on is only dropping 0.1% of packets"? You should look at the stats on the queue argus is using.
Select/poll are not supported by the cluster as we experienced that using usleep behaves better than the poll implementation in this case.

Alfredo

On Jul 16, 2013, at 1:51 AM, Craig Merchant <cmerchant at responsys.com<mailto:cmerchant at responsys.com>> wrote:








I'm trying to troubleshoot some issues with the argus netflow tool running on top of pfdnacluster_master.  Pfcount says that the queue that argus is running  on is only dropping 0.1% of packets, yet argus can't figure out the direction of about 60% of the flows.  That means for some reason it isn't seeing the SYN and SYNACK of a lot of flows.

The argus developer had a couple questions about the pfdnacluster_master that I can't answer...  They are below.

Thanks.

Craig

From: Carter Bullard [mailto:carter at qosient.com<http://qosient.com/>]
Sent: Monday, July 15, 2013 3:13 PM
To: Craig Merchant
Cc: Argus (argus-info at lists.andrew.cmu.edu<mailto:argus-info at lists.andrew.cmu.edu>)
Subject: Re: [ARGUS] Direction and IP/TCP timeout settings

Hey Craig,
If radium doesn't keep, the argi will drop the connections,
so unless you see radium losing its connection and
then re-establishing, I don't think its radium.  We can measure
all of this, so its not going to be hard to track down, I don't
think.

If argus is generating the same number of flows, then its probably
seeing the same traffic.  So, it seems that we are not getting all
the packets, and it doesn't appear to be due to argus running
out of cycles.  Are we running out of memory? How does vmstat look
on the machine ??  Not swapping out ?

To understand this issue, I need to know if the pfdnacluster_master queue
is a selectable packet source, or not.  We want to use select() to get
packets, so that we can leverage the select()s timeout feature to wake
us up, periodically, so we can do some background maintenance, like queue
timeouts, etc...

When we can't select(), we have to poll the interface, and if
there isn't anything there, we could fall into a nanosleep() call,
waiting for packets.  That may be a very bad thing, causing us to
could be lose packets.

Does the pfdnacluster_master queue provide standard pcap_stats() ?
We should be able to look at the MARs, which will tell us  how
many packets the interface dropped.

Not sure that I understand the problem with multiple argus processes?
You can run 24 copies of argus, and have radium connect to them
all to recreate the single argus data stream, if that is something
you would like to do.

Lets focus on this new interface.  It could be we have to do something
special to get the best performance out of it.

Carter


_______________________________________________
Ntop-misc mailing list
Ntop-misc at listgateway.unipi.it<mailto:Ntop-misc at listgateway.unipi.it>
http://listgateway.unipi.it/mailman/listinfo/ntop-misc


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20130726/f014fd70/attachment.html>


More information about the argus mailing list