Direction and IP/TCP timeout settings

Thu Jul 18 19:24:30 EDT 2013

Carter,

I’ve tried running Argus using the vanilla ixgbe driver from Intel rather than the DNA and pf_ring aware driver I got from the folks at ntop.  Still getting the same behavior…  60%+ of the flows have direction problems.  The interface statistics show less than 0.1% packet loss.  Argus runs at less than 10% of the CPU…

The longest flow duration was 304 seconds over the last hour.  The average is around 22 seconds.  Argus is configured with a two hour timeout for the tcp cache and an hour for IP.

We do have a Gigamon tap that sits between Argus and the core switch.  I’m investigating that those two devices see for the physical link.  I’m also going to see if it’s possible to get the Gigamon removed from the equation.

The only thing I can think of is using tcpdump to write data to a file and then see if the argus clients have trouble with direction then.

Can you think of anything else I should try?

The authors of pf_ring said the following about how data is queried:

Select/poll are not supported by the cluster as we experienced that using usleep behaves better than the poll implementation in this case.

I’m still waiting to hear if their software support the pcap stats you mentioned (that I don’t understand…).

Let me know how best to proceed.  Thanks again for all your help!

Craig

From: argus-info-bounces+cmerchant=responsys.com at lists.andrew.cmu.edu [mailto:argus-info-bounces+cmerchant=responsys.com at lists.andrew.cmu.edu] On Behalf Of Craig Merchant
Sent: Tuesday, July 16, 2013 12:45 AM
To: Jesse Bowling
Cc: Argus (argus-info at lists.andrew.cmu.edu)
Subject: Re: [ARGUS] Direction and IP/TCP timeout settings

Thanks for posting that thread.  I downloaded the latest updates for pf_ring from ntop’s archive.  I compiled and loaded that driver, but it didn’t have any impact on Argus’ ability to see direction properly.

I’m wondering if it would be valuable to run the pfdnacluster_master app with two queues that each have all the traffic.  I could try running both the pf_ring aware and pf_ring unaware versions of tcpdump and see if there are any discrepancies between their output.

I honestly have no idea what the difference is between the two versions of tcpdump, but from what Carter has said previously, it sounds like whatever “DNA/libzero awareness” is, it isn’t built into Argus currently.  Although I’m not even clear if that matters…

I can’t help but wonder if there is something about the DNA/libzero implementation of the network drivers that makes Argus miss the first couple packets of the flow if under load.

I don’t have the infrastructure currently to do something like query the NIC with SNMP to see the total volume of data and then compare that to a summation of all of the bytes in all of the Argus flows and see how big the difference is.

Does anyone have any ideas for a thorough testing plan?

Thx.

Craig

From: Jesse Bowling [mailto:jessebowling at gmail.com]
Sent: Monday, July 15, 2013 6:52 PM
To: Craig Merchant
Cc: Carter Bullard; Argus (argus-info at lists.andrew.cmu.edu<mailto:argus-info at lists.andrew.cmu.edu>)
Subject: Re: [ARGUS] Direction and IP/TCP timeout settings

Hey Craig,

There was recently a pretty bad bug in 5.5.3 of PF_RING...I'd take a look at the output of:
# cat /proc/net/pf_ring/info
and check the stats for "Cluster Fragment Discard"...If you're seeing high numbers there, I'd advise you to update PF_RING...

http://www.gossamer-threads.com/lists/ntop/misc/31502
Cheers,

Jesse

On Mon, Jul 15, 2013 at 7:49 PM, Craig Merchant <cmerchant at responsys.com<mailto:cmerchant at responsys.com>> wrote:
If radium is dropping connections, will error messages appear in syslog?

Here is the vmstat output from the two sensors…

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
2  0      0 61130260  31220 1112324    0    0     2     1   53   22  4  4 91  0  0

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
0  0      0 59848016 200272 2378216    0    0     0     2    5    4 22  4 74  0  0<tel:2%C2%A0%C2%A0%C2%A0%205%C2%A0%C2%A0%C2%A0%204%2022%C2%A0%204%2074%C2%A0%200%C2%A0%200>

Doesn’t look like it’s swapping anything out…

As for your questions about pfdnacluster_master, I’ll have to forward those to the ntop list.  We’re using pf_ring 5.5.3…

As far as running multiple instances goes…  pfdnacluster_master doesn’t load balance the traffic.  It hashes the src/dest (and maybe protocol/port) and uses that as a key to ensure that all traffic between those hosts all end up in the same queue.  But the keys are distributed to queues in a round-robin fashion.  At any given time, we’ll have 4-12 snort sensors running at near 100% CPU while others are largely idle.  If argus had to share CPUs with snort instances, some instances would definitely get starved for CPU time.

I’ll see if I can get the non-pf_ring driver working and if that impacts anything.  I’ll let you know what I hear from ntop…

Thanks!

Craig

From: Carter Bullard [mailto:carter at qosient.com<mailto:carter at qosient.com>]
Sent: Monday, July 15, 2013 3:13 PM

To: Craig Merchant
Cc: Argus (argus-info at lists.andrew.cmu.edu<mailto:argus-info at lists.andrew.cmu.edu>)
Subject: Re: [ARGUS] Direction and IP/TCP timeout settings

Hey Craig,
If radium doesn't keep, the argi will drop the connections,
so unless you see radium losing its connection and
then re-establishing, I don't think its radium.  We can measure
all of this, so its not going to be hard to track down, I don't
think.

If argus is generating the same number of flows, then its probably
seeing the same traffic.  So, it seems that we are not getting all
the packets, and it doesn't appear to be due to argus running
out of cycles.  Are we running out of memory? How does vmstat look
on the machine ??  Not swapping out ?

To understand this issue, I need to know if the pfdnacluster_master queue
is a selectable packet source, or not.  We want to use select() to get
packets, so that we can leverage the select()s timeout feature to wake
us up, periodically, so we can do some background maintenance, like queue
timeouts, etc…

When we can't select(), we have to poll the interface, and if
there isn't anything there, we could fall into a nanosleep() call,
waiting for packets.  That may be a very bad thing, causing us to
could be lose packets.

Does the pfdnacluster_master queue provide standard pcap_stats() ?
We should be able to look at the MARs, which will tell us  how
many packets the interface dropped.

Not sure that I understand the problem with multiple argus processes?
You can run 24 copies of argus, and have radium connect to them
all to recreate the single argus data stream, if that is something
you would like to do.

Lets focus on this new interface.  It could be we have to do something
special to get the best performance out of it.

Carter

On Jul 15, 2013, at 5:34 PM, Craig Merchant <cmerchant at responsys.com<mailto:cmerchant at responsys.com>> wrote:

The DNA/libzero drivers only allow a single process to connect to the “queues” that the pfdnacluster_master app presents.  The default version of their app will allow you to copy the same flow to multiple queues, but then we’d need to run 28 snort instances and 28 argus instances.  From my experience, Argus wasn’t burning that much CPU, so I opted to take advantage of the work Chris Wakelin did in modifying pfdnacluster_master so that it created a single queue with a copy of all the traffic.

Here’s the weird thing...  When argus is listening to the dna0 interface directly, it’s CPU probably runs at 30-40%.  But when I run it on the pfdnacluster_master queue, the CPU probably runs at about half that.

Yet when I look at the count of flow records for running Argus on the DNA interface vs the pfdnacluster_master queue, the volume of records is about the same.  It’s tough to test though because our traffic volume is pretty variable depending on when customers launch their campaigns.  The only way to test it for sure would be to wire the second 10g interface into the Gigamon tap, send a copy of the traffic there, and then run one instance of argus on the interface and one on pfdnacluster_master and compare them.

Is it possible that radium is getting overwhelmed?  The two argi that it connects to probably do an aggregate volume of 5-15 Gbps…  Since there is a fair bit of traffic between data centers, the dedup features of radium are helpful.  If so, how do I troubleshoot that?

I might be able to put a copy of the non-pf_ring ixgbe driver on the sensor and see how that impacts things.

Thanks for all your help!

Craig

From: Carter Bullard [mailto:carter@<mailto:carter@>qosient.com<http://qosient.com>]
Sent: Monday, July 15, 2013 1:13 PM
To: Craig Merchant
Cc: Argus (argus-info at lists.andrew.cmu.edu<mailto:argus-info at lists.andrew.cmu.edu>)
Subject: Re: [ARGUS] Direction and IP/TCP timeout settings

What percent utilization do you have for argus ?
Argus could be running out of steam and dropping packets.
So, if you have snort running on 20+ queues to get the performance up,
why not try to do that with argus ?

Carter

On Jul 15, 2013, at 3:49 PM, Craig Merchant <cmerchant at responsys.com<mailto:cmerchant at responsys.com>> wrote:

I recompiled argus after making the change to ArgusModeler.h.  Judging by the memory use, Argus is now able to use a much bigger cache for connections.  Thanks!

It hasn’t had any impact on the direction problem though.

When argus runs on top of the pfdnacluster_master app, it can’t figure out the direction about 60%+ of the time.  If I run Argus directly on the dna0 interface, it can’t figure out the direction about 40% of the time.  The pfcount utility that comes with pf_ring says that there is less than 0.1% packet loss when running on pfdnacluster_master and no packet loss when running on dna0 itself.

The interface isn’t dropping anything either:

dna0      Link encap:Ethernet  HWaddr 00:E0:ED:1F:60:38
          inet6 addr: fe80::2e0:edff:fe1f:6038/64 Scope:Link
          UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
          RX packets:97888412645 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:63700614828375 (57.9 TiB)  TX bytes:0 (0.0 b)
          Memory:feaa0000-feac0000

Can you think of why Argus might have issues with pf_ring and DNA?  Any suggestions for working around it?

Thx.

Craig

From: Carter Bullard [mailto:carter@<mailto:carter@>qosient.com<http://qosient.com/>]
Sent: Saturday, July 13, 2013 7:38 AM
To: Craig Merchant
Subject: Re: [ARGUS] Direction and IP/TCP timeout settings

Hey Craig,
So I capped the largest timeout to be 5 minutes.  Easy fix, really sorry for the inconvenience.

The per flow timeout value is an unsigned short, (16bits), so you can use this patch
to set timeouts up to 65534, in the file ./argus/ArgusModeler.h.

osiris:argus carter$ diff ./argus/ArgusModeler.h ./argus/ArgusModeler.h.orig
84c84
< #define ARGUSTIMEOUTQS                  65534
---
> #define ARGUSTIMEOUTQS                  301

Carter

Carter Bullard
CEO/President
QoSient, LLC
150 E 57th Street Suite 12D
New York, New York  10022

+1 212 588-9133<tel:%2B1%20212%20588-9133> Phone
+1 212 588-9134<tel:%2B1%20212%20588-9134> Fax

On Jul 12, 2013, at 2:15 PM, Carter Bullard <carter at qosient.com<mailto:carter at qosient.com>> wrote:

Hey Craig,
I haven't had a chance to look at the code.
Let me see this afternoon, if its suppose to be working or not.
Carter

Carter Bullard, QoSient, LLC
150 E. 57th Street Suite 12D
New York, New York 10022
+1 212 588-9133<tel:%2B1%20212%20588-9133> Phone
+1 212 588-9134<tel:%2B1%20212%20588-9134> Fax

On Jul 12, 2013, at 1:35 PM, Craig Merchant <cmerchant at responsys.com<mailto:cmerchant at responsys.com>> wrote:
I’ve been running Argus for about 18 hours now with a two hour timeout setting and there hasn’t been any change in the number of flows that it is unsure of the direction…

Let me know if there is anything I can do to help test this…

C

From: Carter Bullard [mailto:carter at qosient.com]
Sent: Friday, July 12, 2013 6:37 AM
To: Craig Merchant
Cc: Argus (argus-info at lists.andrew.cmu.edu<mailto:argus-info at lists.andrew.cmu.edu>)
Subject: Re: [ARGUS] Direction and IP/TCP timeout settings

Hmmmm, do the new timeouts change the direction problem?
That will be the real test, if the memory issues aren't showing themselves,
the cool, as long as your traffic looks better.

If not, I'll take a look.  Never know where things break down.
In some cases, we'll try to make the direction indicator match the traffic,
with the central character indicating the confidence.  So, when there is
a " ? ", the < or > should change to indicate direction of traffic, since
the assignment of flow direction isn't " on ".

Carter

On Jul 11, 2013, at 7:28 PM, Craig Merchant <cmerchant at responsys.com<mailto:cmerchant at responsys.com>> wrote:

Hey, Carter…

We’re finding that for about 70% of our flows, Argus can’t figure out the direction.  From previous posts, it would seem that the 60 second TCP session timeout is too short.  If I understand correctly, a flow longer than 60 seconds will have its session timeout in the cache and then argus can’t really determine what the direction is.

The argus.conf file warns of the hit on memory if those settings are adjusted from the defaults.  I’ve been steadily increasing the TCP and IP timeout values and watching to see if memory consumption jumps up dramatically or if we’re seeing less events where the direction is uncertain.

I’ve gone as high up as two hour session timeout.  We do something like 2.5-8 Gbps 24 hours a day, so I would expect to see a huge increase in Argus memory consumption when increase the timeout value.  The machine has like 64 GB of memory and top says argus is only using .2%.

The settings look like:

ARGUS_IP_TIMEOUT=3600
ARGUS_TCP_TIMEOUT=7200
#ARGUS_ICMP_TIMEOUT=5
#ARGUS_IGMP_TIMEOUT=30
#ARGUS_FRAG_TIMEOUT=5
#ARGUS_ARP_TIMEOUT=5
#ARGUS_OTHER_TIMEOUT=30

Am I doing something wrong here?  Is there some other setting I need to enable to increase that timeout value?

Also, what’s the difference between a direction value of ?> vs <?>?

Thanks!

Craig

--
Jesse Bowling
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20130718/94852aea/attachment.html>