Direction and IP/TCP timeout settings

Mon Jul 15 21:51:43 EDT 2013

Hey Craig,

There was recently a pretty bad bug in 5.5.3 of PF_RING...I'd take a look
at the output of:

# cat /proc/net/pf_ring/info

and check the stats for "Cluster Fragment Discard"...If you're seeing high
numbers there, I'd advise you to update PF_RING...

http://www.gossamer-threads.com/lists/ntop/misc/31502

Cheers,

Jesse

On Mon, Jul 15, 2013 at 7:49 PM, Craig Merchant <cmerchant at responsys.com>wrote:

>  If radium is dropping connections, will error messages appear in syslog?*
> ***
>
> ** **
>
> Here is the vmstat output from the two sensors…****
>
> ** **
>
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu-----****
>
> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
> wa st****
>
> 2  0      0 61130260  31220 1112324    0    0     2     1   53   22  4  4
> 91  0  0****
>
> ** **
>
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu-----****
>
> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
> wa st****
>
> 0  0      0 59848016 200272 2378216    0    0     0     2    5    4 22  4
> 74  0  0****
>
> ** **
>
> Doesn’t look like it’s swapping anything out…****
>
> ** **
>
> As for your questions about pfdnacluster_master, I’ll have to forward
> those to the ntop list.  We’re using pf_ring 5.5.3…****
>
> ** **
>
> As far as running multiple instances goes…  pfdnacluster_master doesn’t
> load balance the traffic.  It hashes the src/dest (and maybe protocol/port)
> and uses that as a key to ensure that all traffic between those hosts all
> end up in the same queue.  But the keys are distributed to queues in a
> round-robin fashion.  At any given time, we’ll have 4-12 snort sensors
> running at near 100% CPU while others are largely idle.  If argus had to
> share CPUs with snort instances, some instances would definitely get
> starved for CPU time.****
>
> ** **
>
> I’ll see if I can get the non-pf_ring driver working and if that impacts
> anything.  I’ll let you know what I hear from ntop…****
>
> ** **
>
> Thanks!****
>
> ** **
>
> Craig****
>
> ** **
>
> *From:* Carter Bullard [mailto:carter at qosient.com]
> *Sent:* Monday, July 15, 2013 3:13 PM
>
> *To:* Craig Merchant
> *Cc:* Argus (argus-info at lists.andrew.cmu.edu)
> *Subject:* Re: [ARGUS] Direction and IP/TCP timeout settings****
>
>  ** **
>
> Hey Craig,****
>
> If radium doesn't keep, the argi will drop the connections,****
>
> so unless you see radium losing its connection and ****
>
> then re-establishing, I don't think its radium.  We can measure****
>
> all of this, so its not going to be hard to track down, I don't****
>
> think.****
>
> ** **
>
> If argus is generating the same number of flows, then its probably****
>
> seeing the same traffic.  So, it seems that we are not getting all****
>
> the packets, and it doesn't appear to be due to argus running****
>
> out of cycles.  Are we running out of memory? How does vmstat look****
>
> on the machine ??  Not swapping out ?****
>
> ** **
>
> To understand this issue, I need to know if the pfdnacluster_master queue*
> ***
>
> is a selectable packet source, or not.  We want to use select() to get****
>
> packets, so that we can leverage the select()s timeout feature to wake****
>
> us up, periodically, so we can do some background maintenance, like queue*
> ***
>
> timeouts, etc…****
>
> ** **
>
> When we can't select(), we have to poll the interface, and if****
>
> there isn't anything there, we could fall into a nanosleep() call,****
>
> waiting for packets.  That may be a very bad thing, causing us to****
>
> could be lose packets.****
>
> ** **
>
> Does the pfdnacluster_master queue provide standard pcap_stats() ?****
>
> We should be able to look at the MARs, which will tell us  how****
>
> many packets the interface dropped.****
>
> ** **
>
> Not sure that I understand the problem with multiple argus processes?****
>
> You can run 24 copies of argus, and have radium connect to them****
>
> all to recreate the single argus data stream, if that is something****
>
> you would like to do.****
>
> ** **
>
> Lets focus on this new interface.  It could be we have to do something****
>
> special to get the best performance out of it.****
>
> ** **
>
> Carter****
>
> ** **
>
> ** **
>
> On Jul 15, 2013, at 5:34 PM, Craig Merchant <cmerchant at responsys.com>
> wrote:****
>
>
>
> ****
>
> The DNA/libzero drivers only allow a single process to connect to the
> “queues” that the pfdnacluster_master app presents.  The default version of
> their app will allow you to copy the same flow to multiple queues, but then
> we’d need to run 28 snort instances and 28 argus instances.  From my
> experience, Argus wasn’t burning that much CPU, so I opted to take
> advantage of the work Chris Wakelin did in modifying pfdnacluster_master so
> that it created a single queue with a copy of all the traffic.****
>
>  ****
>
> Here’s the weird thing...  When argus is listening to the dna0 interface
> directly, it’s CPU probably runs at 30-40%.  But when I run it on the
> pfdnacluster_master queue, the CPU probably runs at about half that.****
>
>  ****
>
> Yet when I look at the count of flow records for running Argus on the DNA
> interface vs the pfdnacluster_master queue, the volume of records is about
> the same.  It’s tough to test though because our traffic volume is pretty
> variable depending on when customers launch their campaigns.  The only way
> to test it for sure would be to wire the second 10g interface into the
> Gigamon tap, send a copy of the traffic there, and then run one instance of
> argus on the interface and one on pfdnacluster_master and compare them.***
> *
>
>  ****
>
> Is it possible that radium is getting overwhelmed?  The two argi that it
> connects to probably do an aggregate volume of 5-15 Gbps…  Since there is a
> fair bit of traffic between data centers, the dedup features of radium are
> helpful.  If so, how do I troubleshoot that?****
>
>  ****
>
> I might be able to put a copy of the non-pf_ring ixgbe driver on the
> sensor and see how that impacts things.****
>
>  ****
>
> Thanks for all your help!****
>
>  ****
>
> Craig****
>
>  ****
>
> *From:* Carter Bullard [mailto:carter at qosient.com]
> *Sent:* Monday, July 15, 2013 1:13 PM
> *To:* Craig Merchant
> *Cc:* Argus (argus-info at lists.andrew.cmu.edu)
> *Subject:* Re: [ARGUS] Direction and IP/TCP timeout settings****
>
>  ****
>
> What percent utilization do you have for argus ?****
>
> Argus could be running out of steam and dropping packets.****
>
> So, if you have snort running on 20+ queues to get the performance up,****
>
> why not try to do that with argus ?****
>
>  ****
>
> Carter****
>
>  ****
>
> On Jul 15, 2013, at 3:49 PM, Craig Merchant <cmerchant at responsys.com>
> wrote:****
>
>
>
>
> ****
>
> I recompiled argus after making the change to ArgusModeler.h.  Judging by
> the memory use, Argus is now able to use a much bigger cache for
> connections.  Thanks!****
>
>  ****
>
> It hasn’t had any impact on the direction problem though.****
>
>  ****
>
> When argus runs on top of the pfdnacluster_master app, it can’t figure out
> the direction about 60%+ of the time.  If I run Argus directly on the dna0
> interface, it can’t figure out the direction about 40% of the time.  The
> pfcount utility that comes with pf_ring says that there is less than 0.1%
> packet loss when running on pfdnacluster_master and no packet loss when
> running on dna0 itself.****
>
>  ****
>
> The interface isn’t dropping anything either:****
>
>  ****
>
> dna0      Link encap:Ethernet  HWaddr 00:E0:ED:1F:60:38****
>
>           inet6 addr: fe80::2e0:edff:fe1f:6038/64 Scope:Link****
>
>           UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1****
>
>           RX packets:97888412645 errors:0 dropped:0 overruns:0 frame:0****
>
>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0****
>
>           collisions:0 txqueuelen:1000****
>
>           RX bytes:63700614828375 (57.9 TiB)  TX bytes:0 (0.0 b)****
>
>           Memory:feaa0000-feac0000****
>
>  ****
>
> Can you think of why Argus might have issues with pf_ring and DNA?  Any
> suggestions for working around it?****
>
>  ****
>
> Thx.****
>
>
> Craig****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> *From:* Carter Bullard [mailto:carter at qosient.com]
> *Sent:* Saturday, July 13, 2013 7:38 AM
> *To:* Craig Merchant
> *Subject:* Re: [ARGUS] Direction and IP/TCP timeout settings****
>
>  ****
>
> Hey Craig,****
>
> So I capped the largest timeout to be 5 minutes.  Easy fix, really sorry
> for the inconvenience.****
>
>  ****
>
> The per flow timeout value is an unsigned short, (16bits), so you can use
> this patch****
>
> to set timeouts up to 65534, in the file ./argus/ArgusModeler.h.****
>
>  ****
>
> osiris:argus carter$ diff ./argus/ArgusModeler.h
> ./argus/ArgusModeler.h.orig****
>
> 84c84****
>
> < #define ARGUSTIMEOUTQS                  65534****
>
> ---****
>
> > #define ARGUSTIMEOUTQS                  301****
>
>  ****
>
>  ****
>
> Carter****
>
>  ****
>
> Carter Bullard
> CEO/President
> QoSient, LLC
> 150 E 57th Street Suite 12D
> New York, New York  10022
>
> +1 212 588-9133 Phone
> +1 212 588-9134 Fax****
>
>  ****
>
> On Jul 12, 2013, at 2:15 PM, Carter Bullard <carter at qosient.com> wrote:***
> *
>
>
>
>
>
> ****
>
> Hey Craig,****
>
> I haven't had a chance to look at the code.****
>
> Let me see this afternoon, if its suppose to be working or not.****
>
> Carter
>
> Carter Bullard, QoSient, LLC****
>
> 150 E. 57th Street Suite 12D****
>
> New York, New York 10022****
>
> +1 212 588-9133 Phone****
>
> +1 212 588-9134 Fax****
>
>
> On Jul 12, 2013, at 1:35 PM, Craig Merchant <cmerchant at responsys.com>
> wrote:****
>
>  I’ve been running Argus for about 18 hours now with a two hour timeout
> setting and there hasn’t been any change in the number of flows that it is
> unsure of the direction…****
>
>  ****
>
> Let me know if there is anything I can do to help test this…****
>
>  ****
>
> C****
>
>  ****
>
> *From:* Carter Bullard [mailto:carter at qosient.com <carter at qosient.com>]
> *Sent:* Friday, July 12, 2013 6:37 AM
> *To:* Craig Merchant
> *Cc:* Argus (argus-info at lists.andrew.cmu.edu)
> *Subject:* Re: [ARGUS] Direction and IP/TCP timeout settings****
>
>  ****
>
> Hmmmm, do the new timeouts change the direction problem?****
>
> That will be the real test, if the memory issues aren't showing themselves,
> ****
>
> the cool, as long as your traffic looks better.****
>
>  ****
>
> If not, I'll take a look.  Never know where things break down.****
>
> In some cases, we'll try to make the direction indicator match the traffic,
> ****
>
> with the central character indicating the confidence.  So, when there is**
> **
>
> a " ? ", the < or > should change to indicate direction of traffic, since*
> ***
>
> the assignment of flow direction isn't " on ".****
>
>  ****
>
> Carter****
>
>  ****
>
>  ****
>
> On Jul 11, 2013, at 7:28 PM, Craig Merchant <cmerchant at responsys.com>
> wrote:****
>
>
>
>
>
>
> ****
>
> Hey, Carter…****
>
>  ****
>
> We’re finding that for about 70% of our flows, Argus can’t figure out the
> direction.  From previous posts, it would seem that the 60 second TCP
> session timeout is too short.  If I understand correctly, a flow longer
> than 60 seconds will have its session timeout in the cache and then argus
> can’t really determine what the direction is.****
>
>  ****
>
> The argus.conf file warns of the hit on memory if those settings are
> adjusted from the defaults.  I’ve been steadily increasing the TCP and IP
> timeout values and watching to see if memory consumption jumps up
> dramatically or if we’re seeing less events where the direction is
> uncertain.****
>
>  ****
>
> I’ve gone as high up as two hour session timeout.  We do something like
> 2.5-8 Gbps 24 hours a day, so I would expect to see a huge increase in
> Argus memory consumption when increase the timeout value.  The machine has
> like 64 GB of memory and top says argus is only using .2%. ****
>
>  ****
>
> The settings look like:****
>
>  ****
>
> ARGUS_IP_TIMEOUT=3600****
>
> ARGUS_TCP_TIMEOUT=7200****
>
> #ARGUS_ICMP_TIMEOUT=5****
>
> #ARGUS_IGMP_TIMEOUT=30****
>
> #ARGUS_FRAG_TIMEOUT=5****
>
> #ARGUS_ARP_TIMEOUT=5****
>
> #ARGUS_OTHER_TIMEOUT=30****
>
>  ****
>
> Am I doing something wrong here?  Is there some other setting I need to
> enable to increase that timeout value?****
>
>  ****
>
> Also, what’s the difference between a direction value of ?> vs <?>?****
>
>  ****
>
> Thanks!****
>
>  ****
>
> Craig****
>
>  ****
>
>      ** **
>

-- 
Jesse Bowling
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20130715/55bb66d1/attachment.html>