Direction and IP/TCP timeout settings
Jesse Bowling
jessebowling at gmail.com
Mon Jul 15 21:51:43 EDT 2013
Hey Craig,
There was recently a pretty bad bug in 5.5.3 of PF_RING...I'd take a look
at the output of:
# cat /proc/net/pf_ring/info
and check the stats for "Cluster Fragment Discard"...If you're seeing high
numbers there, I'd advise you to update PF_RING...
http://www.gossamer-threads.com/lists/ntop/misc/31502
Cheers,
Jesse
On Mon, Jul 15, 2013 at 7:49 PM, Craig Merchant <cmerchant at responsys.com>wrote:
> If radium is dropping connections, will error messages appear in syslog?*
> ***
>
> ** **
>
> Here is the vmstat output from the two sensors…****
>
> ** **
>
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu-----****
>
> r b swpd free buff cache si so bi bo in cs us sy id
> wa st****
>
> 2 0 0 61130260 31220 1112324 0 0 2 1 53 22 4 4
> 91 0 0****
>
> ** **
>
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu-----****
>
> r b swpd free buff cache si so bi bo in cs us sy id
> wa st****
>
> 0 0 0 59848016 200272 2378216 0 0 0 2 5 4 22 4
> 74 0 0****
>
> ** **
>
> Doesn’t look like it’s swapping anything out…****
>
> ** **
>
> As for your questions about pfdnacluster_master, I’ll have to forward
> those to the ntop list. We’re using pf_ring 5.5.3…****
>
> ** **
>
> As far as running multiple instances goes… pfdnacluster_master doesn’t
> load balance the traffic. It hashes the src/dest (and maybe protocol/port)
> and uses that as a key to ensure that all traffic between those hosts all
> end up in the same queue. But the keys are distributed to queues in a
> round-robin fashion. At any given time, we’ll have 4-12 snort sensors
> running at near 100% CPU while others are largely idle. If argus had to
> share CPUs with snort instances, some instances would definitely get
> starved for CPU time.****
>
> ** **
>
> I’ll see if I can get the non-pf_ring driver working and if that impacts
> anything. I’ll let you know what I hear from ntop…****
>
> ** **
>
> Thanks!****
>
> ** **
>
> Craig****
>
> ** **
>
> *From:* Carter Bullard [mailto:carter at qosient.com]
> *Sent:* Monday, July 15, 2013 3:13 PM
>
> *To:* Craig Merchant
> *Cc:* Argus (argus-info at lists.andrew.cmu.edu)
> *Subject:* Re: [ARGUS] Direction and IP/TCP timeout settings****
>
> ** **
>
> Hey Craig,****
>
> If radium doesn't keep, the argi will drop the connections,****
>
> so unless you see radium losing its connection and ****
>
> then re-establishing, I don't think its radium. We can measure****
>
> all of this, so its not going to be hard to track down, I don't****
>
> think.****
>
> ** **
>
> If argus is generating the same number of flows, then its probably****
>
> seeing the same traffic. So, it seems that we are not getting all****
>
> the packets, and it doesn't appear to be due to argus running****
>
> out of cycles. Are we running out of memory? How does vmstat look****
>
> on the machine ?? Not swapping out ?****
>
> ** **
>
> To understand this issue, I need to know if the pfdnacluster_master queue*
> ***
>
> is a selectable packet source, or not. We want to use select() to get****
>
> packets, so that we can leverage the select()s timeout feature to wake****
>
> us up, periodically, so we can do some background maintenance, like queue*
> ***
>
> timeouts, etc…****
>
> ** **
>
> When we can't select(), we have to poll the interface, and if****
>
> there isn't anything there, we could fall into a nanosleep() call,****
>
> waiting for packets. That may be a very bad thing, causing us to****
>
> could be lose packets.****
>
> ** **
>
> Does the pfdnacluster_master queue provide standard pcap_stats() ?****
>
> We should be able to look at the MARs, which will tell us how****
>
> many packets the interface dropped.****
>
> ** **
>
> Not sure that I understand the problem with multiple argus processes?****
>
> You can run 24 copies of argus, and have radium connect to them****
>
> all to recreate the single argus data stream, if that is something****
>
> you would like to do.****
>
> ** **
>
> Lets focus on this new interface. It could be we have to do something****
>
> special to get the best performance out of it.****
>
> ** **
>
> Carter****
>
> ** **
>
> ** **
>
> On Jul 15, 2013, at 5:34 PM, Craig Merchant <cmerchant at responsys.com>
> wrote:****
>
>
>
> ****
>
> The DNA/libzero drivers only allow a single process to connect to the
> “queues” that the pfdnacluster_master app presents. The default version of
> their app will allow you to copy the same flow to multiple queues, but then
> we’d need to run 28 snort instances and 28 argus instances. From my
> experience, Argus wasn’t burning that much CPU, so I opted to take
> advantage of the work Chris Wakelin did in modifying pfdnacluster_master so
> that it created a single queue with a copy of all the traffic.****
>
> ****
>
> Here’s the weird thing... When argus is listening to the dna0 interface
> directly, it’s CPU probably runs at 30-40%. But when I run it on the
> pfdnacluster_master queue, the CPU probably runs at about half that.****
>
> ****
>
> Yet when I look at the count of flow records for running Argus on the DNA
> interface vs the pfdnacluster_master queue, the volume of records is about
> the same. It’s tough to test though because our traffic volume is pretty
> variable depending on when customers launch their campaigns. The only way
> to test it for sure would be to wire the second 10g interface into the
> Gigamon tap, send a copy of the traffic there, and then run one instance of
> argus on the interface and one on pfdnacluster_master and compare them.***
> *
>
> ****
>
> Is it possible that radium is getting overwhelmed? The two argi that it
> connects to probably do an aggregate volume of 5-15 Gbps… Since there is a
> fair bit of traffic between data centers, the dedup features of radium are
> helpful. If so, how do I troubleshoot that?****
>
> ****
>
> I might be able to put a copy of the non-pf_ring ixgbe driver on the
> sensor and see how that impacts things.****
>
> ****
>
> Thanks for all your help!****
>
> ****
>
> Craig****
>
> ****
>
> *From:* Carter Bullard [mailto:carter at qosient.com]
> *Sent:* Monday, July 15, 2013 1:13 PM
> *To:* Craig Merchant
> *Cc:* Argus (argus-info at lists.andrew.cmu.edu)
> *Subject:* Re: [ARGUS] Direction and IP/TCP timeout settings****
>
> ****
>
> What percent utilization do you have for argus ?****
>
> Argus could be running out of steam and dropping packets.****
>
> So, if you have snort running on 20+ queues to get the performance up,****
>
> why not try to do that with argus ?****
>
> ****
>
> Carter****
>
> ****
>
> On Jul 15, 2013, at 3:49 PM, Craig Merchant <cmerchant at responsys.com>
> wrote:****
>
>
>
>
> ****
>
> I recompiled argus after making the change to ArgusModeler.h. Judging by
> the memory use, Argus is now able to use a much bigger cache for
> connections. Thanks!****
>
> ****
>
> It hasn’t had any impact on the direction problem though.****
>
> ****
>
> When argus runs on top of the pfdnacluster_master app, it can’t figure out
> the direction about 60%+ of the time. If I run Argus directly on the dna0
> interface, it can’t figure out the direction about 40% of the time. The
> pfcount utility that comes with pf_ring says that there is less than 0.1%
> packet loss when running on pfdnacluster_master and no packet loss when
> running on dna0 itself.****
>
> ****
>
> The interface isn’t dropping anything either:****
>
> ****
>
> dna0 Link encap:Ethernet HWaddr 00:E0:ED:1F:60:38****
>
> inet6 addr: fe80::2e0:edff:fe1f:6038/64 Scope:Link****
>
> UP BROADCAST RUNNING PROMISC MULTICAST MTU:1500 Metric:1****
>
> RX packets:97888412645 errors:0 dropped:0 overruns:0 frame:0****
>
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0****
>
> collisions:0 txqueuelen:1000****
>
> RX bytes:63700614828375 (57.9 TiB) TX bytes:0 (0.0 b)****
>
> Memory:feaa0000-feac0000****
>
> ****
>
> Can you think of why Argus might have issues with pf_ring and DNA? Any
> suggestions for working around it?****
>
> ****
>
> Thx.****
>
>
> Craig****
>
> ****
>
> ****
>
> ****
>
> ****
>
> ****
>
> *From:* Carter Bullard [mailto:carter at qosient.com]
> *Sent:* Saturday, July 13, 2013 7:38 AM
> *To:* Craig Merchant
> *Subject:* Re: [ARGUS] Direction and IP/TCP timeout settings****
>
> ****
>
> Hey Craig,****
>
> So I capped the largest timeout to be 5 minutes. Easy fix, really sorry
> for the inconvenience.****
>
> ****
>
> The per flow timeout value is an unsigned short, (16bits), so you can use
> this patch****
>
> to set timeouts up to 65534, in the file ./argus/ArgusModeler.h.****
>
> ****
>
> osiris:argus carter$ diff ./argus/ArgusModeler.h
> ./argus/ArgusModeler.h.orig****
>
> 84c84****
>
> < #define ARGUSTIMEOUTQS 65534****
>
> ---****
>
> > #define ARGUSTIMEOUTQS 301****
>
> ****
>
> ****
>
> Carter****
>
> ****
>
> Carter Bullard
> CEO/President
> QoSient, LLC
> 150 E 57th Street Suite 12D
> New York, New York 10022
>
> +1 212 588-9133 Phone
> +1 212 588-9134 Fax****
>
> ****
>
> On Jul 12, 2013, at 2:15 PM, Carter Bullard <carter at qosient.com> wrote:***
> *
>
>
>
>
>
> ****
>
> Hey Craig,****
>
> I haven't had a chance to look at the code.****
>
> Let me see this afternoon, if its suppose to be working or not.****
>
> Carter
>
> Carter Bullard, QoSient, LLC****
>
> 150 E. 57th Street Suite 12D****
>
> New York, New York 10022****
>
> +1 212 588-9133 Phone****
>
> +1 212 588-9134 Fax****
>
>
> On Jul 12, 2013, at 1:35 PM, Craig Merchant <cmerchant at responsys.com>
> wrote:****
>
> I’ve been running Argus for about 18 hours now with a two hour timeout
> setting and there hasn’t been any change in the number of flows that it is
> unsure of the direction…****
>
> ****
>
> Let me know if there is anything I can do to help test this…****
>
> ****
>
> C****
>
> ****
>
> *From:* Carter Bullard [mailto:carter at qosient.com <carter at qosient.com>]
> *Sent:* Friday, July 12, 2013 6:37 AM
> *To:* Craig Merchant
> *Cc:* Argus (argus-info at lists.andrew.cmu.edu)
> *Subject:* Re: [ARGUS] Direction and IP/TCP timeout settings****
>
> ****
>
> Hmmmm, do the new timeouts change the direction problem?****
>
> That will be the real test, if the memory issues aren't showing themselves,
> ****
>
> the cool, as long as your traffic looks better.****
>
> ****
>
> If not, I'll take a look. Never know where things break down.****
>
> In some cases, we'll try to make the direction indicator match the traffic,
> ****
>
> with the central character indicating the confidence. So, when there is**
> **
>
> a " ? ", the < or > should change to indicate direction of traffic, since*
> ***
>
> the assignment of flow direction isn't " on ".****
>
> ****
>
> Carter****
>
> ****
>
> ****
>
> On Jul 11, 2013, at 7:28 PM, Craig Merchant <cmerchant at responsys.com>
> wrote:****
>
>
>
>
>
>
> ****
>
> Hey, Carter…****
>
> ****
>
> We’re finding that for about 70% of our flows, Argus can’t figure out the
> direction. From previous posts, it would seem that the 60 second TCP
> session timeout is too short. If I understand correctly, a flow longer
> than 60 seconds will have its session timeout in the cache and then argus
> can’t really determine what the direction is.****
>
> ****
>
> The argus.conf file warns of the hit on memory if those settings are
> adjusted from the defaults. I’ve been steadily increasing the TCP and IP
> timeout values and watching to see if memory consumption jumps up
> dramatically or if we’re seeing less events where the direction is
> uncertain.****
>
> ****
>
> I’ve gone as high up as two hour session timeout. We do something like
> 2.5-8 Gbps 24 hours a day, so I would expect to see a huge increase in
> Argus memory consumption when increase the timeout value. The machine has
> like 64 GB of memory and top says argus is only using .2%. ****
>
> ****
>
> The settings look like:****
>
> ****
>
> ARGUS_IP_TIMEOUT=3600****
>
> ARGUS_TCP_TIMEOUT=7200****
>
> #ARGUS_ICMP_TIMEOUT=5****
>
> #ARGUS_IGMP_TIMEOUT=30****
>
> #ARGUS_FRAG_TIMEOUT=5****
>
> #ARGUS_ARP_TIMEOUT=5****
>
> #ARGUS_OTHER_TIMEOUT=30****
>
> ****
>
> Am I doing something wrong here? Is there some other setting I need to
> enable to increase that timeout value?****
>
> ****
>
> Also, what’s the difference between a direction value of ?> vs <?>?****
>
> ****
>
> Thanks!****
>
> ****
>
> Craig****
>
> ****
>
> ** **
>
--
Jesse Bowling
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20130715/55bb66d1/attachment.html>
More information about the argus
mailing list