A couple troubleshooting questions...

Tue Jul 29 08:34:47 EDT 2014

A file with primitive data is the best way to debug the tcp 0 issue.
In the newest rarc file there is an option RA_PORT_DIRECTION.
you should try that.

load balancers can cause problems, multiple argi looking at multiple streams, is it possible 2 sensors are seeing portions of the same flow ???  Bothe will see gaps, one of them may not see the connection setup.

Carter

> On Jul 28, 2014, at 2:05 PM, Craig Merchant <craig.merchant at oracle.com> wrote:
> 
> I’m positive we don’t have any asymmetric routing going on.  Each data center has a pair of Cisco 6500s in a VSS configuration.  The only thing connected to the core switches are top of rack clusters and the firewalls/F5 load balancers.  All of the connections to the core go through TAPs.  So, even if there was asymmetric routing going on, Argus should still be seeing 100% of the traffic.
>  
> Here is the output of ifconfig.  Looks clean to me:
>  
> dna0      Link encap:Ethernet  HWaddr 00:E0:ED:1F:60:38
>           inet6 addr: fe80::2e0:edff:fe1f:6038/64 Scope:Link
>           UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
>           RX packets:228025522518 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:138285659803170 (125.7 TiB)  TX bytes:0 (0.0 b)
>           Memory:feaa0000-feac0000
>  
> I’ve got a call with some of our netops guys and Gigamon on Wednesday.  I’ll let you know what I find.
>  
> I’m not clear on how the label file that RA_LOCAL_DIRECTION uses can be differentiated by port…  The examples are all subnets.  Virtually all of the flows that Argus is unsure of the direction of are between two internal hosts.  Usually hosts running either a database or tomcat.  I’m not sure what RA_LOCAL_DIRECTION will do if both hosts are local.
>  
> Our Argus sensors have like 64 GB of memory and are barely using half.
>  
> I’m happy to upload some of our primitive data if you want to take a look at the TCP 0 issue.  What is the best way to get it to you?
> 
> Thx.
>  
> C
>  
>  
> From: Carter Bullard [mailto:carter at qosient.com] 
> Sent: Friday, July 25, 2014 2:49 PM
> To: Craig Merchant
> Cc: Argus
> Subject: Re: [ARGUS] A couple troubleshooting questions...
>  
> Hey Craig,
> Sorry for the delayed response.
> Your gap reporting suggests that you aren't seeing 50+% of the contents of some flows.
> Are you sure that you don't have some asymmetric routing going on ???
>  
> The PktsDropped value is the number reported by the libpcap interface.  If your PF_RING
> facility doesn't populate these numbers then we won't know if the problem is between
> argus and PF_RING.  If it does populate this value, then argus is keeping up with
> the packets that are presented to it,and the .  But there are many
> other opportunities for packets to be dropped.  Look at the ifconfig and/or netstat
> statistics for the interfaces that argus is using.  They will reveal if there are any
> physical issues.  Gigamon's are pretty good, but they can be used incorrectly.
>  
> Timeout values  ... you need to deal with the fact that you will get flow records from old
> timed out flows.  That is not hard...  yes the RA_LOCAL_DIRECTION support is intended
> to put port numbers on the left or right when printing, as well as LOCAL ip addresses,
> and it works well.  I would still to ports under 1024, so use the RESERVED port setting.
>  
> If you have enough memory to keep all those caches in memory, then no problems
> with the timeout values.  Can be a lot of memory for some sites to have 1000 second
> order of magnitude timeouts.
>  
> Understand that bad actors will flip the ports around on you.  But figuring that out
> isn't any different than dealing with what you're dealing with now.
>  
> SO, the 0 port issues.  This is a problem, if default aggregation is generating records
> with 0 port values, where no 0 ports exist, then we've got a big bad bug.  
>  
> If you can get a file of primitive records, that when aggregated show the 0 port bug,
> i.e. demonstrate this bug, and you can share the file, I'll fix it as fast as possible.
>  
> If not, I'll go ahead and release 3.0.8 and we'll fix it in 3.0.9 code.
>  
> Carter
>  
>  
>  
> On Jul 24, 2014, at 12:22 PM, Craig Merchant <craig.merchant at oracle.com> wrote:
> 
> 
> Just got some:
>  
> <ArgusManagementRecord  StartTime = "1395380563.321"    Flags = "         "     Proto = "man"   PktsRcvd = "0"  Records = "0"   BytesRcvd = "0"         PktsDropped = "0"       State = "STA" SrcUserData = ""></ArgusManagementRecord>
> <ArgusManagementRecord  StartTime = "1406217072.001"    Flags = "         "     Proto = "man"   PktsRcvd = "16981734"   Records = "818535"      BytesRcvd = "11704539046"       PktsDropped =
> "0"      State = "CON"   SrcUserData = ""></ArgusManagementRecord>
> <ArgusManagementRecord  StartTime = "1406217093.007"    Flags = "         "     Proto = "man"   PktsRcvd = "16139016"   Records = "877935"      BytesRcvd = "9801629160"        PktsDropped =
> "0"      State = "CON"   SrcUserData = ""></ArgusManagementRecord>
> <ArgusManagementRecord  StartTime = "1406217094.000"    Flags = "         "     Proto = "man"   PktsRcvd = "0"  Records = "1708990"     BytesRcvd = "0"         PktsDropped = "0"       State
> = "CON"  SrcUserData = ""></ArgusManagementRecord>
> <ArgusManagementRecord  StartTime = "1406217132.002"    Flags = "         "     Proto = "man"   PktsRcvd = "17272249"   Records = "824336"      BytesRcvd = "10814753015"       PktsDropped =
> "0"      State = "CON"   SrcUserData = ""></ArgusManagementRecord>
> <ArgusManagementRecord  StartTime = "1406217153.004"    Flags = "         "     Proto = "man"   PktsRcvd = "17311017"   Records = "920161"      BytesRcvd = "7858246369"        PktsDropped =
> "0"      State = "CON"   SrcUserData = ""></ArgusManagementRecord>
> <ArgusManagementRecord  StartTime = "1406217154.000"    Flags = "         "     Proto = "man"   PktsRcvd = "0"  Records = "1733970"     BytesRcvd = "0"         PktsDropped = "0"       State
> = "CON"  SrcUserData = ""></ArgusManagementRecord>
> <ArgusManagementRecord  StartTime = "1406217192.002"    Flags = "         "     Proto = "man"   PktsRcvd = "17468993"   Records = "807238"      BytesRcvd = "11106467601"       PktsDropped =
> "0"      State = "CON"   SrcUserData = ""></ArgusManagementRecord>
> <ArgusManagementRecord  StartTime = "1406217213.000"    Flags = "         "     Proto = "man"   PktsRcvd = "17292538"   Records = "888384"      BytesRcvd = "8555571755"        PktsDropped =
> "0"      State = "CON"   SrcUserData = ""></ArgusManagementRecord>
> <ArgusManagementRecord  StartTime = "1406217214.001"    Flags = "         "     Proto = "man"   PktsRcvd = "0"  Records = "1695482"     BytesRcvd = "0"         PktsDropped = "0"       State
> = "CON"  SrcUserData = ""></ArgusManagementRecord>
>  
> While these records were being generated, I ran the ra client and grep’d for ‘*\sg’ and I saw a ton of flows with gaps.  So, from what you said earlier in the thread, if the problem is that Argus can’t keep up, PktsDropped would be greater than zero.
>  
> We recently implemented a bunch of Gigamon taps instead of using SPAN ports on the Catalyst 6500s.  So, I’m guessing that is where the problem may lie.  I’ll have to talk to our netops team to see what kind of troubleshooting tools are available in the Gigamon devices.  You probably have a lot of experience with them, so any suggestions are appreciated.
>  
> As far as the timeout values go, if you remember from last year, Argus was having difficulty identifying the direction of flows in about 15-30% of our traffic.  Since we’re a SaaS environment with customers worldwide, a lot of servers in our infrastructure keep flows open for a really, really long time.  I upped those values in the hope that Argus would be able to keep track of longer running flows better.  It didn’t seem to help much, but it also didn’t  seem to cause any performance issues that I’m aware of.
>  
> That was the reason I asked if it was possible for Argus to have some kind of mechanism for hard-coding the direction of the flow based upon a combination of host/port.  I see that there is a RA_LOCAL_DIRECTION option in the .rarc file.  Is there any way that feature could be extended to include protocol/port information?
>  
> I also tried searching for flows with a source port of 0.  I don’t see any of the SMTP flows with a source port of 0 that racluster was generating.  But I am seeing a bunch of UDP flows using both source and destination port 0.  Most are inbound from the Internet, presumably bad guys scanning us.  I’m still not clear why racluster using a 5-tuple aggregation model would produce so many flows with a source of TCP 0, particularly for SMTP traffic.  Any ideas…?
>  
> I’m going to try running Argus again on the standard ixgbe driver instead of PF_RING and see if that impacts the volume of gaps.  Thanks again for all your help!
>  
> Craig
>  
> From: Carter Bullard [mailto:carter at qosient.com] 
> Sent: Thursday, July 24, 2014 3:02 AM
> To: Craig Merchant
> Cc: Argus
> Subject: Re: [ARGUS] A couple troubleshooting questions...
>  
> Hey Craig,
> Did you man records ever print out ???
>  
> I think you should use the default IP and TCP timeouts.  You're holding onto
> caches waaaaayyyyyyy to long.
>  
> Carter
>  
> On Jul 23, 2014, at 7:24 PM, Craig Merchant <craig.merchant at oracle.com> wrote:
> 
> 
> 
> I am aggregating my flows using the standard 5-tuple model every five minutes.  We’ve got Gigamon taps between all of our top of rack switch clusters and the core switches.
> 
> What does TCP 0 mean if the flows are aggregated with racluster and I’m using a 5-tuple model?
>  
> Is it possible to get the management records by connecting to argus or radium rather than a file?  I tried:  ra –S argus_ip:561 –M xml – man, but that didn’t give me any records.  I tried the same thing against my radium instance and I got data, but nothing that includes any performance data.  It looked like:
>  
> <ArgusFlowRecord  StartTime = "2014-07-23T15:46:32.000131" Flags = " * g     " Proto = "tcp" SrcAddr = "15.23.223.33" SrcPort = "10503" Dir = "<?>" DstAddr = "23.67.242.93" DstPort = "47536" Pkts = "9" Bytes = "6997" State = "FIN"></ArgusFlowRecord>
>  
> My argus.conf looks like.  Am I missing something?
>  
> ARGUS_FLOW_TYPE="Bidirectional"
> ARGUS_FLOW_KEY="CLASSIC_5_TUPLE"
> ARGUS_DAEMON=no
> ARGUS_MONITOR_ID="argus01"
> ARGUS_ACCESS_PORT=561
> ARGUS_BIND_IP="10.10.10.10"
> ARGUS_INTERFACE=dnacluster:10 at 28
> ARGUS_GO_PROMISCUOUS=no
> udp://224.0.20.21:561
> ARGUS_SET_PID=yes
> ARGUS_PID_PATH="/var/run"
> ARGUS_FLOW_STATUS_INTERVAL=5
> ARGUS_MAR_STATUS_INTERVAL=60
> ARGUS_IP_TIMEOUT=900
> ARGUS_TCP_TIMEOUT=1800
> ARGUS_GENERATE_RESPONSE_TIME_DATA=yes
> ARGUS_GENERATE_PACKET_SIZE=yes
> ARGUS_GENERATE_APPBYTE_METRIC=yes
> ARGUS_GENERATE_TCP_PERF_METRIC=yes
> ARGUS_GENERATE_BIDIRECTIONAL_TIMESTAMPS=yes
> ARGUS_CAPTURE_DATA_LEN=10
> ARGUS_SELF_SYNCHRONIZE=yes
> ARGUS_KEYSTROKE="yes"
>  
> As you can see from the interface setting, we are using the PF_RING DNA/Libzero drivers.  I compiled the latest ixgbe drivers from Intel and tried those and the packet loss was as bad or worse than with the PF_RING drivers.  We are using the 5.3.3 version of the drivers which, I believe, have that SELECT() bug that causes argus to run at 100% all of the time.
>  
> I grabbed a bunch of non-aggregated flow records that had gaps in the packets and added the sgap and dgap fields.  I’ll send that to you offline.  For whatever reason, the header row didn’t get printed.  The last two fields are sgap and dgap.  I opened it in excel and the average sgap is 21,903.  The average dgap is 10,812.
>  
> Each argus instance that we’re running probably sees 3-8 Gbps pretty much 24/7. 
>  
>  
>  
> Thanks for your help!
>  
> From: Carter Bullard [mailto:carter at qosient.com] 
> Sent: Wednesday, July 23, 2014 3:23 PM
> To: Craig Merchant
> Cc: Argus
> Subject: Re: [ARGUS] A couple troubleshooting questions...
>  
> Hey Craig,
> Here are a few suggestions on what to look for.  If you do find something
> please send your observations to the list.
>  
> A couple of things first.  Are you basing these observations on primitive
> argus data (data straight from argus) or from processed data (aggregated
> argus flows ??).
>  
> If these observations are coming from primitive Argus data: 
> the ?’s and ‘g’aps, can be indications that your Argus is either not
> getting all the packets from the wire, or there is asymmetric routing,
> such that all the packets don’t come down the wire/interface that your
> monitoring.
>  
> Argus management records have the argus packet drop rate in them.  If argus
> isn’t getting all the packets, and the libpcap interface is dropping packets
> then the ‘man’ record will show this.  When you print the man records using
> xml, it will show the number of dropped packets during the reporting interval.
>  
>    ra -S argus.source -M xml - man
>    ra -r repository.file(s) -M xml - man
>  
> If the ‘PktsDropped’ number is gt 0, then argus is having problems keeping up
> with the captured load, and the packet loss is between the libpcap interface
> and argus reading packets from the interface.  This is the only place where
> we can directly report on packet capture infrastructure loss.  If the packets
> are lost in the switch that is port mirroring packets, or if they are dropped
> by the sensors capture ethernet interface, there isn’t any way that we 
> can “ know “ that they were dropped.  
>  
> The ‘g’ap tracking is our way of indicating that we are seeing gaps, which means
> we didn’t see all the packets for this flow.  You can print the size of the gaps,
> from the TCP records “ -s +sgap +dgap” in order to understand how much we
> missed, which can help in your understanding of the problem.
>  
> Because some TCP flow idle times do exceed the Argus default TCP idle time,
> there will be TCP status flow records that have the ‘?’ in them.  To understand
> if this is the case, you have to look earlier in your archive to see if you
> saw this flow before, if so, there is an answer, if not, we’re back to thinking
> that we aren’t seeing all the packets.
>  
> All of that can help to figuring out how bad is the issue and where it might be.
> Packet loss in the collection infrastructure is expected above 1G.  If you are
> port mirroring, it can be expected at any speed, depending on how the mirroring
> is being implemented.
>  
>  
> If these observations are coming from aggregated Argus data:
>  
> the 0 TCP port number and ‘g’aps can be expected when using non-default
> aggregation rules.  If so we have a somewhat long conversation, but
> it is important to email about it, again if this the case.
>  
>  
> Carter
>  
> On Jul 23, 2014, at 5:41 PM, Craig Merchant <craig.merchant at oracle.com> wrote:
> 
> 
> 
> 
> I’ve been trying to troubleshoot why Argus is having a tough time determining the direction of flows (approximately 40% of flows).  We also seem to be seeing a fairly high number of flows with gaps (approximately 15%).  Although oddly enough, only about 20% of flows with questionable direction have gaps in them.
>  
> What I am seeing is that the overwhelming majority of traffic with gaps in the sequence numbers have either TCP 0 or TCP 25 as the source port or TCP 25 as the destination.  After doing a little reading (http://www.lovemytool.com/blog/2013/08/the-strange-history-of-port-0-by-jim-macleod.html), TCP 0 doesn’t seem to mean that the source port was defined as 0, but that it means a Layer 4 header wasn’t included in the packet.  This article implies that packet fragmentation is often a cause of this, but I’m not seeing TCP flags indicating any kind of fragmentation.
>  
> What does a packet with TCP 0 as a source port mean in Argus?
>  
> Is there anything special about SMTP that might generate a higher volume of gaps than other types of traffic?  We’re an ESP, so we send and receive a ton of email on behalf of our customers.  But I’m also not seeing gaps in other types of traffic (like HTTPS) between us and the Internet.
>  
> Thanks.
>  
> Craig
>  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20140729/399ddfca/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2443 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20140729/399ddfca/attachment.bin>