A couple troubleshooting questions...

Peter Van Epp vanepp at sfu.ca
Fri Jul 25 17:02:30 EDT 2014


On Thu, Jul 24, 2014 at 09:22:51AM -0700, Craig Merchant wrote:
> Just got some:
> 
>  
> 
> <ArgusManagementRecord  StartTime = "1395380563.321"    Flags = "         "     Proto = "man"   PktsRcvd = "0"  Records = "0"   BytesRcvd = "0"         PktsDropped = "0"       State = "STA" SrcUserData = ""></ArgusManagementRecord>
> 
<snip>
> 
> While these records were being generated, I ran the ra client and grep'd for '*\sg' and I saw a ton of flows with gaps.  So, from what you said earlier in the thread, if the problem is that Argus can't keep up, PktsDropped would be greater than zero.
> 
>  
> 
> We recently implemented a bunch of Gigamon taps instead of using SPAN ports on the Catalyst 6500s.  So, I'm guessing that is where the problem may lie.  I'll have to talk to our netops team to see what kind of troubleshooting tools are available in the Gigamon devices.  You probably have a lot of experience with them, so any suggestions are appreciated.
> 

	Welcome to the fun game of "packet packet who lost the packet?" :-).
While I'm rusty (I've been retired for 5+ years now) when I first started
using argus 20+ years ago I poked at (and verified to my satisfaction) that
argus could accuratly count traffic and some of the methods I used then should
still apply. For background there used to be some notes on hardware performance
and the various places packet loss can occur on the argus web site (I no longer
have the url to hand). That said I'd hope that the Gigamon solution would do 
better than a span port on a switch as it is designed to capture packets rather
than doing it as an after thought if there are resources as a span port does. 
	While I looked at Gigamon just before I retired, I don't have direct
experience with them. However I think you should be able to implement what
Netoptics called regen taps i.e. two (or more) monitor ports outputting the 
same data with the Gigamon. I had 4 port optical regen tap (at Gig in those 
days) and a gig sniffer that could do full speed capture at wire speed (for a 
small amount of time unfortunatly :-)) to test with without impacting 
production which was nice. I think (your netops folks willing, I was lucky 
enough to be both net engineering and the security guy) that the Gigamon 
should let you do the same thing i.e.  give you a test tap of the same data 
that argus is seeing on another port where you can try and see what is 
Happening. I say try and see because your test setup has a similar problem to 
argus in that if the limitation is hardware in the interface cards losing 
packets (as opposed to the packets being lost in the main real connection, or 
in the Gigamon which are also both possible). However the test system isn't 
also trying to process the packets as argus is so if the loss is in fact in 
the argus system it may do better. 
	The ideal situation is to have a capture device (such as a network 
sniffer or Endace Ninja) that you are sure can keep up with the wire at least 
for a while and/or a test generator (I used to use tcpreplay for that at 
10 gigs I expect you are in Ninja country) that can generate known traffic in 
a test setup. At 10 gigs all of this is challanging however (as it was even at 
gig back in the day :-)). One possibly useful thing (however again with the 
warning about potential performace issues internally) is counters and 
statistics from the network switches and gigamon. It can be very instructive 
(and equally hard to do in a non test environment with known traffic) to 
compare the packet and byte counters that your core switches, the gigamon and 
argus report for the same time interval (which is usually the rub, finding a 
correct time interval). Here longish sample times tend to even out the 
truncation errors caused by uneven start and end times (i.e. loss of a couple 
of hundred packets in 10s of thousands is less signifigant than loss of 100s 
of packets in a 1000 packet sample). As noted the ideal (but possibly too 
expensive) way is to have an IP traffic generator that can generate a 
repeatable, longish typical (perhaps recorded from the link with tcpdump) 
traffic from the link at wire speed in a test setup. That enables you to 
identify what is losing traffic, the gigamon the network interface cards in 
the argus box at a hardware level, the OS level tap software (pcap, pf-ring 
etc.) or the argus process itself (or all of them which is unfortunatly also 
possible :-)). 
	Another useful tool can be a tap (hopefully optical to lessen tap loss 
issues) and a wire speed network switch with rmon or at least port stats and 
doing nothing else so it has CPU available to capture hopefully accurate counts
to give you hopefully accurate packet and byte counts at various places in the 
network. The production path is probably the hardest to arrange but if you can 
get a tap installed the output of the tap (modulo security, privacy and 
political issues) is then available for use without being able to impact the 
production network. If resources are available an Endace Nija 10 gig capture 
appliance (hundreds K $$$ last I knew :-)) makes an excellent trouble shooting 
tool. It can capture at close to wire speed for a reasonable period of time 
(the one I speced for the local regional network before I retired had 16 
terabytes of disk storage for captures). With such a capture from the 
production network, it should be possible to replay the traffic in to a test 
setup and see exactly what is happening as well which is the ideal trouble 
shooting environment, known repeatable data. 
	As you have probably gathered this is also a huge amount of work
to add to what you already need to do. I was interested in the results and thus
used to do this on my own time afer hours as I usually couldn't justify it 
as directly work related although I was also fortunate in having bosses that
saw the value in argus and supported it with both capital for tools and staff
time. It comes down to how valuable having accurate (and for what value of 
accurate :-)) data is to your bosses and somewhat if you are interested to 
figure out whats happening using tools that you couldn't afford on your own. 
	Hope some of this helps, and good luck!

Peter Van Epp


 



More information about the argus mailing list