Detect packet drops

Thu Jan 26 16:19:26 EST 2012

On Thu, Jan 26, 2012 at 11:37:27AM +0100, elof2 at sentor.se wrote:
> 
<snip>
> 
> Hi Peter.
> Thanks for your input.
> 
> Ah, didn't know about the hidden pcap drop counters. I will take a
> look at it.
> 
	Note it isn't the be all and end all :-), at least on FreeBSD (the
only one I've looked at in detail and that years ago), it is counting only
overflows on the copy from kernel memory into user memory, there are still
a lot of loss points that aren't reported. As well here is an (again fairly
old) format for the early 3.0 mar record which I don't think is documented 
anywhere (I dug this out of the source): 

sport (argus_mar.dropped) is the pcap loss counter.

argus  mar format  3.0

stime   ArgusPrintStartDate     argus_mar.now.tv_sec
ltime   ArgusPrintLastDate      src.start.tv_sec
        ArgusGetIndicatorString blanks
flgs    ArgusPrintFlags         blanks
proto   ArgusPrintProto         "man"
saddr   ArgusPrintSrcAddr       argus_mar.queue
sport   ArgusPrintSrcPort       argus_mar.dropped
dir     ArgusPrintDir           blanks  (now version number)
daddr   ArgusPrintDstAddr       argus_mar.bufs
dport   ArgusPrintDstPort       argus_mar.clients
spkts   ArgusPrintSrcPackets    argus_mar.pktsRcvd
dpkts   ArgusPrintDstPackets    argus_mar.records
sbytes  ArgusPrintSrcBytes      argus_mar.bytesRcvd
dbytes  ArgusPrintDstBytes      argus_mar.bytes
state   ArgusPrintState         state  (current?)

> However... Even though I can see the pcap drop count, I still think
> it would be nice if argus could tag individual flows where it has
> detected gaps.
> The tag would give us argus users a notification that not all
> traffic is monitored 100%. An informative tag just like the
> out-of-order tag or ECN tag.
> 

	I'd agree as long as detection is easy i.e. doesn't impact performance
too much (but I'm not sure it is :-)). Part of the problem is where argus is
looking at the data. There can be loss, and as you note on span ports (which is
why I always prefer passive taps :-)) packet duplication that isn't present
on the monitored connection. The best cure for that is sensor over engineering
(to insure minimal loss on the sensor path) but in reality argus may or may not
detect the loss (if extra packets were inserted) and can't tell without more
monitor points where the loss occurred on the real connection or on the path
to the sensor. 

> I now realise that my suggestion of having tags like "dropped
> externally" and "dropped internally" is not feasable, since there's
> no way to correlate the pcap drop counter to specific flows, so
> ignore this.

	There actually is an internal possibility: the argus sensor having too
much load and reducing the granularity of detection and eventually losing 
packets that could be and probably should be flagged somehow (possibly in man
records as it is sensor related not connection) so that you now that it is 
happening and can look at increasing the performance of your sensor. As the
code that reduces the granularity (I believe) only kicks in at high load this
shouldn't add to much of an extra performance hit, even though it is at an
already strained time. 

> 
> Apart from simply being informed that the monitored traffic is not
> 100%, I would also very much like to be able to determine if the
> drops occur outside of the sensor, i.e. the switch drop lots of
> packets while the sensor drop nothing.
> With the tag above, and a pcap-drop-counter in the argus man-records
> it should be easier to spot that external drops occur.
> (naturally, if you have both external drops and internal drops, it
> will be hard to investigate, but that's always the case. If I'm sure
> I have 0 drops within my sniffing machine, then all flows tagged
> with gaps must be due to drops in the external switch or tap (or
> faulty DAG/DAC drivers that doesn't report their own drop count, but
> that is a completely different matter).
> 
> 
> >	Comparing the RMON traffic counts reported by the switch feeding your
> >sensor against the argus counts is another way although syncronizing the two
> >counts can be exciting :-). Both of these only indicate loss of data that makes
> >it as far as your sensor of course and isn't an indication of loss else where
> >in the path but thats a start ...
> 
> Hehe, this is not possible since in many cases the SPAN port is not
> managed by me. I just manage the sensor receiving the mirrored
> traffic, but it is someone else who has setup the SPAN
> configuration.

	Ah, I'm spoiled :-), I was one of the network engineers as well as the
security person so I had complete control of the path and taps and 
responsibility for both security and network correct operation which is both 
a blessing and a curse (but more blessing than curse :-)). 

> So diffing the reported drop-numbers is practically not feasable.
> 
> >	As well using something like tcpreplay from a pcap file with suitable
> >hardware (which can get very hard at high speed of course :-)) feeding in to
> >your sensor can give you a known input traffic pattern to estimate sensor loss
> >as well.
> 
> Now you're rather talking about detecting local sensor loss. What
> I'm primarily asking for is a way to easily detect that there are
> external packet loss.
> 

	Yes that is correct, but thats where I used to start: first make sure
(as closely as possible) that my sensor can keep up at maximum load so it isn't
losing (much :-)) traffic. As noted I was spoiled by having complete control
of the network so could grab counters (if available and accurate which is 
another problem :-)) from all the switches. As well I had several multiport
passive taps in the network for diagnostic reasons that I could capture traffic
independent of argus on to a hardware equivelent of wireshark which also 
helped. 

> Currently I'm sniffing e.g. 100 000 packets with tcpdump, making
> sure nothing is dropped locally. In this case it took 3 seconds to
> gather 100 000 packets. I scp the pcap file to a machine running
> Wireshark. I open up the "Expert Info Composite" and look at "ACKed
> lost segment" and "Previous segment lost".
> In an environment where the traffic is mirrored correctly, these two
> counters give me an estimate as to how many gaps there are in the
> tcp flows in the pcap file (disregarding a couple of false positives
> at capture startup).
> ...that is, I can see if the people feeding me mirrored traffic have
> problems in their end.
> 
> This procedure is quite tiresome. Also, it is unreliable when the
> mirrored packets are received out of order (common in
> redundant/loadbalanced environments), then Wireshark will tag
> packets as lost even though they exist.
> 

	Unfortunatly I always found that trying to verify much of anything 
about capture to be tiresome and time consuming (which is worse, as our way as
I expect everywhere, skilled staff time was in shortest supply!). There are
so many variables and points where loss can occur and relatively few points
where you can monitor that its hard. Worse, your tools unless you have DAGs or
hardware sniffers that you have tested to make sure they capture whats really
there may be unreliable. Tcpdump (with the probable exception of DAG cards 
with a well engineered capture system such as a Ninja) is just as prone to 
losing packets as argus, which means you are standing in quicksand when 
attempting to analyse the loss! Although I never did it, I suspect a good 
start would be getting the DAGs to keep (or possibly just supply since they
may already keep it) a count of packets seen. Due to the on card CPU and 
buffering they will usually see everything that comes in (it may get lost due
to lack of buffers in the path to argus, but the input side of the DAG should
see it all). Comparing that to the argus packet and byte counts will give one
indication if the sensor is losing data which as noted is one part of the 
potential loss.
	There is one additional argus (or more correctly argus + post 
processing) metric that is useful on tcp connections: use ra to output both
the usual packet/byte counts and app packet/byte counts and use a perl (or ?)
script to compare the two. On a good connection the normal counts will be ~15%
larger than the app byte counts (which measures delivered tcp data to the 
application). Numbers much higher than this indicate packet loss in that flow
(although it doesn't tell you anything about where the loss is of course). 
	The best I can say is good luck, and tell us if you find something that
works well (although I doubt that its going to be easy it at all :-)) and I 
hope some of this hslps somewhat.

Peter Van Epp