Detect packet drops
Peter Van Epp
vanepp at sfu.ca
Thu Jan 26 16:19:26 EST 2012
On Thu, Jan 26, 2012 at 11:37:27AM +0100, elof2 at sentor.se wrote:
>
<snip>
>
> Hi Peter.
> Thanks for your input.
>
> Ah, didn't know about the hidden pcap drop counters. I will take a
> look at it.
>
Note it isn't the be all and end all :-), at least on FreeBSD (the
only one I've looked at in detail and that years ago), it is counting only
overflows on the copy from kernel memory into user memory, there are still
a lot of loss points that aren't reported. As well here is an (again fairly
old) format for the early 3.0 mar record which I don't think is documented
anywhere (I dug this out of the source):
sport (argus_mar.dropped) is the pcap loss counter.
argus mar format 3.0
stime ArgusPrintStartDate argus_mar.now.tv_sec
ltime ArgusPrintLastDate src.start.tv_sec
ArgusGetIndicatorString blanks
flgs ArgusPrintFlags blanks
proto ArgusPrintProto "man"
saddr ArgusPrintSrcAddr argus_mar.queue
sport ArgusPrintSrcPort argus_mar.dropped
dir ArgusPrintDir blanks (now version number)
daddr ArgusPrintDstAddr argus_mar.bufs
dport ArgusPrintDstPort argus_mar.clients
spkts ArgusPrintSrcPackets argus_mar.pktsRcvd
dpkts ArgusPrintDstPackets argus_mar.records
sbytes ArgusPrintSrcBytes argus_mar.bytesRcvd
dbytes ArgusPrintDstBytes argus_mar.bytes
state ArgusPrintState state (current?)
> However... Even though I can see the pcap drop count, I still think
> it would be nice if argus could tag individual flows where it has
> detected gaps.
> The tag would give us argus users a notification that not all
> traffic is monitored 100%. An informative tag just like the
> out-of-order tag or ECN tag.
>
I'd agree as long as detection is easy i.e. doesn't impact performance
too much (but I'm not sure it is :-)). Part of the problem is where argus is
looking at the data. There can be loss, and as you note on span ports (which is
why I always prefer passive taps :-)) packet duplication that isn't present
on the monitored connection. The best cure for that is sensor over engineering
(to insure minimal loss on the sensor path) but in reality argus may or may not
detect the loss (if extra packets were inserted) and can't tell without more
monitor points where the loss occurred on the real connection or on the path
to the sensor.
> I now realise that my suggestion of having tags like "dropped
> externally" and "dropped internally" is not feasable, since there's
> no way to correlate the pcap drop counter to specific flows, so
> ignore this.
There actually is an internal possibility: the argus sensor having too
much load and reducing the granularity of detection and eventually losing
packets that could be and probably should be flagged somehow (possibly in man
records as it is sensor related not connection) so that you now that it is
happening and can look at increasing the performance of your sensor. As the
code that reduces the granularity (I believe) only kicks in at high load this
shouldn't add to much of an extra performance hit, even though it is at an
already strained time.
>
> Apart from simply being informed that the monitored traffic is not
> 100%, I would also very much like to be able to determine if the
> drops occur outside of the sensor, i.e. the switch drop lots of
> packets while the sensor drop nothing.
> With the tag above, and a pcap-drop-counter in the argus man-records
> it should be easier to spot that external drops occur.
> (naturally, if you have both external drops and internal drops, it
> will be hard to investigate, but that's always the case. If I'm sure
> I have 0 drops within my sniffing machine, then all flows tagged
> with gaps must be due to drops in the external switch or tap (or
> faulty DAG/DAC drivers that doesn't report their own drop count, but
> that is a completely different matter).
>
>
> > Comparing the RMON traffic counts reported by the switch feeding your
> >sensor against the argus counts is another way although syncronizing the two
> >counts can be exciting :-). Both of these only indicate loss of data that makes
> >it as far as your sensor of course and isn't an indication of loss else where
> >in the path but thats a start ...
>
> Hehe, this is not possible since in many cases the SPAN port is not
> managed by me. I just manage the sensor receiving the mirrored
> traffic, but it is someone else who has setup the SPAN
> configuration.
Ah, I'm spoiled :-), I was one of the network engineers as well as the
security person so I had complete control of the path and taps and
responsibility for both security and network correct operation which is both
a blessing and a curse (but more blessing than curse :-)).
> So diffing the reported drop-numbers is practically not feasable.
>
> > As well using something like tcpreplay from a pcap file with suitable
> >hardware (which can get very hard at high speed of course :-)) feeding in to
> >your sensor can give you a known input traffic pattern to estimate sensor loss
> >as well.
>
> Now you're rather talking about detecting local sensor loss. What
> I'm primarily asking for is a way to easily detect that there are
> external packet loss.
>
Yes that is correct, but thats where I used to start: first make sure
(as closely as possible) that my sensor can keep up at maximum load so it isn't
losing (much :-)) traffic. As noted I was spoiled by having complete control
of the network so could grab counters (if available and accurate which is
another problem :-)) from all the switches. As well I had several multiport
passive taps in the network for diagnostic reasons that I could capture traffic
independent of argus on to a hardware equivelent of wireshark which also
helped.
> Currently I'm sniffing e.g. 100 000 packets with tcpdump, making
> sure nothing is dropped locally. In this case it took 3 seconds to
> gather 100 000 packets. I scp the pcap file to a machine running
> Wireshark. I open up the "Expert Info Composite" and look at "ACKed
> lost segment" and "Previous segment lost".
> In an environment where the traffic is mirrored correctly, these two
> counters give me an estimate as to how many gaps there are in the
> tcp flows in the pcap file (disregarding a couple of false positives
> at capture startup).
> ...that is, I can see if the people feeding me mirrored traffic have
> problems in their end.
>
> This procedure is quite tiresome. Also, it is unreliable when the
> mirrored packets are received out of order (common in
> redundant/loadbalanced environments), then Wireshark will tag
> packets as lost even though they exist.
>
Unfortunatly I always found that trying to verify much of anything
about capture to be tiresome and time consuming (which is worse, as our way as
I expect everywhere, skilled staff time was in shortest supply!). There are
so many variables and points where loss can occur and relatively few points
where you can monitor that its hard. Worse, your tools unless you have DAGs or
hardware sniffers that you have tested to make sure they capture whats really
there may be unreliable. Tcpdump (with the probable exception of DAG cards
with a well engineered capture system such as a Ninja) is just as prone to
losing packets as argus, which means you are standing in quicksand when
attempting to analyse the loss! Although I never did it, I suspect a good
start would be getting the DAGs to keep (or possibly just supply since they
may already keep it) a count of packets seen. Due to the on card CPU and
buffering they will usually see everything that comes in (it may get lost due
to lack of buffers in the path to argus, but the input side of the DAG should
see it all). Comparing that to the argus packet and byte counts will give one
indication if the sensor is losing data which as noted is one part of the
potential loss.
There is one additional argus (or more correctly argus + post
processing) metric that is useful on tcp connections: use ra to output both
the usual packet/byte counts and app packet/byte counts and use a perl (or ?)
script to compare the two. On a good connection the normal counts will be ~15%
larger than the app byte counts (which measures delivered tcp data to the
application). Numbers much higher than this indicate packet loss in that flow
(although it doesn't tell you anything about where the loss is of course).
The best I can say is good luck, and tell us if you find something that
works well (although I doubt that its going to be easy it at all :-)) and I
hope some of this hslps somewhat.
Peter Van Epp
More information about the argus
mailing list