Another vote for packet drop detection

Sun Jan 29 08:56:09 EST 2012

OK, I think that we are already doing all that is needed in argus to report on
suspicion of infrastructure loss.  I have written about this for many years, and
I'm glad that we're talking about this topic now.  We have done so much work
to make sure that we're pretty accurate in reporting loss, and many have put
Argus through the ringer on this.  (if you think you see a bug in simply reporting
loss, please send to the list a packet trace that demonstrates that, and I'll fix
it).  If Argus is doing a good job estimating loss, then we're talking about only
adding another metric to differentiate unobserved traffic vs loss, based on
your criteria.

So, lets talk about how we can estimate the amount of unobserved  traffic.   TCP
provides you with a number of possibilities for knowing what the offered load is/was,
i.e. the amount of traffic that was actually transmitted.  The most reliable is the total
number of bytes.  This is derived by differencing the closing sequence number
and the initial base sequence number, adjusting for rollover.  This is an excellent
number, as it tells you the exact number of bytes reliably transmitted (Br).

Now this is the TCP bytes. Argus tracks this stat, and reports today the TCP bytes
observed (Bo).  This number is the TCP bytes for the all transmitted packets,
original data (Od) and retransmissions (Rb).   If you can compare Od with Br,
you will realize how many bytes you didn't see.  Now how many packets you
didn't see will have to be estimated.

We're already doing this, so if this works for you, I can report the metric in the
next release.

Gaps.  Argus detects TCP gaps today, as a secondary metric to the windowed
out-or-order sequence numbers its trying to calculate.  We don't report the gaps
though,  If you would like Argus to report actual gaps, that is doable, no problem.
With gaps, together with the total sequence bytes transmitted and the real observed
bytes, you can realize I think, the best numbers for estimating unobserved data.

Would this be useful ?  What do you want to call it.

In the absence of TCP selective retransmissions, there is a little complexity to
tracking ACK'd sequence numbers, at high speeds.  As you know, because
ACK's can be lost, dealing with that single exception generates a bit of complexity.

I think the best solution would be an approach that estimates gaps from the
perspective of a unidirectional TCP sensor.  I think I can do that in a day or so.

Carter

On Jan 28, 2012, at 12:55 PM, Charles Smutz wrote:

> 
> Carter,
> 
> I'd like to put in another plug for packet drop detection in argus. There are many people who could use this. In many cases, people are running sensors where there various places were packet loss is reported are 0 (pcap drop, ifconfig drop, ethtool -S) but there is still loss (in tapping infrastructure, link overflows not reported by NIC, etc).
> 
> Note that I'm concerned about loss that occurs in the network monitor, not loss in the network. We all know packets are lost and that's dandy, but I want to make sure my network monitor is seeing everything traversing the network (and if it happens to see some things twice because normal drop occurs after my visibility point--that's the least of my worries--especially for network flow data). If packets do traverse network and I don't seem them, I consider that a very bad thing. This can happen in places, as I mentioned before, that are not reported and so often go unnoticed.
> 
> I've discussed methods for doing this in this blog post:
> http://smusec.blogspot.com/2010/06/flushing-out-leaky-taps.html
> 
> Wireshark seems to have the best capabilities for doing this of any network monitoring tool that I know of, but as many have pointed out, these counters are actually often inaccurate :(
> 
> In addition to this thread, see https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=6081
> 
> Note that both Gyorgy and I have been clear in exactly what we're looking for and have even provided solid pcap examples.
> 
> What many want is to be able to have networking monitoring tool report any packets (or tcp streams if you need to be pedantic--you would be guessing on number of packets) where the network monitor saw the stream data ACK'd but didn't see the stream data itself. In that case I can infer with strong confidence that the endpoint thinks he saw data that the monitor didn't. In the vast majority of cases, that will be because of network visibility or a tapping issue. Despite the inherent limitations, this sort of analysis is extremely valuable for quantifying and debugging loss in networking monitoring equipment (especially places were the debugger can't see reported loss or equipment reporting loss lies). Argus doesn't need to try to figure out were the loss occurs, it just needs to be able to detect loss through tcp "ack data not seen" inference. The user can compare this to other places where he can quantify loss--the most interesting being when everything else is zero (usually means bad taps, etc).
> 
> I'm not quite sure how easy this would be to implement in argus, and certainly it would only work in cases were you see (or think you should be seeing) bi-directional data. If argus could do this, possibly as mar record stat, that would make me a very happy man. In my opinion this capability fits within argus at least as well, if not better, than full content analysis tools like wireshark because we're just dealing with layer 4 metadata here--no need to look at content for this. Having argus do this would make it easy to alert on and would allow me to debug stuff that I can't do very easily now. This capability would be useful for people who do go to great lengths to do things right (good taps, good sensors, etc) but who need to verify that everything is working well and alert when it isn't.
> 
> As always, thanks for a great tool,
> 
> Charles
> 
> 
> 
> 
> On 1/28/2012 12:00 PM, argus-info-request at lists.andrew.cmu.edu wrote:
>> Send Argus-info mailing list submissions to
>> 	argus-info at lists.andrew.cmu.edu
>> 
>> To subscribe or unsubscribe via the World Wide Web, visit
>> 	https://lists.andrew.cmu.edu/mailman/listinfo/argus-info
>> or, via email, send a message with subject or body 'help' to
>> 	argus-info-request at lists.andrew.cmu.edu
>> 
>> You can reach the person managing the list at
>> 	argus-info-owner at lists.andrew.cmu.edu
>> 
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of Argus-info digest..."
>> 
>> 
>> Today's Topics:
>> 
>>    1. Re:  Detect packet drops (Carter Bullard)
>> 
>> 
>> ----------------------------------------------------------------------
>> 
>> Message: 1
>> Date: Fri, 27 Jan 2012 14:16:47 -0500
>> From: Carter Bullard<carter at qosient.com>
>> Subject: Re: [ARGUS] Detect packet drops
>> To: elof2 at sentor.se
>> Cc: Argus Development<argus-info at lists.andrew.cmu.edu>
>> Message-ID:<FC2964C1-C541-4554-844A-48BB224D84A8 at qosient.com>
>> Content-Type: text/plain; charset="us-ascii"
>> 
>> Hey /Elof,
>> OK, as I have mentioned before, we do distinguish between 'skipped' sequence numbers,
>> out of order sequence numbers, and retransmitted numbers (data and asks)  The duplicates,
>> such as multiple copies of the exact same packet, is detectable and I put code in to do
>> this, although I don't have any packet files that have the conditions that you describe to
>> verify if they are correct or not, so I haven't finished the support.
>> 
>> The problems in guaranteeing that you can count every drop, in this case for TCP are these.
>> 
>> Because TCP is reliable, there aren't going to be any gaps, if you see all the traffic.  If you
>> see gaps, it is only because the sensor isn't seeing all the packets, and  you have to try
>> to figure out why.  Did the network not pass the traffic past your sensor, because of asymmetric
>> routing, or stripping, load balancing, or path failure, or did the sensor drop a packet that
>> actually came by.  This is an impossible thing to know, unless you can find a pattern for
>> the loss.
>> 
>> If the sensor is watching TCP traffic at a point in the network prior to the loss point,
>> the sensor will see retransmissions, multiple instances of the same sequence numbers.
>> The sender will retransmit traffic because the receiver states that he hasn't seen the traffic.
>> But there is a race condition, where the receiver receives the packet late.  No loss will have
>> occurred but there are retransmissions.
>> 
>> If the sensor is past the loss point, you won't see any drops, because TCP is reliable.  You will
>> see out of order packets.  So out of order is an indication of loss?  Not necessarily, the
>> network can deliver them out of order.  The time domain for the out-of-order is the best
>> way to tell what is going on.
>> 
>> Because there are generally more than one point where loss can occur, your sensor will see
>> all sorts of weird combinations of the above behavior.
>> 
>> The best way to see all indications of loss is to look at the ACK behavior from the receiver.
>> Selective ACK advertisements are the best way to track loss, as you'll get a fine grain reporting
>> of what the receiver didn't receive.  Without selective ACK, you don't know how many packets
>> in a window were lost, you just know that at least 1 was lost.
>> 
>> Argus is doing what you are asking for.  If you want specific counters to try to get more info, I
>> can report them.  But outside of what Argus is already doing,  I'm thinking is not possible
>> to detect.
>> 
>> So, tell me what counters you want.  In your example Argus is already doing better than wireshark.
>> 
>> But I would also like to see the discrepancy between Argus and wireshark.  Argus gives a drop
>> count, regardless of how we calculate it.  How does it compare to wiresharks?
>> 
>> So what's the big deal ?  Are you so into the QoS part of this that each packet lost is important
>> to your analysis?
>> 
>> 
>> Carter
>> 
>> On Jan 26, 2012, at 5:37 AM, elof2 at sentor.se wrote:
>> 
>>> On Wed, 25 Jan 2012, Peter Van Epp wrote:
>>>> On Wed, Jan 25, 2012 at 02:02:08PM +0100, elof2 at sentor.se wrote:
>>>>> Any more thoughts or progress with this?
>>>>> 
>>>>> I just realised that I can't even rely on Wireshark for an estimate
>>>>> of dropped packets, since Wireshark's Expert Info "ACKed lost
>>>>> segment" tag out-of-order FIN-packets as "ACKed lost segment".
>>>>> 
>>>>> What I'm looking for is not a 100% accurate system to count every
>>>>> missing packet (which is impossible to determine), but a flag on
>>>>> each session that argus know is missing one or more packets.
>>>>> Just like the flag for retransmission doesn't say how many
>>>>> retransmissions there were in a tcp flow.
>>>> 	Checking the pcap reported loss rate (its in the man records which
>>>> you have to enable to see these days) will give you an indication, although
>>>> it is only one of the several ways your sensor can be losing packets, is one
>>>> good indication of how your sensor is doing. There is an explaination of a
>>>> number of the possible (and usually invisible) loss points in a sensor on
>>>> Carter's web site at http://www.qosient.com/argus/sensorPerformance.shtml as
>>>> well.
>>> 
>>> Hi Peter.
>>> Thanks for your input.
>>> 
>>> Ah, didn't know about the hidden pcap drop counters. I will take a look at it.
>>> 
>>> However... Even though I can see the pcap drop count, I still think it would be nice if argus could tag individual flows where it has detected gaps.
>>> The tag would give us argus users a notification that not all traffic is monitored 100%. An informative tag just like the out-of-order tag or ECN tag.
>>> 
>>> I now realise that my suggestion of having tags like "dropped externally" and "dropped internally" is not feasable, since there's no way to correlate the pcap drop counter to specific flows, so ignore this.
>>> 
>>> Apart from simply being informed that the monitored traffic is not 100%, I would also very much like to be able to determine if the drops occur outside of the sensor, i.e. the switch drop lots of packets while the sensor drop nothing.
>>> With the tag above, and a pcap-drop-counter in the argus man-records it should be easier to spot that external drops occur.
>>> (naturally, if you have both external drops and internal drops, it will be hard to investigate, but that's always the case. If I'm sure I have 0 drops within my sniffing machine, then all flows tagged with gaps must be due to drops in the external switch or tap (or faulty DAG/DAC drivers that doesn't report their own drop count, but that is a completely different matter).
>>> 
>>> 
>>>> 	Comparing the RMON traffic counts reported by the switch feeding your
>>>> sensor against the argus counts is another way although syncronizing the two
>>>> counts can be exciting :-). Both of these only indicate loss of data that makes
>>>> it as far as your sensor of course and isn't an indication of loss else where
>>>> in the path but thats a start ...
>>> Hehe, this is not possible since in many cases the SPAN port is not managed by me. I just manage the sensor receiving the mirrored traffic, but it is someone else who has setup the SPAN configuration.
>>> So diffing the reported drop-numbers is practically not feasable.
>>> 
>>>> 	As well using something like tcpreplay from a pcap file with suitable
>>>> hardware (which can get very hard at high speed of course :-)) feeding in to
>>>> your sensor can give you a known input traffic pattern to estimate sensor loss
>>>> as well.
>>> Now you're rather talking about detecting local sensor loss. What I'm primarily asking for is a way to easily detect that there are external packet loss.
>>> 
>>> Currently I'm sniffing e.g. 100 000 packets with tcpdump, making sure nothing is dropped locally. In this case it took 3 seconds to gather 100 000 packets. I scp the pcap file to a machine running Wireshark. I open up the "Expert Info Composite" and look at "ACKed lost segment" and "Previous segment lost".
>>> In an environment where the traffic is mirrored correctly, these two counters give me an estimate as to how many gaps there are in the tcp flows in the pcap file (disregarding a couple of false positives at capture startup).
>>> ...that is, I can see if the people feeding me mirrored traffic have problems in their end.
>>> 
>>> This procedure is quite tiresome. Also, it is unreliable when the mirrored packets are received out of order (common in redundant/loadbalanced environments), then Wireshark will tag packets as lost even though they exist.
>>> 
>>> /Elof
>> -------------- next part --------------
>> A non-text attachment was scrubbed...
>> Name: smime.p7s
>> Type: application/pkcs7-signature
>> Size: 4367 bytes
>> Desc: not available
>> Url : https://lists.andrew.cmu.edu/mailman/private/argus-info/attachments/20120127/e6e10542/attachment-0001.bin
>> 
>> ------------------------------
>> 
>> _______________________________________________
>> Argus-info mailing list
>> Argus-info at lists.andrew.cmu.edu
>> https://lists.andrew.cmu.edu/mailman/listinfo/argus-info
>> 
>> 
>> End of Argus-info Digest, Vol 77, Issue 42
>> ******************************************
>> 
>> .
>> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4367 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20120129/eb69b224/attachment.bin>