Detect packet drops

Wed Feb 1 09:10:05 EST 2012

On Fri, 27 Jan 2012, Carter Bullard wrote:
> OK, as I have mentioned before, we do distinguish between 'skipped' sequence numbers,
> out of order sequence numbers, and retransmitted numbers (data and asks)

Great! Then all that's left to do is to set the appropriate output tags.

What I asked for is actually a much simplier function than what has been 
discussed in the spawned "Another vote for packet drop detection"-thread.

See below for my whishes.

> The duplicates,
> such as multiple copies of the exact same packet, is detectable and I put code in to do
> this, although I don't have any packet files that have the conditions that you describe to
> verify if they are correct or not, so I haven't finished the support.

Such a pcap or argus-logfile is easily created by configuring a Cisco SPAN 
to mirror *both* rx and tx for a vlan or *both* rx and tx for two 
switchports:
    _______
A-|1      |
B-|2    48|-Sniffer
    -------

Example of configuration syntax:
monitor session 1 source vlan nnn both   or
monitor session 1 source interface gigabit 1-2 both

A packet from A to B will then be copied:
rx on port 1 --> copy to port 48
tx on port 2 --> copy to port 48
...and vice versa for the replies:
rx on port 2 --> copy to port 48
tx on port 1 --> copy to port 48

...or if the same packet is received by a vlan and then transmitted, two 
identical copies will be sent.

I have never seen a packet stream to the sniffer where another packet is 
inserted in between the two copies, so the detection need not to be 
advanced, simply look if the next received packet is idential to the 
current one, and that it was received extremely fast after the first one. 
Then tag this flow as 'Duplicate packet(s) exist'.

However, if the second packet is identical to the first, but it was seen 
after more than NN milliseconds, don't tag this as as a duplicate, but 
handle it under the existing 'TCP Retransmission' tags.
Since the TCP retransmission function introduce a delay before a packet is 
retransmitted, you can distinguish between duplicates and retransmissions.

Result: you have solved the current issue in argus where duplicates are 
incorrectly tagged as retransmissions.

> The problems in guaranteeing that you can count every drop, in this case for TCP are these.

There's no need to count every gap. All I need is a tag to indicate that 
gaps exist. If lots of flows get this tag, I need to investigate where 
packets are dropped.
Just like if a lot of flows are tagged with ECN, Window closures or 
Retransmissions, then I will investigate the situation to make sure that 
it is not my own servers/routers that are the problem.

(but having fields and counters for [sd]gap doesn't hurt. It 
just complements the tag)

If/when the new functions are in place, you need to review all the 
man-pages and make sure to distinguish between the words "loss" and 
"drop" so that it is clear what tags (and counters) are related to the 
real traffic between src and dst, and what tags (and counters) are related 
to the SPAN/pcap/bpf mirroring environment.

> The best way to see all indications of loss is to look at the ACK behavior from the receiver.
> Selective ACK advertisements are the best way to track loss, as you'll get a fine grain reporting
> of what the receiver didn't receive.  Without selective ACK, you don't know how many packets
> in a window were lost, you just know that at least 1 was lost.

As I said, I really don't need a count of unique missing packets. It is 
enough to tag tcp flows with 'Gap in stream' when detected.
If I see lots of flows with this tag I will start looking for 
performance/SPAN-problems. (again, a sd[gap] field with counters is 
interesting as well, but not really neccesary if it introduces a 
performance impact, or if you need to do heavy argus rewrites).

I agree that the easiest way, and most accurate, is to look solely at the 
selective ack, so do just that! Describe in the man-page that the 'Gap in 
stream' tag is only watching for and detecting gaps in tcp-flows with 
Selective Ack enabled. All other kinds of packet-loss are too hard to 
detect, or would introduce a too big performance impact in argus.
...or something like that...

By only looking at selective acks, you're also spared from the problems 
Wireshark has, dealing with drops or out-of-order packets in the SYN phase 
or the FIN pase of the flow.

> So what's the big deal ?  Are you so into the QoS part of this that each packet lost is important
> to your analysis?

Oh no, the big deal is a small deal!

* I just want ra to add a tag to flows that are missing packets.
   Just as ra add a "v" to all VLAN-tagged flows, I want a tag for flows
   with detected gaps in them.
   If I see this tag frequently, I know something is wrong. Just like if I
   see lots of "ICMP events mapped to this flow", ECN, Window closure or
   Retransmissions...

   As I stated in my previous email, I think you should add a new
   proto-column for this tag.

Another small deal is:

* Argus is currently tagging duplicates as retransmissions. This is wrong.
   It is also misleading since you can have 100% perfect flows with no
   packet loss and no tcp retransmissions, yet argus indicate lots of * s d
   tags on lots of flows due to external forces, i.e. incorrect SPAN setup.

Naturally, my whishes and the more advanced ones in the "Another vote for 
packet drop detection"-thread complement eachother. The perfect solution 
would be to have both my new tags AND new fields like [sd]gap.

/Elof

> On Jan 26, 2012, at 5:37 AM, elof2 at sentor.se wrote:
>
>>
>> On Wed, 25 Jan 2012, Peter Van Epp wrote:
>>> On Wed, Jan 25, 2012 at 02:02:08PM +0100, elof2 at sentor.se wrote:
>>>> Any more thoughts or progress with this?
>>>>
>>>> I just realised that I can't even rely on Wireshark for an estimate
>>>> of dropped packets, since Wireshark's Expert Info "ACKed lost
>>>> segment" tag out-of-order FIN-packets as "ACKed lost segment".
>>>>
>>>> What I'm looking for is not a 100% accurate system to count every
>>>> missing packet (which is impossible to determine), but a flag on
>>>> each session that argus know is missing one or more packets.
>>>> Just like the flag for retransmission doesn't say how many
>>>> retransmissions there were in a tcp flow.
>>>
>>> 	Checking the pcap reported loss rate (its in the man records which
>>> you have to enable to see these days) will give you an indication, although
>>> it is only one of the several ways your sensor can be losing packets, is one
>>> good indication of how your sensor is doing. There is an explaination of a
>>> number of the possible (and usually invisible) loss points in a sensor on
>>> Carter's web site at http://www.qosient.com/argus/sensorPerformance.shtml as
>>> well.
>>
>>
>> Hi Peter.
>> Thanks for your input.
>>
>> Ah, didn't know about the hidden pcap drop counters. I will take a look at it.
>>
>> However... Even though I can see the pcap drop count, I still think it would be nice if argus could tag individual flows where it has detected gaps.
>> The tag would give us argus users a notification that not all traffic is monitored 100%. An informative tag just like the out-of-order tag or ECN tag.
>>
>> I now realise that my suggestion of having tags like "dropped externally" and "dropped internally" is not feasable, since there's no way to correlate the pcap drop counter to specific flows, so ignore this.
>>
>> Apart from simply being informed that the monitored traffic is not 100%, I would also very much like to be able to determine if the drops occur outside of the sensor, i.e. the switch drop lots of packets while the sensor drop nothing.
>> With the tag above, and a pcap-drop-counter in the argus man-records it should be easier to spot that external drops occur.
>> (naturally, if you have both external drops and internal drops, it will be hard to investigate, but that's always the case. If I'm sure I have 0 drops within my sniffing machine, then all flows tagged with gaps must be due to drops in the external switch or tap (or faulty DAG/DAC drivers that doesn't report their own drop count, but that is a completely different matter).
>>
>>
>>> 	Comparing the RMON traffic counts reported by the switch feeding your
>>> sensor against the argus counts is another way although syncronizing the two
>>> counts can be exciting :-). Both of these only indicate loss of data that makes
>>> it as far as your sensor of course and isn't an indication of loss else where
>>> in the path but thats a start ...
>>
>> Hehe, this is not possible since in many cases the SPAN port is not managed by me. I just manage the sensor receiving the mirrored traffic, but it is someone else who has setup the SPAN configuration.
>> So diffing the reported drop-numbers is practically not feasable.
>>
>>> 	As well using something like tcpreplay from a pcap file with suitable
>>> hardware (which can get very hard at high speed of course :-)) feeding in to
>>> your sensor can give you a known input traffic pattern to estimate sensor loss
>>> as well.
>>
>> Now you're rather talking about detecting local sensor loss. What I'm primarily asking for is a way to easily detect that there are external packet loss.
>>
>> Currently I'm sniffing e.g. 100 000 packets with tcpdump, making sure nothing is dropped locally. In this case it took 3 seconds to gather 100 000 packets. I scp the pcap file to a machine running Wireshark. I open up the "Expert Info Composite" and look at "ACKed lost segment" and "Previous segment lost".
>> In an environment where the traffic is mirrored correctly, these two counters give me an estimate as to how many gaps there are in the tcp flows in the pcap file (disregarding a couple of false positives at capture startup).
>> ...that is, I can see if the people feeding me mirrored traffic have problems in their end.
>>
>> This procedure is quite tiresome. Also, it is unreliable when the mirrored packets are received out of order (common in redundant/loadbalanced environments), then Wireshark will tag packets as lost even though they exist.
>>
>> /Elof
>
>