Bugs in gap detection
elof2 at sentor.se
elof2 at sentor.se
Tue May 26 15:34:34 EDT 2015
Hi Carter!
I've just upgraded argus to the latest version on my FreeBSD-machine.
Three bug reports:
Issue 1)
Problem: Argus detect gaps when there are none.
ra -Zb -nr ch119_no_gap.log -s +sgap +dgap -
StartTime Flgs Proto SrcAddr Sport Dir
DstAddr Dport SrcPkts DstPkts SrcBytes DstBytes State
SrcGap DstGap
11:36:24.160011 e tcp 10.123.45.5.25768 ->
10.23.23.10.8080 4 3 1037 715
SPA_SPA 0 0
11:36:48.322557 e tcp 10.123.45.5.25768 ->
10.23.23.10.8080 38 81 3201 100099
PA_PA 0 0
11:37:54.333394 e g tcp 10.123.45.5.25768 <?>
10.23.23.10.8080 8 8 7705 1809
PA_PA 2994 0
So, on that last line, argus says there are 2994 src-bytes missing.
'netstat -B' (on my FreeBSD) show that the argus process has zero dropped
packets.
I ran a tcpdump simultaneously and this tcpdump process also had zero
drops.
Therefore I expect to see a few missing packets for the above flow in the
pcap, as reported by argus.
To my surprise, Wireshark does NOT complain about any missing packets at
all!
I have manually compared the three ra-lines to the pcap:
I see the 4 (1037) + 3 (715) packets in the pcap.
I see the 38 (3201) + 81 packets in the pcap.
I see the 8 (7705) + 8 (1809) packets in the pcap.
Both argus and tcpdump saw a total of 142 packets.
So... I guess there's something wrong with the gap detection in argus. :-/
I'm sending you a private email, attaching both the pcap and the argus log
for further debugging.
note: the last two packets from the middle line are two window updates.
Then there is a pause for 1 min 6 seconds before those last 8+8 packets at
11:37:54.
Could the window updates or the >1 minute of silence fool argus into
thinking there are gaps?
Issue 2)
Problem: Argus reports gaps when there are none.
This must be because argus flush its state just before the missing data
was retransmitted on the wire.
Example:
StartTime Flgs Proto SrcAddr Sport Dir
DstAddr Dport SrcPkts DstPkts SrcBytes DstBytes State
LastTime SrcGap DstGap
11:36:05.719501 e s tcp 216.129.114.88.8800 <?>
10.123.1.5.25699 69 43 88147 2958
PA_A 11:36:10.862907 0 0
11:36:10.962535 e tcp 216.129.114.88.8800 <?>
10.123.1.5.25699 71 42 88894 2772
PA_A 11:36:16.111922 0 0
11:36:16.238263 e s tcp 216.129.114.88.8800 <?>
10.123.1.5.25699 71 46 88507 3084
PA_A 11:36:21.338720 0 0
11:36:21.450236 e s tcp 216.129.114.88.8800 <?>
10.123.1.5.25699 71 44 87447 2964
PA_A 11:36:26.560071 0 0
11:36:26.649754 e g tcp 216.129.114.88.8800 <?>
10.123.1.5.25699 71 43 87160 2874
PA_A 11:36:31.751251 857 0
Ok, so there are 71 packets from 123.123.123.88:8800 in this line.
Somehere withing these 71 packets there's a gap.
In my tcpdump pcap I start counting from the packet with timestamp
11:36:26.649754.
Between src packets 68 and 69 there's a missing packet.
After 69, packets 70 and 71 continue in the same high speed, where
packet 71 has a timestamp of 11:36.31.751249.
LastTime of this line is 2us later, 11:36:31.751251 and this correspond
fine with the timestamp of the 43rd packet from 10.123.1.5.
Now, in the second next packet (packet 73), a retransmission of the
missing packet is received by argus.
So there are no *actual* gap in the tcp stream monitored by argus.
However, since argus already flushed the 71+43 packets, packet 72 and 73
instead became packets #1 and #2 in the next line.
A SrcGap of 857 is therefore logged on the previous line.
11:36:31.821865 e s tcp 216.129.114.88.8800 <?>
10.123.1.5.25699 72 42 89425 2784
PA_A 11:36:37.012968 0 0
11:36:37.065894 e s tcp 216.129.114.88.8800 <?>
10.123.1.5.25699 72 45 87654 3030
PA_A 11:36:42.139706 0 0
11:36:42.220030 e s tcp 216.129.114.88.8800 <?>
10.123.1.5.25699 74 49 91817 3342
PA_A 11:36:47.361654 0 0
11:36:47.497642 e s tcp 216.129.114.88.8800 <?>
10.123.1.5.25699 69 43 88775 2934
PA_A 11:36:52.636345 0 0
11:36:52.688988 e s tcp 216.129.114.88.8800 <?>
10.123.1.5.25699 72 47 87834 3210
PA_A 11:36:57.840345 0 0
11:36:57.914371 e s tcp 216.129.114.88.8800 <?>
10.123.1.5.25699 70 43 88106 2934
PA_A 11:37:03.012641 0 0
11:37:03.143249 e s tcp 216.129.114.88.8800 <?>
10.123.1.5.25699 71 44 86827 2988
PA_A 11:37:08.241077 0 0
11:37:08.373448 e s tcp 216.129.114.88.8800 <?>
10.123.1.5.25699 75 48 89773 3264
PA_A 11:37:13.467642 0 0
11:37:13.555818 e s tcp 216.129.114.88.8800 <?>
10.123.1.5.25699 70 46 87370 3144
PA_A 11:37:18.679088 0 0
11:37:18.765883 e tcp 216.129.114.88.8800 <?>
10.123.1.5.25699 73 42 87613 2772
PA_A 11:37:23.814428 0 0
11:37:23.901961 e s tcp 216.129.114.88.8800 <?>
10.123.1.5.25699 71 47 89087 3222
PA_A 11:37:29.082170 0 0
11:37:29.258747 e s tcp 216.129.114.88.8800 <?>
10.123.1.5.25699 69 45 88978 3094
PA_A 11:37:34.375937 0 0
11:37:34.427460 e g tcp 216.129.114.88.8800 <?>
10.123.1.5.25699 69 43 87253 2874
PA_A 11:37:39.516085 30 0
Again, the gap took place in this flow while the retransmission came
just a few packets later, but too late, so it was put in the next flow
line.
11:37:39.637529 e s tcp 216.129.114.88.8800 <?>
10.123.1.5.25699 72 44 87799 3000
PA_A 11:37:44.736910 0 0
11:37:44.831350 e s tcp 216.129.114.88.8800 <?>
10.123.1.5.25699 72 44 88953 2964
PA_A 11:37:49.980516 0 0
11:37:50.102882 e tcp 216.129.114.88.8800 <?>
10.123.1.5.25699 76 42 89443 2772
PA_A 11:37:55.259373 0 0
I think this is a bug. Argus should only list *true* gaps detected in the
flows (and let the retransmission counters show the retransmissions).
Currently, argus can report gaps even though there physically are none.
This makes the gaps field more or less useless.
You can probably come up with a solution to this. Here's my suggestion:
Delay the flush if there are active gaps in this unflushed flow.
if (srcgap!=0 || dstgap!=0)
{
Wait up to 10 seconds for the missing packet to arrive (and hence fill the
gap). Then flush the line with gap values set to zero.
OR if the session is finished (FIN) or reset (RST), then flush the line
immediately with the current gap values.
You could probably build a third case too, in order to exit the delay loop
early. Something like this:
OR wait until N next packets for this stream has been received, where N is
set to a value that covers the current sliding window filled with some
typical packetsize plus a few packets more just for margin.
That is, only wait for as long it would take the Src and Dst to retransmit
the missing packet. If argus hasn't seen it by then, the packet will never
appear, so stop waiting for it and flush the line with the current gap
values. (The tcp session continues between Src and Dst, so a missing
packet was packet loss on the SPAN port = a true gap)
}
Benefit:
Argus show gap values you can trust and understand. I.e. if argus says
zero gaps, that mean that the SPAN is good. No packet loss in any external
equipment nor locally in the sniffer itself.
Drawback:
Sometimes the flushpoint can be extended by up to 10 seconds if there are
gaps, instead of always be ARGUS_FLOW_STATUS_INTERVAL long.
To me, this is a minor problem though, since long-lived flows might be
silent for random x minutes, then active a few seconds, then silent again
for random y minutes...
(Perhaps you can build an even better delayed_flush() function than the
above. One that flush all lines periodically as usual, but with corrected
gap values, if the gap was filled later with a retransmission.)
Issue 3)
Problem: Argus seem to be 1 byte off when reporting the size of gaps.
Example:
11:36:26.649754 e g tcp 123.123.123.88.8800 <?>
10.123.1.5.25699 71 43 87160 2874
PA_A 857 0
seq analysis show:
packet 68 show Next seq=244713
packet 69 has seq=245571
245571-244713=858 bytes missing, not 857.
11:37:34.427460 e g tcp 123.123.123.88.8800 <?>
10.123.1.5.25699 69 43 87253 2874
PA_A 30 0
seq1334074-1334043=31, not 30.
/Elof
More information about the argus
mailing list