Bugs in gap detection

elof2 at sentor.se elof2 at sentor.se
Tue May 26 15:34:34 EDT 2015


Hi Carter!

I've just upgraded argus to the latest version on my FreeBSD-machine.

Three bug reports:



Issue 1)
Problem: Argus detect gaps when there are none.

ra -Zb -nr ch119_no_gap.log -s +sgap +dgap -
       StartTime      Flgs  Proto            SrcAddr  Sport Dir 
DstAddr  Dport  SrcPkts  DstPkts     SrcBytes     DstBytes         State 
SrcGap   DstGap
11:36:24.160011  e           tcp        10.123.45.5.25768   -> 
10.23.23.10.8080          4        3         1037          715 
SPA_SPA        0        0
11:36:48.322557  e           tcp        10.123.45.5.25768   -> 
10.23.23.10.8080         38       81         3201       100099 
PA_PA        0        0
11:37:54.333394  e g         tcp        10.123.45.5.25768  <?> 
10.23.23.10.8080          8        8         7705         1809 
PA_PA     2994        0

So, on that last line, argus says there are 2994 src-bytes missing.

'netstat -B' (on my FreeBSD) show that the argus process has zero dropped 
packets.
I ran a tcpdump simultaneously and this tcpdump process also had zero 
drops.
Therefore I expect to see a few missing packets for the above flow in the 
pcap, as reported by argus.

To my surprise, Wireshark does NOT complain about any missing packets at 
all!

I have manually compared the three ra-lines to the pcap:
I see the 4 (1037) + 3 (715) packets in the pcap.
I see the 38 (3201) + 81 packets in the pcap.
I see the 8 (7705) + 8 (1809) packets in the pcap.
Both argus and tcpdump saw a total of 142 packets.

So... I guess there's something wrong with the gap detection in argus. :-/

I'm sending you a private email, attaching both the pcap and the argus log 
for further debugging.

note: the last two packets from the middle line are two window updates. 
Then there is a pause for 1 min 6 seconds before those last 8+8 packets at 
11:37:54.
Could the window updates or the >1 minute of silence fool argus into 
thinking there are gaps?




Issue 2)

Problem: Argus reports gaps when there are none.
This must be because argus flush its state just before the missing data 
was retransmitted on the wire.

Example:

       StartTime      Flgs  Proto            SrcAddr  Sport Dir 
DstAddr  Dport  SrcPkts  DstPkts     SrcBytes     DstBytes         State 
LastTime   SrcGap   DstGap
11:36:05.719501  e s         tcp     216.129.114.88.8800   <?> 
10.123.1.5.25699        69       43        88147         2958 
PA_A    11:36:10.862907        0        0
11:36:10.962535  e           tcp     216.129.114.88.8800   <?> 
10.123.1.5.25699        71       42        88894         2772 
PA_A    11:36:16.111922        0        0
11:36:16.238263  e s         tcp     216.129.114.88.8800   <?> 
10.123.1.5.25699        71       46        88507         3084 
PA_A    11:36:21.338720        0        0
11:36:21.450236  e s         tcp     216.129.114.88.8800   <?> 
10.123.1.5.25699        71       44        87447         2964 
PA_A    11:36:26.560071        0        0
11:36:26.649754  e g         tcp     216.129.114.88.8800   <?> 
10.123.1.5.25699        71       43        87160         2874 
PA_A    11:36:31.751251      857        0
   Ok, so there are 71 packets from 123.123.123.88:8800 in this line. 
Somehere withing these 71 packets there's a gap.

   In my tcpdump pcap I start counting from the packet with timestamp 
11:36:26.649754.
   Between src packets 68 and 69 there's a missing packet.
   After 69, packets 70 and 71 continue in the same high speed, where 
packet 71 has a timestamp of 11:36.31.751249.
   LastTime of this line is 2us later, 11:36:31.751251 and this correspond 
fine with the timestamp of the 43rd packet from 10.123.1.5.

   Now, in the second next packet (packet 73), a retransmission of the 
missing packet is received by argus.
   So there are no *actual* gap in the tcp stream monitored by argus.
   However, since argus already flushed the 71+43 packets, packet 72 and 73 
instead became packets #1 and #2 in the next line.
   A SrcGap of 857 is therefore logged on the previous line.

11:36:31.821865  e s         tcp     216.129.114.88.8800   <?> 
10.123.1.5.25699        72       42        89425         2784 
PA_A    11:36:37.012968        0        0
11:36:37.065894  e s         tcp     216.129.114.88.8800   <?> 
10.123.1.5.25699        72       45        87654         3030 
PA_A    11:36:42.139706        0        0
11:36:42.220030  e s         tcp     216.129.114.88.8800   <?> 
10.123.1.5.25699        74       49        91817         3342 
PA_A    11:36:47.361654        0        0
11:36:47.497642  e s         tcp     216.129.114.88.8800   <?> 
10.123.1.5.25699        69       43        88775         2934 
PA_A    11:36:52.636345        0        0
11:36:52.688988  e s         tcp     216.129.114.88.8800   <?> 
10.123.1.5.25699        72       47        87834         3210 
PA_A    11:36:57.840345        0        0
11:36:57.914371  e s         tcp     216.129.114.88.8800   <?> 
10.123.1.5.25699        70       43        88106         2934 
PA_A    11:37:03.012641        0        0
11:37:03.143249  e s         tcp     216.129.114.88.8800   <?> 
10.123.1.5.25699        71       44        86827         2988 
PA_A    11:37:08.241077        0        0
11:37:08.373448  e s         tcp     216.129.114.88.8800   <?> 
10.123.1.5.25699        75       48        89773         3264 
PA_A    11:37:13.467642        0        0
11:37:13.555818  e s         tcp     216.129.114.88.8800   <?> 
10.123.1.5.25699        70       46        87370         3144 
PA_A    11:37:18.679088        0        0
11:37:18.765883  e           tcp     216.129.114.88.8800   <?> 
10.123.1.5.25699        73       42        87613         2772 
PA_A    11:37:23.814428        0        0
11:37:23.901961  e s         tcp     216.129.114.88.8800   <?> 
10.123.1.5.25699        71       47        89087         3222 
PA_A    11:37:29.082170        0        0
11:37:29.258747  e s         tcp     216.129.114.88.8800   <?> 
10.123.1.5.25699        69       45        88978         3094 
PA_A    11:37:34.375937        0        0
11:37:34.427460  e g         tcp     216.129.114.88.8800   <?> 
10.123.1.5.25699        69       43        87253         2874 
PA_A    11:37:39.516085       30        0

   Again, the gap took place in this flow while the retransmission came 
just a few packets later, but too late, so it was put in the next flow 
line.

11:37:39.637529  e s         tcp     216.129.114.88.8800   <?> 
10.123.1.5.25699        72       44        87799         3000 
PA_A    11:37:44.736910        0        0
11:37:44.831350  e s         tcp     216.129.114.88.8800   <?> 
10.123.1.5.25699        72       44        88953         2964 
PA_A    11:37:49.980516        0        0
11:37:50.102882  e           tcp     216.129.114.88.8800   <?> 
10.123.1.5.25699        76       42        89443         2772 
PA_A    11:37:55.259373        0        0

I think this is a bug. Argus should only list *true* gaps detected in the 
flows (and let the retransmission counters show the retransmissions).

Currently, argus can report gaps even though there physically are none. 
This makes the gaps field more or less useless.

You can probably come up with a solution to this. Here's my suggestion:
Delay the flush if there are active gaps in this unflushed flow.
if (srcgap!=0 || dstgap!=0)
{
Wait up to 10 seconds for the missing packet to arrive (and hence fill the 
gap). Then flush the line with gap values set to zero.
OR if the session is finished (FIN) or reset (RST), then flush the line 
immediately with the current gap values.
You could probably build a third case too, in order to exit the delay loop 
early. Something like this:
OR wait until N next packets for this stream has been received, where N is 
set to a value that covers the current sliding window filled with some 
typical packetsize plus a few packets more just for margin.
That is, only wait for as long it would take the Src and Dst to retransmit 
the missing packet. If argus hasn't seen it by then, the packet will never 
appear, so stop waiting for it and flush the line with the current gap 
values. (The tcp session continues between Src and Dst, so a missing 
packet was packet loss on the SPAN port = a true gap)
}

Benefit:
Argus show gap values you can trust and understand. I.e. if argus says 
zero gaps, that mean that the SPAN is good. No packet loss in any external 
equipment nor locally in the sniffer itself.

Drawback:
Sometimes the flushpoint can be extended by up to 10 seconds if there are 
gaps, instead of always be ARGUS_FLOW_STATUS_INTERVAL long.
To me, this is a minor problem though, since long-lived flows might be 
silent for random x minutes, then active a few seconds, then silent again 
for random y minutes...

(Perhaps you can build an even better delayed_flush() function than the 
above. One that flush all lines periodically as usual, but with corrected 
gap values, if the gap was filled later with a retransmission.)




Issue 3)

Problem: Argus seem to be 1 byte off when reporting the size of gaps.

Example:

11:36:26.649754  e g         tcp     123.123.123.88.8800   <?> 
10.123.1.5.25699        71       43        87160         2874 
PA_A      857        0
   seq analysis show:
   packet 68 show Next seq=244713
   packet 69 has seq=245571
   245571-244713=858 bytes missing, not 857.
11:37:34.427460  e g         tcp     123.123.123.88.8800   <?> 
10.123.1.5.25699        69       43        87253         2874 
PA_A       30        0
   seq1334074-1334043=31, not 30.


/Elof




More information about the argus mailing list