Why is rabins() "ramping up" counts?

Wed Jul 31 14:29:20 EDT 2013

Hey Carter...

Thanks for replying quickly.

Hope you're ready for one of my novels...

If not, this message can be summerized into a question:

Should I be "throwing away" any data returned within the first
"ARGUS_FLOW_STATUS_INTERVAL" when using rabins() as it appears to be
inaccurately reported?

--
/etc/argus.conf: ARGUS_FLOW_STATUS_INTERVAL=60
Does 60 seconds qualify in the "very large in comparison to 5 seconds"
category?

--
I definitely have a small number of _flows_ per five second interval for
this specific BPF.
Am I right to assume that rabins() with `-M hard` will take whatever flows
are occurring within each bin and treat it solo, not disgarding it in the
next bins (this is what `-M nomodify` is for right?)?

--
Here is the outcome of what you've described to help me understand rabins():
1) grab the `seq` of the `-N 1` record:
#ra -N 1 -r ~/working/2013-07-25_argus_09\:00\:00 -s seq - port 5432 and
src host 192.168.10.22
   Seq
   12187458

2) write the single flow record to an argus binary file:
#ra -N 1 -r ~/working/2013-07-25_argus_09\:00\:00 -w - - port 5432 and src
host 192.168.10.22 > ~/temp.argus

3) If I look at a field that is summation (`pkts`) [not an aggregate
itself, like `rate` is], without using field aggregation (`-m`), I get the
TotPkts:
#ra -r ~/temp.argus -s seq ltime pkts - port 5432 and src host 192.168.10.22
   Seq                        LastTime  TotPkts
   12187458 2013-07-25 09:59:17.698748    59326

4) If I then look at the output of rabins() running against the same `seq`,
It appears that rabins() shows `pkts` within each bin, whose sum IS equal
to the above TotPkts:
#rabins -M hard time 5s -r ~/temp.argus -s seq ltime pkts - port 5432 and
src host 192.168.10.22
...snipped output...

Cool!  A summation works with a field that isn't, itself, an aggregate.
 [Note the output is the same with or without `-B 5s`]

What about a field that is, itself, an aggregate (`rate`)?

#ra -r ~/temp.argus -s seq ltime rate - port 5432 and src host 192.168.10.22
   Seq                        LastTime         Rate
   12187458 2013-07-25 09:59:17.698748    16.675105

#rabins -M hard time 5s -B 5s -r ~/temp.argus -s seq ltime rate - port 5432
and src host 192.168.10.22

Cool! If I avg() the sum of the resultant Rates, I get 16.4646067416... so
not exactly correct, but good enough(?). [Note the output is the same with
or without `-B 5s`]

--
The (`-m`) aggregator does not cause the "ramp up"...

proof:
I do not see a difference when using an aggregator (`-m saddr`, because my
BPF considers a single src host) with rabins() if I use an aggregate field
(`rate`) against a live feed:

#timeout 60s rabins -M hard time 5s -B 5s -S 127.0.0.1:561 -s seq ltime
saddr rate - port 5432 and src host 192.168.10.22 > ~/rabins_aggr.out &
timeout 60s rabins -M hard time 5s -B 5s -S 127.0.0.1:561 -m saddr -s seq
ltime saddr rate - port 5432 and src host 192.168.10.22 >
~/rabins_aggr_saddr.out

I can successfully manually add up the output of rabins() without the `-m`
aggregator that would fall into the bins, and they equate to the `-m saddr`
values [+/-~0.5].

So aggregation is not the cause of the "ramp up".

--
"Ramp up" is exhibited on both aggregated fields and non-aggregated fields.

proof:
# rabins -M hard time 5s -B 5s -m saddr -S 127.0.0.1:561 -s seq ltime pkts
- port 5432 and src host 192.168.10.22
   Seq                        LastTime  TotPkts
   15103267 2013-07-31 13:46:35.000000       41
   14983890 2013-07-31 13:46:40.000000       75
   14983890 2013-07-31 13:46:45.000000      144
   14983890 2013-07-31 13:46:50.000000      255
   14983890 2013-07-31 13:46:55.000000      377
   14983890 2013-07-31 13:47:00.000000      368
   15103267 2013-07-31 13:47:05.000000      373
   14983890 2013-07-31 13:47:10.000000      446
   14983890 2013-07-31 13:47:15.000000      570
   14983890 2013-07-31 13:47:20.000000      567
   14983890 2013-07-31 13:47:25.000000      575
   14983890 2013-07-31 13:47:30.000000      637
   15103267 2013-07-31 13:47:35.000000      647

# rabins -M hard time 5s -B 5s -S 127.0.0.1:561 -m saddr -s seq ltime saddr
rate - port 5432 and src host 192.168.10.22                       Seq
                       LastTime            SrcAddr         Rate
   14667433 2013-07-31 13:43:45.000000       192.168.10.22    15.200000
   14667433 2013-07-31 13:43:50.000000       192.168.10.22    38.600000
   14667433 2013-07-31 13:43:55.000000       192.168.10.22    61.800000
   14667433 2013-07-31 13:44:00.000000       192.168.10.22    61.000000
   14667433 2013-07-31 13:44:05.000000       192.168.10.22    60.600000
   14667433 2013-07-31 13:44:10.000000       192.168.10.22    75.200000
   14667433 2013-07-31 13:44:15.000000       192.168.10.22    99.400000
   14667433 2013-07-31 13:44:20.000000       192.168.10.22    99.200000
   14667433 2013-07-31 13:44:25.000000       192.168.10.22   101.400000
   14667433 2013-07-31 13:44:30.000000       192.168.10.22   113.400000
   14667433 2013-07-31 13:44:35.000000       192.168.10.22   123.400000
   14667433 2013-07-31 13:44:40.000000       192.168.10.22   130.600000
   14667433 2013-07-31 13:44:45.000000       192.168.10.22   129.800000

FINAL QUESTION:
Should I simply be "throwing away" any data returned within the first
"ARGUS_FLOW_STATUS_INTERVAL" when using rabins() as it appears to be
inaccurately reporting?

SORRY ONE MORE... :)
Also, does ra() only report flows (`seq`) that have flow records reporting,
while rabins() (with `-M hard`) report all flows that have any activity
within the bin?

Thanks,

Matt

On Jul 30, 2013, at 4:32 PM, Carter Bullard <carter at qosient.com> wrote:

Hey Matt,
Have to see the data that generated the output to know if
there is a problem.

The key here is the ARGUS_FLOW_STATUS_INTERVAL.  If it is
very large in comparison to your bin size, and you
have a small number of records, then this kind of
skewing can occur.  But have to see the data.

Your rabins() call will cut flow records into 5 second bins,
normally distributing the metrics (pkts, bytes, appbytes, etc…),
and then when its time to output the bins, it will apply the
aggregation strategy to all the flow records that are in
each bin.

Your -B 5s will throw away records that preceed the apparent
start time of the stream,  and is only used when reading live data.
Don't use the " -B secs" option when reading files.
That may clear up your problem.

So grab a single flow record's status records, writing them out to a file.
Then run rabins() to see how it carves up the flow record.
You should see that it processes well.

Carter

On Jul 30, 2013, at 4:19 PM, Matt Brown <matthewbrown at gmail.com> wrote:

Hello all,

Does rabins() "ramp up to normal" over N bins?

I'd like to start working on calculating moving averages to help

identify performance outliers (like "spikes" in `loss` or `rate`).

For this purpose, I believe grabbing data from the output of rabins()

would serve me well.

For example, if I take historic argus data and run it through the

following rabins() invocation, I see some odd things that can only be

noted as "ramping up":

for f in $(ls -m1 ~/working/*) ; do (

rabins -M hard time 5s -B 5s -r $f -m saddr -s ltime rate - port 5432

and src host 192.168.10.22

) >> ~/aggregated_rate ; done

The first few and the last few resulting records per file seem to not

be reporting correctly.

For example, these dudes at 192.168.10.22 utilize a postgres DB

replication package called bucardo.  During idle time, bucardo sends

heartbeat info, and appears to be holding at about 47-49 packets per

second (rate).

However, I am seeing the following in my rabins() resultant data (note

the precense of field label header == the start of a new rabins() from

the above for..loop):

2013-07-25 00:59:25.000000    47.400000

2013-07-25 00:59:30.000000    47.400000

2013-07-25 00:59:35.000000    48.000000

2013-07-25 00:59:40.000000    48.000000

2013-07-25 00:59:45.000000    40.600000

2013-07-25 00:59:50.000000    21.400000

2013-07-25 00:59:55.000000    15.400000

2013-07-25 01:00:00.000000     5.000000

2013-07-25 01:00:05.000000     0.000000

               LastTime         Rate

2013-07-25 01:00:05.000000     0.200000

2013-07-25 01:00:10.000000     0.600000

2013-07-25 01:00:15.000000     0.400000

2013-07-25 01:00:35.000000     0.400000

2013-07-25 01:00:40.000000     1.000000

2013-07-25 01:00:45.000000     6.200000

2013-07-25 01:00:50.000000    25.400000

2013-07-25 01:00:55.000000    32.400000

2013-07-25 01:01:00.000000    41.800000

2013-07-25 01:01:05.000000    47.600000

2013-07-25 01:01:10.000000    48.600000

[The source files were written with rastream().]

It is well worth noting that if I start an rabins() reading from the

argus() socket with the following invocation, the same sort of thing

occurs:

# rabins -M hard time 5s -B 5s -S 127.0.0.1:561 -m saddr -s ltime rate

- port 5432 and src host 192.168.10.22

               LastTime         Rate

2013-07-30 15:42:55.000000     1.400000

2013-07-30 15:43:00.000000     0.600000

2013-07-30 15:43:05.000000    33.800000

2013-07-30 15:43:10.000000    47.400000

2013-07-30 15:43:15.000000    58.600000

2013-07-30 15:43:20.000000    87.600000

2013-07-30 15:43:25.000000    96.200000

2013-07-30 15:43:30.000000    96.000000

2013-07-30 15:43:35.000000   134.200000

2013-07-30 15:43:40.000000   137.200000

2013-07-30 15:43:45.000000   137.400000

2013-07-30 15:43:50.000000   136.600000

2013-07-30 15:43:55.000000   139.800000

2013-07-30 15:44:00.000000   136.200000 <-- `rate` averages about here

going forward

It's irrelevant which field I utilize, the same instance occurs:

# rabins -M hard time 5s -B 5s -S 127.0.0.1:561 -m saddr -s ltime load

- port 5432 and src host 192.168.10.22

               LastTime     Load

2013-07-30 15:50:15.000000 1461.19*

2013-07-30 15:50:20.000000 42524.7*

2013-07-30 15:50:25.000000 54329.5*

2013-07-30 15:50:30.000000 55244.8*

2013-07-30 15:50:35.000000 90164.8*

2013-07-30 15:50:40.000000 92539.1*

2013-07-30 15:50:45.000000 94827.1*

2013-07-30 15:50:50.000000 95292.7*

2013-07-30 15:50:55.000000 96286.3*

2013-07-30 15:51:00.000000 94857.6*

2013-07-30 15:51:05.000000 130699.*

2013-07-30 15:51:10.000000 149979.*

2013-07-30 15:51:15.000000 149320.*

[killed]# rabins -M hard time 5s -B 5s -S 127.0.0.1:561 -m saddr -s

ltime load - port 5432 and src host 192.168.2.22

               LastTime     Load

2013-07-30 15:52:35.000000 33894.4*

2013-07-30 15:52:40.000000 3134.84*

2013-07-30 15:52:45.000000 39262.4*

2013-07-30 15:52:50.000000 40024.0*

2013-07-30 15:52:55.000000 41188.7*

2013-07-30 15:53:00.000000 40259.2*

2013-07-30 15:53:05.000000 75057.6*

2013-07-30 15:53:10.000000 97160.0*

2013-07-30 15:53:15.000000 106520.*

2013-07-30 15:53:20.000000 138504.*

2013-07-30 15:53:25.000000 153835.*

2013-07-30 15:53:30.000000 152892.*

2013-07-30 15:53:35.000000 154017.* <-- `load` averages here going forward

This happens whether or not I perform field aggregation (`-m saddr`).

Why is this happening?

This seems like it will really screw up calculating moving averages

(figuring out spikes, etc.) from the rabins() resultant data.

Thanks!

Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20130731/8deabbe6/attachment.html>