Why is rabins() "ramping up" counts?

Thu Aug 1 16:39:48 EDT 2013

Hey Carter,

I appreciate the thorough responses as always.  I was five steps
behind you, but I now understand.

Is it worth adding documentation (in the man) on how the "calculated
fields" are calculated?  I think that would have avoided some of my
confusion.

I was also confused by how the specific flow I was looking at is a
long lived flow that encompasses the bin, and, in fact, the entire
time span contained in the source data file; meaning the duration is
always 5 seconds regardless if I'm using `-M hard` or not.

Here is a summary of what you described (and correct me if I'm wrong):

`-M hard`:
- Modifies a flow's ltime and stime so that their duration is set to
that of the bin timespan.

`-M nomodify`:
- doesn't modify anything about the records.  Specifically, doesn't
modify time-wise fields.
Q: On the NSM wiki, it's stated that `-M nomodify` is used when you
wish to have a flow only appear in a single bin.  Is this correct?
Would it appear in the single bin at its stime?

Given the following two points:
1) rate=pkts/(ltime - stime)
2) dur=ltime-stime
...the reason why `-M hard` affects "ramps up" and "ramps down" `rate`:
- Since the `dur` of each record is forced to be equal to the bin timespan,
-- in this case, the `pkts` sent within the first and last designated
bin timespan are only for a small section of the time within the bin
timespan (`dur`)...
--- therefore, The `rate` calculation result is an abnormally small
count, because the `dur` is the not the length of the real `dur` of
the shorter lived flow.

Q: Wouldn't this indicate that you can not use a field like `rate`
(what I called an aggregate field, but what you called a "calculated
field") with `-M hard` and expect the first report to be reliable?
Q: How can I effectively use `-M hard`?

1) `-B` must always be larger than ARGUS_FLOW_STATUS_INTERVAL since
`-B` is a buffer time that waits for all data to arrive for a bin.
2) `-B` should be used to set rabins() to hang out while all the
status records arrive for a bin (from the argus cache ?).  The status
records are delivered at the interval set with
argus.conf:ARGUS_FLOW_STATUS_INTERVAL.

Q: I think, by what you've stated, when used with a `-B` timespan
value that is less than ARGUS_FLOW_STATUS_INTERVAL, rabins() would
report nO data, but it clearly reports data.  Are you try to say that
the data is unpredictable and therefore can't be reliable upon?

Clearly, my problem was a mix of misusing `-M hard` and `-B`.
Q: I will be lowering my ARGUS_FLOW_STATUS_INTERVAL from 60 to 5 (as
reportedly acceptable as the default).  What is the cost of this?

Thanks again,

Matt

On Jul 31, 2013, at 11:23 PM, Carter Bullard <carter at qosient.com> wrote:

> Matt,
>
> The reason you are having problems is the " -B 5s " option.
>
> Don't use it when reading data from a file.  Its not intended
> for this use, and while it won't hurt anything, it represents
> a lack of understanding of the option.
>
> When you use it with a live argus data source, it must be
> greater than the ARGUS_FLOW_STATUS_INTERVAL to have its effect.
>
> Lets try to understand what this option is doing.  The " -B secs "
> option is saying how long you have to hold a bin in memory
> in order to guarantee that all the data arrives for that time
> bin.
>
> With these options " -M time 5s -B 5s ", you will process
> only 2 bins at a time, the current bin, who's range is
> [now - (now + 5s)] and the hold buffer, which is [(now - 5s) - now].
>
> With an ARGUS_FLOW_STATUS_INTERVAL=60s, you will receive data
> whose time range could be [(now - 60s) - now].  rabins(), when
> carving up the record into 5 second chunks, will generate, at
> most 12 records whose ranges are:
>   1. [((now - 60s)+0*5s) - ((now - 60s)+1*5s)]
>   2. [((now - 60s)+1*5s) - ((now - 60s)+2*5s)]
>   3. [((now - 60s)+2*5s) - ((now - 60s)+3*5s)]
> …
>  12. [((now - 60s)+11*5s) - ((now - 60s)+12*5s)]
>
> With a " -B 5s " option, only the 12th record will have a slot
> to aggregate into.  So you will be throwing away 11/12 of
> all your data.
>
> Increase your "-B sec" option to
>   (max(ARGUS_FLOW_STATUS_INTERVAL) + someDelay)
>
> So I would use at a minimum " -B 65s ".
>
>
> OK, lets try to keep the email short in the future,
> if its feasible to do so.  One topic at a time…
>
> Carter
>
>
>
> On Jul 30, 2013, at 4:19 PM, Matt Brown <matthewbrown at gmail.com> wrote:
>
>> Hello all,
>>
>>
>> Does rabins() "ramp up to normal" over N bins?
>>
>>
>> I'd like to start working on calculating moving averages to help
>> identify performance outliers (like "spikes" in `loss` or `rate`).
>>
>> For this purpose, I believe grabbing data from the output of rabins()
>> would serve me well.
>>
>>
>> For example, if I take historic argus data and run it through the
>> following rabins() invocation, I see some odd things that can only be
>> noted as "ramping up":
>>
>>
>> for f in $(ls -m1 ~/working/*) ; do (
>> rabins -M hard time 5s -B 5s -r $f -m saddr -s ltime rate - port 5432
>> and src host 192.168.10.22
>> ) >> ~/aggregated_rate ; done
>>
>>
>> The first few and the last few resulting records per file seem to not
>> be reporting correctly.
>>
>> For example, these dudes at 192.168.10.22 utilize a postgres DB
>> replication package called bucardo.  During idle time, bucardo sends
>> heartbeat info, and appears to be holding at about 47-49 packets per
>> second (rate).
>>
>> However, I am seeing the following in my rabins() resultant data (note
>> the precense of field label header == the start of a new rabins() from
>> the above for..loop):
>>
>> 2013-07-25 00:59:25.000000    47.400000
>> 2013-07-25 00:59:30.000000    47.400000
>> 2013-07-25 00:59:35.000000    48.000000
>> 2013-07-25 00:59:40.000000    48.000000
>> 2013-07-25 00:59:45.000000    40.600000
>> 2013-07-25 00:59:50.000000    21.400000
>> 2013-07-25 00:59:55.000000    15.400000
>> 2013-07-25 01:00:00.000000     5.000000
>> 2013-07-25 01:00:05.000000     0.000000
>>                LastTime         Rate
>> 2013-07-25 01:00:05.000000     0.200000
>> 2013-07-25 01:00:10.000000     0.600000
>> 2013-07-25 01:00:15.000000     0.400000
>> 2013-07-25 01:00:35.000000     0.400000
>> 2013-07-25 01:00:40.000000     1.000000
>> 2013-07-25 01:00:45.000000     6.200000
>> 2013-07-25 01:00:50.000000    25.400000
>> 2013-07-25 01:00:55.000000    32.400000
>> 2013-07-25 01:01:00.000000    41.800000
>> 2013-07-25 01:01:05.000000    47.600000
>> 2013-07-25 01:01:10.000000    48.600000
>>
>> [The source files were written with rastream().]
>>
>>
>> It is well worth noting that if I start an rabins() reading from the
>> argus() socket with the following invocation, the same sort of thing
>> occurs:
>> # rabins -M hard time 5s -B 5s -S 127.0.0.1:561 -m saddr -s ltime rate
>> - port 5432 and src host 192.168.10.22
>>                LastTime         Rate
>> 2013-07-30 15:42:55.000000     1.400000
>> 2013-07-30 15:43:00.000000     0.600000
>> 2013-07-30 15:43:05.000000    33.800000
>> 2013-07-30 15:43:10.000000    47.400000
>> 2013-07-30 15:43:15.000000    58.600000
>> 2013-07-30 15:43:20.000000    87.600000
>> 2013-07-30 15:43:25.000000    96.200000
>> 2013-07-30 15:43:30.000000    96.000000
>> 2013-07-30 15:43:35.000000   134.200000
>> 2013-07-30 15:43:40.000000   137.200000
>> 2013-07-30 15:43:45.000000   137.400000
>> 2013-07-30 15:43:50.000000   136.600000
>> 2013-07-30 15:43:55.000000   139.800000
>> 2013-07-30 15:44:00.000000   136.200000 <-- `rate` averages about here
>> going forward
>>
>>
>> It's irrelevant which field I utilize, the same instance occurs:
>> # rabins -M hard time 5s -B 5s -S 127.0.0.1:561 -m saddr -s ltime load
>> - port 5432 and src host 192.168.10.22
>>                LastTime     Load
>> 2013-07-30 15:50:15.000000 1461.19*
>> 2013-07-30 15:50:20.000000 42524.7*
>> 2013-07-30 15:50:25.000000 54329.5*
>> 2013-07-30 15:50:30.000000 55244.8*
>> 2013-07-30 15:50:35.000000 90164.8*
>> 2013-07-30 15:50:40.000000 92539.1*
>> 2013-07-30 15:50:45.000000 94827.1*
>> 2013-07-30 15:50:50.000000 95292.7*
>> 2013-07-30 15:50:55.000000 96286.3*
>> 2013-07-30 15:51:00.000000 94857.6*
>> 2013-07-30 15:51:05.000000 130699.*
>> 2013-07-30 15:51:10.000000 149979.*
>> 2013-07-30 15:51:15.000000 149320.*
>> [killed]# rabins -M hard time 5s -B 5s -S 127.0.0.1:561 -m saddr -s
>> ltime load - port 5432 and src host 192.168.2.22
>>                LastTime     Load
>> 2013-07-30 15:52:35.000000 33894.4*
>> 2013-07-30 15:52:40.000000 3134.84*
>> 2013-07-30 15:52:45.000000 39262.4*
>> 2013-07-30 15:52:50.000000 40024.0*
>> 2013-07-30 15:52:55.000000 41188.7*
>> 2013-07-30 15:53:00.000000 40259.2*
>> 2013-07-30 15:53:05.000000 75057.6*
>> 2013-07-30 15:53:10.000000 97160.0*
>> 2013-07-30 15:53:15.000000 106520.*
>> 2013-07-30 15:53:20.000000 138504.*
>> 2013-07-30 15:53:25.000000 153835.*
>> 2013-07-30 15:53:30.000000 152892.*
>> 2013-07-30 15:53:35.000000 154017.* <-- `load` averages here going forward
>>
>> This happens whether or not I perform field aggregation (`-m saddr`).
>>
>>
>> Why is this happening?
>>
>>
>> This seems like it will really screw up calculating moving averages
>> (figuring out spikes, etc.) from the rabins() resultant data.
>>
>>
>> Thanks!
>>
>> Matt
>