ragraph and unsorted files

Tue May 3 04:36:23 EDT 2011

Hi,

I simply noticed the 'bug' but did no think about the why's. The problem of
generating timeseries with a single pass is clear. Maybe it helps if I
explain how I got to the example I've sent.

I need to generate statistics on a per-flow basis, for that I created a
racluster.conf file that would aggregate all records using a 5min timeout
interval (filter="" status=0 idle=300). As my data consists of a few fairly
large flows (a few days), this problem occurs. The difference in the start
time between the 1st and 2nd records in my aggregated file is 4(!) days.

There are several possible solutions for my specific problem:
- use the '-t' option to define the range.
- keep 'status' reports in the files used to generated the time series.
- use a 2-pass approach, as suggested by you, would also solve the problem,
but I think my problem is a rare case, and it's not worth the performance
impact.

The biggest problem in my opinion, is that the user might not be aware of
this issue, thus think the generated graph with the unsorted data is
correct. So, it would be good to generate a warning message in case a flow
with start time sooner than the first bin is found, when generating the time
series. What do you think of this?

Rafael Barbosa
http://www.vf.utwente.nl/~barbosarr/

On Mon, May 2, 2011 at 6:47 PM, Carter Bullard <carter at qosient.com> wrote:

> Hey Rafael,
> ragraph() is just a front end to rabins(), and so any problems will be
> caused by rabins().
>
> I think this is a bug, so I'll take a look, to see what I can do.
> rabins() is our time-series
> engine, so it has a lot of bells and whistles in it.  ragraph() doesn't
> need all the stuff that
> makes rabins() complicated, so it maybe that there is a better strategy.
>
> The reason there is some complexity to the problem, is that we want the
> approach
> rabins() uses for bin management to be able to work with both streaming
> data and
> file based data.  With infinite streaming data, you need to be concerned
> with memory
> management, so our strategy is to have a "silding window" type of data
> processing,
> for aggregation etc...   As you suggested, the problem is rabins() is not
> allowing for
> a large window, when processing files.
>
> If you were to give ragraph() an explicit time range to graph, this problem
> would go away.
>
> So, the client library supports the notion of multi-pass processing of
> files.
> If you look at the source code, all clients have a variable ArgusPassNum,
> and if
> in your own clients initialization routine, you defined that to be 2, as an
> example, we
> would process the input file list twice.  I could use that to simply scan
> the data from the
> file list on the first pass to set the time series start and stop times,
> and then run the data
> through again to tally the results, but the performance can be pretty bad
> if I do that
> as a general strategy.  But that would be faster than if we had to sort the
> data prior to graphing it.
>
> I'll look to see if this is a bug, or a feature.  How wildly out of order
> are the records?
>
> Carter
>
>
> On May 2, 2011, at 11:26 AM, Rafael Barbosa wrote:
>
> Hi all,
>
> I run into something today that might be considered a bug: ragraph does not
> handle well files that are not ordered by 'stime'. Basically it seems that
> ragraph uses the info of the first record to initialize the timeseries, so
> flows that are before in time (but later in the file) are ignored, or at
> least erroneously processed.
>
> I upload the file 'ragraph-unsorted.zip' to ftp://qosient.com/incomingthat contains an example.
>
> An easy work around is to make sure that the file is ordered, with
> rasort(), before using ragraph. E.g.:
> rasort -m stime -r flows.argus -w sorted.argus
>
> Best regards,
> Rafael Barbosa
> http://www.vf.utwente.nl/~barbosarr/
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20110503/88876129/attachment.html>