Long time series

Thu Sep 27 10:53:41 EDT 2012

Hi Carter,

How big are your files and how much memory do you have ?

Around 2000 files of ~30Mb each. The traffic load is not high (local
traffic only). I made some tests in my local machine with 4Gb, but the
server where I will run it now has 32Gb.

Can you use rasplit() to create the hourly bins, and then run racluster()
> to generate

your hourly aggregates?

Yes, this seems the way to go. For some reason I was afraid of what would
happen in the edges of the bins, with the connections that would span over
multiple bins. But I think rasplit should handle it just fine.

what is it that you are actually trying to do with your time series data?

At the moment I am simply looking at connectivity graphs. Like who talks to
who and using which service? But I soon I will look a little deeper,
including number of packets, bytes, flows. So I wanted the solution to be
flexible.

The -R option works fine for me with rabins().  What kind of behavior are
> you getting?

There is something odd:

$> rabins -M hard time 1h -R ../data/
rabins[11151]: 16:50:10.929848 no input files
$> cd ..
$> rabins -M hard time 1h -R data/
<Runs just fine>

--
Rafael Barbosa
http://www.ewi.utwente.nl/~barbosarr/

On Thu, Sep 27, 2012 at 2:08 PM, Carter Bullard <carter at qosient.com> wrote:

> Hey Rafael,
> Not sure what you're trying to accomplish, so hard to make a good
> recommendation.
> How big are your files and how much memory do you have ?
>
> The rabins() call you're doing isn't really doing any real data reduction.
>  I would
> suspect that your average transaction duration is under 1 second, so
> "binning"
> the data, holding it all in memory, before outputting it, just isn't going
> to do much
> for you.  Can you aggregate the data with a data reduction key, instead of
> the
> default?
>
> Can you use rasplit() to create the hourly bins, and then run racluster()
> to generate
> your hourly aggregates?
>
> Because you are probably looking to work with only one metric in your
> output,
> you can throw all the DSRs away, expect for the few that matter, like the
> time, flow
> and metric dsrs.  That will save you a massive amount of memory.
>
>    rabins -M dsrs="flow,time,metric"
>
> what is it that you are actually trying to do with your time series data?
>
> I have always advocated 5 minute files, and I recommend this to you.  If
> you need
> hourly aggregate data, it can be generated from processed 5 minute files.
> As a test, I would recommend that you take one current file, and use
> rasplit() to
> generate 5 minute files, aggregate the 5 minute files with racluster(),
> and then
> run racluster() again on the 5 minute aggregates, to generate a 1 hour
> aggregate.
> The resulting files will give you an indication of how many flows you're
> dealing
> with, and how much memory will be required to do the job.
>
> The -R option works fine for me with rabins().  What kind of behavior are
> you getting?
>
> Carter
>
> On Sep 26, 2012, at 8:05 AM, Rafael Barbosa <rrbarbosa at gmail.com> wrote:
>
> Hi all,
>
> What is the recommended way to generate large time series with rabins?
>
> Some context. I am running:
> $> rabins -f racluster.conf -M hard time 1h -r ../data/* - some-filter >
> time-series.txt
>
> And:
> $> cat racluster.conf
> filter="" status=0 idle=300
>
> In the 'data' folder I have around 3 months of data and, each file with
> roughly 40-60min worth of traffic. However I rapidly run out of memory and
> I can't afford the swapping. Is there a way to do this with argus using
> less memory? Or should I start generating multiple time-series (eg. 1 per
> day) and 'stitch' then together afterwards?
>
> I tried setting -B 5s, but if I seems to have little impact, if any.
>
> Extra: why does rabins not accept the -R option?
>
> Rafael Barbosa
> http://www.ewi.utwente.nl/~barbosarr/
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20120927/11a794a8/attachment.html>