Huge argus files and racluster

Carter Bullard carter at qosient.com
Tue Feb 7 11:08:25 EST 2012


Hey Marco,
Regardless of what time range you work with, there will always be
a flow that extends beyond that range.  You have to figure out what
you are trying to say with the data to decide if you need to count
every connection only once.

If 5 or 10 or 15 minute files isn't attractive, racluster.1 provides you
configuration options so you can efficiently track long term flows, but
it is based on finding an effective idle timeout that will make persistent
tracking work for your memory limits.  See racluster.5.  Most flows are
finished in less than a second, and so keeping all of those flows in memory
is a waste.  Figuring out a good idle timeout strategy, however, is an art.

By default, racluster's idle timeout is "infinite" and so it holds each flow in
memory until the end of processing.  If you decide that 600 seconds
of idle time is sufficient to decide that the flow is done (120 works for
most, except Windows boxes, which can send TCP Resets for
connections that have been closed for well over 300 seconds), then
a simple racluster.conf file of:

racluster.conf
    filter="" model="saddr daddr proto sport dport" status=0 idle=600

may keep you from running out of memory.  If a flow hasn't seen any
activity in 600 seconds, racluster.1 will report the flow and release
its memory. 

    racluster -f racluster.conf -r your.files -w single.output.file

Improving on the aggregation model would include protocol and port
specific idle time strategies, such as:

racluster.better.conf
    filter="udp and port domain" model="saddr daddr proto sport dport" status=0 idle=10
    filter="udp" model="saddr daddr proto sport dport" status=0 idle=60
    filter="" model="saddr daddr proto sport dport" status=0 idle=600

The output data stream of this type of processing will be semi-sorted
in last time seen order, rather than start time order, so that may be a
consideration for you.  Sorting currently is a memory hog, so don't
expect to sort these records after you generate the single output file,
without some strategy, like using rasplit.1.

Using state, such as TCP closing state to declare that a flow is done, is
an attractive approach, but it has huge problems, and I don't recommend it.

rasqlinsert.1 is the tool of choice if you really would like to have 1 flow
record per flow, and you're running out of resources.

Using argus-clients-3.0.5.31 from the developers thread of code,
use rasqlinsert.1 with the caching option.

   rasqlinsert -M cache -r your.files -w mysql://user@localhost/db/raOutfile

This causes rasqlinsert.1 to use a database table as its flow cache.
Its pretty efficient so its not going to do a database transaction per
record, if there would be aggregation, so you do get some wins.
When its finished processing, then create your single file with:

   rasql -r mysql://user@localhost/db/raOutfile -w single.output.file


There are problems with any approach that aggregates over long periods
time, because systems do reuse the 5-tuple flow attributes that make
up a flow key much faster than you would think.  This results in many situations
where multiple independent sessions will be reported as a single very
long lived flow.  This is particularly evident with DNS, where if you aggregate
over months, you find that you get fewer and fewer DNS transactions (they
tend to approach somewhere around 32K) between host and server, and
instead of lasting around 0.025 seconds, they seem to last for months.

I like 5 minute files, and if I need to understand what is going on just at
the edge of two 5 minute boundaries, I read them both, and focus on the edge
time boundary.  Anything longer than that is another type of time domain,
and there are lots of processing strategies for developing data at that scale,
that may be useful.

Carter


On Feb 7, 2012, at 9:45 AM, Marco wrote:

> 
> Thanks. But what about long-lived flows that last more than 5 minutes?
> Will they be merged or will they appear once per 5-minute file in the
> result? The whole point of clustering is having a single entry for
> each of them, AFAIK.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4367 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20120207/cc28690b/attachment.bin>


More information about the argus mailing list