Huge argus files and racluster

Marco listaddr at gmail.com
Tue Feb 7 11:38:20 EST 2012


Thanks for the detailed answer. I suppose a bit more of background on
what I'm trying to do is in order here. Basically, I've been handed
that 50GB pcap monster and been told to "make sense of it".
Essentially, it contains all the traffic to and from the Internet seen
on a particular LAN.
"making sense of it" basically means, in simple terms, finding out:

- global bandwidth usage (incoming, outgoing)
- bandwidth usage by protocol (http, smtp, dns, etc.), again incoming
and outgoing
- traffic between specific source/destination hosts (possibly
including detailed protocol usage within that specific traffic)

Ideally, I'd like to graph some or all of that information, but for
now I'm ok with running some command line query using racluster/rasort
to get textual tabular output.

So, based on what I read, the first thing I was doing was trying to
summarize the pcap data into an argus file to use as a starting point,
and that file should ideally include exactly one entry per flow (where
flow==saddr daddr proto sport dport), because otherwise (if I
understand correctly) packets, bytes, etc. belonging to a specific
flow would be counted multiple times, which is not what I want (it's
entirely possible that I'm misunderstanding how argus works though).
Note that I'm mostly interested in aggregated numbers here rather than
detailed flow analysis. For example: I'd like to get all flows where
the protocol is TCP and dport is 80, then obtain aggregated sbytes and
dbytes for all those flows. Same for other well-known destination
ports.

As it's probably clear by now, I'm a novice to argus, so any help
would be appreciated (including pointers to examples or other material
to study). Thanks for your help.

2012/2/7 Carter Bullard <carter at qosient.com>:
> Hey Marco,
> Regardless of what time range you work with, there will always be
> a flow that extends beyond that range.  You have to figure out what
> you are trying to say with the data to decide if you need to count
> every connection only once.
>
> If 5 or 10 or 15 minute files isn't attractive, racluster.1 provides you
> configuration options so you can efficiently track long term flows, but
> it is based on finding an effective idle timeout that will make persistent
> tracking work for your memory limits.  See racluster.5.  Most flows are
> finished in less than a second, and so keeping all of those flows in memory
> is a waste.  Figuring out a good idle timeout strategy, however, is an art.
>
> By default, racluster's idle timeout is "infinite" and so it holds each flow in
> memory until the end of processing.  If you decide that 600 seconds
> of idle time is sufficient to decide that the flow is done (120 works for
> most, except Windows boxes, which can send TCP Resets for
> connections that have been closed for well over 300 seconds), then
> a simple racluster.conf file of:
>
> racluster.conf
>    filter="" model="saddr daddr proto sport dport" status=0 idle=600
>
> may keep you from running out of memory.  If a flow hasn't seen any
> activity in 600 seconds, racluster.1 will report the flow and release
> its memory.
>
>    racluster -f racluster.conf -r your.files -w single.output.file
>
> Improving on the aggregation model would include protocol and port
> specific idle time strategies, such as:
>
> racluster.better.conf
>    filter="udp and port domain" model="saddr daddr proto sport dport" status=0 idle=10
>    filter="udp" model="saddr daddr proto sport dport" status=0 idle=60
>    filter="" model="saddr daddr proto sport dport" status=0 idle=600
>
> The output data stream of this type of processing will be semi-sorted
> in last time seen order, rather than start time order, so that may be a
> consideration for you.  Sorting currently is a memory hog, so don't
> expect to sort these records after you generate the single output file,
> without some strategy, like using rasplit.1.
>
> Using state, such as TCP closing state to declare that a flow is done, is
> an attractive approach, but it has huge problems, and I don't recommend it.
>
> rasqlinsert.1 is the tool of choice if you really would like to have 1 flow
> record per flow, and you're running out of resources.
>
> Using argus-clients-3.0.5.31 from the developers thread of code,
> use rasqlinsert.1 with the caching option.
>
>   rasqlinsert -M cache -r your.files -w mysql://user@localhost/db/raOutfile
>
> This causes rasqlinsert.1 to use a database table as its flow cache.
> Its pretty efficient so its not going to do a database transaction per
> record, if there would be aggregation, so you do get some wins.
> When its finished processing, then create your single file with:
>
>   rasql -r mysql://user@localhost/db/raOutfile -w single.output.file
>
>
> There are problems with any approach that aggregates over long periods
> time, because systems do reuse the 5-tuple flow attributes that make
> up a flow key much faster than you would think.  This results in many situations
> where multiple independent sessions will be reported as a single very
> long lived flow.  This is particularly evident with DNS, where if you aggregate
> over months, you find that you get fewer and fewer DNS transactions (they
> tend to approach somewhere around 32K) between host and server, and
> instead of lasting around 0.025 seconds, they seem to last for months.
>
> I like 5 minute files, and if I need to understand what is going on just at
> the edge of two 5 minute boundaries, I read them both, and focus on the edge
> time boundary.  Anything longer than that is another type of time domain,
> and there are lots of processing strategies for developing data at that scale,
> that may be useful.
>
> Carter
>
>
> On Feb 7, 2012, at 9:45 AM, Marco wrote:
>
>>
>> Thanks. But what about long-lived flows that last more than 5 minutes?
>> Will they be merged or will they appear once per 5-minute file in the
>> result? The whole point of clustering is having a single entry for
>> each of them, AFAIK.



More information about the argus mailing list