Huge argus files and racluster

Tue Feb 7 13:26:41 EST 2012

Hey Marco,
Argus is very good at not over or undercounting packets, so don't worry about
the aggregation model and how it affects accuracy, that has been worked over
very well.   

Since you are interested in making sense of it all. 
You should run racount.1 first.

   racount -r files -M proto addr

You should be doing some very large aggregations, such as:

   racluster -m matrix/16 -r files -s stime dur saddr daddr pkts bytes - ip

This will show you which CIDR /16 networks are talking to whom.

If you want to know the list of IP addresses that are active:

   racluster -M rmon -m saddr -r files -w addrs.out - ip

Then you can aggregate for the networks, or the countries or whatever:
   racluster -r addrs.out -m saddr/24 -r files -s stime dur saddr spkts dpkts sbytes dbytes -  ip

If you want to aggregate based on the country code, you need to use ralabel to set the
country codes.  Check out 'man ralabel' and 'man 5 ralabel' to see how to do that, and you can
do that with the IP address file you created above:

   ralabel -f ralabel.country.code.conf -r addrs,out -w - |  racluster -m sco -w - | \
      rasort -m sco -v -s stime dur sco spkts dpkts sbytes dbytes

There are the perl scripts:

   rahosts -r files
   raports -r files

These are pretty informative, and will server you well.  
That should get you started.

Carter

On Feb 7, 2012, at 11:38 AM, Marco wrote:

> Thanks for the detailed answer. I suppose a bit more of background on
> what I'm trying to do is in order here. Basically, I've been handed
> that 50GB pcap monster and been told to "make sense of it".
> Essentially, it contains all the traffic to and from the Internet seen
> on a particular LAN.
> "making sense of it" basically means, in simple terms, finding out:
> 
> - global bandwidth usage (incoming, outgoing)
> - bandwidth usage by protocol (http, smtp, dns, etc.), again incoming
> and outgoing
> - traffic between specific source/destination hosts (possibly
> including detailed protocol usage within that specific traffic)
> 
> Ideally, I'd like to graph some or all of that information, but for
> now I'm ok with running some command line query using racluster/rasort
> to get textual tabular output.
> 
> So, based on what I read, the first thing I was doing was trying to
> summarize the pcap data into an argus file to use as a starting point,
> and that file should ideally include exactly one entry per flow (where
> flow==saddr daddr proto sport dport), because otherwise (if I
> understand correctly) packets, bytes, etc. belonging to a specific
> flow would be counted multiple times, which is not what I want (it's
> entirely possible that I'm misunderstanding how argus works though).
> Note that I'm mostly interested in aggregated numbers here rather than
> detailed flow analysis. For example: I'd like to get all flows where
> the protocol is TCP and dport is 80, then obtain aggregated sbytes and
> dbytes for all those flows. Same for other well-known destination
> ports.
> 
> As it's probably clear by now, I'm a novice to argus, so any help
> would be appreciated (including pointers to examples or other material
> to study). Thanks for your help.
> 
> 2012/2/7 Carter Bullard <carter at qosient.com>:
>> Hey Marco,
>> Regardless of what time range you work with, there will always be
>> a flow that extends beyond that range.  You have to figure out what
>> you are trying to say with the data to decide if you need to count
>> every connection only once.
>> 
>> If 5 or 10 or 15 minute files isn't attractive, racluster.1 provides you
>> configuration options so you can efficiently track long term flows, but
>> it is based on finding an effective idle timeout that will make persistent
>> tracking work for your memory limits.  See racluster.5.  Most flows are
>> finished in less than a second, and so keeping all of those flows in memory
>> is a waste.  Figuring out a good idle timeout strategy, however, is an art.
>> 
>> By default, racluster's idle timeout is "infinite" and so it holds each flow in
>> memory until the end of processing.  If you decide that 600 seconds
>> of idle time is sufficient to decide that the flow is done (120 works for
>> most, except Windows boxes, which can send TCP Resets for
>> connections that have been closed for well over 300 seconds), then
>> a simple racluster.conf file of:
>> 
>> racluster.conf
>>    filter="" model="saddr daddr proto sport dport" status=0 idle=600
>> 
>> may keep you from running out of memory.  If a flow hasn't seen any
>> activity in 600 seconds, racluster.1 will report the flow and release
>> its memory.
>> 
>>    racluster -f racluster.conf -r your.files -w single.output.file
>> 
>> Improving on the aggregation model would include protocol and port
>> specific idle time strategies, such as:
>> 
>> racluster.better.conf
>>    filter="udp and port domain" model="saddr daddr proto sport dport" status=0 idle=10
>>    filter="udp" model="saddr daddr proto sport dport" status=0 idle=60
>>    filter="" model="saddr daddr proto sport dport" status=0 idle=600
>> 
>> The output data stream of this type of processing will be semi-sorted
>> in last time seen order, rather than start time order, so that may be a
>> consideration for you.  Sorting currently is a memory hog, so don't
>> expect to sort these records after you generate the single output file,
>> without some strategy, like using rasplit.1.
>> 
>> Using state, such as TCP closing state to declare that a flow is done, is
>> an attractive approach, but it has huge problems, and I don't recommend it.
>> 
>> rasqlinsert.1 is the tool of choice if you really would like to have 1 flow
>> record per flow, and you're running out of resources.
>> 
>> Using argus-clients-3.0.5.31 from the developers thread of code,
>> use rasqlinsert.1 with the caching option.
>> 
>>   rasqlinsert -M cache -r your.files -w mysql://user@localhost/db/raOutfile
>> 
>> This causes rasqlinsert.1 to use a database table as its flow cache.
>> Its pretty efficient so its not going to do a database transaction per
>> record, if there would be aggregation, so you do get some wins.
>> When its finished processing, then create your single file with:
>> 
>>   rasql -r mysql://user@localhost/db/raOutfile -w single.output.file
>> 
>> 
>> There are problems with any approach that aggregates over long periods
>> time, because systems do reuse the 5-tuple flow attributes that make
>> up a flow key much faster than you would think.  This results in many situations
>> where multiple independent sessions will be reported as a single very
>> long lived flow.  This is particularly evident with DNS, where if you aggregate
>> over months, you find that you get fewer and fewer DNS transactions (they
>> tend to approach somewhere around 32K) between host and server, and
>> instead of lasting around 0.025 seconds, they seem to last for months.
>> 
>> I like 5 minute files, and if I need to understand what is going on just at
>> the edge of two 5 minute boundaries, I read them both, and focus on the edge
>> time boundary.  Anything longer than that is another type of time domain,
>> and there are lots of processing strategies for developing data at that scale,
>> that may be useful.
>> 
>> Carter
>> 
>> 
>> On Feb 7, 2012, at 9:45 AM, Marco wrote:
>> 
>>> 
>>> Thanks. But what about long-lived flows that last more than 5 minutes?
>>> Will they be merged or will they appear once per 5-minute file in the
>>> result? The whole point of clustering is having a single entry for
>>> each of them, AFAIK.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20120207/8df7bdca/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4367 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20120207/8df7bdca/attachment.bin>