Clustering flows within a specific time interval

Wed Jan 11 10:33:20 EST 2012

Hey Manaf,
The tool for this is rasqltimeindex(), but it is poorly documented.  This program uses
mysql and builds "Filename" and "Seconds" tables, that hold the byte offsets of
argus data records for the start of every second in the file.  rasql(), with a time filter,
then accesses the tables, to find the records from the specified time range.

This program is designed to work with standard argus archives, where the files are
persistent, and so the tools allow for finding data pretty quickly in very large repositories,
but it could be used in a more dynamic way.

I'm not sure that its useable in its current state without some dialog.  I will try to put
together a "HowTo" description on how to use it before I get back from FloCon.

Until then, most sites use rasplit.1 to divide the large data files into more manageable
time periods. rasplit.1 is well documented, so it may be the best approach for you.
I split all of my data streams into 5 minute files, and then my perl scripts take the
"-t timerangefilter" and finds the files that need to be processed to find the data.

Let me improve the rasqltimeindex() approach so that it can be useful for you.

Carter

On Jan 11, 2012, at 3:31 AM, manaf gharaibeh wrote:

> Hi,
> 
> I have huge Argus files (each with records of flows for an entire day). I am trying to gather statistics like the number of flows, number of different sources, or source packets that target the same destination within a given interval of time like 1 minute. I use the following command line within a Perl script to cluster flows based on destination then sort the result of that based on the number of source packets to destinations:
> `racluster -nw - @arglist -m daddr -t @timeIneterval |rasort -u -m spkts -s daddr stime ltime dur spkts srate -c, > spktsSorted.dat`; 
> 
> where @arglist contains user command-line options, mainly the name of the input argus file. And @timeIneterval contains a time interval in a form like i1293864155+60s. The result is saved to spktsSorted.dat file in a comma separated format.
> 
> Now here is my problem: The argus files I have are originally sorted based on the ending time of a flow rather than the starting time of that flow. So when I run the racluster command, it will have no clue where are the flows that fall within the specified interval. It will simply search through the whole argus file, which is very expensive with huge files like the ones I'm working with. I used the option -N to limit the number of flows that racluster should find, and that reduced the time needed by the command significantly. But this is not a good solution since I might loose some flows. Or if the integer with the -N is larger than the number of flows the satisfy the specified constrains then I will have the original expensive exhaustive search problem.
> 
> So the question is: how can I cluster flows based on destination host IP within a specific time interval in a reasonable time, that is to cluster flows that were active during an interval that starts at x and ends at y based on their destination IP addresses?  
>  
> -Manaf

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20120111/a8e40150/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4367 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20120111/a8e40150/attachment.bin>