Clustering flows within a specific time interval

Wed Jan 18 01:39:42 EST 2012

Hi Carter,

I got the  argus client installed from argus-clients-3.0.5.29.tar.gz. And I can see the man page for the rasqltimeindex tool, but the tool itself doesn't seem to be with the rest of ra* tools in the bin directory, so I couldn't give it try. Maybe the make files need to be updated.

Thanks,
-Manaf

________________________________
 From: Carter Bullard <carter at qosient.com>
To: manaf gharaibeh <manafhgh at yahoo.com> 
Cc: "argus-info at lists.andrew.cmu.edu" <argus-info at lists.andrew.cmu.edu> 
Sent: Wednesday, January 11, 2012 2:40 PM
Subject: Re: [ARGUS] Clustering flows within a specific time interval

Hey Manaf,
I uploaded argus-clients-3.0.5.29.tar.gz into the developers site.
It has a rasqltimeindex.1 man page, and I fixed a few things with the code, so
its now ready to go.  It does have a mysql requirement, so if you could try it
out and tell me if its working, that would be great.  This is what you would do
for a given file:

   rasqltimeindex -r file -w mysql://user@localhost/db

where user is the mysql account, I use 'root', and db is the database name
you will use to hold the tables.  Then you use rasql() to make the queries:

   rasql -r mysql://user@localhost/db -t timeFilter

if you want to cluster that data, you will pipe the rasql output to racluster:

   rasql -r mysql://user@localhost/db -t timeFilter-timeFilter -w - | racluster ……

You can index whole days at a time, or entire archives at once.  rasqltimeindex()
will not index the same file twice, so, you should be able to experiment.

Do give this a try, and if you have any problem, holler.

Carter

On Jan 11, 2012, at 3:01 PM, manaf gharaibeh wrote:

Thanks Carter,
>The rasqltimeindex() tool sounds interesting. It would be nice to have the HowTo for it. I'll also try with rasplit.1 and see what I can get.
>
>
>Cheers, 
>-Manaf
>
>
>
>________________________________
> From: "argus-info-request at lists.andrew.cmu.edu" <argus-info-request at lists.andrew.cmu.edu>
>To: argus-info at lists.andrew.cmu.edu 
>Sent: Wednesday, January 11, 2012 8:33 AM
>Subject: Argus-info Digest, Vol 77, Issue 4
> 
>Send Argus-info mailing list submissions to
>    argus-info at lists.andrew.cmu.edu
>
>To subscribe or unsubscribe via the World Wide Web, visit
>    https://lists.andrew.cmu.edu/mailman/listinfo/argus-info
>or, via email, send a message with subject or body 'help' to
>    argus-info-request at lists.andrew.cmu.edu
>
>You can reach the person managing the list at
>    argus-info-owner at lists.andrew.cmu.edu
>
>When replying, please edit your Subject line so it is more
 specific
>than "Re: Contents of Argus-info digest..."
>
>
>Today's Topics:
>
>   1.  Argus ralabel (CS Lee)
>   2. Re:  (no subject) (Bruce Hawkins)
>   3.  Clustering flows within a specific time interval
>      (manaf gharaibeh)
>   4. Re:  Clustering flows within a specific time interval
>      (Carter Bullard)
>
>
>----------------------------------------------------------------------
>
>Message: 1
>Date: Wed, 11 Jan 2012 10:37:29 +0800
>From: CS Lee <geek00l at gmail.com>
>Subject: [ARGUS] Argus ralabel
>To: Argus <argus-info at lists.andrew.cmu.edu>
>Message-ID:
>    <CABWd2irzQbO9QSu96FpOc=CTmHEMZk3X2qJM92isZD6TtS731A at mail.gmail.com>
>Content-Type: text/plain; charset="iso-8859-1"
>
>hi Carter,
>
>Maxmind has released ipv6 to AS mapping where you can find here -
>
>http://geolite.maxmind.com/download/geoip/database/asnum/
>
>Will you add support for ipv6 to AS for ralabel, that would be something
>good to have!
>
>Cheers!
>
>-- 
>Best Regards,
>
>CS Lee<geek00L[at]gmail.com>
>
>http://geek00l.blogspot.com
>http://defcraft.net
>-------------- next part --------------
>An HTML attachment was scrubbed...
>URL: https://lists.andrew.cmu.edu/mailman/private/argus-info/attachments/20120111/163e40eb/attachment-0001.html 
>
>------------------------------
>
>Message: 2
>Date: Tue, 10 Jan 2012 23:47:41 -0600
>From: Bruce Hawkins <keta144 at msn.com>
>Subject: Re: [ARGUS] (no subject)
>To: <adam at funkstarr.com>, <aishaterux3 at yahoo.com>, <aelahi at umd.edu>,
>    <aelahi at mail.umd.edu>, <amie745 at hotmail.com>,
>    <argus-info at lists.andrew.cmu.edu>, <usnavygirl82 at aol.com>
>Message-ID: <SNT112-W50C16E0765F2D039C27A9AEF9E0 at phx.gbl>
>Content-Type: text/plain; charset="iso-8859-1"
>
>http://www.quikly.com.ar/january.php?opozyt=74&ywam=614&umjvifygyr=74
>                          
>-------------- next part --------------
>An HTML attachment was scrubbed...
>URL: https://lists.andrew.cmu.edu/mailman/private/argus-info/attachments/20120110/2c722eb2/attachment-0001.html 
>
>------------------------------
>
>Message: 3
>Date: Wed, 11 Jan 2012 00:31:19 -0800 (PST)
>From: manaf gharaibeh <manafhgh at yahoo.com>
>Subject: [ARGUS] Clustering flows within a specific time interval
>To: "argus-info at lists.andrew.cmu.edu"
>    <argus-info at lists.andrew.cmu.edu>
>Message-ID:
>    <1326270679.37971.YahooMailNeo at web33807.mail.mud.yahoo.com>
>Content-Type: text/plain; charset="iso-8859-1"
>
>Hi,
>
>I have huge Argus files (each with records of flows for an entire day). I am trying to gather statistics like the number of flows, number of different sources, or source packets that target the same destination within a given interval of time like 1 minute. I use the following command line within a Perl script to cluster flows based on destination then sort the result of that based on the number of source packets to destinations:
>`racluster -nw - @arglist -m daddr -t @timeIneterval |rasort -u -m spkts -s daddr stime ltime dur spkts srate -c, > spktsSorted.dat`;?
>
>where @arglist contains user command-line options, mainly the name of the input argus file. And @timeIneterval contains a time interval in a form like i1293864155+60s. The result
 is saved to spktsSorted.dat file in a comma separated format.
>
>Now here is my problem: The argus files I have are originally sorted based on the ending time of a flow rather than the starting time of that flow. So when I run the racluster command, it will have no clue where are the flows that fall within the specified interval. It will simply search through the whole argus file, which is very expensive with huge files like the ones I'm working with. I used the option -N to limit the number of flows that racluster should find, and that reduced the time needed by the command significantly. But this is not a good solution since I might loose some flows. Or if the integer with the -N is larger than the number of flows the satisfy the specified constrains then I will have the original expensive exhaustive search problem.
>
>So the question is: how can I cluster flows based on destination host IP within a specific time interval in a reasonable time,
 that is to cluster flows that were active during an interval that starts at x and ends at y based on their destination IP addresses? ?
>?
>-Manaf
>-------------- next part --------------
>An HTML attachment was scrubbed...
>URL: https://lists.andrew.cmu.edu/mailman/private/argus-info/attachments/20120111/e6afdce6/attachment-0001.html 
>
>------------------------------
>
>Message: 4
>Date: Wed, 11 Jan 2012 10:33:20 -0500
>From: Carter Bullard <carter at qosient.com>
>Subject: Re: [ARGUS] Clustering flows within a specific time interval
>To: manaf gharaibeh <manafhgh at yahoo.com>
>Cc: "argus-info at lists.andrew.cmu.edu"
>    <argus-info at lists.andrew.cmu.edu>
>Message-ID: <60E5372D-346C-4381-AC5C-97C5AB5D1FEA at qosient.com>
>Content-Type: text/plain; charset="iso-8859-1"
>
>Hey Manaf,
>The tool for this is rasqltimeindex(), but it is poorly documented.  This program uses
>mysql and builds "Filename" and "Seconds" tables, that hold the byte offsets of
>argus data records for the start of every second in the file.  rasql(), with a time filter,
>then accesses the tables, to find the records from the specified time range.
>
>This program is designed to work with standard argus archives, where the files
 are
>persistent, and so the tools allow for finding data pretty quickly in very large repositories,
>but it could be used in a more dynamic way.
>
>I'm not sure that its useable in its current state without some dialog.  I will try to put
>together a "HowTo" description on how to use it before I get back from FloCon.
>
>Until then, most sites use rasplit.1 to divide the large data files into more manageable
>time periods. rasplit.1 is well documented, so it may be the best approach for you.
>I split all of my data streams into 5 minute files, and then my perl scripts take the
>"-t timerangefilter" and finds the files that need to be processed to find the data.
>
>Let me improve the rasqltimeindex() approach so that it can be useful for you.
>
>Carter
>
>On Jan 11, 2012, at 3:31 AM, manaf gharaibeh wrote:
>
>> Hi,
>> 
>> I have huge Argus files (each with records of flows for an entire day). I am
 trying to gather statistics like the number of flows, number of different sources, or source packets that target the same destination within a given interval of time like 1 minute. I use the following command line within a Perl script to cluster flows based on destination then sort the result of that based on the number of source packets to destinations:
>> `racluster -nw - @arglist -m daddr -t @timeIneterval |rasort -u -m spkts -s daddr stime ltime dur spkts srate -c, > spktsSorted.dat`; 
>> 
>> where @arglist contains user command-line options, mainly the name of the input argus file. And @timeIneterval contains a time interval in a form like i1293864155+60s. The result is saved to spktsSorted.dat file in a comma separated format.
>> 
>> Now here is my problem: The argus files I have are originally sorted based on the ending time of a flow rather than the starting time of that flow. So when I run the racluster command, it
 will have no clue where are the flows that fall within the specified interval. It will simply search through the whole argus file, which is very expensive with huge files like the ones I'm working with. I used the option -N to limit the number of flows that racluster should find, and that reduced the time needed by the command significantly. But this is not a good solution since I might loose some flows. Or if the integer with the -N is larger than the number of flows the satisfy the specified constrains then I will have the original expensive exhaustive search problem.
>> 
>> So the question is: how can I cluster flows based on destination host IP within a specific time interval in a reasonable time, that is to cluster flows that were active during an interval that starts at x and ends at y based on their destination IP addresses?  
>>  
>> -Manaf
>
>-------------- next part --------------
>An HTML attachment was
 scrubbed...
>URL: https://lists.andrew.cmu.edu/mailman/private/argus-info/attachments/20120111/a8e40150/attachment.html 
>-------------- next part --------------
>A non-text attachment was scrubbed...
>Name: smime.p7s
>Type: application/pkcs7-signature
>Size: 4367 bytes
>Desc: not available
>Url : https://lists.andrew.cmu.edu/mailman/private/argus-info/attachments/20120111/a8e40150/attachment.bin 
>
>------------------------------
>
>_______________________________________________
>Argus-info mailing list
>Argus-info at lists.andrew.cmu.edu
>https://lists.andrew.cmu.edu/mailman/listinfo/argus-info
>
>
>End of Argus-info Digest, Vol 77, Issue 4
>*****************************************
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20120117/d325f60c/attachment.html>