racluster issue

Thu Mar 7 10:56:50 EST 2013

Hey Craig,
If you've got so many cores, you should write the primitive data to a file, with rasplit() or
rastream() (assuming of course that its argus data not netflow data) and then have the
cores do a lot of parallel work on the single file to generate many different views from
the original data.    By writing the data into a file, you get some independence from the
data stream, and you don't have to worry about your only collector getting into trouble
because it got 100M flows all at once, and it runs out of memory.

Definately, test rastream() with -D3, to find out why its not doing what you want it to do.  Don't
run it as a daemon, just try to find out why its not working. It works great for me.  

rastream(), calls your script, and passes these parameters to it " -r /full/path/to/the/completed/file ".
You should use ./support/Config/rastream.sh as an example of how to deal with the
passed parameters.  The example handles all the appropriate issues.  

With the -D3, it will printout all the commands and results from trying to
fork and run your script.  You can also append echo's from your script into a log file, so
you can see that your script is running and what its doing.  You control that !!!!

We have so many ways of paralleling the processing once the data is in a file.  Using
filters, where one core processes the http trafffic, one does all the other tcp traffic,
and one does all the other ip traffic, an one processes the non-ip traffic.  The single 
file ends up in memory, cached, with all these various cores processing subsets
of the file.

You can carve the file up into sections using record counts " -N x-y ", or byte offsets
" -r file::24000:295670 ", if you know the record offsets....... So many options.  I use
rasqltimeindex() to provide the byte offsets for time boundaries, through rastream().
rastream() develops the hard split points, say on 5 minute boundaries, which become
my quantum of time processing.

I have rastream() writing all of my data, from dozens of argi, into a persistent archive,
establishing 5 minute files per sensor per day, and within seconds after the cutoff,
rastream() runs this script:

------ begin rastream.sh ----
PATH="/usr/local/bin:$PATH"; export PATH
package="argus-clients"
version="3.0.7"

OPTIONS="$*"
FILES=
while  test $# != 0
do
    case "$1" in
    -r) shift; FILES="$1"; break;;
    esac
    shift
done

rasqltimeindex -r $FILES -w mysql://root@localhost/ratop
exit 0
------ end rastream.sh ----

Then when I want to run an analytic that is based on time, I grab the data from the
archive using rasql(), in a lazy fashion (maybe 30-40 seconds after the time period
has passed).  This is only an example:

   rasql -t -10m+5m -w - | rawhatever -r - .........

The reason I do this, is because many rasql's can run in parallel, can run
from multiple cores / machines, etc... to give you the ability to use all that you have.
Find something of intersest in the 5m stream, do it again with a real time range to get
the data of specific interest......

But,  I'm all about finding new things.  You are setting up a " find what you are looking for "
system, which may not benefit from keeping the data around, outside of splunk.

Some out there folks are getting good results using gearman http://gearman.org/gearman
to do large parallelization of argus aggregation, and there's lots of hadoop / flow data
papers, talks out there, but they are all processing files, so do think about creating some
files, and then processing them.

Carter

On Mar 7, 2013, at 1:01 AM, Craig Merchant <cmerchant at responsys.com> wrote:

> It's all right...  End user support is always aggravating, particularly when you aren't paid for it!  I appreciate the help.  You're both patient with and dedicated to your user community.
> 
> From your last email...
> 
> Why 1 minute?  Because it allowed me to test more quickly.  
> 
> What is the output file name all about?  %s is the strftime value for Unix epoch time.  Splunk uses unix epoch time and it also seemed like the best time format in a file name for scripting.  List a directory, grep the file name from the results, if StartTime < filename < EndTime, then do...
> 
> Clustering on 5 minutes worth of data is arbitrary.  Based on the size of the files I've seen and my history with Splunk, it seemed like the right number.
> 
> Rastream successfully output the file in both my "%s.argus" format and the "argus.%Y.%m.%d.%H.%M.%S", but I never saw the /usr/local/argus/rastream.sh run nor did the racluster command  contained in rastream.sh run:
> 
> 	/usr/local/bin/racluster -r $1 -c "@" -p 3 -u -Z b -s "+0ltime,+trans,+dur,+runtime,+mean,+stddev,+sum,+sco,+dco,+spkts,+dpkts,+sbytes,+dbytes,+load,+sload,+dload,+loss,+sloss,+dloss,+ploss,+sploss,+dploss,+rate,+srate,+drate,+appbytes,+sappbytes,+dappbytes,+label:200" > /ssd/argus/test.csv
> 	exit 0
> 
> I can run "/usr/local/argus/rastream.sh filename.argus" and it successfully runs and produces the correct CSV output.  
> 
> I ran a few tests on a file containing five minutes worth of binary data just to get a rough idea of what the performance was of different options.  The results were run against a binary argus file of about 1.35 GB.  It would seem that 4 vs 5 tuples for aggregation, the number of fields output, and filters on proto/host/net/port all impact the performance and efficiency.  My machine has 32 cores in it, so if I can get the -f script functionality working for rabins, I should be able to use filters to divide the traffic up and spread it over more CPUs.
> 
> We want to use this data for a number of things in Splunk.  I can build statistical models for different server roles or networks, search for unusual TCP flags, alert on poorly performing hosts...  We probably don't need the granularity of five tuple aggregation, though it wouldn't be too tough to write logic in Splunk that could correct flows with a "?" in the direction by comparing the address/port combinations against our Nessus database and look for open services.  The Netops team can use it for diagnostics without having to learn the complexities of Argus...
> 
> Some of the tests I ran earlier tonight are below.  It shows the execution time for the search, the size (in bytes) of output (both CSV and binary formats), and how large the file is compared to the original.  Obviously the CSV to binary comparison isn't apples-to-apples, but...
> 
> Thanks.
> 
> Craig
> 
> racluster -r 1362619200.argus -p 3 -u -Z b -n > racluster_5tuple_basic.csv
> 	racluster_5tuple_basic.csv:  2:15 min	248532175	17% of original
> 
> racluster -r 1362619200.argus -p 3 -u -Z b -n -s "+0ltime,+1stime,+trans,+dur,+runtime,+mean,+stddev,+sum,+sco,+dco,+pkts,+spkts,+dpkts,+bytes,+sbytes,+dbytes,+load,+sload,+dload,+loss,+sloss,+dloss,+ploss,+sploss,+dploss,+rate,+srate,+drate,+appbytes,+sappbytes,+dappbytes,+label:200" > racluster_5tuple_full.csv
> 	racluster_5tuple_full.csv:  5:15 min	1378792663	94% of original
> 
> # 4 tuple flow -- proto,saddr,daddr,dport	
> racluster -r 1362619200.argus -p 3 -u -Z b -n -m proto saddr daddr dport > racluster_4tuple_basic.csv
> 	racluster_4tuple_basic.csv:  1:55 min	115532659	7% of original
> 
> racluster -r 1362619200.argus -p 3 -u -Z b -n -m proto saddr daddr dport -s "+0ltime,+1stime,+trans,+dur,+runtime,+mean,+stddev,+sum,+sco,+dco,+pkts,+spkts,+dpkts,+bytes,+sbytes,+dbytes,+load,+sload,+dload,+loss,+sloss,+dloss,+ploss,+sploss,+dploss,+rate,+srate,+drate,+appbytes,+sappbytes,+dappbytes,+label:200" > racluster_4tuple_full.csv
> 	racluster_4tuple_full.csv:  2:45 min	640858306	43% of original
> 
> racluster -r 1362619200.argus -p 3 -u -Z b -n -s "+0ltime,+1stime,+trans,+dur,+runtime,+mean,+stddev,+sum,+sco,+dco,+pkts,+spkts,+dpkts,+bytes,+sbytes,+dbytes,+load,+sload,+dload,+loss,+sloss,+dloss,+ploss,+sploss,+dploss,+rate,+srate,+drate,+appbytes,+sappbytes,+dappbytes,+label:200" - host 10.150.4.8 > racluster_5tuple_full_one_host.csv
> 	racluster_5tuple_full_one_host.csv:  35 sec		1171198	
> 
> racluster -r 1362619200.argus -p 3 -u -Z b -n - proto TCP > racluster_5tuple_tcp.csv
> 	racluster_5tuple_tcp.csv:  3:30	min	176794489	12% of original
> 
> #  Binary output of racluster
> racluster -r 1362619200.argus -p 3 -u -Z b -n -w racluster_5tuple_basic.argus
> 	racluster_5tuple_basic.argus:  668236064	45% of original
> 
> racluster -r 1362619200.argus -p 3 -u -Z b -n -m proto saddr daddr dport -w racluster_4tuple_basic.argus
> 	racluster_4tuple_basic.argus:  304422084	20% of original
> 
> -----Original Message-----
> From: Carter Bullard [mailto:carter at qosient.com] 
> Sent: Wednesday, March 06, 2013 3:55 PM
> To: Craig Merchant
> Cc: Argus (argus-info at lists.andrew.cmu.edu)
> Subject: Re: [ARGUS] racluster issue
> 
> OK, sorry about the rant.  Not your fault.  Maybe a bad day.
> So........  lets break it down a bit.
> 
> Converting 5sec status data to 5 minute flow data, won't buy you that much,
> except the occasional direction correction.  Because most 5-tuple transactions
> are short lived.  Argus will have gotten most of the data consolidated out of the sensor.
> 
> So the question is, does default aggregation actually buy you any thing, regarding
> data size and semantics.  This is easy to calculate.  Start with a file of your own primitive
> data, and aggregate it, and look to see how many records are reduced.  If the new
> file size is  < 60% of the primitive data, then you win, if not, maybe too expensive for
> what you get.
> 
> You want to load primitive flow data into splunk.  So what is splunk going to do
> with this data?  What are you looking for?  The answer should help you to determine
> if aggregation is useful, and at what granularity level.  Usually, you change the
> aggregation keys, not default aggregation, to get rid of the source port, or you blow
> in matrix data, just the IP address pairs, and let load thresholds and flag indicators
> key you to some anomalous behavior that helps you to go back to the archive, with
> time filters to limit the search.
> 
> If your having resource problems with racluster() reading a file, your racluster -T will
> have the same problems, possibly worst.  Try to think through what you're trying to
> accomplish / find, and find an aggregation strategy that will reduce the data load.
> 
> I still don't think the " -T secs " option is what you're looking for.
> 
> Carter
> 
> On Mar 6, 2013, at 3:39 PM, Carter Bullard <carter at qosient.com> wrote:
> 
>> Hmmmm, you must have changed every aspect of my example,
>> and now you're saying that it doesn't work.   Not surprising.
>> 
>> So where did 1 minute come from ?
>> What is that output file name all about ?
>> Why isn't the script completing / running ?  Did you test it at all ?
>> 
>> Do you know why your clustering 5 minutes of data?  Is there a purpose?
>> 
>> Carter
>> 
>> On Mar 6, 2013, at 3:03 PM, Craig Merchant <cmerchant at responsys.com> wrote:
>> 
>>> Hey, Carter...
>>> 
>>> So, I tried the first approach, but the script never executes.  I should also say my scripting skills are minimal, so I apologize for any noob-related errors in advance.
>>> 
>>> I invoked the rastream command as follows:
>>> 
>>> rastream -S 10.10.10.10:561 -M time 1m -B 10s -w /ssd/argus/%s.argus -f /usr/local/argus/rastream.sh -d
>>> 
>>> /usr/local/rastream.sh is pretty simple:
>>> 
>>> /usr/local/bin/racluster -r $1 -c "@" -p 3 -u -Z b -s "+0ltime,+trans,+dur,+runtime,+mean,+stddev,+sum,+sco,+dco,+spkts,+dpkts,+sbytes,+dbytes,+load,+sload,+dload,+loss,+sloss,+dloss,+ploss,+sploss,+dploss,+rate,+srate,+drate,+appbytes,+sappbytes,+dappbytes,+label:200" > /root/test.csv
>>> 
>>> After testing racluster against the files produced above, I'm not sure this process will work for us.  The binary argus records produced by rastream are averaging about 250-300 MB per minute and during that testing window our data centers were only generating about 4-5 Gbps of traffic (peak volume could be in the 12-15 Gbps range).  
>>> 
>>> Both the binary argus files and the CSV output are being read/written to an array of SSDs, so the I/O is pretty fast.  Running racluster against a 1m binary file takes 50-75 seconds.  So, at that rate, I can't write the binary with rastream and aggregate the data with racluster with enough time remaining to import it into Splunk and do my analytics.
>>> 
>>> The problem I'm having with rabins is that if I try and generate ASCII output with something like "rabins -S data.source -M time 5m -B 10s > output.csv", each bin is appended to output.csv instead of overwriting it.  If I can get the scripting to work, I can probably have the script remove the old file and then create a symbolic link to the new file and have Splunk use the symbolic link to import the flows (for my use case, Splunk would require reconfiguration each time the CSV file name changes).
>>> 
>>> So...  unless you've got a better idea, having radium handle my labeling and have racluster connect to radium for -T seconds seems like the easiest way to get a fresh CSV file of aggregated flows every X minutes.
>>> 
>>> As an aside, I noticed that rastream will convert %s.argus to the unix epoch value, but rabins just writes the file as "%s.argus"...
>>> 
>>> Thanks!
>>> 
>>> Craig
>>> 
>>> -----Original Message-----
>>> From: Carter Bullard [mailto:carter at qosient.com] 
>>> Sent: Tuesday, March 05, 2013 8:03 AM
>>> To: Craig Merchant
>>> Cc: Argus (argus-info at lists.andrew.cmu.edu)
>>> Subject: Re: [ARGUS] racluster issue
>>> 
>>> Hey Craig,
>>> You are starting to realize the same issues that caused us to create raspit() and rastream().
>>> Flow records span whatever ARGUS_STATUS_INTERVAL period there is, so without some
>>> record processing, your output from you methods will have irregular start and stop times.
>>> 
>>> Now, the assumption is you are processing argus records, where argus has a good
>>> configuration, meaning that the ARGUS_FLOW_STATUS_INTERVAL, is reasonable,
>>> like 1-15 seconds.  With this, you should use either rabins() or rastream().
>>> 
>>> I think you should relax your requirement that rejects an intermediate argus data file.
>>> If you can do that, use rastream(), to output records into a file with the date in its name,
>>> and after a brief wait time after your time boundary passes, have rastream() run a shell
>>> script containing your commands, against that data file.  You can delete the file when
>>> your done, so that you aren't piling up a lot of data.  
>>> 
>>> You can also use radium to label your traffic so that you don't need to do it yourself in the
>>> scripts.  But lets stay with your example:
>>> 
>>> OK assume an ARGUS_FLOW_STATUS_INTERVAL = 5 secs
>>> 
>>> rastream -M time 5m -B 10s -S data.source -w /tmp/argus.data/argus.%Y.%m.%d.%H.%M.%S -f rastream.sh -d
>>> 
>>> This will get the data into a file structure that will be useful, and 10 seconds after each 5 min time boundary,
>>> rastream will run the rastream.sh shell script, passing the file as the single parameter.  Use the
>>> ./support/Config/rastream.sh as a guide, and in the script have something like:
>>> 
>>> racluster -r $1-w - | ralabel -f ralabel.conf -F ralabel.script.conf > /ssd/argus/splunk/racluster.csv
>>> 
>>> where ralabel.script.conf has all your particulars in it, like comma separated, and the fields.
>>> Not sure what your " -M dsrs="+metric,+agr..." is doing, I would remove that.
>>> 
>>> This will give you a new /ssd/argus/splunk/racluster.csv 10 seconds after each 5 minute period.
>>> check for last write time, to see that its changed, and the feed it into whatever.
>>> 
>>> rabins() is being used my most sites to generate periodic ASCII output of aggregated data.
>>> Gloriad does this for their spinning globe. 
>>> 
>>> See http://www.gloriad.org/gloriaddrupal/
>>> 
>>> so in your example, you would have radium() do the labeling, so that you don't have to pipe
>>> anything in your terminal analytic.  This should work
>>> 
>>>  rabins -S data.source -M time 5m -B 10s -F ralabel.script.conf
>>> 
>>> rabins() will sti there, and then 10 seconds after each 5 minute period, like 05:00:10, it will write out
>>> all its clustered data, starting with a START MAR and ending with a STOP MAR.  which can be used
>>> to realize that here is the beginning and here is the end of this time period.  So no intermediate files
>>> of any kind.  I dont like this, necessarily, as you hold a lot of data in memory, before writing out the
>>> time period results, creating a bit of a pipeline issue.
>>> 
>>> So what do you think, which one will you use ?
>>> 
>>> Carter
>>> 
>>> On Mar 4, 2013, at 11:05 PM, Craig Merchant <cmerchant at responsys.com> wrote:
>>> 
>>>> Carter,
>>>> 
>>>> Here's what I'm trying to do and I may not be going about it the smartest way...  I would like racluster, rabins, or rastream to output a csv file containing five minutes of flow data, aggregated using proto, saddr, daddr, sport, and dport.  That CSV file will be imported into Splunk for analysis every five minutes.  I would prefer for the CSV file to be overwritten each time the argus client outputs five minutes of aggregated flows.  I would also prefer to avoid writing to an argus binary file as an intermediary step.
>>>> 
>>>> The way I've been doing it is to set up an entry in the crontab file that looks like:
>>>> 
>>>> 00,05,10,15,20,25,30,35,40,45,55 * * * * /usr/local/bin/racluster -S 10.10.10.10:561 -T 300 -p 3 -u -Z b -w - | /usr/local/bin/ralabel -r - -f /usr/local/argus/ralabel.conf -c "," -M dsrs=+metric,+agr,+psize,+cocode -n -p 3 -u -Z b -s "+0ltime,+1stime,+trans,+dur,+runtime,+mean,+stddev,+sum,+sco,+dco,+pkts,+spkts,+dpkts,+bytes,+sbytes,+dbytes,+load,+sload,+dload,+loss,+sloss,+dloss,+ploss,+sploss,+dploss,+rate,+srate,+drate,+appbytes,+sappbytes,+dappbytes,+label:200" > /ssd/argus/splunk/racluster.csv
>>>> 
>>>> The problem is that when I'm checking the timestamp on the racluster.csv file, it's always on the 01,06,11,... minute.  So, it looks like even though racluster is set to connect to radium for 300 seconds, it's writing out the results after < 120 seconds.  I also tried just running the racluster part of the above command on the command-line and it is also writing the results out before the full five minutes has elapsed.
>>>> 
>>>> Is there a smarter way to accomplish my goal?  If not, how can I figure out why racluster isn't connecting for the full length of time specified in the -T flag?
>>>> 
>>>> Thanks.
>>>> 
>>>> Craig
>>> 
>>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20130307/a98e6d2e/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2589 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20130307/a98e6d2e/attachment.bin>