racluster issue

Thu Mar 7 01:01:53 EST 2013

It's all right...  End user support is always aggravating, particularly when you aren't paid for it!  I appreciate the help.  You're both patient with and dedicated to your user community.

>From your last email...

Why 1 minute?  Because it allowed me to test more quickly.  

What is the output file name all about?  %s is the strftime value for Unix epoch time.  Splunk uses unix epoch time and it also seemed like the best time format in a file name for scripting.  List a directory, grep the file name from the results, if StartTime < filename < EndTime, then do...

Clustering on 5 minutes worth of data is arbitrary.  Based on the size of the files I've seen and my history with Splunk, it seemed like the right number.

Rastream successfully output the file in both my "%s.argus" format and the "argus.%Y.%m.%d.%H.%M.%S", but I never saw the /usr/local/argus/rastream.sh run nor did the racluster command  contained in rastream.sh run:

	/usr/local/bin/racluster -r $1 -c "@" -p 3 -u -Z b -s "+0ltime,+trans,+dur,+runtime,+mean,+stddev,+sum,+sco,+dco,+spkts,+dpkts,+sbytes,+dbytes,+load,+sload,+dload,+loss,+sloss,+dloss,+ploss,+sploss,+dploss,+rate,+srate,+drate,+appbytes,+sappbytes,+dappbytes,+label:200" > /ssd/argus/test.csv
	exit 0

I can run "/usr/local/argus/rastream.sh filename.argus" and it successfully runs and produces the correct CSV output.  

I ran a few tests on a file containing five minutes worth of binary data just to get a rough idea of what the performance was of different options.  The results were run against a binary argus file of about 1.35 GB.  It would seem that 4 vs 5 tuples for aggregation, the number of fields output, and filters on proto/host/net/port all impact the performance and efficiency.  My machine has 32 cores in it, so if I can get the -f script functionality working for rabins, I should be able to use filters to divide the traffic up and spread it over more CPUs.

We want to use this data for a number of things in Splunk.  I can build statistical models for different server roles or networks, search for unusual TCP flags, alert on poorly performing hosts...  We probably don't need the granularity of five tuple aggregation, though it wouldn't be too tough to write logic in Splunk that could correct flows with a "?" in the direction by comparing the address/port combinations against our Nessus database and look for open services.  The Netops team can use it for diagnostics without having to learn the complexities of Argus...

Some of the tests I ran earlier tonight are below.  It shows the execution time for the search, the size (in bytes) of output (both CSV and binary formats), and how large the file is compared to the original.  Obviously the CSV to binary comparison isn't apples-to-apples, but...

Thanks.

Craig

racluster -r 1362619200.argus -p 3 -u -Z b -n > racluster_5tuple_basic.csv
	racluster_5tuple_basic.csv:  2:15 min	248532175	17% of original

racluster -r 1362619200.argus -p 3 -u -Z b -n -s "+0ltime,+1stime,+trans,+dur,+runtime,+mean,+stddev,+sum,+sco,+dco,+pkts,+spkts,+dpkts,+bytes,+sbytes,+dbytes,+load,+sload,+dload,+loss,+sloss,+dloss,+ploss,+sploss,+dploss,+rate,+srate,+drate,+appbytes,+sappbytes,+dappbytes,+label:200" > racluster_5tuple_full.csv
	racluster_5tuple_full.csv:  5:15 min	1378792663	94% of original

# 4 tuple flow -- proto,saddr,daddr,dport	
racluster -r 1362619200.argus -p 3 -u -Z b -n -m proto saddr daddr dport > racluster_4tuple_basic.csv
	racluster_4tuple_basic.csv:  1:55 min	115532659	7% of original

racluster -r 1362619200.argus -p 3 -u -Z b -n -m proto saddr daddr dport -s "+0ltime,+1stime,+trans,+dur,+runtime,+mean,+stddev,+sum,+sco,+dco,+pkts,+spkts,+dpkts,+bytes,+sbytes,+dbytes,+load,+sload,+dload,+loss,+sloss,+dloss,+ploss,+sploss,+dploss,+rate,+srate,+drate,+appbytes,+sappbytes,+dappbytes,+label:200" > racluster_4tuple_full.csv
	racluster_4tuple_full.csv:  2:45 min	640858306	43% of original

racluster -r 1362619200.argus -p 3 -u -Z b -n -s "+0ltime,+1stime,+trans,+dur,+runtime,+mean,+stddev,+sum,+sco,+dco,+pkts,+spkts,+dpkts,+bytes,+sbytes,+dbytes,+load,+sload,+dload,+loss,+sloss,+dloss,+ploss,+sploss,+dploss,+rate,+srate,+drate,+appbytes,+sappbytes,+dappbytes,+label:200" - host 10.150.4.8 > racluster_5tuple_full_one_host.csv
	racluster_5tuple_full_one_host.csv:  35 sec		1171198	

racluster -r 1362619200.argus -p 3 -u -Z b -n - proto TCP > racluster_5tuple_tcp.csv
	racluster_5tuple_tcp.csv:  3:30	min	176794489	12% of original

#  Binary output of racluster
racluster -r 1362619200.argus -p 3 -u -Z b -n -w racluster_5tuple_basic.argus
	racluster_5tuple_basic.argus:  668236064	45% of original

racluster -r 1362619200.argus -p 3 -u -Z b -n -m proto saddr daddr dport -w racluster_4tuple_basic.argus
	racluster_4tuple_basic.argus:  304422084	20% of original

-----Original Message-----
From: Carter Bullard [mailto:carter at qosient.com] 
Sent: Wednesday, March 06, 2013 3:55 PM
To: Craig Merchant
Cc: Argus (argus-info at lists.andrew.cmu.edu)
Subject: Re: [ARGUS] racluster issue

OK, sorry about the rant.  Not your fault.  Maybe a bad day.
So........  lets break it down a bit.

Converting 5sec status data to 5 minute flow data, won't buy you that much,
except the occasional direction correction.  Because most 5-tuple transactions
are short lived.  Argus will have gotten most of the data consolidated out of the sensor.

So the question is, does default aggregation actually buy you any thing, regarding
data size and semantics.  This is easy to calculate.  Start with a file of your own primitive
data, and aggregate it, and look to see how many records are reduced.  If the new
file size is  < 60% of the primitive data, then you win, if not, maybe too expensive for
what you get.

You want to load primitive flow data into splunk.  So what is splunk going to do
with this data?  What are you looking for?  The answer should help you to determine
if aggregation is useful, and at what granularity level.  Usually, you change the
aggregation keys, not default aggregation, to get rid of the source port, or you blow
in matrix data, just the IP address pairs, and let load thresholds and flag indicators
key you to some anomalous behavior that helps you to go back to the archive, with
time filters to limit the search.

If your having resource problems with racluster() reading a file, your racluster -T will
have the same problems, possibly worst.  Try to think through what you're trying to
accomplish / find, and find an aggregation strategy that will reduce the data load.

I still don't think the " -T secs " option is what you're looking for.

Carter

On Mar 6, 2013, at 3:39 PM, Carter Bullard <carter at qosient.com> wrote:

> Hmmmm, you must have changed every aspect of my example,
> and now you're saying that it doesn't work.   Not surprising.
> 
> So where did 1 minute come from ?
> What is that output file name all about ?
> Why isn't the script completing / running ?  Did you test it at all ?
> 
> Do you know why your clustering 5 minutes of data?  Is there a purpose?
> 
> Carter
> 
> On Mar 6, 2013, at 3:03 PM, Craig Merchant <cmerchant at responsys.com> wrote:
> 
>> Hey, Carter...
>> 
>> So, I tried the first approach, but the script never executes.  I should also say my scripting skills are minimal, so I apologize for any noob-related errors in advance.
>> 
>> I invoked the rastream command as follows:
>> 
>> rastream -S 10.10.10.10:561 -M time 1m -B 10s -w /ssd/argus/%s.argus -f /usr/local/argus/rastream.sh -d
>> 
>> /usr/local/rastream.sh is pretty simple:
>> 
>> /usr/local/bin/racluster -r $1 -c "@" -p 3 -u -Z b -s "+0ltime,+trans,+dur,+runtime,+mean,+stddev,+sum,+sco,+dco,+spkts,+dpkts,+sbytes,+dbytes,+load,+sload,+dload,+loss,+sloss,+dloss,+ploss,+sploss,+dploss,+rate,+srate,+drate,+appbytes,+sappbytes,+dappbytes,+label:200" > /root/test.csv
>> 
>> After testing racluster against the files produced above, I'm not sure this process will work for us.  The binary argus records produced by rastream are averaging about 250-300 MB per minute and during that testing window our data centers were only generating about 4-5 Gbps of traffic (peak volume could be in the 12-15 Gbps range).  
>> 
>> Both the binary argus files and the CSV output are being read/written to an array of SSDs, so the I/O is pretty fast.  Running racluster against a 1m binary file takes 50-75 seconds.  So, at that rate, I can't write the binary with rastream and aggregate the data with racluster with enough time remaining to import it into Splunk and do my analytics.
>> 
>> The problem I'm having with rabins is that if I try and generate ASCII output with something like "rabins -S data.source -M time 5m -B 10s > output.csv", each bin is appended to output.csv instead of overwriting it.  If I can get the scripting to work, I can probably have the script remove the old file and then create a symbolic link to the new file and have Splunk use the symbolic link to import the flows (for my use case, Splunk would require reconfiguration each time the CSV file name changes).
>> 
>> So...  unless you've got a better idea, having radium handle my labeling and have racluster connect to radium for -T seconds seems like the easiest way to get a fresh CSV file of aggregated flows every X minutes.
>> 
>> As an aside, I noticed that rastream will convert %s.argus to the unix epoch value, but rabins just writes the file as "%s.argus"...
>> 
>> Thanks!
>> 
>> Craig
>> 
>> -----Original Message-----
>> From: Carter Bullard [mailto:carter at qosient.com] 
>> Sent: Tuesday, March 05, 2013 8:03 AM
>> To: Craig Merchant
>> Cc: Argus (argus-info at lists.andrew.cmu.edu)
>> Subject: Re: [ARGUS] racluster issue
>> 
>> Hey Craig,
>> You are starting to realize the same issues that caused us to create raspit() and rastream().
>> Flow records span whatever ARGUS_STATUS_INTERVAL period there is, so without some
>> record processing, your output from you methods will have irregular start and stop times.
>> 
>> Now, the assumption is you are processing argus records, where argus has a good
>> configuration, meaning that the ARGUS_FLOW_STATUS_INTERVAL, is reasonable,
>> like 1-15 seconds.  With this, you should use either rabins() or rastream().
>> 
>> I think you should relax your requirement that rejects an intermediate argus data file.
>> If you can do that, use rastream(), to output records into a file with the date in its name,
>> and after a brief wait time after your time boundary passes, have rastream() run a shell
>> script containing your commands, against that data file.  You can delete the file when
>> your done, so that you aren't piling up a lot of data.  
>> 
>> You can also use radium to label your traffic so that you don't need to do it yourself in the
>> scripts.  But lets stay with your example:
>> 
>> OK assume an ARGUS_FLOW_STATUS_INTERVAL = 5 secs
>> 
>>  rastream -M time 5m -B 10s -S data.source -w /tmp/argus.data/argus.%Y.%m.%d.%H.%M.%S -f rastream.sh -d
>> 
>> This will get the data into a file structure that will be useful, and 10 seconds after each 5 min time boundary,
>> rastream will run the rastream.sh shell script, passing the file as the single parameter.  Use the
>> ./support/Config/rastream.sh as a guide, and in the script have something like:
>> 
>>  racluster -r $1-w - | ralabel -f ralabel.conf -F ralabel.script.conf > /ssd/argus/splunk/racluster.csv
>> 
>> where ralabel.script.conf has all your particulars in it, like comma separated, and the fields.
>> Not sure what your " -M dsrs="+metric,+agr..." is doing, I would remove that.
>> 
>> This will give you a new /ssd/argus/splunk/racluster.csv 10 seconds after each 5 minute period.
>> check for last write time, to see that its changed, and the feed it into whatever.
>> 
>> rabins() is being used my most sites to generate periodic ASCII output of aggregated data.
>> Gloriad does this for their spinning globe. 
>> 
>>  See http://www.gloriad.org/gloriaddrupal/
>> 
>> so in your example, you would have radium() do the labeling, so that you don't have to pipe
>> anything in your terminal analytic.  This should work
>> 
>>   rabins -S data.source -M time 5m -B 10s -F ralabel.script.conf
>> 
>> rabins() will sti there, and then 10 seconds after each 5 minute period, like 05:00:10, it will write out
>> all its clustered data, starting with a START MAR and ending with a STOP MAR.  which can be used
>> to realize that here is the beginning and here is the end of this time period.  So no intermediate files
>> of any kind.  I dont like this, necessarily, as you hold a lot of data in memory, before writing out the
>> time period results, creating a bit of a pipeline issue.
>> 
>> So what do you think, which one will you use ?
>> 
>> Carter
>> 
>> On Mar 4, 2013, at 11:05 PM, Craig Merchant <cmerchant at responsys.com> wrote:
>> 
>>> Carter,
>>> 
>>> Here's what I'm trying to do and I may not be going about it the smartest way...  I would like racluster, rabins, or rastream to output a csv file containing five minutes of flow data, aggregated using proto, saddr, daddr, sport, and dport.  That CSV file will be imported into Splunk for analysis every five minutes.  I would prefer for the CSV file to be overwritten each time the argus client outputs five minutes of aggregated flows.  I would also prefer to avoid writing to an argus binary file as an intermediary step.
>>> 
>>> The way I've been doing it is to set up an entry in the crontab file that looks like:
>>> 
>>> 00,05,10,15,20,25,30,35,40,45,55 * * * * /usr/local/bin/racluster -S 10.10.10.10:561 -T 300 -p 3 -u -Z b -w - | /usr/local/bin/ralabel -r - -f /usr/local/argus/ralabel.conf -c "," -M dsrs=+metric,+agr,+psize,+cocode -n -p 3 -u -Z b -s "+0ltime,+1stime,+trans,+dur,+runtime,+mean,+stddev,+sum,+sco,+dco,+pkts,+spkts,+dpkts,+bytes,+sbytes,+dbytes,+load,+sload,+dload,+loss,+sloss,+dloss,+ploss,+sploss,+dploss,+rate,+srate,+drate,+appbytes,+sappbytes,+dappbytes,+label:200" > /ssd/argus/splunk/racluster.csv
>>> 
>>> The problem is that when I'm checking the timestamp on the racluster.csv file, it's always on the 01,06,11,... minute.  So, it looks like even though racluster is set to connect to radium for 300 seconds, it's writing out the results after < 120 seconds.  I also tried just running the racluster part of the above command on the command-line and it is also writing the results out before the full five minutes has elapsed.
>>> 
>>> Is there a smarter way to accomplish my goal?  If not, how can I figure out why racluster isn't connecting for the full length of time specified in the -T flag?
>>> 
>>> Thanks.
>>> 
>>> Craig
>> 
>> 
>