racluster issue

Wed Mar 6 18:54:49 EST 2013

OK, sorry about the rant.  Not your fault.  Maybe a bad day.
So........  lets break it down a bit.

Converting 5sec status data to 5 minute flow data, won't buy you that much,
except the occasional direction correction.  Because most 5-tuple transactions
are short lived.  Argus will have gotten most of the data consolidated out of the sensor.

So the question is, does default aggregation actually buy you any thing, regarding
data size and semantics.  This is easy to calculate.  Start with a file of your own primitive
data, and aggregate it, and look to see how many records are reduced.  If the new
file size is  < 60% of the primitive data, then you win, if not, maybe too expensive for
what you get.

You want to load primitive flow data into splunk.  So what is splunk going to do
with this data?  What are you looking for?  The answer should help you to determine
if aggregation is useful, and at what granularity level.  Usually, you change the
aggregation keys, not default aggregation, to get rid of the source port, or you blow
in matrix data, just the IP address pairs, and let load thresholds and flag indicators
key you to some anomalous behavior that helps you to go back to the archive, with
time filters to limit the search.

If your having resource problems with racluster() reading a file, your racluster -T will
have the same problems, possibly worst.  Try to think through what you're trying to
accomplish / find, and find an aggregation strategy that will reduce the data load.

I still don't think the " -T secs " option is what you're looking for.

Carter

On Mar 6, 2013, at 3:39 PM, Carter Bullard <carter at qosient.com> wrote:

> Hmmmm, you must have changed every aspect of my example,
> and now you're saying that it doesn't work.   Not surprising.
> 
> So where did 1 minute come from ?
> What is that output file name all about ?
> Why isn't the script completing / running ?  Did you test it at all ?
> 
> Do you know why your clustering 5 minutes of data?  Is there a purpose?
> 
> Carter
> 
> On Mar 6, 2013, at 3:03 PM, Craig Merchant <cmerchant at responsys.com> wrote:
> 
>> Hey, Carter...
>> 
>> So, I tried the first approach, but the script never executes.  I should also say my scripting skills are minimal, so I apologize for any noob-related errors in advance.
>> 
>> I invoked the rastream command as follows:
>> 
>> rastream -S 10.10.10.10:561 -M time 1m -B 10s -w /ssd/argus/%s.argus -f /usr/local/argus/rastream.sh -d
>> 
>> /usr/local/rastream.sh is pretty simple:
>> 
>> /usr/local/bin/racluster -r $1 -c "@" -p 3 -u -Z b -s "+0ltime,+trans,+dur,+runtime,+mean,+stddev,+sum,+sco,+dco,+spkts,+dpkts,+sbytes,+dbytes,+load,+sload,+dload,+loss,+sloss,+dloss,+ploss,+sploss,+dploss,+rate,+srate,+drate,+appbytes,+sappbytes,+dappbytes,+label:200" > /root/test.csv
>> 
>> After testing racluster against the files produced above, I'm not sure this process will work for us.  The binary argus records produced by rastream are averaging about 250-300 MB per minute and during that testing window our data centers were only generating about 4-5 Gbps of traffic (peak volume could be in the 12-15 Gbps range).  
>> 
>> Both the binary argus files and the CSV output are being read/written to an array of SSDs, so the I/O is pretty fast.  Running racluster against a 1m binary file takes 50-75 seconds.  So, at that rate, I can't write the binary with rastream and aggregate the data with racluster with enough time remaining to import it into Splunk and do my analytics.
>> 
>> The problem I'm having with rabins is that if I try and generate ASCII output with something like "rabins -S data.source -M time 5m -B 10s > output.csv", each bin is appended to output.csv instead of overwriting it.  If I can get the scripting to work, I can probably have the script remove the old file and then create a symbolic link to the new file and have Splunk use the symbolic link to import the flows (for my use case, Splunk would require reconfiguration each time the CSV file name changes).
>> 
>> So...  unless you've got a better idea, having radium handle my labeling and have racluster connect to radium for -T seconds seems like the easiest way to get a fresh CSV file of aggregated flows every X minutes.
>> 
>> As an aside, I noticed that rastream will convert %s.argus to the unix epoch value, but rabins just writes the file as "%s.argus"...
>> 
>> Thanks!
>> 
>> Craig
>> 
>> -----Original Message-----
>> From: Carter Bullard [mailto:carter at qosient.com] 
>> Sent: Tuesday, March 05, 2013 8:03 AM
>> To: Craig Merchant
>> Cc: Argus (argus-info at lists.andrew.cmu.edu)
>> Subject: Re: [ARGUS] racluster issue
>> 
>> Hey Craig,
>> You are starting to realize the same issues that caused us to create raspit() and rastream().
>> Flow records span whatever ARGUS_STATUS_INTERVAL period there is, so without some
>> record processing, your output from you methods will have irregular start and stop times.
>> 
>> Now, the assumption is you are processing argus records, where argus has a good
>> configuration, meaning that the ARGUS_FLOW_STATUS_INTERVAL, is reasonable,
>> like 1-15 seconds.  With this, you should use either rabins() or rastream().
>> 
>> I think you should relax your requirement that rejects an intermediate argus data file.
>> If you can do that, use rastream(), to output records into a file with the date in its name,
>> and after a brief wait time after your time boundary passes, have rastream() run a shell
>> script containing your commands, against that data file.  You can delete the file when
>> your done, so that you aren't piling up a lot of data.  
>> 
>> You can also use radium to label your traffic so that you don't need to do it yourself in the
>> scripts.  But lets stay with your example:
>> 
>> OK assume an ARGUS_FLOW_STATUS_INTERVAL = 5 secs
>> 
>>  rastream -M time 5m -B 10s -S data.source -w /tmp/argus.data/argus.%Y.%m.%d.%H.%M.%S -f rastream.sh -d
>> 
>> This will get the data into a file structure that will be useful, and 10 seconds after each 5 min time boundary,
>> rastream will run the rastream.sh shell script, passing the file as the single parameter.  Use the
>> ./support/Config/rastream.sh as a guide, and in the script have something like:
>> 
>>  racluster -r $1-w - | ralabel -f ralabel.conf -F ralabel.script.conf > /ssd/argus/splunk/racluster.csv
>> 
>> where ralabel.script.conf has all your particulars in it, like comma separated, and the fields.
>> Not sure what your " -M dsrs="+metric,+agr..." is doing, I would remove that.
>> 
>> This will give you a new /ssd/argus/splunk/racluster.csv 10 seconds after each 5 minute period.
>> check for last write time, to see that its changed, and the feed it into whatever.
>> 
>> rabins() is being used my most sites to generate periodic ASCII output of aggregated data.
>> Gloriad does this for their spinning globe. 
>> 
>>  See http://www.gloriad.org/gloriaddrupal/
>> 
>> so in your example, you would have radium() do the labeling, so that you don't have to pipe
>> anything in your terminal analytic.  This should work
>> 
>>   rabins -S data.source -M time 5m -B 10s -F ralabel.script.conf
>> 
>> rabins() will sti there, and then 10 seconds after each 5 minute period, like 05:00:10, it will write out
>> all its clustered data, starting with a START MAR and ending with a STOP MAR.  which can be used
>> to realize that here is the beginning and here is the end of this time period.  So no intermediate files
>> of any kind.  I dont like this, necessarily, as you hold a lot of data in memory, before writing out the
>> time period results, creating a bit of a pipeline issue.
>> 
>> So what do you think, which one will you use ?
>> 
>> Carter
>> 
>> On Mar 4, 2013, at 11:05 PM, Craig Merchant <cmerchant at responsys.com> wrote:
>> 
>>> Carter,
>>> 
>>> Here's what I'm trying to do and I may not be going about it the smartest way...  I would like racluster, rabins, or rastream to output a csv file containing five minutes of flow data, aggregated using proto, saddr, daddr, sport, and dport.  That CSV file will be imported into Splunk for analysis every five minutes.  I would prefer for the CSV file to be overwritten each time the argus client outputs five minutes of aggregated flows.  I would also prefer to avoid writing to an argus binary file as an intermediary step.
>>> 
>>> The way I've been doing it is to set up an entry in the crontab file that looks like:
>>> 
>>> 00,05,10,15,20,25,30,35,40,45,55 * * * * /usr/local/bin/racluster -S 10.10.10.10:561 -T 300 -p 3 -u -Z b -w - | /usr/local/bin/ralabel -r - -f /usr/local/argus/ralabel.conf -c "," -M dsrs=+metric,+agr,+psize,+cocode -n -p 3 -u -Z b -s "+0ltime,+1stime,+trans,+dur,+runtime,+mean,+stddev,+sum,+sco,+dco,+pkts,+spkts,+dpkts,+bytes,+sbytes,+dbytes,+load,+sload,+dload,+loss,+sloss,+dloss,+ploss,+sploss,+dploss,+rate,+srate,+drate,+appbytes,+sappbytes,+dappbytes,+label:200" > /ssd/argus/splunk/racluster.csv
>>> 
>>> The problem is that when I'm checking the timestamp on the racluster.csv file, it's always on the 01,06,11,... minute.  So, it looks like even though racluster is set to connect to radium for 300 seconds, it's writing out the results after < 120 seconds.  I also tried just running the racluster part of the above command on the command-line and it is also writing the results out before the full five minutes has elapsed.
>>> 
>>> Is there a smarter way to accomplish my goal?  If not, how can I figure out why racluster isn't connecting for the full length of time specified in the -T flag?
>>> 
>>> Thanks.
>>> 
>>> Craig
>> 
>> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2589 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20130306/a22a2350/attachment.bin>