racluster

Fri Feb 8 02:22:20 EST 2008

Hey Torbjörn,
The program of choice for this scenario is going to be rastream().
It should be ready now, so if you would like to give it a try, that
would be excellent!!!!!

rasplit() is a simple fast program.  It takes in a record, generates
a file pathname, it opens the file if needed, writes out a record, and
closes the file, if needed.  So, as time goes, rasplit() leaves a trail
of  files that are quasi-sorted and blocked in time.   But in the data
flow of your application, it lacks in just one area.   It doesn't do the
  processing you would like to be done on each file when they are  
"done".

rastream() is rasplit(), but with the added functionality to know
when a file is "done", and to launch a script against the file at the
moment the file is finished.  It also sorts the traffic, based on stime,
which is very nice when you get data from multiple sources (whose
times need to agree to some level)  You basically provide it with a  
timer,
that specifies how long to wait after a time boundary, before the file  
will
be completed.  Say you are generating 5 minute size files.  Well,
with argus, flow activity data that crosses a given 5 minute boundary
will be out of the probes and into your distribution tree within the
ARGUS_FAR_STATUS_INTERVAL, which is by default 5 seconds.
So, depending on your distribution system delay, your file will be done
within 6-10 seconds after any given 5 minute interval.

If you ran rasplit() like this:
    rasplit -S radium -M time 5m -w /path/\$srcid/%Y/%m/%d/argus.%Y.%m. 
%d.%H.%M.%S
or
    rasplit rasplitOptions

you would run rastream() like this:
    rastream  -B 10s -f rastream.sh  rasplitOptions

(there is a sample rastream.sh in ./support/Config that just  
compresses the file).
Assuming that all your probes are time sync'd.

The "-B time" is how long rastream() will wait after a time boundary has
passed before it will close the file and then run the script.
Rastream() calls the script with the pathname to the file like this:

    script -r /full/path/to/the/data/file

and then takes note of the return code of the script to see if it  
completed.
It will run only one script at a time, so if you've got a rasplit()  
strategy
that is writing out 100's of files at a time (say you're getting data  
from
100 remote argi), you will get 100 files to close and process at the  
same
time, and that could be a bit much.

This program is referred to as a "stream block processor".  It takes
in an infinite stream, and it blocks the data.  The block provides
some scoping, and bounds the processing.  Very big topic in
cryptography and database research 10 years ago, still today.

Give it a try, if you have any problems, give me a holler!!!

Carter

On Feb 7, 2008, at 8:02 PM, Torbjorn.Wictorin at its.uu.se wrote:

>> What are you trying to do?
>
> argus--+
> argus--+
> argus--+--- radium --- racluster
> argus--+
> and restarting racluster/moving away logs periodically.
>
> So, your suggest
>
> argus--+
> argus--+
> argus--+--- radium --- rasplit -> files -> racluster -> clustered  
> files
> argus--+
>
> Which was my first approach, but because of the previous problem  
> with radium I have tried a lot of variants.
>
> Now back to the first method.
> ...
> Seems to work now... Thanks, Carter for putting me on the right  
> track again...
>
> One small bug: rasplit does not die when killed 1 or 15.
>
> Torbjörn
>
>