flocon 2010 presentations on the web

Sat Feb 6 22:24:01 EST 2010

On Sat, Feb 06, 2010 at 09:06:37AM -0500, Carter Bullard wrote:
> Hey Peter,
> I'll add it to the argus-3.0.3 tree.  Do we need any documentation?
> Carter
> 

	We always need more documentation :-). So I added some and moved 
setting the PATH to an earlier part of the script and changed the default
post processing options from "yes" to "no" so that by default the script
will act mostly as the original argusarchive did and not try and call perl
scripts that aren't present :-). I've attached both a new copy of the script 
and a README for it.

Peter Van Epp
-------------- next part --------------
#!/bin/sh
#  Argus Software
#  Copyright (c) 2000-2009 QoSient, LLC
#  All rights reserved.
# 
#  QoSIENT, LLC DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS
#  SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
#  FITNESS, IN NO EVENT SHALL QoSIENT, LLC BE LIABLE FOR ANY
#  SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER
#  RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF
#  CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
#  CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
# 
#*/

#  14 Jan 2010 - Peter Van Epp (vanepp at sfu.ca):
#  	Modified for perl traffic scripts and to make it work again

# In case we are running from cron set an appropriate PATH 

PATH=/bin:/usr/bin:/usr/local/bin

# If there is an argument on the command line set it as the argus prefix which
# will modify the names of the various data files (for the case where a machine
# is collecting for more than one argus instance). If there is no argument, 
# then the prefix is set to argus so as to be compatible with the original
# version of argusarchive.

if [ "$1"x = x ]; then
  INSTANCE=argus
else
  INSTANCE="$1_argus"
fi

# User setable options:

#
# Try to use $ARGUSDATA and $ARGUSARCHIVE where possible.
# If these are available, the only thing that we need to
# know is what is the name of the argus output file.
#
# If ARGUSDATA set then don't need to define below.  For
# cron scripts however, $ARGUSDATA may not be defined, so
# lets do that here if it isn't already.

# where to find the data argus is writing

if [ "$ARGUSDATA"x = x ]; then
  ARGUSDATA=/var/log/argus			# not set by user, so set it
fi

if [ "$ARGUSARCHIVE"x = x ]; then
  ARGUSARCHIVEBASE=/usr/local/argus		# not set by user so set it
else 
  ARGUSARCHIVEBASE=$ARGUSARCHIVE		# else us the user's value
fi

DATAFILE=${INSTANCE}.out   # argus must be writing data in /$ARGUSDATA/$DATAFILE

# set the program paths for your OS (this is FreeBSD)

ARGUSBIN=/usr/local/bin	# location of argus programs 
AWK=/usr/bin/awk
MV=/bin/mv
MKDIR=/bin/mkdir
CHOWN=/usr/sbin/chown
SU=/usr/bin/su
TOUCH=/usr/bin/touch

# Data file compression

COMPRESS=yes		# compress the archived data files yes or no

# pick one of the below

COMPRESSOR=/usr/bin/gzip  # using this compression program
COMPRESSFILEEX=gz

#COMPRESSOR=/usr/bin/bzip2  # using this compression program
#COMPRESSFILEEX=bz2

#COMPRESSOR=/usr/bin/compress  # using this compression program
#COMPRESSFILEEX=Z

# options for perl traffic processing scripts

ARGUSREPORTS=$ARGUSDATA	# post processing directory
POSTPROCESS=no		# run the traffic scripts 
SPOOL=$ARGUSREPORTS/spool	# spool directory name		
POSTPROG=/usr/local/bin/argus3_post_drv.pl
POSTLOG=/var/log/argus.logs/argus3_post_drv.log
ACCOUNT=argus		# account used to run the post scripts (needs only
			# read access to the archive files)

# optionally anonymize the argus data before post processing it

ANONPOSTPROCESS=no	# run the traffic scripts on the anon data as well
ANONYMIZE=$ARGUSBIN/ranonymize		# $ARGUSBIN/ranonymize or no 
ANONCONF=$ARGUSDATA/ranonymize.conf  	# using this config file if anonimizing
ANONDATADIR=$ARGUSREPORTS/anondata	# anonymized data storage directory name

# end of options

# Set ARGUSARCHIVE according to the settings above

ARGUSARCHIVE=$ARGUSARCHIVEBASE/${INSTANCE}.archive

# create the archive directory 

if [ ! -d $ARGUSARCHIVE ]; then
  $MKDIR $ARGUSARCHIVE
  if [ ! -d $ARGUSARCHIVE ]; then
    echo "Could not create archive directory $ARGUSARCHIVE"
    exit
  fi
fi

if [ -d $ARGUSDATA ] ; then
   cd $ARGUSDATA
   echo "cd $ARGUSDATA"
   if [ $ARGUSDATA != `pwd` ]; then 
     echo "couldn't change to directory $ARGUSDATA, got `pwd` instead"
     exit
   fi
else
   echo "argus data directory $ARGUSDATA not found"
   exit
fi

# If there is an argument on the command line set it as the argus prefix which
# will modify the names of the various data files (for the case where a machine
# is collecting for more than one argus instance). If there is no argument, 
# then the prefix is set to argus so as to be compatible with the original
# version of argusarchive.

if [ "$1"x = x ]; then
  INSTANCE=argus
else
  INSTANCE="$1_argus"
fi

# In order to have the archive be date consistant (i.e. the first file of the
# day starts at or close to midnight instead of 23:00 of the day before as
# was originally the case), take the archive file name from a file called
# $ARGUSDATA/${INSTANCE}.start.date (which is supposed to be created by the
# startup scripts at boot, and every time this script is run). To provide for
# the case where the file doesn't exist when this script runs, set the file
# to the current time (with a .0 appended to the end) and the next cycles file
# name to the current time with a .1 appended. This makes sure that the two
# close to identically named files sort in the correct date order for processing
# even after the compression suffix is tagged on the end. All the files need
# to have the .0 appended to them so they remain the same length and thus 
# sort correctly. 

if [ ! -f $ARGUSDATA/${INSTANCE}.start.date ]; then 

  # File doesn't exist so create a current archive file with the current time
  # and a .0 suffix, and the new archive file (for next cycle) with the current
  # time and a .1 suffix. The purpose of the suffixes is to maintain file 
  # time order on a sort after the compression suffix is appended to the 
  # file name. Without the suffixes at the next cycle the script would 
  # overwrite the data we archived this time (bad!) because the file names
  # would be identical.

  echo "$ARGUSDATA/${INSTANCE}.start.date doesn't exist creating files"
  ARCHIVE=${INSTANCE}.`date '+%Y.%m.%d.%H.%M.%S'`.0
  NEWARCHIVE=${INSTANCE}.`date '+%Y.%m.%d.%H.%M.%S'`.1

else

  # The file exists, so check the contents are of the form 
  # $INSTANCE.yy.mm.hh.mm.ss.0|1 as it should be. If not set both file names
  # as above to create a correct pair of file names and log the invalid 
  # contents of the file.

  ARCHIVE=`cat $ARGUSDATA/${INSTANCE}.start.date`

  # since I can't figure out how to escape the $ to match eol, cheat ...

  ESC=$
  RESULT=`egrep -c "^$INSTANCE\.[0-9][0-9][0-9][0-9]\.[0-9][0-9]\.[0-9][0-9]\.[0-9][0-9]\.[0-9][0-9]\.[0-9][0-9]\.[0-9]$ESC" $ARGUSDATA/${INSTANCE}.start.date`

  if [ "$RESULT" = "1" ]; then

    # the file appears valid so use the contents as the current archive name
    # and create the next one from the current time with .0 appended. This 
    # should be the normal case when all is well.

    NEWARCHIVE=${INSTANCE}.`date '+%Y.%m.%d.%H.%M.%S'`.0

  else

    # The format of the saved file looks invalid (perhaps because someone 
    # external messed with it), so recreate a proper current and new archive
    # file. Log the corrupted version.

    echo "$ARCHIVE is invalid, recreated"
    ARCHIVE=${INSTANCE}.`date '+%Y.%m.%d.%H.%M.%S'`.0
    NEWARCHIVE=${INSTANCE}.`date '+%Y.%m.%d.%H.%M.%S'`.1

  fi
fi

TIMESTAMP=`date '+%Y.%m.%d.%H.%M.%S'`

# and write the next cycle's archive file name to file for the next cycle. 

`echo $NEWARCHIVE > $ARGUSDATA/${INSTANCE}.start.date`

echo "$TIMESTAMP ${INSTANCE}_argusarchive started"

YEAR=`echo $ARCHIVE | $AWK 'BEGIN {FS="."}{print $2}'`
MONTH=`echo $ARCHIVE | $AWK 'BEGIN {FS="."}{print $3}'`
DAY=`echo $ARCHIVE | $AWK 'BEGIN {FS="."}{print $4}'`

if [ ! -d $ARGUSARCHIVE ] ; then
   $MKDIR $ARGUSARCHIVE
   if [ ! -d $ARGUSARCHIVE ] ; then
      echo "could not create archive directory $ARGUSARCHIVE"
      exit
   else
      echo "archive directory $ARGUSARCHIVE created"
   fi
else
   echo "archive directory $ARGUSARCHIVE found"
fi

ARGUSARCHIVE=$ARGUSARCHIVE/$YEAR

if [ ! -d $ARGUSARCHIVE ]; then
   $MKDIR $ARGUSARCHIVE
   if [ ! -d $ARGUSARCHIVE ]; then
      echo "could not create archive directory structure."
      exit
   fi
fi

ARGUSARCHIVE=$ARGUSARCHIVE/$MONTH

if [ ! -d $ARGUSARCHIVE ]; then
   $MKDIR $ARGUSARCHIVE
   if [ ! -d $ARGUSARCHIVE ]; then
      echo "could not create archive directory structure."
      exit
   fi
fi

ARGUSARCHIVE=$ARGUSARCHIVE/$DAY

if [ ! -d $ARGUSARCHIVE ]; then
  $MKDIR $ARGUSARCHIVE
   if [ ! -d $ARGUSARCHIVE ]; then
      echo "could not create archive directory structure."
      exit
   fi
fi

# Presumably this is for mysql, but I don't know how to create it so 
# it is currently commented out 

# if [ ! -d $ARGUSARCHIVE/$INDEX ]; then
#   $MKDIR $ARGUSARCHIVE/$INDEX
#   if [ ! -d $ARGUSARCHIVE/$INDEX ]; then
#      echo "could not create archive index directory."
#      exit
#   fi
# fi

if [ -f $ARGUSDATA/$DATAFILE ] ; then
   if [ -f $ARGUSARCHIVE/$ARCHIVE ] ; then
      echo "argus archive file $ARGUSARCHIVE/$ARCHIVE exists, leaving data"
      exit
   else
      $MV $ARGUSDATA/$DATAFILE $ARGUSARCHIVE/$ARCHIVE 2>/dev/null
   fi
else
   echo "argus data file $ARGUSDATA/$DATAFILE not found"
   exit
fi

TIMESTAMP=`date '+%Y.%m.%d.%H.%M.%S'`

if [ -f $ARGUSARCHIVE/$ARCHIVE ]; then
   echo "$TIMESTAMP argus data file $ARGUSARCHIVE/$ARCHIVE moved successfully"
else 
   echo "argus data file $ARGUSDATA/$DATAFILE move failed"
   exit
fi

# Now compress and/or post process the data file if that has been requested

# save a copy of the archive filename (which will change and be updated if
# compression is requested) for later processing

ARCHIVEFILE=$ARCHIVE
ARCHIVEPATHFILE=$ARGUSARCHIVE/$ARCHIVE

# compression first if requested

if [ $COMPRESS = yes ]; then
   if [ "$COMPRESSOR"x = x ]; then
     echo "Compression requested but COMPRESSOR not set"
     exit
   fi

   if [ -f $ARGUSARCHIVE/$ARCHIVE.$COMPRESSFILEEX ]; then
     echo "Compressed file $ARGUSARCHIVE/$ARCHIVE.$COMPRESSFILEEX already exists, leaving data file"
     exit
   fi

   $COMPRESSOR $ARGUSARCHIVE/$ARCHIVE

   TIMESTAMP=`date '+%Y.%m.%d.%H.%M.%S'`

   if [ -f $ARGUSARCHIVE/$ARCHIVE ]; then
     echo "$TIMESTAMP Original data file $ARGUSARCHIVE/$ARCHIVE still exists compression failed?"
     exit
   fi

   if [ -f $ARGUSARCHIVE/$ARCHIVE.$COMPRESSFILEEX ]; then
     echo "$TIMESTAMP $ARGUSARCHIVE/$ARCHIVE.$COMPRESSFILEEX compression completed"

     # so update the data file name for futher processing if requested

     ARCHIVE=$ARCHIVE.$COMPRESSFILEEX
     ARCHIVEPATHFILE=$ARGUSARCHIVE/$ARCHIVE

   else
     echo "$TIMESTAMP no compressed file $ARGUSARCHIVE/$ARCHIVE.$COMPRESSFILEEX compression failed?"
     exit
   fi
fi

# if we got this far things seem to have worked correctly so do the 
# anonymizing and post processing if requested

if [ $POSTPROCESS = yes ] || [ $ANONPOSTPROCESS = yes ]; then

 # check the reports directories creating as needed and requested

 if [ ! -d $ARGUSREPORTS ] ; then

   $MKDIR $ARGUSREPORTS

   if [ ! -d $ARGUSREPORTS ] ; then

     echo "could not create reports directory $ARGUSREPORTS"
     exit
   else
     echo "report directory $ARGUSREPORTS created"
   fi
 fi

 # check and create the spool directory if needed

 if [ ! -d $SPOOL ] ; then

   $MKDIR $SPOOL

   if [ ! -d $SPOOL ] ; then

     echo "could not create reports directory $SPOOL"
     exit
   else
     echo "report directory $SPOOL created"
   fi
 fi
fi

if [ $POSTPROCESS = yes ]; then

  # If postprocessing of the unanonymized files has been requested create
  # the appropriate workfile in the spool directory. 

  echo "$ARCHIVEPATHFILE" > $ARGUSREPORTS/spool/w$ARCHIVE
fi

# If anonymization has been requested, anonymize and (if requested) compress
# the data file 

if [ $ANONYMIZE = $ARGUSBIN/ranonymize ]; then

  # Check and create the data directory as needed

  if [ ! -d $ANONDATADIR ] ; then

    $MKDIR $ANONDATADIR

    if [ ! -d $ANONDATADIR ] ; then

      echo "could not create reports directory $ANONDATADIR"
      exit
    else
      echo "report directory $ANONDATADIR created"
    fi
  fi

  # anonymize the data file as requested and save it (compressed if requestedi) 
  # in to the anondata directory as anonfilename (to differentiate it from the
  # unanonympized data file)

  ARCHIVEFILE=anon$ARCHIVEFILE

  $ANONYMIZE -f $ANONCONF -r $ARGUSARCHIVE/$ARCHIVE -w $ANONDATADIR/$ARCHIVEFILE

  # update the full path to the now anonymized data file to pass to post 
  # processing if requested

  ARCHIVEPATHFILE=$ANONDATADIR/$ARCHIVEFILE

  if [ $COMPRESS = yes ]; then

    if [ -f $ANONDATADIR/$ARCHIVEFILE.$COMPRESSFILEEX ]; then
      echo "Compressed file  $ANONDATADIR/$ARCHIVEFILE.$COMPRESSFILEEX already exists, leaving data file"
      exit
    fi

    TIMESTAMP=`date '+%Y.%m.%d.%H.%M.%S'`
    echo "$TIMESTAMP starting compression of $ANONDATADIR/$ARCHIVEFILE"

    $COMPRESSOR $ANONDATADIR/$ARCHIVEFILE

    TIMESTAMP=`date '+%Y.%m.%d.%H.%M.%S'`

    if [ -f $ANONDATADIR/$ARCHIVEFILE ]; then
      echo "$TIMESTAMP Original data file $ANONDATADIR/$ARCHIVEFILE still exists compression failed?"
      exit
    fi

    if [ -f $ANONDATADIR/$ARCHIVEFILE.$COMPRESSFILEEX ]; then
      echo "$TIMESTAMP compression of $ANONDATADIR/$ARCHIVEFILE.$COMPRESSFILEEX completed"
      ARCHIVEFILE=$ARCHIVEFILE.$COMPRESSFILEEX
    else 
      echo "$TIMESTAMP compression of $ANONDATADIR/$ARCHIVEFILE failed"
      exit
    fi

    # update the full path to the now anonymized data file to pass to post 
    # processing if requested

    ARCHIVEPATHFILE=$ANONDATADIR/$ARCHIVEFILE
  fi  # end of anon compression

 if [ $ANONPOSTPROCESS = yes ]; then

   # Write the workfile in to the spool directory to cause this file to be 
   # post processed when the post processing script is run later. 

   echo "$ARCHIVEPATHFILE" > $ARGUSREPORTS/spool/w$ARCHIVEFILE
 fi
fi

# At this point the appropriate work files have been written to the spool
# directory so change the ownership of the files to the post processing
# user and launch post processing if requested

if [ $POSTPROCESS = yes ] || [ $ANONPOSTPROCESS = yes ]; then

 # Check for and try and create an appropriate log file

 if [ ! -f $POSTLOG ]; then

   $TOUCH $POSTLOG
   if [ ! -f $POSTLOG ]; then
     echo "Log file $POSTLOG can't be created"
     exit
   fi
 fi

 # Correct the ownership of the directories we have been writing as root to
 # the post processing user

 $CHOWN $ACCOUNT $POSTLOG
 $CHOWN -R $ACCOUNT $ARGUSREPORTS/spool
 $CHOWN -R $ACCOUNT $ANONDATADIR

 # then run the post processing command

 TIMESTAMP=`date '+%Y.%m.%d.%H.%M.%S'`

 echo "$TIMESTAMP Post processing started"

 $SU $ACCOUNT -c "$POSTPROG >> $POSTLOG"

 TIMESTAMP=`date '+%Y.%m.%d.%H.%M.%S'`

 echo "$TIMESTAMP Post processing completed"

fi

TIMESTAMP=`date '+%Y.%m.%d.%H.%M.%S'`
echo "$TIMESTAMP argusarchive completed successfully"

-------------- next part --------------
Feb 2010

	Some minimal documentation for the argusarchive script. First off 
(ignoring the new options for post processing for now) the purpose of this
script is to be run from cron on the argus archive machine (which may be
the same as the sensor machine if the load is low enough) and create an
archive of argus files (I usually use 1 hour, your link speed may need 
that to be much shorter). This is the cron entry I've been using for many
years now (although parts of this script are new, I've been using one like
it for 10 years or more):

0 * * * * 	/usr/local/bin/argusarchive >> /var/log/argus.logs/argusarchive.log 2>&1

which creates an archive like this:

ls /usr/local/argus

argus.archive

ls /usr/local/argus/argus.archive

2009 2010

ls /usr/local/argus/argus.archive/2010

01 02 

ls /usr/local/argus/argus.archive/2010/02

01 02 03 04 05 06

ls /usr/local/argus/argus.archive/2010/02/04

argus.2010.02.04.00.00.00.0.gz
argus.2010.02.04.01.00.00.0.gz
...
argus.2010.02.04.23.00.00.0.gz

The script creates new directories as needed and via the use of a file
(by default in /var/log/argus) remembers the start time of the file 
(because when the file is processed it only has the end time). The
reason for that is so that the first file starts at midnight rather
than 23:00 of the day before as using the time the script runs to 
set the file name does and ends at midnight (rather than 23:00). 
The starttime file is also the reason that there is a .0 in the file name 
before the .gz. In the case of a reboot or restart the startup script that
starts argus should write a new start time file so to log a record that
the restart happened and to mark the appropriate hour of the new data
file (in case of an outage that crosses an hour boundary). As well the 
startup script needs to call argusarchive to move the current output file
(using the current value in the start.date file) to the archive and then
set the current time in to the argus.start.time file as the start time
for the current record. If the crash / restart should occur just as the
current file changed the file names may be identical. To fix that problem
a ".0" is normally appended to the archive file name. If that file already
exists when the restart happens the new file name is set to the ".1" 
extension so that the files will sort in the correct time order in the 
archive. This covers the case where the outage crosses an hour boundary 
as well, the first file will have the correct start time (but possibly 
less than an hour's data) and the new file will have the correct (possibly
several hours later) start time and possibly less than an hour's data. 
In normal operation all files will end in the .0.gz. This is all that the 
original argus archive script used to do (although as noted it used the 
end time and thus ran from 23:00 on day-1 til 23:00 on day). This script 
by default matches (other than the times) the original script. 
	I added some features to suit how I was running argus. You may or
may not want to use some of them. First this script started in the days 
of argus 2.0.6 (or possibly earlier :-) around 2002 or so. 

	There wasn't an argus ID option in 2.0.6 and I was running multiple 
argus instances on a single machine. Thus if you add an instance name to the 
command running from cron like this:

0 * * * * 	/usr/local/bin/argusarchive com >> /var/log/argus.logs/argusarchive.log 2>&1

then the instance supplied (com for commodity Internet as opposed to CA*net
in this case) is appended to all the archive names so the data is separated
by instance name (the "com" is replaced by "c4" for the Ca*net link in my
case):

ls /usr/local/argus

argus.archive

would become 

ls /usr/local/argus

com_argus.archive

and

ls /usr/local/argus/argus.archive/2010/02/04

argus.2010.02.04.00.00.00.0.gz

would become 

ls /usr/local/argus/com_argus.archive/2010/02/04

com_argus.2010.02.04.00.00.00.0.gz

	Then because processing an entire days worth of argus records at one
time was taking too much time and memory I added the code to launch post
processing code (perl in my case) against the file just archived when 
argusarchive runs. There is also the option to run the post processing on
this machine (probably not advised if it is your capture machine as well)
or to cause it to be transferred to a remote machine via ssh and processed
there. You can modify this to your taste.     
	New in this version is the option to anonymize the archive data using
ranonymize and pass the anonymized data to the post processing programs (which
incidentally can and should be run on an id that only has read access to the
archived data!). This allows traffic analysis of the data to occur (as only
the IPs are anonymized in my case) without the analyist knowing what machines
are creating the data. This is useful when a client has a traffic problem but
isn't comfortable (or possibly even allowed) to let the analyist see the real
IP addresses. If the client runs the same non anonymized traffic through 
the post processing scripts, they should get the same reports with the real
IP addresses from which they can take appropriate action. 

Peter Van Epp (vanepp at sfu.ca)