rasql, rasqlinsert, rasqltimeindex

Carter Bullard carter at qosient.com
Tue May 28 19:19:43 EDT 2013


Hey John,

The concept of rasql() is to read argus data using the database, either
to fetch raw data when its in a database table, or to find it in an indexed
filesystem, or both at the same time.  This support is integrated into
rasql() and its turned on by using a " -t time-range " filter.

The time filter is used by two algorithms to select the source of the
data.  The first algorithm is designed to test if rasqltimeindex() has
been at work indexing files and poking its data into THIS database.
To do this, rasql() will look in the specified mysql database, which is
provided either on the command line or defined in the RA_DATABASE
rarc variable, for a " Seconds " table.  If that table exists, then the algorithm
will query the database for any files that may contain any seconds in
the time range, and from that list we just read the data that intersects
the time range.

I have my .rarc file set up so this works for me to find data from the indexed files.

   % rasql -t -7d+1m

This is really equivalent to:

   % rasql -t -7d+1m -r mysql://localhost:dbuser[:passwd]/ratop

The existence of the " Seconds " table in a particular database is the
only hook between rasql() and rasqltimeindex().  The " Seconds " table
refers to a " Filename " table which has the file data, its path, filename,
creation time, access time, status, size, and MD5 checksum, so we
can know if the index is good for that specific file.  rasql() knows to look
at the status field to see if it can read the file.  The file can be compressed
after the index is developed and rasql() knows how to decompress it,
etc…..

Now, for the second algorithm, its oriented around the features of
rasqlinsert().  rasqlinsert() can insert data into database tables that have
time and date elements in the table names.  This form of database
partitioning is pretty good, as it supports simple data removal, based on
time, which made sense to me.  Keep primitive data in a database table
for some retention time, and then roll it out, into a filesystem based
archive, and index it, until its time to roll that out.

If rasql() has any  " -M time xxx -r mysql://host:user/db/table_%whatever "
options, it knows that " strftime() " directives are involved, and it will take the
time range filter and use it generate all the possible database table names, 
and then test to see if any exist in the database.   It there are table matches,
rasql() will look at the table(s) schema to see if the "stime or ltime" variables
are attributes in the table.   If either exist, it will query for time specific data
from the tables, by adding a range to the SQL query statement.
If the "stime or ltime" values are not in the table, rasql() will read the entire
table and apply its time filter on the data it reads from the database.

So, …., for me, I have rasqlinsert() write primitive data into a table
that has the date in the name, and I make sure that the " ltime "
field is in the database schema.

  % rasqlinsert -d -S localhost:562 -m none -s -stime -s +0ltime \
                -M time 1d -w mysql://root@localhost/ratop/primitiveTable_%Y_%m_%d 
                

So to read data from specific tables you use the same options as rasqlinsert(),

  % rasql -M time 1d -r mysql://root@localhost/ratop/primitiveTable_%Y_%m_%d \
          -t -1w+20m

rasql() uses the same algorithm that rasqlinsert() uses to formulate the table names,
so the concept is that you are changing the " -w … " to " -r …" and you voila,
you can read the data that was written.

Seems complicated, but rasql() is designed to find data that may be
in either strategy at the same time.  If we rolled data out of a
database table, into a file system, and then indexed it before removing
it, we can migrate data from one archive system to the next without
losing the ability to find the data.

When its time to roll data out of the primitiveTable(s), I read the entire table using
rasql(), and write its output to rasplit() which will generate the output file(s), (should
only be one) and when its done, I time index the file(s), and if successful, I drop
the database table, else rollback.

As I indicated above, my ~/.rarc file allows me to make queries like this and
get data regardless of what system the data is in:

     % rasql -t -12w+1d -M time 1d -r mysql://primitiveTable_%Y_%m_%d

Going for data in the filesystem first attempts to get older data first,
thinking that the data that maybe in the tables would be more recent.

OK, have to go to dinner, hopefully that helps !!!!!

Carter



On May 28, 2013, at 4:40 PM, John Gerth <gerth at graphics.stanford.edu> wrote:

> This is a very nice example of how to use the rasql* tools which I'd like to
> see completely fleshed out as it reveals your innovative data management approach.
> 
> The piece I'm not grokking yet, is how to leverage the rasqltimeindex tables
> against a full argus archive.  The man page for 'rasql' talks about retrieving
> argus records from a DB, but rasqltimeindex is all about indexing records
> from argus files which implies that the data is left in the files and not
> duplicated in the database (a Very Good Thing).  Perhaps I'll understand
> if you augment the example to show how to list all the flows for 69.74.153.46
> 
> John Gerth      gerth at graphics.stanford.edu  Gates 378   (650) 725-3273 fax 725-6949
> 
> On 5/28/2013 8:10 AM, Carter Bullard wrote:
>> Hey Russell,
>> Yes, a lot is going on, but I'm sure you'll catch up very quickly !!!!
>> 
>> About merging flows from multiple sensors using radium.
>> 
>> This is a great thing to do, as it consolidates your data into a single stream, so
>>   1) a single tool can get a bigger look at what is going on
>>   2) you can develop a single repository that is being updated in near realtime
>>   3) you can have multiple points to access live data streams (probe and radii)
>> 
>> To make this work, the key is that all your argus data sources MUST have
>> unique source id's, so you can discriminate the source of the data, either 
>> when ingesting the data or when you do your analytics.
>> 
>> I use rastream(), which is working very well for me, to split the single stream
>> into 5 minute files, separated by source id, and the rastream script indexes
>> the data for time.  This establishes and maintains a single repository, which
>> can be accessed very quickly.
>> 
>> The radium() that provides geolocation data, and process correlation for
>> all my data is on my localhost port 562, so…..
>> 
>>   % rastream -S localhost:562 -f /usr/local/bin/rastream.sh -B 10s -M time 5m \
>>      -w /Volumes/Data/Archive/QoSient/\$srcid/%Y/%m/%d/argus.%Y.%m.%d.%H.%M.%S -d
>> 
>> The rastream.sh file is:
>> 
>> ----- Begin included file -----
>> 
>> #!/bin/sh
>> #
>> #  Argus Client Software.  Tools to read, analyze and manage Argus data.
>> #  Copyright (C) 2000-2013 QoSient, LLC.
>> #  All Rights Reserved
>> #
>> # Script called by rastream, to process files.
>> #
>> # Since this is being called from rastream(), it will have only a single
>> # parameter, filename,
>> #
>> # Carter Bullard <carter at qosient.com>
>> #
>> 
>> PATH="/usr/local/bin:$PATH"; export PATH
>> package="argus-clients"
>> version="3.0.6"
>> 
>> OPTIONS="$*"
>> FILES=
>> while  test $# != 0
>> do
>>    case "$1" in
>>    -r) shift; FILES="$1"; break;;
>>    esac
>>    shift
>> done
>> 
>> rasqltimeindex -r $FILES -w mysql://root@localhost/ratop
>> exit 0
>> 
>> 
>> ----- End included file -----
>> 
>> Very simple, just time index the data in the files.
>> 
>> I have a specific purpose for my repository, to help me identify new 
>> network assets as they appear, to realize who they are and who they
>> have been talking to.
>> 
>> To do this, I have multiple rasqlinsert()'s attached to the same radium()
>> that are establishing and maintaining the pre-processed views into the
>> database.  I have one to track mac address / IP address pairs:
>> 
>>   rasqlinsert -d -M time 1d -S localhost:562 -w mysql://root@localhost/ratop/etherHost_%Y_%m_%d \
>>      -M rmon cache -m srcid smac saddr -s ltime dur srcid smac saddr spkts dpkts sappbytes dappbytes
>> 
>> That is my go to set of data for finding IP addresses.  rasqlinsert() will maintain this table,
>> in real-time, so any address that pops up is immediately in the table, and the result
>> of rasqlinsert()s processing, I end up with a table for every day.
>> 
>> To track the who is talking to who data, I've got another rasqlinsert() doing the IP matrix data:
>> 
>>   rasqlinsert -d -M time 1d -S localhost:562 -w mysql://root@localhost/ratop/ipMatrix_%Y_%m_%d \
>>      -M cache -m srcid matrix -s ltime dur srcid saddr daddr bytes - ip
>> 
>> These two provide all the information I need.  I end up with references to every IP address seen
>> in the complete argus data, and I can find the data in seconds.
>> 
>> So when I need to look up a specific IP address, some random address in one of my tables
>> as an example,  I make this call:
>> 
>>  % time rasql -t -365d+365d -M time 1d -r mysql://root@localhost/ratop/etherHost_%Y_%m_%d -M sql='saddr="69.74.153.46"'
>>                  LastTime        Dur              SrcId             SrcMac            SrcAddr  SrcPkts  DstPkts    SAppBytes    DAppBytes 
>> 2012/06/04.09:04:18.107813   0.000000      207.237.36.98  00:21:a0:ce:0c:d9       69.74.153.46        1        0           20            0
>> 
>> real	0m2.753s
>> user	0m0.029s
>> sys	0m0.013s
>> 
>> 
>> 2.7 seconds to scan an entire years set of data from 20 argus data sources.  Then if I want to know
>> what addresses that address talked to, knowing that it was just in one day, I can use my ipMatrix
>> table to search for the list of addresses:
>> 
>> time rasql -t 2012/06/04 -M time 1d -r mysql://root@localhost/ratop/ipMatrix_%Y_%m_%d -M sql="saddr='69.74.153.46'" 
>>                  LastTime        Dur              SrcId            SrcAddr            DstAddr   TotBytes 
>> 2012/06/04.09:04:18.107813   0.000000      207.237.36.98       69.74.153.46     207.237.190.64         70
>> 
>> real	0m0.110s
>> user	0m0.026s
>> sys	0m0.008s
>> 
>> Not bad.
>> 
>> 
>> OK, so I suggest getting a sense of what you want to do, get all the data into a single stream,
>> and point a bunch of rasqlinsert()s at it, then write some scripts to can the basic queries to
>> get your answers out as fast as you can.  That will get you started.
>> 
>> Holler, if there is anything that doesn't work as you expect….
>> 
>> Carter
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20130528/95f9c341/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6837 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20130528/95f9c341/attachment.bin>


More information about the argus mailing list