Database design concerns

Dave Edelman dedelman at iname.com
Fri Oct 26 23:58:57 EDT 2012


Carter, 

 

What type of insertion rate are you able to achieve?  I just pointed
mysqlinsert at a day's worth of one sensor (changed to -S localhost:562 to
-r *) and it is chugging along.  I have spent no time at all tuning the
database (fresh install from source) MyISAM engine, pretty good size server.
Do you have any tips for tuning MySQL for best insertion rates?

 

--Dave

 

 

 

 

 

From: argus-info-bounces+dedelman=iname.com at lists.andrew.cmu.edu
[mailto:argus-info-bounces+dedelman=iname.com at lists.andrew.cmu.edu] On
Behalf Of Carter Bullard
Sent: Friday, October 26, 2012 11:43 AM
To: Paul Schmehl
Cc: Argus List
Subject: Re: [ARGUS] Database design concerns

 

* PGP - S/MIME Signed by an unverified key: 10/26/2012 at 11:42:33 AM

Hey Paul,

So I thought I'd give you some real data on this type of data strategy.

This is how I use rasqlinsert() and rasql() together to do what you're
describing.

I run rasqlinsert() as a daemon, reading argus data from all my argus
sources.

Currently I have 28 sources, here at QoSient World Headquarters.

 

A single radium attaches to all the sources, and generates a single stream

of realtime argus data.  I run rastream() to write the data to files, and
this

rasqlinsert() to formulate an IP matrix index:

 

   rasqlinsert -M time 1d -S localhost:562 -w
mysql://root@localhost/ratop/ipMatrix_%Y_%m_%d
<mysql://root@localhost/ratop/ipMatrix_%25Y_%25m_%25d>  \

          -M cache -m srcid matrix -s ltime dur srcid saddr daddr bytes -d -
ip

 

This has been running for a little over 20 months, and so I have some

decent data to work with.

 

OK, I'm interested in this address, 208.70.31.236, as I'm looking at live

data, and its something I'm not familiar with.  its owned by " Internet
Archive ".

I have no idea why I would be talking to " Internet Archive ", so how long

have I been talking to this IP?

 

So this is a rasql() call to scan back 365 days, and tell me how long have

I been communicating with this site.

 

   rasql -t -365d -r mysql://root@localhost/ratop/ipMatrix_%Y_%m_%d
<mysql://root@localhost/ratop/ipMatrix_%25Y_%25m_%25d>  -M time 1d \

          -M sql="saddr = '208.70.31.236' or daddr = '208.70.31.236'"

 

This will find every daily IP address instance for this address in my

entire repository.  So what's going on?

 

   rasql - used to read argus " record "s from database tables

 

   -t -365d -  set the start time to be 1 calendar year ago, since I

               don't specify a stop time, stop will be " now ".

 

   -r mysql://root@localhost/ratop/ipMatrix_%Y_%m_%d
<mysql://root@localhost/ratop/ipMatrix_%25Y_%25m_%25d>  

               this sez, use wildcard table names.  You have to

               specify what the time duration is that creates the names.

 

   -M time 1d - This sez use 1 day to formulate the table names.

 

   -M sql="saddr = '208.70.31.236' or daddr = '208.70.31.236'"

               this passes a where clause to the SQL commands

               to find the IP addresses of interest.

   

OK, so how well does it do?  This specific query in my system, covering 450
GB of

argus data,  which is a small repository, returns 169 records covering the
entire

year.  For this respository, there are 365 tables to look at, for instances
of

this IP address.  Since I'm interested in how long its been going on, I'll
pipe

the output to ratimerange() to get the result.  Lets time it:

 

 

  thoth:~ carter% time rasql -t -365d -r
mysql://root@localhost/ratop/ipMatrix_%Y_%m_%d
<mysql://root@localhost/ratop/ipMatrix_%25Y_%25m_%25d>  -M time 1d \

         -M sql="saddr = '208.70.31.236' or daddr = '208.70.31.236'" \

         -w -  | ratimerange

 

  2012/04/16.15:55:18.645652 - 2012/10/26.09:37:54.977133

 

  real 0m0.816s

  user 0m0.054s

  sys  0m0.025s

 

 

Not bad for covering an entire year of data.  This is on a 4 year old Mac
Pro,

with 2x4 core Xeon, with 14GB of memory.  So it started, April 16, at
3:55:18.645652 PM. 

Next steps are pretty simple ..

 

If I pipe the output to racluster(), I get this:

 

thoth:~ carter$ time rasql -t -365d -r
mysql://root@localhost/ratop/ipMatrix_%Y_%m_%d
<mysql://root@localhost/ratop/ipMatrix_%25Y_%25m_%25d>  \

                      -M time 1d -M sql="saddr = '208.70.31.236' or daddr =
'208.70.31.236'" \

                      -w - | racluster -s stime dur:16 srcid saddr daddr
trans pkts bytes

 

                 StartTime              Dur              SrcId
SrcAddr            DstAddr  Trans  TotPkts   TotBytes

2012/04/16.15:55:18.646102  16652556.131783      207.237.36.98
207.237.36.98      208.70.31.236    357      714     110786

2012/04/16.15:55:18.645652  16652556.134256        192.168.0.1
192.168.0.66      208.70.31.236    357      714     110786

2012/04/16.15:55:18.639736  16652556.136862       192.168.0.66
192.168.0.66      208.70.31.236    357      714     110786

 

real   0m0.846s

user   0m0.058s

sys    0m0.025s

thoth:~ carter$ 

 

Here we find the 3 probes that saw the traffic, the end system, the interior
border sensor, and the exterior sensor.

The ipMatrix table has aggregates in them, so we had 169 active days, and a
total of 357 transactions over 

the year. Still not sure what it is, so need to look at the primitive data
to figure it out....

 

OK, so we find out its innocuous, and no big deal, but that didn't take long
at all to find out.

Hope this is useful,

 

Carter 

 

 

On Oct 26, 2012, at 6:53 AM, Carter Bullard <carter at qosient.com> wrote:





Hey Paul,

Its all about your forensics data model.  You're putting all your data

into a database without a model, and then using the DB to do a lot

of work.  Great idea, but for some sites, 120M rows isn't a lot of data.

When you get into the 1.2G rows a day, those queries get 1 or more

additional 0's at the end of your response times.  Thats about 20K

records / sec.   Just putting the primitive argus data into a mysql table

gets to be tough to do.  Now, your column count has to go down to

keep up, and the query taxonomy that the limited flow id's and metrics

( columns)  can support goes down.

 

So we / I approach this by not putting the primitive records into the

database.  By building multiple data models in parallel, and putting

those in the database, we can get good performance on a good

Cyber Forensics query taxonomy over Petabytes of data, over periods

of years.  These are the targets I think are useful.

 

So, YES, being able to find all the nodes that X is / has talked to is

really valuable.  And, we have methods that can maintain a day based

matrix table in realtime.  Its far less expensive that putting every record

in the database.

 

    rasqlinsert -S localhost -m srcid matrix -M cache \

        -w mysql://user@localhost/db/matrix_%Y_%m_%d
<mysql://user@localhost/db/matrix_%25Y_%25m_%25d>  \

        -s srcid stime ltime saddr daddr - ipv4

 

For some, the matrix is expressed as a CIDR addresses:

    -m srcid matrix/24

 

The resulting table has all the X's <-> Y's, by day.  Now you may find

that that granularity doesn't do well for you, but you can adjust that with

the tools we provide, using a " -M hard time 1h " like option and adding

the " stime " to the key, you get hourly granularity in daily tables.
Adding

other metrics can be useful, but we always put a full argus record in each

row, so we get all the metrics that can be useful, without beating up the

DB.

 

The idea is that this table can be done in parallel with other data models

that are also very useful.

 

On the web site, we talk about also having an IP address inventory, that

you can have going, concurrently, in real time:

 

      rasqlinsert -S localhost -M rmon -m srcid smac saddr -M cache \

        -w mysql://user@localhost/db/addrs_%Y_%m_%d
<mysql://user@localhost/db/addrs_%25Y_%25m_%25d>  \

        -s srcid stime ltime smac saddr - ipv4

 

Now this table provides you with ethernet / IP pairs, which will provide you

with some topology information on where X came from.  This table has a

bonus, it will tell you whether your IP came from a router / switch,

or came from an end system, as you can find all the MAC addresses in your

infrastructure, and how many IP's that MAC addr is supporting.   Helps you

to find VM's, wireless access points, and spoofing sources.

 

With all queries in these data models returning start and end times, 

you should be able to get to the original argus records ( primitive data ),
that are

being stored in time based files, very quickly.  But you should be able to

query other data models, bounded in some relevant time based queries, or
these

same models in other sites, to do some really good forensics work, evenly
interactively.

 

So when you get an event that is interesting involving some odd ball X at
time T, you

generally want to know who X was talking to at time (T - Tc) - (T + Tc),
where c is like

60-300 seconds, but the next question is almost always, when did I first see
X, always

going back at least 1 month, but a year is much better.  And then, its
always valuable

to know when did X first start talking to the specific Y's that are involved
in the current

event, and who were those Y's talking to.

 

The whole point is to find the initial penetration event, and attempt to
characterize

the stepping stones ( Markov chain ), to tell you how bad it is.

 

Generally, these types of cross table queries, when these tables are already
built,

are much faster than your response times.  The important fields are already
indexed,

because they are keys.   ( -m fields specify the DB keys ).

 

So what kind of data models do you think may be useful ?  That would be an

interesting conversation, I believe.

 

Carter

 

 

 

On Oct 26, 2012, at 12:48 AM, Paul Schmehl <pschmehl_lists at tx.rr.com> wrote:





--On October 26, 2012 12:21:58 AM -0400 Carter Bullard <carter at qosient.com>
wrote:




Hey Paul,
All the rasql* tools know how to cross day boundaries to find data, so it
works well for ra*. But, that doesn't mean that there aren't better ways
of doing things.

No reason, ..., nobody has asked for column indexing with rasql*
programs.  Probably because it does slow down record insertion.  You have
decided to not insert the binary argus record into the database, so you
don't need any of the rasql* tools to do your database operations.

What kind of queries are you doing ?


Typically the question we're trying to answer is something like this:

Show me what saddr's connected to unique daddr on port num between one time
and another.

Specific example would be for Torpig.

Which IPs connected to 8.8.8.8 on port 53 between 8AM and 12 Noon yesterday.

We might occasionally want to search for all connections for one host over
several days.

The problem we're trying to solve is that searches on the data using ra take
a very long time.  We're archiving the logs in four hour increments,
bzipped, and a search of one four hour period will take 20 minutes or so. A
search through a whole day's logs (6) would take several hours.  A typical
day is between 50 and 75 GB compressed - 90 to 120GB uncompressed.

By putting the critical information into a database we can locate the
connections more quickly, then use ra with the -t switch to hopefully find
the data more quickly.

A typical query would be something like this:

SELECT * from table where saddr between x and y and daddr=z and stime
between z and a.

I just ran that query and it took 25 seconds.  Dropping the saddr range and
just searching for all connections to that destination dropped the time to 1
second.  That same query on a four hour argus file would taken 20 minutes or
so.

Paul Schmehl, Senior Infosec Analyst
As if it wasn't already obvious, my opinions
are my own and not those of my employer.
*******************************************
"It is as useless to argue with those who have
renounced the use of reason as to administer
medication to the dead." Thomas Jefferson
"There are some ideas so wrong that only a very
intelligent person could believe in them." George Orwell



 

 

* Carter Bullard <carter at qosient.com>
* Issuer: "VeriSign - Unverified

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20121026/ef030645/attachment.html>


More information about the argus mailing list