Database design concerns

Paul Schmehl pschmehl_lists at tx.rr.com
Fri Oct 26 15:03:30 EDT 2012


Thanks a lot, Carter.  You've made me realize that 1) I was trying to 
reinvent the wheel and 2) I have no understanding at all of the tools 
you've provided to parse argus records.

Now I have to go read for a few months.  :-)

--On October 26, 2012 6:53:23 AM -0400 Carter Bullard <carter at qosient.com> 
wrote:

> Hey Paul,
> Its all about your forensics data model.  You're putting all your data
> into a database without a model, and then using the DB to do a lot
> of work.  Great idea, but for some sites, 120M rows isn't a lot of data.
> When you get into the 1.2G rows a day, those queries get 1 or more
> additional 0's at the end of your response times.  Thats about 20K
> records / sec.   Just putting the primitive argus data into a mysql table
> gets to be tough to do.  Now, your column count has to go down to
> keep up, and the query taxonomy that the limited flow id's and metrics
> ( columns)  can support goes down.
>
>
>
> So we / I approach this by not putting the primitive records into the
> database.  By building multiple data models in parallel, and putting
> those in the database, we can get good performance on a good
> Cyber Forensics query taxonomy over Petabytes of data, over periods
> of years.  These are the targets I think are useful.
>
>
> So, YES, being able to find all the nodes that X is / has talked to is
> really valuable.  And, we have methods that can maintain a day based
> matrix table in realtime.  Its far less expensive that putting every
> record
> in the database.
>
>
>     rasqlinsert -S localhost -m srcid matrix -M cache \
>         -w mysql://user@localhost/db/matrix_%Y_%m_%d \
>         -s srcid stime ltime saddr daddr - ipv4
>
>
> For some, the matrix is expressed as a CIDR addresses:
>     -m srcid matrix/24
>
>
> The resulting table has all the X's <-> Y's, by day.  Now you may find
> that that granularity doesn't do well for you, but you can adjust that
> with
> the tools we provide, using a " -M hard time 1h " like option and adding
> the " stime " to the key, you get hourly granularity in daily tables.
> Adding
> other metrics can be useful, but we always put a full argus record in each
> row, so we get all the metrics that can be useful, without beating up the
> DB.
>
>
> The idea is that this table can be done in parallel with other data models
> that are also very useful.
>
>
> On the web site, we talk about also having an IP address inventory, that
> you can have going, concurrently, in real time:
>
>
>       rasqlinsert -S localhost -M rmon -m srcid smac saddr -M cache \
>         -w mysql://user@localhost/db/addrs_%Y_%m_%d \
>         -s srcid stime ltime smac saddr - ipv4
>
>
> Now this table provides you with ethernet / IP pairs, which will provide
> you
> with some topology information on where X came from.  This table has a
> bonus, it will tell you whether your IP came from a router / switch,
> or came from an end system, as you can find all the MAC addresses in your
> infrastructure, and how many IP's that MAC addr is supporting.   Helps you
> to find VM's, wireless access points, and spoofing sources.
>
>
> With all queries in these data models returning start and end times,
> you should be able to get to the original argus records ( primitive data
> ), that are
> being stored in time based files, very quickly.  But you should be able to
> query other data models, bounded in some relevant time based queries, or
> these
> same models in other sites, to do some really good forensics work, evenly
> interactively.
>
>
> So when you get an event that is interesting involving some odd ball X at
> time T, you
> generally want to know who X was talking to at time (T - Tc) - (T + Tc),
> where c is like
> 60-300 seconds, but the next question is almost always, when did I first
> see X, always
> going back at least 1 month, but a year is much better.  And then, its
> always valuable
> to know when did X first start talking to the specific Y's that are
> involved in the current
> event, and who were those Y's talking to.
>
>
> The whole point is to find the initial penetration event, and attempt to
> characterize
> the stepping stones ( Markov chain ), to tell you how bad it is.
>
>
> Generally, these types of cross table queries, when these tables are
> already built,
> are much faster than your response times.  The important fields are
> already indexed,
> because they are keys.   ( -m fields specify the DB keys ).
>
>
> So what kind of data models do you think may be useful ?  That would be an
> interesting conversation, I believe.
>
>
>
> Carter
>
>
>
>
>
>
>
>
> On Oct 26, 2012, at 12:48 AM, Paul Schmehl <pschmehl_lists at tx.rr.com>
> wrote:
>
> --On October 26, 2012 12:21:58 AM -0400 Carter Bullard
> <carter at qosient.com> wrote:
>
>
> Hey Paul,
> All the rasql* tools know how to cross day boundaries to find data, so it
> works well for ra*. But, that doesn't mean that there aren't better ways
> of doing things.
>
> No reason, ..., nobody has asked for column indexing with rasql*
> programs.  Probably because it does slow down record insertion.  You have
> decided to not insert the binary argus record into the database, so you
> don't need any of the rasql* tools to do your database operations.
>
> What kind of queries are you doing ?
>
>
>
> Typically the question we're trying to answer is something like this:
>
> Show me what saddr's connected to unique daddr on port num between one
> time and another.
>
> Specific example would be for Torpig.
>
> Which IPs connected to 8.8.8.8 on port 53 between 8AM and 12 Noon
> yesterday.
>
> We might occasionally want to search for all connections for one host
> over several days.
>
> The problem we're trying to solve is that searches on the data using ra
> take a very long time.  We're archiving the logs in four hour increments,
> bzipped, and a search of one four hour period will take 20 minutes or so.
> A search through a whole day's logs (6) would take several hours.  A
> typical day is between 50 and 75 GB compressed - 90 to 120GB uncompressed.
>
> By putting the critical information into a database we can locate the
> connections more quickly, then use ra with the -t switch to hopefully
> find the data more quickly.
>
> A typical query would be something like this:
>
> SELECT * from table where saddr between x and y and daddr=z and stime
> between z and a.
>
> I just ran that query and it took 25 seconds.  Dropping the saddr range
> and just searching for all connections to that destination dropped the
> time to 1 second.  That same query on a four hour argus file would taken
> 20 minutes or so.
>
> Paul Schmehl, Senior Infosec Analyst
> As if it wasn't already obvious, my opinions
> are my own and not those of my employer.
> *******************************************
> "It is as useless to argue with those who have
> renounced the use of reason as to administer
> medication to the dead." Thomas Jefferson
> "There are some ideas so wrong that only a very
> intelligent person could believe in them." George Orwell
>
>
>
>
>



Paul Schmehl, Senior Infosec Analyst
As if it wasn't already obvious, my opinions
are my own and not those of my employer.
*******************************************
"It is as useless to argue with those who have
renounced the use of reason as to administer
medication to the dead." Thomas Jefferson
"There are some ideas so wrong that only a very
intelligent person could believe in them." George Orwell




More information about the argus mailing list