Database design concerns

Fri Oct 26 15:11:25 EDT 2012

Hey Paul,
No problem at all !!!!!  If there is a problem, its that we don't have enough
documentation on this topic for you to read.  I was excited that you brought
up the topics that you did, as I'm hoping that it will generate some thoughts
on the " first pass " support we put into argus-3.0.4+.

I think the best thing for you to do is to keep doing what you're doing, and
if there is anything I can help you with, I'm happy to do so.  I still think you're
onto something with the partitions, but I don't know enough about them to
know if they are a distraction, or something useful.

I just don't want you to think that argus is half done.   I'm thinking its getting
done, but you never know how people like their software cooked ;o)

Carter

On Oct 26, 2012, at 3:03 PM, Paul Schmehl <pschmehl_lists at tx.rr.com> wrote:

> Thanks a lot, Carter.  You've made me realize that 1) I was trying to reinvent the wheel and 2) I have no understanding at all of the tools you've provided to parse argus records.
> 
> Now I have to go read for a few months.  :-)
> 
> --On October 26, 2012 6:53:23 AM -0400 Carter Bullard <carter at qosient.com> wrote:
> 
>> Hey Paul,
>> Its all about your forensics data model.  You're putting all your data
>> into a database without a model, and then using the DB to do a lot
>> of work.  Great idea, but for some sites, 120M rows isn't a lot of data.
>> When you get into the 1.2G rows a day, those queries get 1 or more
>> additional 0's at the end of your response times.  Thats about 20K
>> records / sec.   Just putting the primitive argus data into a mysql table
>> gets to be tough to do.  Now, your column count has to go down to
>> keep up, and the query taxonomy that the limited flow id's and metrics
>> ( columns)  can support goes down.
>> 
>> 
>> 
>> So we / I approach this by not putting the primitive records into the
>> database.  By building multiple data models in parallel, and putting
>> those in the database, we can get good performance on a good
>> Cyber Forensics query taxonomy over Petabytes of data, over periods
>> of years.  These are the targets I think are useful.
>> 
>> 
>> So, YES, being able to find all the nodes that X is / has talked to is
>> really valuable.  And, we have methods that can maintain a day based
>> matrix table in realtime.  Its far less expensive that putting every
>> record
>> in the database.
>> 
>> 
>>    rasqlinsert -S localhost -m srcid matrix -M cache \
>>        -w mysql://user@localhost/db/matrix_%Y_%m_%d \
>>        -s srcid stime ltime saddr daddr - ipv4
>> 
>> 
>> For some, the matrix is expressed as a CIDR addresses:
>>    -m srcid matrix/24
>> 
>> 
>> The resulting table has all the X's <-> Y's, by day.  Now you may find
>> that that granularity doesn't do well for you, but you can adjust that
>> with
>> the tools we provide, using a " -M hard time 1h " like option and adding
>> the " stime " to the key, you get hourly granularity in daily tables.
>> Adding
>> other metrics can be useful, but we always put a full argus record in each
>> row, so we get all the metrics that can be useful, without beating up the
>> DB.
>> 
>> 
>> The idea is that this table can be done in parallel with other data models
>> that are also very useful.
>> 
>> 
>> On the web site, we talk about also having an IP address inventory, that
>> you can have going, concurrently, in real time:
>> 
>> 
>>      rasqlinsert -S localhost -M rmon -m srcid smac saddr -M cache \
>>        -w mysql://user@localhost/db/addrs_%Y_%m_%d \
>>        -s srcid stime ltime smac saddr - ipv4
>> 
>> 
>> Now this table provides you with ethernet / IP pairs, which will provide
>> you
>> with some topology information on where X came from.  This table has a
>> bonus, it will tell you whether your IP came from a router / switch,
>> or came from an end system, as you can find all the MAC addresses in your
>> infrastructure, and how many IP's that MAC addr is supporting.   Helps you
>> to find VM's, wireless access points, and spoofing sources.
>> 
>> 
>> With all queries in these data models returning start and end times,
>> you should be able to get to the original argus records ( primitive data
>> ), that are
>> being stored in time based files, very quickly.  But you should be able to
>> query other data models, bounded in some relevant time based queries, or
>> these
>> same models in other sites, to do some really good forensics work, evenly
>> interactively.
>> 
>> 
>> So when you get an event that is interesting involving some odd ball X at
>> time T, you
>> generally want to know who X was talking to at time (T - Tc) - (T + Tc),
>> where c is like
>> 60-300 seconds, but the next question is almost always, when did I first
>> see X, always
>> going back at least 1 month, but a year is much better.  And then, its
>> always valuable
>> to know when did X first start talking to the specific Y's that are
>> involved in the current
>> event, and who were those Y's talking to.
>> 
>> 
>> The whole point is to find the initial penetration event, and attempt to
>> characterize
>> the stepping stones ( Markov chain ), to tell you how bad it is.
>> 
>> 
>> Generally, these types of cross table queries, when these tables are
>> already built,
>> are much faster than your response times.  The important fields are
>> already indexed,
>> because they are keys.   ( -m fields specify the DB keys ).
>> 
>> 
>> So what kind of data models do you think may be useful ?  That would be an
>> interesting conversation, I believe.
>> 
>> 
>> 
>> Carter
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Oct 26, 2012, at 12:48 AM, Paul Schmehl <pschmehl_lists at tx.rr.com>
>> wrote:
>> 
>> --On October 26, 2012 12:21:58 AM -0400 Carter Bullard
>> <carter at qosient.com> wrote:
>> 
>> 
>> Hey Paul,
>> All the rasql* tools know how to cross day boundaries to find data, so it
>> works well for ra*. But, that doesn't mean that there aren't better ways
>> of doing things.
>> 
>> No reason, ..., nobody has asked for column indexing with rasql*
>> programs.  Probably because it does slow down record insertion.  You have
>> decided to not insert the binary argus record into the database, so you
>> don't need any of the rasql* tools to do your database operations.
>> 
>> What kind of queries are you doing ?
>> 
>> 
>> 
>> Typically the question we're trying to answer is something like this:
>> 
>> Show me what saddr's connected to unique daddr on port num between one
>> time and another.
>> 
>> Specific example would be for Torpig.
>> 
>> Which IPs connected to 8.8.8.8 on port 53 between 8AM and 12 Noon
>> yesterday.
>> 
>> We might occasionally want to search for all connections for one host
>> over several days.
>> 
>> The problem we're trying to solve is that searches on the data using ra
>> take a very long time.  We're archiving the logs in four hour increments,
>> bzipped, and a search of one four hour period will take 20 minutes or so.
>> A search through a whole day's logs (6) would take several hours.  A
>> typical day is between 50 and 75 GB compressed - 90 to 120GB uncompressed.
>> 
>> By putting the critical information into a database we can locate the
>> connections more quickly, then use ra with the -t switch to hopefully
>> find the data more quickly.
>> 
>> A typical query would be something like this:
>> 
>> SELECT * from table where saddr between x and y and daddr=z and stime
>> between z and a.
>> 
>> I just ran that query and it took 25 seconds.  Dropping the saddr range
>> and just searching for all connections to that destination dropped the
>> time to 1 second.  That same query on a four hour argus file would taken
>> 20 minutes or so.
>> 
>> Paul Schmehl, Senior Infosec Analyst
>> As if it wasn't already obvious, my opinions
>> are my own and not those of my employer.
>> *******************************************
>> "It is as useless to argue with those who have
>> renounced the use of reason as to administer
>> medication to the dead." Thomas Jefferson
>> "There are some ideas so wrong that only a very
>> intelligent person could believe in them." George Orwell
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> Paul Schmehl, Senior Infosec Analyst
> As if it wasn't already obvious, my opinions
> are my own and not those of my employer.
> *******************************************
> "It is as useless to argue with those who have
> renounced the use of reason as to administer
> medication to the dead." Thomas Jefferson
> "There are some ideas so wrong that only a very
> intelligent person could believe in them." George Orwell
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20121026/b0688024/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4367 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20121026/b0688024/attachment.bin>