Database design concerns an Update

Carter Bullard carter at qosient.com
Mon Oct 29 12:27:42 EDT 2012


Hey Dave,
Yes, have radium do the labeling.

If you are reading netflow data, you will need a -B of possibly a day ?

The -B tells rastream how long to delay processing the file, waiting for
the final record to show up.  Flows that show up for a file, that come in
after the -B time will be thrown away (file is already processed, so ....)

Netflow sources can wait a huge amount of time before forwarding flow
specific data.  Depending on the hardware and the configuration, it could
be 12 hours before you see a specific flow record come out.

Because of this, for netflow data, it is best to use rasplit(), for a while, and
then look at the create file date and the last write date to see how long you
will have to wait before you can process the file.  I've seen 15 minutes,
I've seen 1 hour, I've seen 1 day.

Carter


On Oct 29, 2012, at 11:25 AM, "Dave Edelman" <dedelman at iname.com> wrote:

> Carter,
>  
> When you say do the labeling as early as possible, I’m thinking of specifying my ralabel.confg as the value for RADIUM_CLASSIFIER_FILE. Is that what you had in mind?
>  
> I’m also looking at rastream() to deal with the file creation and post processing. Because this dataset is derived from NetFlow, I’m not sure about the –B value. I’m going to try a value of 15 seconds and see what happens.
>  
> --Dave
>  
>  
>  
>  
> From: Carter Bullard [mailto:carter at qosient.com] 
> Sent: Monday, October 29, 2012 9:58 AM
> To: Dave Edelman
> Cc: 'Argus List'
> Subject: Re: [ARGUS] Database design concerns an Update
>  
> * PGP - S/MIME Signed by an unverified key: 10/29/2012 at 9:57:42 AM
> Hey Dave,
> Excellent !!!!  There are other dials to tweak if needed, such as the update rate,
> which is how often rasqlinsert() will update caches that change.  Changing that
> value can have impacts, but can also take away from the real-time use of the
> resulting tables.
>  
> I've found that labeling data that is moving, is less expensive that labeling
> data at rest, that's why we put labeling as the first function in radium().  So if
> you can figure out that you should be labeling, put it in your data ingest,
> using radium.
>  
> Carter
>  
> On Oct 28, 2012, at 9:18 AM, "Dave Edelman" <dedelman at iname.com> wrote:
> 
> 
> Some updates from the front lines.
>  
> I have changed three things. I increased the size of the max_allowed_packet value from 1M to 64M; I built the clients without debug enabled; and I took a good look at the data I inserted before I killed the original activity.
>  
> With the rasqlinsert parameters –m srcid matrix and –s ltime dur srcid saddr daddr bytes and a filter of – ip I was aggregating too aggressively and had lost much of the material that I really needed. I also did a bit of thinking about a problem that I encounter with data that are collected over a long time span. In three years, we’ve closed some data centers, opened others, moved to significant amounts of virtualization and have very effectively disconnected the IP addresses in the Argus records from the current reality. I do have the glue to reconnect most, if not all of this but it is not in those records.
>  
> With that last thought in mind, I tried an experiment of running ralabel against the records and piping that into rasqlinsert. I guess that the biggest benefit is that you are guaranteed not to overrun the database, even without optimizations. The insert rate could be measured in furlongs per fortnight J
>  
> I abandoned that idea and reset the insert process parameters and filter to something that provided what I actually need.
>                 -m srcid matrix proto
>                 -s ltime dur srcid saddr daddr bytes proto sco dco
>                 - not icmp and not udp
>  
> Before anyone yells, some explanation is in order. This specific dataset is not typical and I have other datasets that I will deal with in different ways. I am responsible for the broken parts of a very large network. The parts that are not broken always seem to belong to other people L Broken in the sense of both not working and something evil this way comes. In round numbers we have 30,000 routing nodes and about 6.02 * 10^23 users give or take a few. We have a clearly defined DFZ (Default Free Zone) that has full routing tables and the entire network is isolated from the Internet with controlled access through formal authenticating proxy devices. Throughout our DFZ we intentionally inject  default routes which point  to a set of sinkholes. The sinkholes send NetFlow data  to the mother ship and then drop the traffic. It is really not much more than an active darknet detection mechanism. Anything that gets to the sinkhole is misconfigured, broken, or malicious and always requires some sort of attention. The sinkhole NetFlow is really sent to a different system, but by dint of logical explanation, reason, and significant threats I convinced everyone that I should also get a copy of the material. This material is the dataset in question.
>  
> I never see anything but half-flows, because the sinkhole never talks back. I know that the destination address is not routable on the network, and I know that I have had very poor return from ICMP and UDP traffic, so the folks responsible for the other system handle that stuff. One other bit of background. My organization is very concerned about even the hint of a thought of an appearance of communications with a well-defined set of countries so part of this infrastructure is an EPD (Electro-Political Device.)
>  
> The parameters that I used allow me to insert about one day of historical data every 6-7 minutes which is very acceptable for the three years of data that I am loading. I am loading in twelve parallel streams and I haven’t seen any significant degradation in the insertion rate.
>  
> Retrieval, even in the midst of this insertion process, is very quick.
>  
> I still have to figure out what to do about labels both for the data that I’ve processed and new data going forward and I’m always open to suggestions.
>  
> --Dave
>  
>  
> From: Dave Edelman [mailto:dedelman at iname.com] 
> Sent: Saturday, October 27, 2012 2:36 PM
> To: 'Carter Bullard'
> Cc: 'Argus List'
> Subject: RE: [ARGUS] Database design concerns
>  
> The source data are contained in files by one hour intervals. Using racount and  filtering for only ip records a typical sample  has around 1.8 million records and the resulting table has around  44,000 rows and takes about 32 minutes of elapsed time to populate.
>  
> The command that I used was:
> % rasqlinsert –M time 1d –r * -w mysql://argus@localhost/argus/ipMatrix_%Y_%m_%d –M cache –m srcid matrix –s ltime dur srcid saddr daddr bytes - ip
>  
> Since the source data and the tables  are organized by day, I’ve fired up five  parallel instances of rasqlinsert each doing a month’s data by day ( -R ) and I don’t see any significant change in the elapsed time to insert a day’s data measured by the difference between SHOW TABLE STATUS’s  create_time and update_time. That is just under 950 records a second so I’ve just changed the value of max_allowed_packet from 1M to 64M. Once the current crop are finished, I’ll restart the database and fire up a single set of inserts to get a baseline.
>  
> --Dave
>  
>  
> From: Carter Bullard [mailto:carter at qosient.com] 
> Sent: Saturday, October 27, 2012 10:34 AM
> To: Dave Edelman
> Cc: 'Argus List'
> Subject: Re: [ARGUS] Database design concerns
>  
> > Old - S/MIME Signed by an unverified key: 10/27/2012 at 10:34:00 AM
> Hey Dave,
> Insertion is pretty complex, number of columns, number of keys, argus
> data model which dictates the size of the argus records.  Pair that with how
> fast can the clients read data and prepare the record for insertion, and
> what the client is doing, you get a very complex set of contributors to the
> insertion rates.    
>  
> Clients can read around 100-120K records per second, so that is the upper end.
> The standard mysql insertion takes an ASCII string, so rasqlinsert() reads the
> record, processes it, and then when its ready to insert,  it prints out the
> string you've specified with either the " -s fields " or the RA_FIELD_SPECIFIER,
> to a buffer, and adds a printable version of the binary record, if its included.
>  
> We queue the buffers to another thread, who batches the insertions into the
> largest buffer mysql will accept, as long as there are more records, and then
> calls mysql_real_query()'s to insert records.  Pretty expensive, but this the 
> normal way to do it.
>  
> So with a standard argus record but no keys, " -m none ",  with a standard
> argus records, without say TCP performance, or jitter information, I've gotten
> around 12-14K records per second, but that changes with time, as the table
> gets bigger things get slower, and more expensive CPU wise, for a while, and
> then it speeds up again.
>  
> Now, rasqlinsert() is much more complicated that this, of course.  When using
> the " -M cache " option, you are doing flow aggregation, with timed insertions,
> updates and possibly deletions. 
>  
> In this case, where there are insertions, updates and deletions, the queued buffers
> are bunched together based on query type,  into the maximum size buffer mysql
> will accept.  As a result, we'll have an interesting mix of insertions, then updates, then
> maybe a series of updates, then insertions, a few deletions,  etc.....  
>  
> With standard flow keys, " -m srcid saddr daddr proto sport dport ", its 8-10K ? 
> But with a lot of medium and long lived flows, because we're doing aggregation,
> you may see that you can consume 30-40K flows per second, and the database
> can keep up, because the eventual database operations reduce to 8-10K ops.
>  
> So how did your day's worth pan out for you?  How did you run rasqlinsert?
>  
> Carter
>  
> On Oct 26, 2012, at 11:58 PM, "Dave Edelman" <dedelman at iname.com> wrote:
>  
> 
> Carter,
>  
> What type of insertion rate are you able to achieve?  I just pointed mysqlinsert at a day’s worth of one sensor (changed to –S localhost:562 to  –r *) and it is chugging along.  I have spent no time at all tuning the database (fresh install from source) MyISAM engine, pretty good size server. Do you have any tips for tuning MySQL for best insertion rates?
>  
> --Dave
>  
>  
>  
>  
>  
> From: argus-info-bounces+dedelman=iname.com at lists.andrew.cmu.edu [mailto:argus-info-bounces+dedelman=iname.com at lists.andrew.cmu.edu] On Behalf Of Carter Bullard
> Sent: Friday, October 26, 2012 11:43 AM
> To: Paul Schmehl
> Cc: Argus List
> Subject: Re: [ARGUS] Database design concerns
>  
> > Old - S/MIME Signed by an unverified key: 10/26/2012 at 11:42:33 AM
> Hey Paul,
> So I thought I'd give you some real data on this type of data strategy.
> This is how I use rasqlinsert() and rasql() together to do what you're describing.
> I run rasqlinsert() as a daemon, reading argus data from all my argus sources.
> Currently I have 28 sources, here at QoSient World Headquarters.
>  
> A single radium attaches to all the sources, and generates a single stream
> of realtime argus data.  I run rastream() to write the data to files, and this
> rasqlinsert() to formulate an IP matrix index:
>  
>    rasqlinsert -M time 1d -S localhost:562 -wmysql://root@localhost/ratop/ipMatrix_%Y_%m_%d \
>           -M cache -m srcid matrix -s ltime dur srcid saddr daddr bytes -d - ip
>  
> This has been running for a little over 20 months, and so I have some
> decent data to work with.
>  
> OK, I'm interested in this address, 208.70.31.236, as I'm looking at live
> data, and its something I'm not familiar with.  its owned by " Internet Archive ".
> I have no idea why I would be talking to " Internet Archive ", so how long
> have I been talking to this IP?
>  
> So this is a rasql() call to scan back 365 days, and tell me how long have
> I been communicating with this site.
>  
>    rasql -t -365d -r mysql://root@localhost/ratop/ipMatrix_%Y_%m_%d -M time 1d \
>           -M sql="saddr = '208.70.31.236' or daddr = '208.70.31.236'"
>  
> This will find every daily IP address instance for this address in my
> entire repository.  So what's going on?
>  
>    rasql - used to read argus " record "s from database tables
>  
>    -t -365d -  set the start time to be 1 calendar year ago, since I
>                don't specify a stop time, stop will be " now ".
>  
>    -r mysql://root@localhost/ratop/ipMatrix_%Y_%m_%d 
>                this sez, use wildcard table names.  You have to
>                specify what the time duration is that creates the names.
>  
>    -M time 1d - This sez use 1 day to formulate the table names.
>  
>    -M sql="saddr = '208.70.31.236' or daddr = '208.70.31.236'"
>                this passes a where clause to the SQL commands
>                to find the IP addresses of interest.
>    
> OK, so how well does it do?  This specific query in my system, covering 450 GB of
> argus data,  which is a small repository, returns 169 records covering the entire
> year.  For this respository, there are 365 tables to look at, for instances of
> this IP address.  Since I'm interested in how long its been going on, I'll pipe
> the output to ratimerange() to get the result.  Lets time it:
>  
>  
>   thoth:~ carter% time rasql -t -365d -rmysql://root@localhost/ratop/ipMatrix_%Y_%m_%d -M time 1d \
>          -M sql="saddr = '208.70.31.236' or daddr = '208.70.31.236'" \
>          -w -  | ratimerange
>  
>   2012/04/16.15:55:18.645652 - 2012/10/26.09:37:54.977133
>  
>   real 0m0.816s
>   user 0m0.054s
>   sys  0m0.025s
>  
>  
> Not bad for covering an entire year of data.  This is on a 4 year old Mac Pro,
> with 2x4 core Xeon, with 14GB of memory.  So it started, April 16, at 3:55:18.645652 PM. 
> Next steps are pretty simple ….
>  
> If I pipe the output to racluster(), I get this:
>  
> thoth:~ carter$ time rasql -t -365d -rmysql://root@localhost/ratop/ipMatrix_%Y_%m_%d \
>                       -M time 1d -M sql="saddr = '208.70.31.236' or daddr = '208.70.31.236'" \
>                       -w - | racluster -s stime dur:16 srcid saddr daddr trans pkts bytes
>  
>                  StartTime              Dur              SrcId            SrcAddr            DstAddr  Trans  TotPkts   TotBytes
> 2012/04/16.15:55:18.646102  16652556.131783      207.237.36.98      207.237.36.98      208.70.31.236    357      714     110786
> 2012/04/16.15:55:18.645652  16652556.134256        192.168.0.1       192.168.0.66      208.70.31.236    357      714     110786
> 2012/04/16.15:55:18.639736  16652556.136862       192.168.0.66       192.168.0.66      208.70.31.236    357      714     110786
>  
> real   0m0.846s
> user   0m0.058s
> sys    0m0.025s
> thoth:~ carter$ 
>  
> Here we find the 3 probes that saw the traffic, the end system, the interior border sensor, and the exterior sensor.
> The ipMatrix table has aggregates in them, so we had 169 active days, and a total of 357 transactions over 
> the year. Still not sure what it is, so need to look at the primitive data to figure it out…...
>  
> OK, so we find out its innocuous, and no big deal, but that didn't take long at all to find out.
> Hope this is useful,
>  
> Carter 
>  
>  
> On Oct 26, 2012, at 6:53 AM, Carter Bullard <carter at qosient.com> wrote:
> 
> 
> 
> 
> Hey Paul,
> Its all about your forensics data model.  You're putting all your data
> into a database without a model, and then using the DB to do a lot
> of work.  Great idea, but for some sites, 120M rows isn't a lot of data.
> When you get into the 1.2G rows a day, those queries get 1 or more
> additional 0's at the end of your response times.  Thats about 20K
> records / sec.   Just putting the primitive argus data into a mysql table
> gets to be tough to do.  Now, your column count has to go down to
> keep up, and the query taxonomy that the limited flow id's and metrics
> ( columns)  can support goes down.
>  
> So we / I approach this by not putting the primitive records into the
> database.  By building multiple data models in parallel, and putting
> those in the database, we can get good performance on a good
> Cyber Forensics query taxonomy over Petabytes of data, over periods
> of years.  These are the targets I think are useful.
>  
> So, YES, being able to find all the nodes that X is / has talked to is
> really valuable.  And, we have methods that can maintain a day based
> matrix table in realtime.  Its far less expensive that putting every record
> in the database.
>  
>     rasqlinsert -S localhost -m srcid matrix -M cache \
>         -w mysql://user@localhost/db/matrix_%Y_%m_%d \
>         -s srcid stime ltime saddr daddr - ipv4
>  
> For some, the matrix is expressed as a CIDR addresses:
>     -m srcid matrix/24
>  
> The resulting table has all the X's <-> Y's, by day.  Now you may find
> that that granularity doesn't do well for you, but you can adjust that with
> the tools we provide, using a " -M hard time 1h " like option and adding
> the " stime " to the key, you get hourly granularity in daily tables.  Adding
> other metrics can be useful, but we always put a full argus record in each
> row, so we get all the metrics that can be useful, without beating up the
> DB.
>  
> The idea is that this table can be done in parallel with other data models
> that are also very useful.
>  
> On the web site, we talk about also having an IP address inventory, that
> you can have going, concurrently, in real time:
>  
>       rasqlinsert -S localhost -M rmon -m srcid smac saddr -M cache \
>         -w mysql://user@localhost/db/addrs_%Y_%m_%d \
>         -s srcid stime ltime smac saddr - ipv4
>  
> Now this table provides you with ethernet / IP pairs, which will provide you
> with some topology information on where X came from.  This table has a
> bonus, it will tell you whether your IP came from a router / switch,
> or came from an end system, as you can find all the MAC addresses in your
> infrastructure, and how many IP's that MAC addr is supporting.   Helps you
> to find VM's, wireless access points, and spoofing sources.
>  
> With all queries in these data models returning start and end times, 
> you should be able to get to the original argus records ( primitive data ), that are
> being stored in time based files, very quickly.  But you should be able to
> query other data models, bounded in some relevant time based queries, or these
> same models in other sites, to do some really good forensics work, evenly interactively.
>  
> So when you get an event that is interesting involving some odd ball X at time T, you
> generally want to know who X was talking to at time (T - Tc) - (T + Tc), where c is like
> 60-300 seconds, but the next question is almost always, when did I first see X, always
> going back at least 1 month, but a year is much better.  And then, its always valuable
> to know when did X first start talking to the specific Y's that are involved in the current
> event, and who were those Y's talking to.
>  
> The whole point is to find the initial penetration event, and attempt to characterize
> the stepping stones ( Markov chain ), to tell you how bad it is.
>  
> Generally, these types of cross table queries, when these tables are already built,
> are much faster than your response times.  The important fields are already indexed,
> because they are keys.   ( -m fields specify the DB keys ).
>  
> So what kind of data models do you think may be useful ?  That would be an
> interesting conversation, I believe.
>  
> Carter
>  
>  
>  
> On Oct 26, 2012, at 12:48 AM, Paul Schmehl <pschmehl_lists at tx.rr.com> wrote:
> 
> 
> 
> 
> --On October 26, 2012 12:21:58 AM -0400 Carter Bullard <carter at qosient.com> wrote:
> 
> 
> 
> 
> Hey Paul,
> All the rasql* tools know how to cross day boundaries to find data, so it
> works well for ra*. But, that doesn't mean that there aren't better ways
> of doing things.
> 
> No reason, ..., nobody has asked for column indexing with rasql*
> programs.  Probably because it does slow down record insertion.  You have
> decided to not insert the binary argus record into the database, so you
> don't need any of the rasql* tools to do your database operations.
> 
> What kind of queries are you doing ?
> 
> 
> Typically the question we're trying to answer is something like this:
> 
> Show me what saddr's connected to unique daddr on port num between one time and another.
> 
> Specific example would be for Torpig.
> 
> Which IPs connected to 8.8.8.8 on port 53 between 8AM and 12 Noon yesterday.
> 
> We might occasionally want to search for all connections for one host over several days.
> 
> The problem we're trying to solve is that searches on the data using ra take a very long time.  We're archiving the logs in four hour increments, bzipped, and a search of one four hour period will take 20 minutes or so. A search through a whole day's logs (6) would take several hours.  A typical day is between 50 and 75 GB compressed - 90 to 120GB uncompressed.
> 
> By putting the critical information into a database we can locate the connections more quickly, then use ra with the -t switch to hopefully find the data more quickly.
> 
> A typical query would be something like this:
> 
> SELECT * from table where saddr between x and y and daddr=z and stime between z and a.
> 
> I just ran that query and it took 25 seconds.  Dropping the saddr range and just searching for all connections to that destination dropped the time to 1 second.  That same query on a four hour argus file would taken 20 minutes or so.
> 
> Paul Schmehl, Senior Infosec Analyst
> As if it wasn't already obvious, my opinions
> are my own and not those of my employer.
> *******************************************
> "It is as useless to argue with those who have
> renounced the use of reason as to administer
> medication to the dead." Thomas Jefferson
> "There are some ideas so wrong that only a very
> intelligent person could believe in them." George Orwell
> 
> 
> 
>  
>  
> * Carter Bullard <carter at qosient.com>
> * Issuer: "VeriSign - Unverified
>  
>  
> * Carter Bullard <carter at qosient.com>
> * Issuer: "VeriSign - Unverified
>  
>  
> * Carter Bullard <carter at qosient.com>
> * Issuer: "VeriSign - Unverified
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20121029/54559ffb/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2589 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20121029/54559ffb/attachment.bin>


More information about the argus mailing list