mysql database organization and schema

Tue Jun 26 03:31:26 EDT 2012

As one of those "Stanford guys", I've spent a fair amount of time over
the last half decade putting argus flows into a variety of relational databases
In each RDBMS I've encountered my initial attempts left lots of room for improvement
so, as usual, one should planon devoting substantial time to prototyping and measurement.
>From the dialogue so far, I can see that you are already tuned in to many of
the important issues

Since my interests have revolved around building interactive tools, most of my effort
goes into performance analysis. Organizing the DB for performance is critically dependent
on the queries your apps make. If these are disparate enough, you may find that you need
to store information a couple of different ways.  Fortunately, since flow records are unchanging,
replication isn't the issue it would be in a typical transaction processing system.

Another thing to keep in mind is that the major costs in computation today
are in moving the data from storage across busses and/or networks into RAM and
finally into various caches where it can finally be acted upon.  Whenever
possible, you want to avoid moving data that you won't actively reference.
Over time, this has driven me to use column-oriented stores as my typical
queries depend of filters which reference only a few of the several dozen
fields in a flow record.  This took me away from MySQL a few years ago,
but now there are column-oriented and in-memory engines available.

To minimize data movement, I try to store flow fields in whatever DB type minimizes
its storage.  Since most flow fields are unsigned numerics, MySQL's native support
makes things easy and so I agree with making saddr and daddr unsigned ints and then
use the inet_ntoa()/inet_aton() for conversions. Having them as ints also makes
it easier to express range queries and CIDR masks. However, representing temporal
values is more complicated as MySQL's native storage types limit precision to seconds.
Carter's choice of double precision unixtime is reasonable as the 52 bits of
precision allows lossless retention of microsecond precision and mysql

Using MySQL partitioning looks like it's worth trying, especially if you hoping
to keep up with a 10G link. Make sure you do some simple experiments first and
be careful because all cols of each unique key must also appear in the partitioning key
and you're proposing to use an auto-increment column as a primary key, see:
  http://dev.mysql.com/doc/refman/5.1/en/partitioning-limitations-partitioning-keys-unique-keys.html

Hmmmmm....there a lot of other things to think about, e.g. immediate versus
archival tables, but I've rattled on long enough for one message.

/John

--
John Gerth      gerth at graphics.stanford.edu  Gates 378   (650) 725-3273

On 6/25/12 8:04 AM, CS Lee wrote:
> hi Carter,
> 
> It's a trade off because I'm not keeping argus primitive data, I'm trying to minimize the fields I need to keep already ;)
> 
> This is just the starting work of handling 10G with mysql, my plan is to have mysql handling first hand data, then sync them with hadoop/hive to
> perform big data analysis. Nothing is final, and I would like to hear some thoughts from other argus users in the list as well. 
> 
> By the way Carter, do you have plan to make argus client multi-threaded as well to have better performance in term of analysis, I know this is not
> trivial but just asking.
> 
> Cheers!
> 
> 
> On Mon, Jun 25, 2012 at 10:51 PM, CS Lee <geek00l at gmail.com <mailto:geek00l at gmail.com>> wrote:
> 
>     hi Carter,
> 
>     Yeah initially when I thought of using stime(it's double now and I can alter it to integer anyway) for mysql partition and I may lose fraction of
>     second, then we have dur field in argus which is more useful to know(is it short or long flow) and that makes me think of changing the strategy.
> 
>     Mysql partition doesn't play well with string data type, and i want to check how well it performs compare to default schema that we currently
>     have(providing that the changes of sadddr,daddr,sport and dport changes applied). I'm also looking into have autoid field as the key as well since
>     it is sequential incremental field.
> 
....
> 
>     On Mon, Jun 25, 2012 at 10:10 PM, Carter Bullard <carter at qosient.com <mailto:carter at qosient.com>> wrote:
> 
>         Time is currently defined as a double, so that it can get the fractional part,
>         so you may not need to change that representation, unless double is not
>         a good performer?
> 
>         Please keep sending your preferences for data formats to the list so I can make
>         appropriate changes.  The Stanford guys started this effort by asking for addresses
>         to be in decimal, and with your list, we should be able to do a good job on
>         preliminary database work.  They are using some very high performance
>         commercial databases for their work, and seem rather happy, but that doesn't
>         cover Stanford's 10G monitoring efforts.
> 
>         I will put in the database schema creation support for these formats next, but
>         that will be in argus-clients-3.0.7.1, which i should put up later this week.
> 
>         You will find that the db will not be able to keep up with stuffing primitive data
>         into the database, as the network and argus can generate more data than the
>         db can accept.  But maintaining derived data, like an ethernet/ip address table
>         with rasqlinsert() is no problem, or an ethernet matrix table is a no brainer.
> 
>         I'd be interested in any results that you discover !!!!!
> 
>         Carter
> 
> 
>         On Jun 25, 2012, at 10:00 AM, CS Lee wrote:
> 
>>         hi Carter,
>>
>>         After applying latest patch from you, here's the result -
>>
>>         mysql> SELECT stime, saddr, sport, daddr, dport, proto, state FROM tbl_test;
>>         +-------------------+------------+-------+------------+-------+-------+-------+
>>         | stime             | saddr      | sport | daddr      | dport | proto | state |
>>         +-------------------+------------+-------+------------+-------+-------+-------+
>>         | 1340623505.000000 | 3232235779 <tel:3232235779> |     8 | 3512053643 <tel:3512053643> |  5923 | icmp  | ECO   |
>>         +-------------------+------------+-------+------------+-------+-------+-------+
>>
>>         I will run more testing on other data later. My other plan is to change stime to be integer as well so that it will be very efficient for
>>         partition strategy. String data type can't be used for mysql partition. Currently my tweaking and fine tuning is gearing toward high speed
>>         mysql db for argus and see if my strategy works better than default schema. 
>>
>>         Thank you!
>>