mysql db performance consdierations

John Gerth gerth at graphics.stanford.edu
Thu Jun 28 19:33:52 EDT 2012


A few  quick observations here related to performance (not that performance
should necessarily be the determining factor in schema design).

...STORAGE ENGINES
Although my experience was with MySQL 5.0 and so may have been moot, the
MYISAM engine was a much better performer in write-once, read-forever world
that argus wants.  Another advantage of MYISAM is that a 'database' is literally
a Linux directory and a table is a small set of ordinary files.  Furthermore,
there's no metadata elsewhere so you can use traditional Linux commands to
rename and copy items around.  This is very handy for backups, experiments,
and a host of other things.  MySQL 5.0 also had MERGE tables which were a
bit like partitions in that several tables appeared to be a single one.
Their main drawback was that a query of the MERGE table had to look at
all its members (I think partitions may be smarter than that). Anyway,
the MERGE table definitions were just a plaintext ASCII file so I would
create them on-the-fly, e.g. to search two days out of a month, or to
keep around handy table aliases like "today" and "yesterday" so I could build
apps with unchanging queries.

...DATATYPES
The schema below still has many opportunities to save storage. Obvious ones include
storing Mac addresses as BIGINT (8 bytes rather than 24); 'proto' as a TINYINT UNSIGNED (1 byte);
count fields as INT UNSIGNED - getting >4G bytes in a 5 sec reporting interval would be impressive.
Likewise 'dur' as a FLOAT (32 bits will easily keep microsecond precision for a reporting interval).
When character data is fixed size, like country codes, use CHAR and not VARCHAR.
Anyway, nothing surprising here - compact, fixed length data will save space and load faster.

Allowing a column to be NULL affects performance (and storage) so any column which
must exist in the flow should be declared NOT NULL in the schema.

...VIEWS
You can augment the base table with VIEWS to provide computed columns.  Such columns can
be handy when you want to look at tables with generic SQL tools that don't know about flow data.
For example, you could use views to convert UNIX epoch values in time fields into standard
SQL date and time types.  The idea here is that computation is cheap.  Fetching data from
disk is expensive.

...INDEXING
More importantly, index maintenance can be very expensive so you need to think about that carefully.
It's so expensive that people often don't build indices on live tables, only on historical ones.

The point of a primary index is to avoid having multiple records when only one record was intended.
This is critical for a transaction system, but pretty much irrelevant for an event logging system
like argus which is basically an append-only world. (The schema below defines a multi-column primary key
although the autoid field should be unique already) Will you ever use the autoid field in a query?

The real utility of an index for argus is to speed up a query.  Generally, a DB system will use
at most one index in a query.  The normal recommendation is to build an index when
the number of distinct values is small compared to the total number of rows (so that the DB
can skip reading whole swaths of the table.  Because of this, a DB's query optimizer often ignores
an index and will fall back to a full table scan if the estimate is that more than a few percent of
the table must be looked at.  You can and should use the EXPLAIN statement to see whether a query will
make use of an index.

That said, an index can be great.  You may actually find it helpful to synthesize an index from column values.
For example, you might create an index for the /24 values of your local IPs so that you
can easily extract flows by "subnet".  Or one for the remote ASNs in flows.  The good
news is that you can add and delete an index after populating the table.

The bottom line for performance though is that your decisions should be driven
by the actual nature of the queries you plan to use.

--
John Gerth      gerth at graphics.stanford.edu  Gates 378   (650) 725-3273 fax 725-6949

On 6/28/2012 6:32 AM, CS Lee wrote:
> hi Carter,
> 
> This is the one I'm currently testing -
> 
> mysql> SHOW CREATE TABLE tbl_test \G;
> *************************** 1. row ***************************
>        Table: tbl_test
> Create Table: CREATE TABLE `tbl_test` (
>   `autoid` int(11) NOT NULL AUTO_INCREMENT,
>   `stime` int(10) unsigned DEFAULT NULL,
>   `srcid` varchar(64) NOT NULL DEFAULT '',
>   `proto` varchar(16) NOT NULL,
>   `saddr` int(10) unsigned NOT NULL DEFAULT '0',
>   `sport` smallint(5) unsigned NOT NULL DEFAULT '0',
>   `dir` varchar(3) DEFAULT NULL,
>   `daddr` int(10) unsigned NOT NULL DEFAULT '0',
>   `dport` smallint(5) unsigned NOT NULL DEFAULT '0',
>   `pkts` bigint(20) DEFAULT NULL,
>   `bytes` bigint(20) DEFAULT NULL,
>   `appbytes` bigint(20) DEFAULT NULL,
>   `state` varchar(32) DEFAULT NULL,
>   `flgs` varchar(32) DEFAULT NULL,
>   `dur` double(18,6) NOT NULL,
>   `spkts` bigint(20) DEFAULT NULL,
>   `dpkts` bigint(20) DEFAULT NULL,
>   `sbytes` bigint(20) DEFAULT NULL,
>   `dbytes` bigint(20) DEFAULT NULL,
>   `sappbytes` bigint(20) DEFAULT NULL,
>   `dappbytes` bigint(20) DEFAULT NULL,
>   `smaxsz` smallint(5) unsigned DEFAULT NULL,
>   `dmaxsz` smallint(5) unsigned DEFAULT NULL,
>   `sminsz` smallint(5) unsigned DEFAULT NULL,
>   `dminsz` smallint(5) unsigned DEFAULT NULL,
>   `smac` varchar(24) DEFAULT NULL,
>   `dmac` varchar(24) DEFAULT NULL,
>   `sco` varchar(2) DEFAULT NULL,
>   `dco` varchar(2) DEFAULT NULL,
>   `sas` int(10) unsigned DEFAULT NULL,
>   `das` int(10) unsigned DEFAULT NULL,
>   PRIMARY KEY (`autoid`,`srcid`,`proto`,`saddr`,`sport`,`daddr`,`dport`)
> ) ENGINE=InnoDB AUTO_INCREMENT=910612 DEFAULT CHARSET=latin1
> 
> I haven't checked through the default data type for other field yet unless I'm including them. The most obvious changes here are stime,
> saddr,daddr,sport,dport are all integer and not string data type anymore.
> 
> By the way regarding your question about whether command line or rarc taking precedence, I'm with you where command line specification is always priority.
> 
> -- 
> Best Regards,
> 
> CS Lee<geek00L[at]gmail.com <http://gmail.com>>
> 
> http://geek00l.blogspot.com
> http://defcraft.net



More information about the argus mailing list