Argus Database.

Tue Mar 15 08:57:45 EST 2005

On the sorting front, there are several tricks that
were used in the beginning of time to sort large
data sets with little memory using tape.  Knuth has a
lot of tricks for doing this type of thing in his
volumne 2 (maybe 3?) "Sorting and Searching Algorithms".
My position is to not use large files, but to keep them
small from the beginning.

I can recommend a few things.  2 pass sorting.  In
the client library there is support for multi-pass
processing of data, (all the clients in the examples
process data in only one pass).  With 2 passes, you
can just read each record, do some housekeeping on
how far off the sorting is, and make some decisions
as to what strategy you might use for
the sort.

If the sort is based on time, and the data comes from
a probe, then the degree of "out of order" is pretty
low, and as a result you don't need to keep a lot of
records in memory at one time.  But if you're trying
to sort on say port number, or base TCP sequence number,
then the degree of "out of order" can be extensive.

So, if we want to tackle the sorting problem, I'd
recommend that we think about 2 pass processing, and
use some of Knuth's ticks for sorting with a little
bit of memory.

Carter

-----Original Message-----
From: owner-argus-info at lists.andrew.cmu.edu
[mailto:owner-argus-info at lists.andrew.cmu.edu] On Behalf Of Peter Van Epp
Sent: Sunday, March 13, 2005 11:45 PM
To: argus-info at lists.andrew.cmu.edu
Subject: Re: [ARGUS] Argus Database.

	Hmmm, that may be worth trying. When I read about tying though it 
seemed to indicate that the hash was still in memory it just also went to
disk 
which seemed to mean I'd have the same problem (exhaustion of the in memory 
portion of the hash) but that may just be one of unclear documentation or 
unclear reader :-). 
	The usual problem is a wide ranging port scan producing large
numbers 
of single flows to different hosts. The index tends to blow up and while
adding 
memory would help to some extent it is still possible to exhaust it (and
doing 
the same to disk would be much harder as it can be much much larger easily).

While the more memory trick would fix me for now, it wouldn't help someone 
like Eric with 5 or 10 times my traffic and a general solution would be more

desirable.

Peter Van Epp / Operations and Technical Support 
Simon Fraser University, Burnaby, B.C. Canada

> 
> While this isn't a bad idea, I think you should try some simple approaches

> to solve the memory problem before going the whole way to using mysql. 
> (And I'm a big mysql guy, so this isn't just mysql bashing.)
> 
> In particular, if you're storing lots of data in hashes, try tie'ing those

> hashes to files on disk, so they don't eat up your memory.  You may have
to 
> restructure your data format a bit to do this, if you're currently using 
> nested hashes, but it may be worth the effort.  tie'ing to a file actually

> gets around some memory (mis)management problems with perl.  We've seen 
> code that was running a machine out of memory with an in memory hash
result 
> in only a few megabtye file on disk when tie'd.
> 
> If you still want to go the database approach, I found this page in google

> that indicates that someone else may have already done a bunch of the work

> you're looking for:
> <http://article.gmane.org/gmane.network.argus/2626>
> 
> 
> -David
> 
> David Nolan                    <*>                    vitroth+ at cmu.edu
> curses: May you be forced to grep the termcap of an unclean yacc while
>      a herd of rogue emacs fsck your troff and vgrind your pathalias!