[flow-tools] performance question [resend]

Fri, 24 Jan 2003 10:14:36 -0600 (CST)

   > We have installed a netflow collection system and I have developed a
   > set of scripts using flow-tools to analyze the collected data.  So
   > far, so good.
   > 
   > The problem is, it currently takes about 3 days to analyze each days'
   > data (on a 900 MHz top-of-the-line Sparc-something-impressive).

   Could you give a few more details on what your scripts look like to
   analyze the data?  For example are you running all the data
   through flow-tag and flow-nfilter 500 times?

As I mentioned in an earlier message (which may have yet to wend its way
through the queues...), this problem has been fixed by writing tailored
code for flow-tag.

Basically, we are collecting netflow data on about 20 routers.  This
data is in:

	/netflow/<router name>/<flow files at 15 minute intervals>

(I'm simplifying and paraphrasing, but the gist will be correct.)

The data is copied to

	/netflow/filtered/<router name>/<flow files at 15 minute intervals>

by passing it through a chain of three commands:

	flow-nfilter	-- remove duplicate flows
	flow-tag[*]	-- add tags for customer and Internet data
	flow-nfilter	-- remove all non-tagged flows

The data is then merged into:

	/netflow/merged/<flow files at 15 minute intervals>

so that there is one set of files for the whole system.

[*] This is the step that originally took about 25% of the overall
time and that I have sped up by a factor of 20 or so.  Time is now not
an issue.

   One way to make this fast is tag the data once and use flow-split -g
   or -G to create smaller datasets for each of the 500 customers.  Then
   run the various reports on each of the customer datasets.

I couldn't get flow-split to work for me until yesterday, when I found
and fixed a couple of bugs.

The problem with flow-split is that I can't reconstruct the sample times.
For example, if I were to use flow-cat and put all of a day's data into
one file, I could not use flow-split to get it out again.  Even if I do:

	flow-split -T 900

for example, it does not generate 96 files at 15-minute intervals.  Rather,
intervals with zero flows are skipped and the following interval may not
start on a 15-minute boundary.  And, without knowing the intervals, I can't
graph the data.  Perhaps if flow-split had a "-clock" option to generate
files according to clock time, I could use it.

   Something else you should look at is the compile options.  Flow-tools
   by default is built with only -g.  Replace this with -O or -O2 and
   things will really speed up.

I will check into this.

	...
   Also, your flow-tag file is not organized well.  All the customers should
   be listed in one tag-action, each with a different tag.  The way you're
   doing things now there are 500 trie's instead of one, on average 250
   trie lookups are done per flow.  You only need 1!

I'm sure you are right.  However, the documentation does not provide
any clue that this approach will work.  And, since it is a solved problem,
I'm not going to rework it at this time.

Craig