[flow-tools] performance question [resend]

Mark Fullmer maf@eng.oar.net
Tue, 28 Jan 2003 15:29:23 -0500


On Fri, Jan 24, 2003 at 10:14:36AM -0600, Craig A. Finseth wrote:

> As I mentioned in an earlier message (which may have yet to wend its way
> through the queues...), this problem has been fixed by writing tailored
> code for flow-tag.

What did you do to flow-tag?  Is this something that would be of general
use?  Is it faster / more functional than the suggestion I made earlier
about reformatting the config file?  Your original post effectively
boiled down to

  foreach customer # 500 customers
    patricia_trie_lookup() 
  done

  Where all that was necessary is

  patricial_trie_lookup()

You can send any fixes to the list, I'll try to get them integrated into
the next snapshot.

On the split timing problem...I need to think about this a little more, but
I think it's a bug.  flow-split is probably assuming the clock doesn't
go away for more than a split period.  The time-series option in flow-report
probably behaves the same way.

mark


> 
> Basically, we are collecting netflow data on about 20 routers.  This
> data is in:
> 
> 	/netflow/<router name>/<flow files at 15 minute intervals>
> 
> (I'm simplifying and paraphrasing, but the gist will be correct.)
> 
> The data is copied to
> 
> 	/netflow/filtered/<router name>/<flow files at 15 minute intervals>
> 
> by passing it through a chain of three commands:
> 
> 	flow-nfilter	-- remove duplicate flows
> 	flow-tag[*]	-- add tags for customer and Internet data
> 	flow-nfilter	-- remove all non-tagged flows
> 
> The data is then merged into:
> 
> 	/netflow/merged/<flow files at 15 minute intervals>
> 
> so that there is one set of files for the whole system.
> 
> [*] This is the step that originally took about 25% of the overall
> time and that I have sped up by a factor of 20 or so.  Time is now not
> an issue.
> 
>    One way to make this fast is tag the data once and use flow-split -g
>    or -G to create smaller datasets for each of the 500 customers.  Then
>    run the various reports on each of the customer datasets.
> 
> I couldn't get flow-split to work for me until yesterday, when I found
> and fixed a couple of bugs.
> 
> The problem with flow-split is that I can't reconstruct the sample times.
> For example, if I were to use flow-cat and put all of a day's data into
> one file, I could not use flow-split to get it out again.  Even if I do:
> 
> 	flow-split -T 900
> 
> for example, it does not generate 96 files at 15-minute intervals.  Rather,
> intervals with zero flows are skipped and the following interval may not
> start on a 15-minute boundary.  And, without knowing the intervals, I can't
> graph the data.  Perhaps if flow-split had a "-clock" option to generate
> files according to clock time, I could use it.
> 
>    Something else you should look at is the compile options.  Flow-tools
>    by default is built with only -g.  Replace this with -O or -O2 and
>    things will really speed up.
> 
> I will check into this.
> 
> 	...
>    Also, your flow-tag file is not organized well.  All the customers should
>    be listed in one tag-action, each with a different tag.  The way you're
>    doing things now there are 500 trie's instead of one, on average 250
>    trie lookups are done per flow.  You only need 1!
> 
> I'm sure you are right.  However, the documentation does not provide
> any clue that this approach will work.  And, since it is a solved problem,
> I'm not going to rework it at this time.
> 
> Craig