[flow-tools] performance question [resend]
Mark Fullmer
maf@eng.oar.net
Tue, 28 Jan 2003 15:29:23 -0500
On Fri, Jan 24, 2003 at 10:14:36AM -0600, Craig A. Finseth wrote:
> As I mentioned in an earlier message (which may have yet to wend its way
> through the queues...), this problem has been fixed by writing tailored
> code for flow-tag.
What did you do to flow-tag? Is this something that would be of general
use? Is it faster / more functional than the suggestion I made earlier
about reformatting the config file? Your original post effectively
boiled down to
foreach customer # 500 customers
patricia_trie_lookup()
done
Where all that was necessary is
patricial_trie_lookup()
You can send any fixes to the list, I'll try to get them integrated into
the next snapshot.
On the split timing problem...I need to think about this a little more, but
I think it's a bug. flow-split is probably assuming the clock doesn't
go away for more than a split period. The time-series option in flow-report
probably behaves the same way.
mark
>
> Basically, we are collecting netflow data on about 20 routers. This
> data is in:
>
> /netflow/<router name>/<flow files at 15 minute intervals>
>
> (I'm simplifying and paraphrasing, but the gist will be correct.)
>
> The data is copied to
>
> /netflow/filtered/<router name>/<flow files at 15 minute intervals>
>
> by passing it through a chain of three commands:
>
> flow-nfilter -- remove duplicate flows
> flow-tag[*] -- add tags for customer and Internet data
> flow-nfilter -- remove all non-tagged flows
>
> The data is then merged into:
>
> /netflow/merged/<flow files at 15 minute intervals>
>
> so that there is one set of files for the whole system.
>
> [*] This is the step that originally took about 25% of the overall
> time and that I have sped up by a factor of 20 or so. Time is now not
> an issue.
>
> One way to make this fast is tag the data once and use flow-split -g
> or -G to create smaller datasets for each of the 500 customers. Then
> run the various reports on each of the customer datasets.
>
> I couldn't get flow-split to work for me until yesterday, when I found
> and fixed a couple of bugs.
>
> The problem with flow-split is that I can't reconstruct the sample times.
> For example, if I were to use flow-cat and put all of a day's data into
> one file, I could not use flow-split to get it out again. Even if I do:
>
> flow-split -T 900
>
> for example, it does not generate 96 files at 15-minute intervals. Rather,
> intervals with zero flows are skipped and the following interval may not
> start on a 15-minute boundary. And, without knowing the intervals, I can't
> graph the data. Perhaps if flow-split had a "-clock" option to generate
> files according to clock time, I could use it.
>
> Something else you should look at is the compile options. Flow-tools
> by default is built with only -g. Replace this with -O or -O2 and
> things will really speed up.
>
> I will check into this.
>
> ...
> Also, your flow-tag file is not organized well. All the customers should
> be listed in one tag-action, each with a different tag. The way you're
> doing things now there are 500 trie's instead of one, on average 250
> trie lookups are done per flow. You only need 1!
>
> I'm sure you are right. However, the documentation does not provide
> any clue that this approach will work. And, since it is a solved problem,
> I'm not going to rework it at this time.
>
> Craig