argus suggestions please

Mon Oct 8 12:50:03 EDT 2007

Hey Michael,
Well, you are not the only one that has scaling problems, so if
you don't mind, I would like to tackle some of your processing
problems on the mailing list.    I have several sites that are in
the 1-2 M  meaningful IP addresses per day, after culling from
7 - 8M per day, so you're not really that unique (although we are
all unique and special in our own way ;o)

One of the sites has resources where they are actually tracking
them all in mysql.  Most are just trying to condense their archives
to control disk space, and they aggregate heavily in one class
of address and lesser in others.

Most of the addresses are of course completely meaningless.
Not even worth looking at as they are so amazingly benign.   Most
are scanners or dubious packets (what was once mis-labeled
as backscatter traffic), and for many of these sites simple exclusion
filters are all that are needed to deal with the data.

One of the simplest filters is the dark space filter, or the inverse
of the active local IP address list.   This is why raaddrfilter() is in
the distribution.  You can give it literally millions of addresses and
it will quickly do something with the list.  The current argus
version is a simple filter front end but you can hack it easily to
be an aggregator.

I am not going to do anything drastic until after argus-3.0.0 is
released, but we do have small-memory footprint aggregators
that I'll put into argus-3.1, that would be able to handle your
IP address load (at least for small numbers of metrics).  And
we have the ability to do things like automatically load rrd's per
IP address, and millions of rrd;s doesn't work that well, but 100K's
of rrds do, so dealing with the IP address scale is a good thing to
work on.

So, please don't be shy, keep the discussion going!!!!!!

Carter

On Oct 8, 2007, at 12:24 PM, Michael Hornung wrote:

> 7,713,318 unique IPs (both src and dst) seen in one day (10/1, see  
> bottom
> of message for detail).  This is a research institution and I'm
> studying...ahem...active parts of our network.
>
> Where I'm monitoring in the network I do not see end host MAC  
> addresses,
> which is why I have to carve those out in post-processing.
>
> I think my current scripts are giving me the data I need, which is  
> the top
> talkers in bytes.  The way I looked at it was that saving the  
> reports with
> per-flow detail could allow me to return to them at a later time  
> and get
> further helpful information if I no longer have the argus files for  
> that
> period of time.
>
> My recent email starting this thread was to add *additional*  
> reporting to
> what I'm already generating by adding a daily sum of flows per local
> source address, so the other info I'm generating is useful to me.
>
> Here's a terse breakdown of the IPs seen in a day:
>
>            10,687 IPs local to monitored segments
>         7,702,631 other IPs
>         ---------------------------------------
>         7,713,318 IPs total
>
> Thanks for your help Carter.  I'll try out some of your suggestions  
> and
> see how they work.
>
> -Mike
>
> On Mon, 8 Oct 2007 at 10:35, Carter Bullard wrote:
>
> |So Michael,
> |How many IP addresses are we talking about?
> |
> |I may  be missing something, but I'm not sure that your example
> |scripts are doing what you'd like?  In think your scripts will
> |generate at the end of the each period, an ascii printout of
> |each flow sorted by total bytes.  Is this what you intend?
> |
> |If all you want is total flows by IP address, why not print the ascii
> |list of just IP addresses, along with the appropriate metric?  And
> |could you print the mac addresses so you have them right there?
> |(unless the mac addresses in the argus records are not the mac
> |addresses you want to tally)?
> |
> |  racluster -M norep -f ${file} -w - - ip | \                     / 
> * generate
> |single flow recs
> |  racluster -m smac saddr proto -M rmon -w - | \     /* aggregate  
> by IP address
> |  rasort -m bytes smac saddr -w - | \                          /*  
> sort output
> |  ra -nn -s smac saddr trans spkts dpkts pkts sloss dloss sbytes  
> dbytes bytes >
> |     ${stats_dir}/.....
> |
> |This will give you (assuming your argus data has MAC address  
> turned on)
> |a report with mac/IP address pairings, the flow count ('trans'),  
> and the 'in
> |and out'
> |packets and the loss reported and the byte statistics all based on  
> IP address.
> |So for each period you get the IP address and the flow counts and  
> a bunch
> |of other interesting data.
> |
> |So how long are these periods?  1, 5 min?
> |
> |If I was doing it,  and I wanted to generate your lists and a top  
> 20 talkers
> |list at the end of the day and graph it, and I was challenged on  
> memory,
> |I would do this (assuming the mac addresses in the argus records are
> |the ones you want):
> |
> |  racluster -M norep -f ${file} -w - - ip | \
> |  racluster -m smac saddr proto -M rmon -w - | \
> |  rasort -m bytes smac saddr -w - | \
> |  ra -N 1000 -w ${stats_dir}/...../day/period
> |
> |This will generate at the end of each time period the top 1000  
> talkers
> |database: then when its time to generate the top 20 talkers for  
> the day:
> |
> |  rasort -R ${stats_dir}/.../day -m bytes smac saddr -w - |\
> |  ra -N 20 -w top20.talkers.list
> |
> |That would really fly, I suspect.
> |
> |The top 20 daily talker graph is easy, just need to get the list
> |of IP addresses you want and then run ragraph with the appropriate
> |set of IP address data:
> |
> |  ra -s addr -r top20.talkers.list > addrs.list
> |  rafilteraddr -f addrs.list -R ${stats_dir}/..../daily  > /tmp/data
> |  ragraph  spkts dpkts saddr -M 1m -w /tmp/ragraph.png
> |
> |
> |Or at least something like that.
> |
> |
> |Carter
> |
> |
> |On Oct 5, 2007, at 5:55 PM, Michael Hornung wrote:
> |
> |> None of those options works on a whole day's worth of data at  
> once, even
> |> the last when it tries to cluster all the processed files from / 
> tmp.
> |>
> |> As Peter mentioned it is possible to run stats on the individual  
> smaller
> |> files throughout the day and process those later to produce the  
> cumulative
> |> set of results.  I have some Perl that helps me do that as well.
> |>
> |> For example, here is what I'm doing every five minutes after  
> archiving the
> |> previous chunk of data:
> |>
> |> 	racluster -r ${file} -w - |  \
> |> 	rasort -r - -m bytes saddr daddr -w - - ip |  \
> |> 	ra -nn -s saddr sport daddr dport spkts dpkts  \
> |> 		pkts sloss dloss sbytes dbytes bytes  \
> |> 	> ${stats_dir}/${year}/${month}/${day}/${seconds}
> |>
> |> At the end of the day I go through each of the reports generated  
> above and
> |> do several things:
> |>
> |> 	1) Go through our ARP cache records (which we poll regularly) and
> |> 	   associate the IPs to MACs based on the name of the report which
> |> 	   is the time when the report was written (roughly when the
> |> 	   device was talking online).
> |>
> |> 	2) Compile aggregate bytes transferred per MAC address.
> |>
> |> 	3) Publish a top-talkers list to a web page, including a graph
> |> 	   (generated by gnuplot) of packet loss.
> |>
> |> It looks like it will be easiest to re-process each raw file at  
> the same
> |> time I do the argus reporting above and add separate accounting  
> for number
> |> of flows per IP.  Then my end-of-day accounting can go through this
> |> additional data and attribute the flow counts to the appropriate  
> device
> |> (MAC address).  Then I can provide another report view that  
> sorts results
> |> by top flows.
> |>
> |> -Mike
> |>
> |> On Fri, 5 Oct 2007 at 15:58, Carter Bullard wrote:
> |>
> |> |The solution is to not count the flows, but the flow records, or
> |> |run the programs script against the individual 5m files, and  
> then combine
> |> |the output to generate the final file.
> |> |
> |> |The fastest way, is to skip the first racluster, as it is the  
> one that is
> |> |eating the
> |> |memory.
> |> |
> |> |Try this:
> |> |  racluster -M rmon -m saddr -R archive/2007/10/04 -w - | \
> |> |  rasort -m bytes -s saddr trans sbytes dbytes
> |> |
> |> |That should run if you have less than 1M addresses, give or take
> |> |250K. If that works and you want to still count the unique flows,
> |> |try this variant:
> |> |
> |> |  racluster -M ind -R archive/2007/10/04 -M norep -w - -- ip|\
> |> |  racluster -M rmon -m saddr -w - | \
> |> |  rasort -m bytes -s saddr trans:10 sbytes:14 dbytes:14
> |> |
> |> |The "-M ind" option will cause racluster to process each file
> |> |independantly, rather than treating the entire directory structure
> |> |as a single stream.
> |> |
> |> |If none of these are successful then try doing the top x for each
> |> |5 minute file, and then raclustering and rasorting the 5m files.
> |> |
> |> |Using bash:
> |> |  for i in archive/2007/10/04/*; do echo $i; racluster -r $i -w  
> - -- ip | \
> |> |  racluster -M rmon -m saddr -w - | rasort -m bytes -w /tmp/ 
> $i.srt; done
> |> |
> |> |  racluster -R /tmp/archive/2007/10/04 -w - | \
> |> |  rasort -m bytes -s saddr trans:10 sbytes:14 dbytes:14
> |> |
> |> |then delete the /tmp/archive/2007/10/04 directory.
> |> |
> |> |Does any of that work?
> |> |
> |> |Carter
> |> |
> |> |
> |> |Michael Hornung wrote:
> |> |> Thanks Carter, this is what I was hoping to hear!  You  
> guessed my setup
> |> |> exactly, though I've got a problem with what you sent, and I  
> suspect it
> |> may
> |> |> be related to the amount of data.  The box I'm using to do  
> processing is
> |> x86
> |> |> linux (RHEL5) with 2 x dual core 2ghz CPUs and 4GB of RAM.
> |> |>
> |> |> % du -hs archive/2007/10/04
> |> |> 20G     archive/2007/10/04
> |> |>
> |> |> % racluster -R archive/2007/10/04 -M norep -w foo -- ip
> |> |> Segmentation fault
> |> |>
> |> |> strace shows:
> |> |> brk(0x4e0be000)                         = 0x4e09d000
> |> |> mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE| 
> MAP_ANONYMOUS, -1,
> |> 0)
> |> |> = -1 ENOMEM (Cannot allocate memory)
> |> |> mmap2(NULL, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS| 
> MAP_NORESERVE,
> |> -1,
> |> |> 0) = -1 ENOMEM (Cannot allocate memory)
> |> |> mmap2(NULL, 1048576, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS| 
> MAP_NORESERVE,
> |> -1,
> |> |> 0) = -1 ENOMEM (Cannot allocate memory)
> |> |> mmap2(NULL, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS| 
> MAP_NORESERVE,
> |> -1,
> |> |> 0) = -1 ENOMEM (Cannot allocate memory)
> |> |> mmap2(NULL, 1048576, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS| 
> MAP_NORESERVE,
> |> -1,
> |> |> 0) = -1 ENOMEM (Cannot allocate memory)
> |> |> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> |> |> +++ killed by SIGSEGV +++
> |> |>
> |> |>
> |> |> I ran top while racluster was running and it seems that the  
> process runs
> |> out
> |> |> of memory, and I'm nearing the system's limits...so what can  
> be done about
> |> |> this?
> |> |>
> |> |>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+   
> COMMAND
> |> |> 7551 argus     20   0 3062m 2.9g  836 R   76 83.2   0:38.64  
> racluster
> |> |>
> |> |> -Mike
> |> |>
> |> |> On Fri, 5 Oct 2007 at 14:19, Carter Bullard wrote:
> |> |>
> |> |> |Hey Michael,
> |> |> |I think asking these questions are great!!!  As it gets  
> examples into the
> |> |> |mailing list,
> |> |> |where people can search etc....
> |> |> |
> |> |> |So, you have a daily directory and you want a report based  
> on IP top
> |> |> talkers.
> |> |> |Lets say the directory is in the standard argus archive  
> format, and we'll
> |> do
> |> |> |yesterday.
> |> |> |Here is the set of commands that I would use:
> |> |> |
> |> |> |  racluster -R archive/2007/10/04 -M norep -w - -- ip | \
> |> |> |  racluster -M rmon -m saddr -w - | \
> |> |> |  rasort -m bytes -s saddr trans:10 sbytes:14 dbytes:14
> |> |> |
> |> |> |So what does this do:
> |> |> |  racluster -R archive... -M norep -w - -- ip      This  
> program will read
> |> in
> |> |> a
> |> |> |days worth of IP data and assemble all the flow status
> |> |> |      reports into individual flow report.  We need to do  
> this because
> |> you
> |> |> said
> |> |> |you wanted
> |> |> |      to know how many flows there were.  The "-M norep"  
> option sez don't
> |> |> |report the
> |> |> |      merge statistics for aggregations.  This allows for a  
> single record
> |> to
> |> |> be
> |> |> |      tallied as a single flow.  And we write the output to  
> stdout.
> |> |> |
> |> |> |  racluster -M rmon -m saddr -w -
> |> |> |     This program will read in the stream of single flow  
> reports from
> |> stdin
> |> |> and
> |> |> |generate
> |> |> |      the top talker stats.  The rmon option pushes the  
> identifiers to
> |> the
> |> |> src
> |> |> |fields, and
> |> |> |      the -m option , and write the output to stdout.
> |> |> |
> |> |> |  rasort -m bytes -s saddr trans:10 sbytes:14 dbytes:14
> |> |> |     This program sorts the output based on total bytes for  
> each top
> |> talker.
> |> |> |      and prints out the IP address, the number of flows,  
> the bytes
> |> |> transmitted
> |> |> |by
> |> |> |      the talker and the bytes received.
> |> |> |
> |> |> |  Now if you want the top 20 talkers, you need to select the  
> first 20
> |> |> records
> |> |> |  from the rasort(), to do this:
> |> |> |  racluster -R archive/2007/10/04 -M norep -w - -- ip | \
> |> |> |  racluster -M rmon -m saddr -w - | \
> |> |> |  rasort -m bytes -w - |\
> |> |> |  ra -N 20 -s saddr trans:10 sbytes:14 dbytes:14
> |> |> |
> |> |> |
> |> |> |If you try this and get something weird, send mail!!  It  
> would be
> |> |> |good if we can get a "standard" set of calls that people  
> understand.
> |> |> |
> |> |> |Carter
> |> |> |
> |> |> |Michael Hornung wrote:
> |> |> |> I have an ra reading from a remote argus collector 24x7,  
> and every 5
> |> |> minutes
> |> |> |> the argus file is archived; at the end of a day I have 290  
> files
> |> |> representing
> |> |> |> the traffic from that day.
> |> |> |> |> Let's say I want to make a list of the top talkers,  
> sorted by total
> |> |> bytes
> |> |> |> transferred.  Given those top talkers, I want to see the  
> following as
> |> |> text,
> |> |> |> and/or alternately graphed, for each top talker:
> |> |> |> |> IP
> |> |> |> # flows
> |> |> |> # bytes rcvd
> |> |> |> # bytes sent
> |> |> |> |> Can you recommend a command-line that's going to give  
> me this?  The
> |> |> profusion
> |> |> |> of argus utilities and a lack of examples is making this  
> hard for me.
> |> |> |> Thanks.
> |> |> |> |> -Mike
> |> |> |> |
> |> |> |
> |> |>
> |> |>
> |> |
> |> |
> |>
> |
>