argus suggestions please

Michael Hornung hornung at cac.washington.edu
Mon Oct 8 12:24:12 EDT 2007


7,713,318 unique IPs (both src and dst) seen in one day (10/1, see bottom 
of message for detail).  This is a research institution and I'm 
studying...ahem...active parts of our network.

Where I'm monitoring in the network I do not see end host MAC addresses, 
which is why I have to carve those out in post-processing.

I think my current scripts are giving me the data I need, which is the top 
talkers in bytes.  The way I looked at it was that saving the reports with 
per-flow detail could allow me to return to them at a later time and get 
further helpful information if I no longer have the argus files for that 
period of time.

My recent email starting this thread was to add *additional* reporting to 
what I'm already generating by adding a daily sum of flows per local 
source address, so the other info I'm generating is useful to me.

Here's a terse breakdown of the IPs seen in a day:

           10,687 IPs local to monitored segments
        7,702,631 other IPs
        ---------------------------------------
        7,713,318 IPs total

Thanks for your help Carter.  I'll try out some of your suggestions and 
see how they work.

-Mike

On Mon, 8 Oct 2007 at 10:35, Carter Bullard wrote:

|So Michael,
|How many IP addresses are we talking about?
|
|I may  be missing something, but I'm not sure that your example
|scripts are doing what you'd like?  In think your scripts will
|generate at the end of the each period, an ascii printout of
|each flow sorted by total bytes.  Is this what you intend?
|
|If all you want is total flows by IP address, why not print the ascii
|list of just IP addresses, along with the appropriate metric?  And
|could you print the mac addresses so you have them right there?
|(unless the mac addresses in the argus records are not the mac
|addresses you want to tally)?
|
|  racluster -M norep -f ${file} -w - - ip | \                     /* generate
|single flow recs
|  racluster -m smac saddr proto -M rmon -w - | \     /* aggregate by IP address
|  rasort -m bytes smac saddr -w - | \                          /* sort output
|  ra -nn -s smac saddr trans spkts dpkts pkts sloss dloss sbytes dbytes bytes >
|     ${stats_dir}/.....
|
|This will give you (assuming your argus data has MAC address turned on)
|a report with mac/IP address pairings, the flow count ('trans'), and the 'in
|and out'
|packets and the loss reported and the byte statistics all based on IP address.
|So for each period you get the IP address and the flow counts and a bunch
|of other interesting data.
|
|So how long are these periods?  1, 5 min?
|
|If I was doing it,  and I wanted to generate your lists and a top 20 talkers
|list at the end of the day and graph it, and I was challenged on memory,
|I would do this (assuming the mac addresses in the argus records are
|the ones you want):
|
|  racluster -M norep -f ${file} -w - - ip | \
|  racluster -m smac saddr proto -M rmon -w - | \
|  rasort -m bytes smac saddr -w - | \
|  ra -N 1000 -w ${stats_dir}/...../day/period
|
|This will generate at the end of each time period the top 1000 talkers
|database: then when its time to generate the top 20 talkers for the day:
|
|  rasort -R ${stats_dir}/.../day -m bytes smac saddr -w - |\
|  ra -N 20 -w top20.talkers.list
|
|That would really fly, I suspect.
|
|The top 20 daily talker graph is easy, just need to get the list
|of IP addresses you want and then run ragraph with the appropriate
|set of IP address data:
|
|  ra -s addr -r top20.talkers.list > addrs.list
|  rafilteraddr -f addrs.list -R ${stats_dir}/..../daily  > /tmp/data
|  ragraph  spkts dpkts saddr -M 1m -w /tmp/ragraph.png
|
|
|Or at least something like that.
|
|
|Carter
|
|
|On Oct 5, 2007, at 5:55 PM, Michael Hornung wrote:
|
|> None of those options works on a whole day's worth of data at once, even
|> the last when it tries to cluster all the processed files from /tmp.
|> 
|> As Peter mentioned it is possible to run stats on the individual smaller
|> files throughout the day and process those later to produce the cumulative
|> set of results.  I have some Perl that helps me do that as well.
|> 
|> For example, here is what I'm doing every five minutes after archiving the
|> previous chunk of data:
|> 
|> 	racluster -r ${file} -w - |  \
|> 	rasort -r - -m bytes saddr daddr -w - - ip |  \
|> 	ra -nn -s saddr sport daddr dport spkts dpkts  \
|> 		pkts sloss dloss sbytes dbytes bytes  \
|> 	> ${stats_dir}/${year}/${month}/${day}/${seconds}
|> 
|> At the end of the day I go through each of the reports generated above and
|> do several things:
|> 
|> 	1) Go through our ARP cache records (which we poll regularly) and
|> 	   associate the IPs to MACs based on the name of the report which
|> 	   is the time when the report was written (roughly when the
|> 	   device was talking online).
|> 
|> 	2) Compile aggregate bytes transferred per MAC address.
|> 
|> 	3) Publish a top-talkers list to a web page, including a graph
|> 	   (generated by gnuplot) of packet loss.
|> 
|> It looks like it will be easiest to re-process each raw file at the same
|> time I do the argus reporting above and add separate accounting for number
|> of flows per IP.  Then my end-of-day accounting can go through this
|> additional data and attribute the flow counts to the appropriate device
|> (MAC address).  Then I can provide another report view that sorts results
|> by top flows.
|> 
|> -Mike
|> 
|> On Fri, 5 Oct 2007 at 15:58, Carter Bullard wrote:
|> 
|> |The solution is to not count the flows, but the flow records, or
|> |run the programs script against the individual 5m files, and then combine
|> |the output to generate the final file.
|> |
|> |The fastest way, is to skip the first racluster, as it is the one that is
|> |eating the
|> |memory.
|> |
|> |Try this:
|> |  racluster -M rmon -m saddr -R archive/2007/10/04 -w - | \
|> |  rasort -m bytes -s saddr trans sbytes dbytes
|> |
|> |That should run if you have less than 1M addresses, give or take
|> |250K. If that works and you want to still count the unique flows,
|> |try this variant:
|> |
|> |  racluster -M ind -R archive/2007/10/04 -M norep -w - -- ip|\
|> |  racluster -M rmon -m saddr -w - | \
|> |  rasort -m bytes -s saddr trans:10 sbytes:14 dbytes:14
|> |
|> |The "-M ind" option will cause racluster to process each file
|> |independantly, rather than treating the entire directory structure
|> |as a single stream.
|> |
|> |If none of these are successful then try doing the top x for each
|> |5 minute file, and then raclustering and rasorting the 5m files.
|> |
|> |Using bash:
|> |  for i in archive/2007/10/04/*; do echo $i; racluster -r $i -w - -- ip | \
|> |  racluster -M rmon -m saddr -w - | rasort -m bytes -w /tmp/$i.srt; done
|> |
|> |  racluster -R /tmp/archive/2007/10/04 -w - | \
|> |  rasort -m bytes -s saddr trans:10 sbytes:14 dbytes:14
|> |
|> |then delete the /tmp/archive/2007/10/04 directory.
|> |
|> |Does any of that work?
|> |
|> |Carter
|> |
|> |
|> |Michael Hornung wrote:
|> |> Thanks Carter, this is what I was hoping to hear!  You guessed my setup
|> |> exactly, though I've got a problem with what you sent, and I suspect it
|> may
|> |> be related to the amount of data.  The box I'm using to do processing is
|> x86
|> |> linux (RHEL5) with 2 x dual core 2ghz CPUs and 4GB of RAM.
|> |>
|> |> % du -hs archive/2007/10/04
|> |> 20G     archive/2007/10/04
|> |>
|> |> % racluster -R archive/2007/10/04 -M norep -w foo -- ip
|> |> Segmentation fault
|> |>
|> |> strace shows:
|> |> brk(0x4e0be000)                         = 0x4e09d000
|> |> mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
|> 0)
|> |> = -1 ENOMEM (Cannot allocate memory)
|> |> mmap2(NULL, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE,
|> -1,
|> |> 0) = -1 ENOMEM (Cannot allocate memory)
|> |> mmap2(NULL, 1048576, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE,
|> -1,
|> |> 0) = -1 ENOMEM (Cannot allocate memory)
|> |> mmap2(NULL, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE,
|> -1,
|> |> 0) = -1 ENOMEM (Cannot allocate memory)
|> |> mmap2(NULL, 1048576, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE,
|> -1,
|> |> 0) = -1 ENOMEM (Cannot allocate memory)
|> |> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
|> |> +++ killed by SIGSEGV +++
|> |>
|> |>
|> |> I ran top while racluster was running and it seems that the process runs
|> out
|> |> of memory, and I'm nearing the system's limits...so what can be done about
|> |> this?
|> |>
|> |>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
|> |> 7551 argus     20   0 3062m 2.9g  836 R   76 83.2   0:38.64 racluster
|> |>
|> |> -Mike
|> |>
|> |> On Fri, 5 Oct 2007 at 14:19, Carter Bullard wrote:
|> |>
|> |> |Hey Michael,
|> |> |I think asking these questions are great!!!  As it gets examples into the
|> |> |mailing list,
|> |> |where people can search etc....
|> |> |
|> |> |So, you have a daily directory and you want a report based on IP top
|> |> talkers.
|> |> |Lets say the directory is in the standard argus archive format, and we'll
|> do
|> |> |yesterday.
|> |> |Here is the set of commands that I would use:
|> |> |
|> |> |  racluster -R archive/2007/10/04 -M norep -w - -- ip | \
|> |> |  racluster -M rmon -m saddr -w - | \
|> |> |  rasort -m bytes -s saddr trans:10 sbytes:14 dbytes:14
|> |> |
|> |> |So what does this do:
|> |> |  racluster -R archive... -M norep -w - -- ip      This program will read
|> in
|> |> a
|> |> |days worth of IP data and assemble all the flow status
|> |> |      reports into individual flow report.  We need to do this because
|> you
|> |> said
|> |> |you wanted
|> |> |      to know how many flows there were.  The "-M norep" option sez don't
|> |> |report the
|> |> |      merge statistics for aggregations.  This allows for a single record
|> to
|> |> be
|> |> |      tallied as a single flow.  And we write the output to stdout.
|> |> |
|> |> |  racluster -M rmon -m saddr -w -
|> |> |     This program will read in the stream of single flow reports from
|> stdin
|> |> and
|> |> |generate
|> |> |      the top talker stats.  The rmon option pushes the identifiers to
|> the
|> |> src
|> |> |fields, and
|> |> |      the -m option , and write the output to stdout.
|> |> |
|> |> |  rasort -m bytes -s saddr trans:10 sbytes:14 dbytes:14
|> |> |     This program sorts the output based on total bytes for each top
|> talker.
|> |> |      and prints out the IP address, the number of flows, the bytes
|> |> transmitted
|> |> |by
|> |> |      the talker and the bytes received.
|> |> |
|> |> |  Now if you want the top 20 talkers, you need to select the first 20
|> |> records
|> |> |  from the rasort(), to do this:
|> |> |  racluster -R archive/2007/10/04 -M norep -w - -- ip | \
|> |> |  racluster -M rmon -m saddr -w - | \
|> |> |  rasort -m bytes -w - |\
|> |> |  ra -N 20 -s saddr trans:10 sbytes:14 dbytes:14
|> |> |
|> |> |
|> |> |If you try this and get something weird, send mail!!  It would be
|> |> |good if we can get a "standard" set of calls that people understand.
|> |> |
|> |> |Carter
|> |> |
|> |> |Michael Hornung wrote:
|> |> |> I have an ra reading from a remote argus collector 24x7, and every 5
|> |> minutes
|> |> |> the argus file is archived; at the end of a day I have 290 files
|> |> representing
|> |> |> the traffic from that day.
|> |> |> |> Let's say I want to make a list of the top talkers, sorted by total
|> |> bytes
|> |> |> transferred.  Given those top talkers, I want to see the following as
|> |> text,
|> |> |> and/or alternately graphed, for each top talker:
|> |> |> |> IP
|> |> |> # flows
|> |> |> # bytes rcvd
|> |> |> # bytes sent
|> |> |> |> Can you recommend a command-line that's going to give me this?  The
|> |> profusion
|> |> |> of argus utilities and a lack of examples is making this hard for me.
|> |> |> Thanks.
|> |> |> |> -Mike
|> |> |> |
|> |> |
|> |>
|> |>
|> |
|> |
|> 
|



More information about the argus mailing list