Some questions on architecture basics for argus collection

Fri Aug 24 14:47:42 EDT 2012

Hey Jesse,
Anwers inline.
Carter
On Aug 24, 2012, at 11:18 AM, Jesse Bowling <jessebowling at gmail.com> wrote:

> A few questions about how to architect a relatively simple argus
> deployment...A great many of these questions will likely depend on
> underlying hardware and traffic volumes being processed...Let's assume
> that each 'path' being monitored is handling (for long peaks) 1 Gb/s
> of traffic, and the hardware is reasonable...In cases where this
> really matters if you wouldn't mind pointing out that this is a
> controlling factor it would be much appreciated...
> 
> On a server where multiple interfaces need to be monitored (let's say
> 2 pairs of taps connections = 4 physical interfaces), which of the
> following would you recommend?
> 
> 1) Configure one argus server for each pair, with the interface
> specified like "ARGUS_INTERFACE=bond:eth1,eth2", and run a radium on
> the SAME box to collect from the two servers
> 2) Same inferface config as 1), but move the radium program off host
> and have it connect to both servers?
> 3) Configure one argus server for both pair, with two interface config
> lines like "ARGUS_INTERFACE=bond:eth1,eth2" and
> "ARGUS_INTERFACE=bond:eth3,eth4" in argus.conf and move the radium off
> host?

The rule of thumb for high demand sensors, is to have one argus server
per Observation Domain, i.e. SourceID, and you would like to have at least
one core per argus observation domain, if you can.

> 
> If one of the goals is to keep a copy of unfiltered data on the host,
> and then have filtering done on clients connecting to radium:
> 
> 1) If the local (on sensor) argus specifies an ARGUS_OUTPUT_FILE
> option, is it possible to have it automatically split up files like
> rasplit does, or should an instance of rasplit be run and have it
> connect to the local server? Will this interfere with radium
> collecting records as well (or stated another way, how many
> connections from clients can an argus server handle)?

Radium on the sensor is all about how many collectors do you want to support.
This is all about how many copies of flow output buffers, you have to generate. 
When there is one reader, we don't copy, unless there is some issues with
socket performance.  But, when there are 2 or more, we have to generate
multiple copies of the flow output buffers, just to schedule them on the threads
that manage the multiple output socket queues.  Its the memory demand that can
get in the way.

You want to avoid output flow filtering in argus if you can.  To filter a flow,
we have to convert the compacted output buffer into a structured record, so
the filters can be snappy.  That conversion is a bit onerous.  Better to leave
some cycles for packet processing.  So this means that if you like to connect
directly to argus, and you like to filter what you get, you'll tax argus a bit to
do the filtering.

Output filter is important to limit the load on the socket processing, so its a balance.
Radium deals with it a bit better than argus.  So if you're architecture is to blow
everything off the sensor to a collection box, then no need for radium.  Only
one consumer of argus from the sensor, no problem.  Run radium on the collection
box, and let everybody access records there.  If you have lots of uses of data
on the sensor, or you want to write argus records to disk and write to clients
(multiple consumers),  then radium on the sensor.
The rasplit() filename function is expensive for argus.  Better to have a
separate rasplit() running on the local sensor doing the disk, than have
argus doing it.

> 
> 2) What are the performance implications of using
> RADIUM_CORRELATE="yes"? For instance, let's say we're monitoring two
> paths to the internet which are in an active-active state (and
> promises of "no asymmetrical routing" have been made), and also
> pulling in Cisco netflow from an internal router and writing out the
> combined flows to file; if asymmetrical routing occurs would this
> setting allow radium to deal gracefully with this issue from a flow
> perspective (putting the correct pieces together into a single
> bi-directional flow), and can records be audited for occurrences of
> this if each path has it's own monitor-id? What would happen when the
> same flow is seen via Netflow as is seen on the argus instance in
> terms of the data kept in the record? Would the richer argus data be
> kept, or would the Netflow information be kept?

RADIUM_CORRELATE doesn't do anything in 3.0.6.2, but it will very soon
do quite a bit in 3.0.7.2.  When it does, think of it as rabins() embedded in
a pipeline processor.  The trick when it receives two matching records, it
will have to decide if the records need to be aggregated, because they are
coming from multiple sensors that make up a single observation domain,
or if they need to be correlated (or diff'ed meaning that they are the same flow
observed at different points along the path).  When records are correlated,
the differential stats are stored, associated with the other source id, so you
get one flow record, from one source id, and in the record will be diffs from
the other source id.

> 
> 3) If you wish to have radium perform labeling via specifying
> RADIUM_CLASSIFIER_FILE, are there performance hits at some point as
> the size of the classifier grows, and how big would a classifier file
> have to be before performance was impacted? Or would it be more
> related to how complex the labeling requirements were?

Each label strategy puts memory demands on the labeler, whether its
radium() of ralabel().  So some methods cause us to read in large configurations
and build internal data structures to hold the labels.  Some, like the GeoIP,
can be implemented as real client server databases, so the labeler doesn't grow,
or they just look like them, and they cache the whole thing, which makes them
bigger.   With the current implementation, the labeler reads everything at
startup, so it shouldn't grow as it ages.

> 
> 4) How well does radium deal with disk IO wait? Which is to say, if
> the plan is to have multiple client programs connected to a radium
> instance locally, and writing out files locally as well, does radium
> have any built in buffering strategies in case disk IO becomes a
> bottleneck? I assume that the solution would be to move those clients
> off to their own hosts, but I wanted to have an idea of where one
> might lose data if this became an issue, i.e., would radium drop it?
> Would the client (ra, rasplit, etc) drop it? Would client and server
> hold everything in memory until memory was exhausted and crash the
> box?

Radium has a lot of big internal queues, so it can deal with disk IO better
than others.  Each output, whether its a file, a socket or whatever, is a
separate thread, and each output has to deal with partial writes of data
and write failures, so they all have multiple queues for buffering data.
So if you use argus-tcp transport (the default) there is no chance for radium
to drop a record.  Radium does detect that either the transport path for output
records is faulty, or the consumer of records is too slow to handle the load, by
tracking the number of records in the output queue.  If the number of records
queue'd up to go has exceeded a big number of records, it can either throw
records away, or declare the connection to be faulty, and close the
connection, discarding all the waiting records.  With argus-udp transport,
there isn't any huge potential for queueing, but records can be lost on the
wire, as we don't have any reliability built on top of UDP.

The decision on whether to drop records or drop the connection is in the
code, but right now we drop records in argus, and radium I believe.

> 
> Thanks for patience on these questions,
> 
> Jesse

Hopefully this is helpful.  If anything doesn't jive with your thoughts on how
it should work, just holler on the list.

> 
> -- 
> Jesse Bowling
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20120824/4619f848/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4367 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20120824/4619f848/attachment.bin>