[ARGUS] ra stops unexpectedly

Thu Sep 30 14:25:21 EDT 2004

Hey Eric,
Many commercial packages have overseer style processes/threads
that periodically check to see if a process has exited and does
the right thing to either restart or generate a message.  We can
do this, but, one of the features you mention is to restart if
a process is overloaded.  That is a hard one to detect from the
position of an overseer.  Better for the client to detect this
and adjust or exit, so the overseer doesn't get tooooo complicated.

In order to know if a process is in a loop, we could have it
periodically touch its PID file, and the overseer can check to
see if it has touched its file in the last whatever time period.
If not kill it and restart.

So, the server does report its status and how many records its
generated in the MAR records (since the R is record, maybe 'records'
is a bit redundant ;) so we should be doing what you suggested.

Carter 

> From: eric <eric-list-argus at catastrophe.net>
> Organization: Catastrophe.Net <http://www.catastrophe.net/>
> Date: Thu, 30 Sep 2004 13:07:44 -0500
> To: Carter Bullard <carter at qosient.com>
> Cc: <slif at bellsouth.net>, Peter Van Epp <vanepp at sfu.ca>, Argus
> <argus-info at lists.andrew.cmu.edu>
> Subject: Re: [ARGUS] ra stops unexpectedly
> 
> On Thu, 2004-09-30 at 13:56:22 -0400, Carter Bullard proclaimed...
> 
>> None of this means we can't provide a "reconnect on failure"
>> feature, but what are we going to specify when you're connected
>> to 3 remote data sources?  How do we notify the specific client
>> that a source has been lost, or has not ever been connected?
> 
> Hey Carter et al,
> 
> So, let's see. What about tracking each client from a parent,
> master, process? I know this would be a real pain, and may lead to
> some race conditions if you're not careful, but it might solve the
> purpose. So let's say PID 4550 establishes connections to three
> servers for PID's 4551,4552,4553. Essentially you can drop the child
> processes into a privilege seperated jail (adding more security
> too!) and only let them communicate back to the parent through very
> specific calls. The children should be given all rights to write to
> disk, etc., then drop privileges. If the parent notices one dies off
> or becomes overloaded for X amount of time, send a SIGHUP or kill it
> and restart it. Perhaps you can just look for bind problems to the
> server?
> 
> What I've found is that it's more of a pain to actually find out
> when we're losing data if we're still connected. So, that said....
> 
> ...wouldn't it be great if the server summarized how many flow
> records it's gather and reported that as a status (stop me if we're
> already doing this) in the form of a sequence number?
> 
> So, "Hey collector A, it's sensor B, I've seen 54141 flows, I'll see
> you again in 30 seconds!"
> 
> Then add that same functionality into the ra() tools and report
> errors and oddities.
> 
> This would help scripting restarts of the clients, etc.
> 
> 
> 
>