[ARGUS] argus monitor script

Peter Van Epp vanepp at sfu.ca
Sat Apr 24 00:51:37 EDT 2004


	While I expect this to be of immediate interest to Eric and I (since
it is currently FreeBSD centric :-)) some of the rest of you may be interested
too. This is perl script designed to run out of cron to watch and kill and 
restart as required a 2.0.x argus_bpf on a sensor. It basically looks to see
that all three tasks (sharing a gpid) are present and if they aren't kill the
tasks that are present and execute the argus boot time start up script from
/etc/rc to restart argus if required. This is the current crontab entry. The
monitor is offset 3 minutes to avoid troubles when the log file is switching
at the top of the hour.

0 * * * *       /usr/local/bin/argusarchive >> /data/archive.log 2>&1
3-59/5 * * * *  /usr/local/bin/mon_argus.pl >> /data/mon_argus.log 2>&1

	This is a typical "fault" where the output task only dies (via kill
this time, although this actually happened to my test machine and I didn't
notice for 3 days, thus this script :-)). I expect this is why daemon tools
isn't tripping during Eric's failures, the prime task is still running. This
is currently lightly tested, but may still be useful to Eric sooner than 
later (and as demonstrated at least the common case works :-)):

ps auxw | grep argus
root    2485  0.0  0.2  2440 1272  ??  Ss    9:01PM   0:00.02 /usr/local/bin/argus_bpf -dJR -i xl0 -i xl1 -w /data/argus.out
root    2486  0.0  0.2  2252 1084  ??  S     9:01PM   0:00.04 /usr/local/bin/argus_bpf -dJR -i xl0 -i xl1 -w /data/argus.out
root    2487  0.0  0.2  2388 1220  ??  S     9:01PM   0:00.02 /usr/local/bin/argus_bpf -dJR -i xl0 -i xl1 -w /data/argus.out

	"fault" takes out the output task:

test6# kill -9 2487

	leaving us broken:

test6# !ps
ps auxw | grep argus
root    2485  0.0  0.2  2440 1272  ??  Ss    9:01PM   0:00.02 /usr/local/bin/argus_bpf -dJR -i xl0 -i xl1 -w /data/argus.out
root    2486  0.0  0.2  2248 1080  ??  S     9:01PM   0:00.04 /usr/local/bin/argus_bpf -dJR -i xl0 -i xl1 -w /data/argus.out

	After the monitor script runs out of cron:

test6# cat mon_argus.log
Fri Apr 23 21:18:00 PDT 2004 Problem detected: initial PID values

2486 2485 0:00.14 /usr/local/bin/argus_bpf -dJR -i xl0 -i xl1 -w /data/argus.out
 2485 2485 0:00.08 /usr/local/bin/argus_bpf -dJR -i xl0 -i xl1 -w /data/argus.out


Attempting to restart argus via the startup script
 starting argus
debug.bpf_bufsize: 524288 -> 524288
argus_bpf[2590]: started

test6# ps auxw | grep argus
root    2590  0.0  0.2  2440 1272  ??  Ss    9:18PM   0:00.11 /usr/local/bin/argus_bpf -dJR -i xl0 -i xl1 -w /data/argus.out
root    2591  0.0  0.2  2252 1088  ??  S     9:18PM   0:00.27 /usr/local/bin/argus_bpf -dJR -i xl0 -i xl1 -w /data/argus.out
root    2592  0.0  0.2  2388 1224  ??  S     9:18PM   0:00.12 /usr/local/bin/argus_bpf -dJR -i xl0 -i xl1 -w /data/argus.out

	alive again.

Peter Van Epp / Operations and Technical Support 
Simon Fraser University, Burnaby, B.C. Canada

/usr/local/bin/mon_argus.pl

#!/usr/bin/perl

# Check that all three argus-2.x tasks are present and burning CPU time (and 
# therefore probably running). If there is anything different than 3 tasks 
# sharing a gpid, kill them all then call the boot startup script  to restart 
# argus on the assumption something is wrong. This script is intended to run
# out of cron (as root of course) every 5 minutes or so on the sensor platform.

	# Check for the correct 3 argus tasks (i.e. exactly 3 different pids 
	# but all with the same gpid, and exit silently if found

	# get the current time to log in case of problems.

	$date = `date`;
	chop $date;

	@pids = &get_pids();

	# save a copy for logging purposes

	@orig_pids = @pids;

	# reset the gpid, count and unique pid variables

	$cur_gpid = "";
	$pid_count = 0;
	%pids = "";

	while (@pids) {
		$line = pop(@pids);
		($pid, $pgid, $time, $command) = split(' ',$line,4);
		if ($cur_gpid eq "") {
			
			# this must be the first pid, so extract the current 
			# gpid to match the rest of the tasks. Mark the pid
			# as seen as a dup check.

			$cur_gpid = $gpid;
			$pids{$pid} = "seen"; 
			$pid_count++;

		} elsif ($cur_gpid ne $gpid) {

			# there is more than one gpid represented a restart is
		 	# required so bail out and do it.

			break;
		} else {

			# gpids match, so if this pid is unique then increase
			# the pid count by one and record this pid.

			if ($pids{$pid} ne "") {
				
				# there is a duplicate pid for some reason
				# bail and do a restart to clean up.

				break;
			} else {

				# otherwise count and record this pid as all
				# is so far well (as it usually should be).

				$pid_count++;
				$pids{$pid} = "seen"; 
			}
		}
	}
	if ($pid_count == 3) {

		# There are 3 unique pids all with the same gpid so it looks
		# like things are fine (later add a test to see that CPU time
		# is increasing at this point, but for now declare success) 

		exit (0);
	}

	# otherwise there is something wrong so kill off the argus tasks in
	# preparation for a restart. If the tasks can't be killed stop_argus
	# will flag that for human intervention currently by writing it to a
	# log file, an email or other timely alert should go here.

	print "$date Problem detected: initial PID values\n\n at orig_pids\n\n";

	$rc = &stop_argus(); 

#	if ($rc == -1) {

		# stop_argus has already complained to stdout about the 
		# unkillable tasks so bail without attempting to restart argus
		# if you like. I think restarting argus anyway may do something
		# and it can't easily get worse so we may as well do it. If 
		# you don't agree, uncomment this part!

#		exit (-1)

#	}

	print "Attempting to restart argus via the startup script\n";

	system("/usr/local/etc/rc.d/argus.sh start");
	exit (0);


# Subroutines to support the above script (currently very FreeBSD specific,
# need modification in the area of the ps command for other OSs ...)
		

sub main'stop_argus {

	@pids = &get_pids();

	# first do a round of kill -HUP on all the argus pids

	while (@pids) {
		$line = pop(@pids);
		($pid, $rest) = split(' ',$line);
		
		# feed argus a kill -HUP

		kill (1, int($pid));
	}
	
	# give it some time to take effect
	
	sleep(2);

	# then see if there are still argus tasks

	@pids = &get_pids();

	while (@pids) {
		$line = pop(@pids);
		($pid, $rest) = split(' ',$line);
		
		# no more fooling around, feed argus a kill -9

		kill (9, int($pid));
	}
	
	# give it some time to take effect
	
	sleep(2);

	# check yet again

	@pids = &get_pids();

	if ($#pids == -1) {

		# no more pids, argus has been killed successfully so silently
		# exit indicating success.
		return (0);
	} else {
	
		# Weren't able to kill argus, human intervention required so 
		# print a complaint and return failure.

		print "Unable to successfully kill\n\n at pids\n";
		return (-1);
	}
}

sub main'get_pids {

	local ($_) = @_;
	local (@lines, $line, @pids, $id, $pid, $ppid, $pgid, $ses, $jobc, 
	       $stat, $tt, $time, $command);

	@lines = `ps ajxwwww | grep "/usr/local/bin/argus_bpf" | grep -v "grep"`;
	@pids = ();
	while (@lines) {

		# there is at least one argus task running so extract its PID
		# GPID and CPU time and command to return
		
		$line = pop(@lines);
		($id, $pid, $ppid, $pgid, $ses, $jobc, $stat, $tt, $time, 
		 $command) = split(' ',$line,10); 
		push(@pids,"$pid $pgid $time $command");
	}
	return (@pids);
}
1;



More information about the argus mailing list