radium fails (rc.50 from 2007-08-31)
Peter Van Epp
vanepp at sfu.ca
Mon Sep 3 12:40:41 EDT 2007
On Mon, Sep 03, 2007 at 03:17:18PM +0000, Carter Bullard wrote:
> Sorry for the delayed response. Peter is correct on the radium issues. Radium shares a lot of code from argus, so we're trying to figure out the memory issues, and I'll merge them into radium() when we're happy with it!! Still a holiday here, but I'll start the radium work tonight, at least that is the plan.
>
> So, it seems that we're pretty good, except for the "-R" option? Have we got a 3.0.0 that has been running for days?
>
> Hope everyone (US in particular) are having a great [ holi ] day !!!
>
> Carter
>
> Carter Bullard
> QoSient LLC
> 150 E. 57th Street Suite 12D
> New York, New York 10022
> +1 212 588-9133 Phone
> +1 212 588-9134 Fax
>
Well no .threads, -R or -J has been working for me for 3 days now :-).
As I recall -J and -R both eventually grew, and .threads has at least the
deadlock on a -HUP and I think other problems relating to not logging things
since I don't see the statistics messages or the debug message saying it saw
the -HUP (which no .threads does) even with the antideadlock hack in place.
Putting a syslog message in the antidead lock code where it wil bail if it
can't get a lock doesn't send anything to syslog either which makes me wonder
if the threads are losing the ability to write to syslog and/or the debug file.
ps auxwwww | grep argus
root 27229 3.9 9.1 364868 361352 ? SL Aug31 158:13 argus -P 560 -i eth0 -i eth1 -U 512 -m -F /scratch/argus.conf
vanepp 2457 0.0 0.0 3132 832 pts/0 S+ 08:50 0:00 grep argus
I'll probably stop the prod one and add -J again, then the real test
will come tomorrow when classes start again :-).
I just HUPed the argus above:
hcids:/scratch # tail debug.log
ArgusWarning: argus[27229]: 31 Aug 07 14:49:07.001464 started
ArgusWarning: argus[27229]: 31 Aug 07 14:49:07.001673 ArgusGetInterfaceStatus: interface eth1 is up
ArgusWarning: argus[27229]: 31 Aug 07 14:49:07.001717 ArgusGetInterfaceStatus: interface eth0 is up
ArgusInfo: argus[27229]: 31 Aug 07 14:49:14.828501 connect from test4.ucs.sfu.ca
argus: Time 238204.274230 Flows 89394483 Closed 89383175 Sends 126316632 BSends 11401 Updates 1427005904 Cache 1337578819
eth1
Total Pkts 800607975 Rate 3361.014313
eth0
Total Pkts 625348331 Rate 2625.260747
hcids:/scratch #
If I get time today perhaps I'll poke at adding these stats to the MAR
record output as Carter has suggested doing so we can see how memory is doing
as the argus runs.
Here is the output (since yesterday) of argus-3.0.0 (on an internal
link rather than our Internet link) with .threads enabled (and my hack to
break the signal deadlock):
sniffer1:/spare # !ps
ps auxwwww | grep argus
root 29697 16.9 0.8 156544 35160 ? SLl Sep02 195:57 argus -P 560 -i eth2 -i eth3 -U 512 -m -D2 -F /spare/argus.conf
root 32628 0.0 0.0 3132 832 pts/1 S+ 09:05 0:00 grep argus
sniffer1:/spare # kill -HUP 29697
sniffer1:/spare # !ps
ps auxwwww | grep argus
root 32630 0.0 0.0 3132 832 pts/1 S+ 09:05 0:00 grep argus
took the -HUP without hanging (indicating the deadlock breaker is
working in some manner because it used to hang here) but there isn't the
output saying it had gotten the HUP from ArgusShutdown in the output file
either (nor a message in syslog that says the lock failed either) which I
think means at least a logging problem. Turns out I'm wrong, the message is
there its just much further back than I looked:
argus[29697.0000040000026f50]: 03 Sep 07 09:05:28.208842 ArgusScheduleShutDown(1)
as is the stats data, it just clears out the queues after those
messages at -D 2 and I need to grep in the huge debug output file to find them.
I'll change the argus on the internet link to using the .threads change
rgus[29697.0000040002b98230]: 03 Sep 07 09:05:28.379816 ArgusFreeListRecord (0x46b5a10) returning
argus[29697.0000040002b98230]: 03 Sep 07 09:05:28.379839 ArgusMallocListRecord (672) returning 0x4613860
argus[29697.0000040002b98230]: 03 Sep 07 09:05:28.379865 ArgusOutputProcess() received stop record 0 records on the list
argus[29697.0000040002b98230]: 03 Sep 07 09:05:28.379891 ArgusWriteSocket: write successful 128
argus[29697.0000040002b98230]: 03 Sep 07 09:05:28.379915 ArgusFreeListRecord (0x4613860) returning
argus[29697.0000040002b98230]: 03 Sep 07 09:05:28.379961 ArgusDeleteSocket (0x45780a0) returning
argus[29697.0000040002b98230]: 03 Sep 07 09:05:28.380000 ArgusOutputProcess() exiting
argus[29697.0000040000026f50]: 03 Sep 07 09:05:28.400954 ArgusCloseOutput(0x10200760) done
argus[29697.0000040000026f50]: 03 Sep 07 09:05:28.404738 ArgusShutDown()
For the adventurous this ugly hack fixes two problems in the latest
argus-3.0.0.tar.gz (394674 Aug 31 09:38 argus-3.0.0.tar.gz on qosient.com):
Time stamps come out correctly and a -HUP won't hang if the signal occurs when
there is a lock on ArgusModel in ArgusDebug when .threads is defined.
Then it will be interesting to see if this runs OK on the Intel machines (all
my test machines are PPC) since we were thinking there may be endian issues
before the latest bugs were found. Hopefully we were wrong and I was just not
seeing the bugs the others were at the time (I believe I did later :-)).
Peter Van Epp / Operations and Technical Support
Simon Fraser University, Burnaby, B.C. Canada
*** common/argus_util.c.orig 2007-08-31 14:38:53.000000000 -0700
--- common/argus_util.c 2007-09-03 09:33:44.000000000 -0700
***************
*** 145,156 ****
--- 145,167 ----
pthread_t ptid;
char pbuf[128];
int i;
+ struct timespec tsbuf = {0, 150000000}, *ts = &tsbuf;
bzero(pbuf, sizeof(pbuf));
ptid = pthread_self();
for (i = 0; i < sizeof(ptid); i++) {
snprintf (&pbuf[i*2], 3, "%02hhx", ((char *)&ptid)[i]);
}
+ if (ArgusModel) {
+ i = 10;
+ while (pthread_mutex_trylock(&ArgusModel->lock)) {
+ if (i-- <= 0) {
+ ArgusLog(LOG_WARNING, "ArgusDebug:bail without printing: no lock");
+ return; /* bail without printing if we can't lock! */
+ }
+ nanosleep(ts, NULL);
+ }
+ }
(void) snprintf (buf, 1024, "%s[%d.%s]: %s ", ArgusProgramName, (int)getpid(), pbuf, print_time(&now));
}
#else
***************
*** 168,177 ****
if (fmt[-1] != '\n')
snprintf (ptr, 2, "\n");
}
- #if defined(ARGUS_THREADS)
- if (ArgusModel)
- pthread_mutex_lock(&ArgusModel->lock);
- #endif
fprintf (stderr, "%s", buf);
#if defined(ARGUS_THREADS)
if (ArgusModel)
--- 179,184 ----
***************
*** 1632,1639 ****
char buf[1024], *ptr = buf;
struct timeval now;
- #ifdef ARGUS_SYSLOG
gettimeofday (&now, 0L);
(void) snprintf (buf, 1024, "%s ", print_time(&now));
ptr = &buf[strlen(buf)];
#else
--- 1639,1646 ----
char buf[1024], *ptr = buf;
struct timeval now;
gettimeofday (&now, 0L);
+ #ifdef ARGUS_SYSLOG
(void) snprintf (buf, 1024, "%s ", print_time(&now));
ptr = &buf[strlen(buf)];
#else
More information about the argus
mailing list