Stability problem update
Carter Bullard
carter at qosient.com
Thu Jun 7 11:45:12 EDT 2001
Yes indeed, the possibility for deleting a partially
written record is there. I've made the fixes to
argus-2.0.2, which I should release in a few weeks.
This shouldn't blow up, but lets do test it, to see
if it fixes the problem. I've included the patches below.
This fix involves increasing the ArgusMaxListLength,
and when deleting records from a queue, delete them
from the back instead from the front. There is a
much more elegant and complex fix that would fix the
real existing problem, but this accomplishes the same
thing, so lets try it this way.
Carter
Carter Bullard
QoSient, LLC
300 E. 56th Street, Suite 18K
New York, New York 10022
carter at qosient.com
Phone +1 212 588-9133
Fax +1 212 588-9134
http://qosient.com
Index: ArgusUtil.c
===================================================================
RCS file: /usr/local/cvsroot/argus/server/ArgusUtil.c,v
retrieving revision 1.77.2.4
diff -r1.77.2.4 ArgusUtil.c
182a183,196
> void *
> ArgusBackList(struct ArgusListStruct *list)
> {
> void *retn = NULL;
>
> if (list->start)
> retn = list->start->prv->obj;
>
> #ifdef ARGUSDEBUG
> ArgusDebug (6, "ArgusBackList (0x%x) returning 0x%x\n", list,
retn);
> #endif
>
> return (retn);
> }
800c814
< int ArgusMaxListLength = 16384;
---
> int ArgusMaxListLength = 262144;
943,944c957,958
< if ((rec = ArgusFrontList(list)) != NULL) {
< ArgusPopFrontList(list);
---
> if ((rec = ArgusBackList(list)) != NULL) {
> ArgusPopBackList(list);
Index: ArgusUtil.h
===================================================================
RCS file: /usr/local/cvsroot/argus/server/ArgusUtil.h,v
retrieving revision 1.25.4.2
diff -r1.25.4.2 ArgusUtil.h
138a139
> void *ArgusBackList (struct ArgusListStruct *);
202a204
> extern void *ArgusBackList (struct ArgusListStruct *);
-----Original Message-----
From: owner-argus-info at lists.andrew.cmu.edu
[mailto:owner-argus-info at lists.andrew.cmu.edu] On Behalf Of Carter
Bullard
Sent: Thursday, June 07, 2001 11:28 AM
To: 'Chris Newton'
Cc: Argus (argus-info)
Subject: RE: Stability problems.
Hey Chris,
Sounds like your truncating records on the argus end.
One possibility. You are probably forcibly deleting
records at the argus end to control the queue sizes, and
you maybe running into a bug with that. A record is partially written,
but because of queue load, we elect to delete it. This would not be
good, as only a partial record is written. The receiving ra can detect
this and recover, but not for a period of time.
Look in your /var/log/messages for argus messages, especially "Queue
Exceeded Max" messages. This would indicate that you are throwing
records away.
Changing the value of ArgusMaxListLength should help.
Use this patch:
Index: ArgusUtil.c
===================================================================
RCS file: /usr/local/cvsroot/argus/server/ArgusUtil.c,v
retrieving revision 1.77.2.4
diff -r1.77.2.4 ArgusUtil.c
800c800
< int ArgusMaxListLength = 16384;
---
> int ArgusMaxListLength = 262144;
The value doesn't have to be a binary number, I just happen
to like them. I'll take a look at the delete logic.
Carter
Carter Bullard
QoSient, LLC
300 E. 56th Street, Suite 18K
New York, New York 10022
carter at qosient.com
Phone +1 212 588-9133
Fax +1 212 588-9134
http://qosient.com
-----Original Message-----
From: Chris Newton [mailto:newton at unb.ca]
Sent: Thursday, June 07, 2001 8:57 AM
To: Carter Bullard
Subject: Stability problems.
Hi Carter.
Since I moved into client/server mode, I have had a few bumps of
instability.
I'm running the most current code.
The sensor is a linux 2.4.x redhat 7.1 box, 512 MB ram, 600 MB swap,
dual
800 Mhz cpus
The recieving end has dual 1 Ghz CPUs and 1.2 GB ram. Ra is running
on
this, dumping to local files.
We are monitoring a link that has a possible traffic rate of a full
duplex,
100 MBit connection.
Sometimes we are receiving DoS attacks that cause the server to grow
and
grow and grow... it doesn't appear to dump it's records to the attached
client
at a fast enough rate to make sure the box doesnt run out of memory.
When the server gets in this state, it starts sending invalid records
to the
client. Some of these records have incredible duration times.. (1 had a
135
year duration).
Today, I'm not sure what occured, but:
01-06-07 08:43:34 0.000000 Fs 131 1.4.0.104 <-> 0.144.8.0 991914214
180000
180000 3459164706 CON
0 duration. The other IP involved was 0.144.8.0 (not possible). The
991M
src packets, is in 30 seconds..., but only 180K.
For a number of minutes after an event like this, the ra client has
trouble
getting anything meaningful out of the server... often only outputting
flow
record files (for 30 seconds) with very few flows in it, the next file
with
lots of flows... so on.
attached is the flow record file.
Let me know how I can help you track down this problem.
Thanks Carter
_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/
Chris Newton, Systems Analyst
Computing Services, University of New Brunswick
newton at unb.ca 506-447-3212(voice) 506-453-3590(fax)
"The best way to have a good idea is to have a lot of ideas." Linus
Pauling (1901 - 1994) US chemist
More information about the argus
mailing list