Stability problem update

Carter Bullard carter at qosient.com
Thu Jun 7 11:45:12 EDT 2001


Yes indeed, the possibility for deleting a partially
written record is there.  I've made the fixes to
argus-2.0.2, which I should release in a few weeks.
This shouldn't blow up, but lets do test it, to see
if it fixes the problem.  I've included the patches below.

This fix involves increasing the ArgusMaxListLength,
and when deleting records from a queue, delete them
from the back instead from the front.  There is a
much more elegant and complex fix that would fix the
real existing problem, but this accomplishes the same
thing, so lets try it this way.

Carter

Carter Bullard
QoSient, LLC
300 E. 56th Street, Suite 18K
New York, New York  10022

carter at qosient.com
Phone +1 212 588-9133
Fax   +1 212 588-9134
http://qosient.com 


Index: ArgusUtil.c
===================================================================
RCS file: /usr/local/cvsroot/argus/server/ArgusUtil.c,v
retrieving revision 1.77.2.4
diff -r1.77.2.4 ArgusUtil.c
182a183,196
> void *
> ArgusBackList(struct ArgusListStruct *list)
> {
>    void *retn = NULL;
> 
>    if (list->start)
>       retn = list->start->prv->obj;
> 
> #ifdef ARGUSDEBUG
>    ArgusDebug (6, "ArgusBackList (0x%x) returning 0x%x\n", list,
retn);
> #endif
> 
>    return (retn);
> }
800c814
< int ArgusMaxListLength = 16384;
---
> int ArgusMaxListLength = 262144;
943,944c957,958
<                      if ((rec = ArgusFrontList(list)) != NULL) {
<                         ArgusPopFrontList(list);
---
>                      if ((rec = ArgusBackList(list)) != NULL) {
>                         ArgusPopBackList(list);
Index: ArgusUtil.h
===================================================================
RCS file: /usr/local/cvsroot/argus/server/ArgusUtil.h,v
retrieving revision 1.25.4.2
diff -r1.25.4.2 ArgusUtil.h
138a139
> void *ArgusBackList (struct ArgusListStruct *);
202a204
> extern void *ArgusBackList (struct ArgusListStruct *);


-----Original Message-----
From: owner-argus-info at lists.andrew.cmu.edu
[mailto:owner-argus-info at lists.andrew.cmu.edu] On Behalf Of Carter
Bullard
Sent: Thursday, June 07, 2001 11:28 AM
To: 'Chris Newton'
Cc: Argus (argus-info)
Subject: RE: Stability problems.


Hey Chris,
   Sounds like your truncating records on the argus end.
One possibility.  You are probably forcibly deleting
records at the argus end to control the queue sizes, and 
you maybe running into a bug with that.  A record is partially written,
but because of queue load, we elect to delete it. This would not be
good, as only a partial record is written. The receiving ra can detect
this and recover, but not for a period of time.

   Look in your /var/log/messages for argus messages, especially "Queue
Exceeded Max" messages.  This would indicate that you are throwing
records away.

   Changing the value of ArgusMaxListLength should help.
Use this patch:

Index: ArgusUtil.c
===================================================================
RCS file: /usr/local/cvsroot/argus/server/ArgusUtil.c,v
retrieving revision 1.77.2.4
diff -r1.77.2.4 ArgusUtil.c
800c800
< int ArgusMaxListLength = 16384;
---
> int ArgusMaxListLength = 262144;


The value doesn't have to be a binary number, I just happen
to like them. I'll take a look at the delete logic.

Carter

Carter Bullard
QoSient, LLC
300 E. 56th Street, Suite 18K
New York, New York  10022

carter at qosient.com
Phone +1 212 588-9133
Fax   +1 212 588-9134
http://qosient.com


-----Original Message-----
From: Chris Newton [mailto:newton at unb.ca] 
Sent: Thursday, June 07, 2001 8:57 AM
To: Carter Bullard
Subject: Stability problems.


Hi Carter.

  Since I moved into client/server mode, I have had a few bumps of 
instability.

  I'm running the most current code.

  The sensor is a linux 2.4.x redhat 7.1 box, 512 MB ram, 600 MB swap,
dual 
800 Mhz cpus

  The recieving end has dual 1 Ghz CPUs and 1.2 GB ram.  Ra is running
on 
this, dumping to local files.

  We are monitoring a link that has a possible traffic rate of a full
duplex, 
100 MBit connection.

  Sometimes we are receiving DoS attacks that cause the server to grow
and 
grow and grow... it doesn't appear to dump it's records to the attached
client 
at a fast enough rate to make sure the box doesnt run out of memory.

  When the server gets in this state, it starts sending invalid records
to the 
client.  Some of these records have incredible duration times.. (1 had a
135 
year duration).

  Today, I'm not sure what occured, but:

01-06-07 08:43:34 0.000000 Fs 131 1.4.0.104  <-> 0.144.8.0  991914214
180000 
180000 3459164706 CON


  0 duration.  The other IP involved was 0.144.8.0 (not possible).  The
991M 
src packets, is in 30 seconds..., but only 180K.

  For a number of minutes after an event like this, the ra client has
trouble 
getting anything meaningful out of the server... often only outputting
flow 
record files (for 30 seconds) with very few flows in it, the next file
with 
lots of flows... so on.


attached is the flow record file.

Let me know how I can help you track down this problem.

Thanks Carter

_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/

Chris Newton, Systems Analyst
Computing Services, University of New Brunswick
newton at unb.ca 506-447-3212(voice) 506-453-3590(fax)

"The best way to have a good idea is to have a lot of ideas." Linus
Pauling (1901 - 1994) US chemist





More information about the argus mailing list