radium stops passing traffic

Carter Bullard carter at qosient.com
Fri Sep 25 13:33:28 EDT 2009


Hey Jason,
There are a few reasons why radium could stop, most should leave a  
trail in
a syslog file somewhere.  The most likely is that radium() has  
terminated the
connection to a client that isn't reading fast enough.  Radium figures  
this out
because its output queue gets too big, assumes the remote has gone or  
is too
slow, and gives up.   Generally, the client (which can be configured  
to retry
the connection) gets a shutdown message, and then reattaches and all is
goodness again.

  Of course there could be bugs anywhere in this logic.  So when this  
happens
there are a few quick questions.

Does radium have a connection to the client, but no data is being  
passed?  If this
is the case, we definitely have a bug and the best strategy is to  
attach to the running
radium() with gdb() and  step through to see what the problem is.

Is radium still reading records from the remote argi?  If not radium  
maybe fine, but
there isn't any data to transmit, or radium has lost its connections  
from the argi, as
it isn't processing fast enough itself to keep up with the record load.

When radium isn't passing records, is it responding to additional  
connection
requests?  This would test if the radium output thread is completely  
dead,
or spinning in a loop.

So check if radium() is living.  use netstat -na to see if the remote 
(s) still have
active connections.  checkout the load to see if radium() is chewing  
up 100% of
one of the processors (infinite loop), and of course, check to see if  
there are any
syslog messages from radium() indicating if the queue limit is reached  
or if
its disconnecting or whatever.

If this gets to be too much, lets see if I can logon and make some  
sense of it?


Carter

On Sep 25, 2009, at 1:12 PM, Jason Carr wrote:

> Hi guys,
>
> Here's what's going on right now.  We've got two argus processes  
> running on our Bivio unit plus two radium processes.  One radium  
> process runs on the Bivio to essentially multiplex and forward to  
> the external radium process that runs on our long term storage  
> machine.
>
> What happens currently is that after a dynamic amount of time the  
> radium running on Bivio stops passing the traffic to the external  
> radium process.  Killing radium and restarting it fixes the problem  
> immediately.  Running radium in debug mode (-D 999) yields a 6.5G  
> output file, so I don't think I'll be sending that one along.
>
> How much debugging needs to be on to get a good understanding of why  
> this radium process would stop passing traffic?
>
> Thanks,
>
> Jason
>
>
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3815 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20090925/19748661/attachment.bin>


More information about the argus mailing list