radium stops passing traffic

Jason Carr jcarr at andrew.cmu.edu
Fri Sep 25 15:31:38 EDT 2009


Once radium is in this state, ra -S localhost:561 accept connections  
but no longer produces any output.  Connecting to the arguses works  
fine via ra -S 10.10.10.100:561 and does produce output.

I'll check to see if the CPU load is high at the time, but I do not  
believe that it is.

- Jason

On Sep 25, 2009, at 1:33 PM, Carter Bullard wrote:

> Hey Jason,
> There are a few reasons why radium could stop, most should leave a  
> trail in
> a syslog file somewhere.  The most likely is that radium() has  
> terminated the
> connection to a client that isn't reading fast enough.  Radium  
> figures this out
> because its output queue gets too big, assumes the remote has gone  
> or is too
> slow, and gives up.   Generally, the client (which can be configured  
> to retry
> the connection) gets a shutdown message, and then reattaches and all  
> is
> goodness again.
>
> Of course there could be bugs anywhere in this logic.  So when this  
> happens
> there are a few quick questions.
>
> Does radium have a connection to the client, but no data is being  
> passed?  If this
> is the case, we definitely have a bug and the best strategy is to  
> attach to the running
> radium() with gdb() and  step through to see what the problem is.
>
> Is radium still reading records from the remote argi?  If not radium  
> maybe fine, but
> there isn't any data to transmit, or radium has lost its connections  
> from the argi, as
> it isn't processing fast enough itself to keep up with the record  
> load.
>
> When radium isn't passing records, is it responding to additional  
> connection
> requests?  This would test if the radium output thread is completely  
> dead,
> or spinning in a loop.
>
> So check if radium() is living.  use netstat -na to see if the remote 
> (s) still have
> active connections.  checkout the load to see if radium() is chewing  
> up 100% of
> one of the processors (infinite loop), and of course, check to see  
> if there are any
> syslog messages from radium() indicating if the queue limit is  
> reached or if
> its disconnecting or whatever.
>
> If this gets to be too much, lets see if I can logon and make some  
> sense of it?
>
>
> Carter
>
> On Sep 25, 2009, at 1:12 PM, Jason Carr wrote:
>
>> Hi guys,
>>
>> Here's what's going on right now.  We've got two argus processes  
>> running on our Bivio unit plus two radium processes.  One radium  
>> process runs on the Bivio to essentially multiplex and forward to  
>> the external radium process that runs on our long term storage  
>> machine.
>>
>> What happens currently is that after a dynamic amount of time the  
>> radium running on Bivio stops passing the traffic to the external  
>> radium process.  Killing radium and restarting it fixes the problem  
>> immediately.  Running radium in debug mode (-D 999) yields a 6.5G  
>> output file, so I don't think I'll be sending that one along.
>>
>> How much debugging needs to be on to get a good understanding of  
>> why this radium process would stop passing traffic?
>>
>> Thanks,
>>
>> Jason
>>
>>
>>
>




More information about the argus mailing list