radium stops passing traffic

Jason Carr jcarr at andrew.cmu.edu
Mon Sep 28 14:48:41 EDT 2009


Checking on the process, it is not using 100% when it gets to the  
point of no longer working.

What other information might you need?

Thanks,

Jason

On Sep 25, 2009, at 3:31 PM, Jason Carr wrote:

> Once radium is in this state, ra -S localhost:561 accept connections  
> but no longer produces any output.  Connecting to the arguses works  
> fine via ra -S 10.10.10.100:561 and does produce output.
>
> I'll check to see if the CPU load is high at the time, but I do not  
> believe that it is.
>
> - Jason
>
> On Sep 25, 2009, at 1:33 PM, Carter Bullard wrote:
>
>> Hey Jason,
>> There are a few reasons why radium could stop, most should leave a  
>> trail in
>> a syslog file somewhere.  The most likely is that radium() has  
>> terminated the
>> connection to a client that isn't reading fast enough.  Radium  
>> figures this out
>> because its output queue gets too big, assumes the remote has gone  
>> or is too
>> slow, and gives up.   Generally, the client (which can be  
>> configured to retry
>> the connection) gets a shutdown message, and then reattaches and  
>> all is
>> goodness again.
>>
>> Of course there could be bugs anywhere in this logic.  So when this  
>> happens
>> there are a few quick questions.
>>
>> Does radium have a connection to the client, but no data is being  
>> passed?  If this
>> is the case, we definitely have a bug and the best strategy is to  
>> attach to the running
>> radium() with gdb() and  step through to see what the problem is.
>>
>> Is radium still reading records from the remote argi?  If not  
>> radium maybe fine, but
>> there isn't any data to transmit, or radium has lost its  
>> connections from the argi, as
>> it isn't processing fast enough itself to keep up with the record  
>> load.
>>
>> When radium isn't passing records, is it responding to additional  
>> connection
>> requests?  This would test if the radium output thread is  
>> completely dead,
>> or spinning in a loop.
>>
>> So check if radium() is living.  use netstat -na to see if the  
>> remote(s) still have
>> active connections.  checkout the load to see if radium() is  
>> chewing up 100% of
>> one of the processors (infinite loop), and of course, check to see  
>> if there are any
>> syslog messages from radium() indicating if the queue limit is  
>> reached or if
>> its disconnecting or whatever.
>>
>> If this gets to be too much, lets see if I can logon and make some  
>> sense of it?
>>
>>
>> Carter
>>
>> On Sep 25, 2009, at 1:12 PM, Jason Carr wrote:
>>
>>> Hi guys,
>>>
>>> Here's what's going on right now.  We've got two argus processes  
>>> running on our Bivio unit plus two radium processes.  One radium  
>>> process runs on the Bivio to essentially multiplex and forward to  
>>> the external radium process that runs on our long term storage  
>>> machine.
>>>
>>> What happens currently is that after a dynamic amount of time the  
>>> radium running on Bivio stops passing the traffic to the external  
>>> radium process.  Killing radium and restarting it fixes the  
>>> problem immediately.  Running radium in debug mode (-D 999) yields  
>>> a 6.5G output file, so I don't think I'll be sending that one along.
>>>
>>> How much debugging needs to be on to get a good understanding of  
>>> why this radium process would stop passing traffic?
>>>
>>> Thanks,
>>>
>>> Jason
>>>
>>>
>>>
>>
>
>




More information about the argus mailing list