radium stops passing traffic
Jason Carr
jcarr at andrew.cmu.edu
Mon Sep 28 14:48:41 EDT 2009
Checking on the process, it is not using 100% when it gets to the
point of no longer working.
What other information might you need?
Thanks,
Jason
On Sep 25, 2009, at 3:31 PM, Jason Carr wrote:
> Once radium is in this state, ra -S localhost:561 accept connections
> but no longer produces any output. Connecting to the arguses works
> fine via ra -S 10.10.10.100:561 and does produce output.
>
> I'll check to see if the CPU load is high at the time, but I do not
> believe that it is.
>
> - Jason
>
> On Sep 25, 2009, at 1:33 PM, Carter Bullard wrote:
>
>> Hey Jason,
>> There are a few reasons why radium could stop, most should leave a
>> trail in
>> a syslog file somewhere. The most likely is that radium() has
>> terminated the
>> connection to a client that isn't reading fast enough. Radium
>> figures this out
>> because its output queue gets too big, assumes the remote has gone
>> or is too
>> slow, and gives up. Generally, the client (which can be
>> configured to retry
>> the connection) gets a shutdown message, and then reattaches and
>> all is
>> goodness again.
>>
>> Of course there could be bugs anywhere in this logic. So when this
>> happens
>> there are a few quick questions.
>>
>> Does radium have a connection to the client, but no data is being
>> passed? If this
>> is the case, we definitely have a bug and the best strategy is to
>> attach to the running
>> radium() with gdb() and step through to see what the problem is.
>>
>> Is radium still reading records from the remote argi? If not
>> radium maybe fine, but
>> there isn't any data to transmit, or radium has lost its
>> connections from the argi, as
>> it isn't processing fast enough itself to keep up with the record
>> load.
>>
>> When radium isn't passing records, is it responding to additional
>> connection
>> requests? This would test if the radium output thread is
>> completely dead,
>> or spinning in a loop.
>>
>> So check if radium() is living. use netstat -na to see if the
>> remote(s) still have
>> active connections. checkout the load to see if radium() is
>> chewing up 100% of
>> one of the processors (infinite loop), and of course, check to see
>> if there are any
>> syslog messages from radium() indicating if the queue limit is
>> reached or if
>> its disconnecting or whatever.
>>
>> If this gets to be too much, lets see if I can logon and make some
>> sense of it?
>>
>>
>> Carter
>>
>> On Sep 25, 2009, at 1:12 PM, Jason Carr wrote:
>>
>>> Hi guys,
>>>
>>> Here's what's going on right now. We've got two argus processes
>>> running on our Bivio unit plus two radium processes. One radium
>>> process runs on the Bivio to essentially multiplex and forward to
>>> the external radium process that runs on our long term storage
>>> machine.
>>>
>>> What happens currently is that after a dynamic amount of time the
>>> radium running on Bivio stops passing the traffic to the external
>>> radium process. Killing radium and restarting it fixes the
>>> problem immediately. Running radium in debug mode (-D 999) yields
>>> a 6.5G output file, so I don't think I'll be sending that one along.
>>>
>>> How much debugging needs to be on to get a good understanding of
>>> why this radium process would stop passing traffic?
>>>
>>> Thanks,
>>>
>>> Jason
>>>
>>>
>>>
>>
>
>
More information about the argus
mailing list