radium stops passing traffic
Jason Carr
jcarr at andrew.cmu.edu
Fri Sep 25 15:31:38 EDT 2009
Once radium is in this state, ra -S localhost:561 accept connections
but no longer produces any output. Connecting to the arguses works
fine via ra -S 10.10.10.100:561 and does produce output.
I'll check to see if the CPU load is high at the time, but I do not
believe that it is.
- Jason
On Sep 25, 2009, at 1:33 PM, Carter Bullard wrote:
> Hey Jason,
> There are a few reasons why radium could stop, most should leave a
> trail in
> a syslog file somewhere. The most likely is that radium() has
> terminated the
> connection to a client that isn't reading fast enough. Radium
> figures this out
> because its output queue gets too big, assumes the remote has gone
> or is too
> slow, and gives up. Generally, the client (which can be configured
> to retry
> the connection) gets a shutdown message, and then reattaches and all
> is
> goodness again.
>
> Of course there could be bugs anywhere in this logic. So when this
> happens
> there are a few quick questions.
>
> Does radium have a connection to the client, but no data is being
> passed? If this
> is the case, we definitely have a bug and the best strategy is to
> attach to the running
> radium() with gdb() and step through to see what the problem is.
>
> Is radium still reading records from the remote argi? If not radium
> maybe fine, but
> there isn't any data to transmit, or radium has lost its connections
> from the argi, as
> it isn't processing fast enough itself to keep up with the record
> load.
>
> When radium isn't passing records, is it responding to additional
> connection
> requests? This would test if the radium output thread is completely
> dead,
> or spinning in a loop.
>
> So check if radium() is living. use netstat -na to see if the remote
> (s) still have
> active connections. checkout the load to see if radium() is chewing
> up 100% of
> one of the processors (infinite loop), and of course, check to see
> if there are any
> syslog messages from radium() indicating if the queue limit is
> reached or if
> its disconnecting or whatever.
>
> If this gets to be too much, lets see if I can logon and make some
> sense of it?
>
>
> Carter
>
> On Sep 25, 2009, at 1:12 PM, Jason Carr wrote:
>
>> Hi guys,
>>
>> Here's what's going on right now. We've got two argus processes
>> running on our Bivio unit plus two radium processes. One radium
>> process runs on the Bivio to essentially multiplex and forward to
>> the external radium process that runs on our long term storage
>> machine.
>>
>> What happens currently is that after a dynamic amount of time the
>> radium running on Bivio stops passing the traffic to the external
>> radium process. Killing radium and restarting it fixes the problem
>> immediately. Running radium in debug mode (-D 999) yields a 6.5G
>> output file, so I don't think I'll be sending that one along.
>>
>> How much debugging needs to be on to get a good understanding of
>> why this radium process would stop passing traffic?
>>
>> Thanks,
>>
>> Jason
>>
>>
>>
>
More information about the argus
mailing list