Radium dropping connections to argi

Phillip G Deneault deneault at WPI.EDU
Fri Mar 11 09:12:44 EST 2011


Over 50

Thanks,
Phil

On Thu, 10 Mar 2011, Carter Bullard wrote:

> Hey Phillip,
> How many remote clients are you connecting to?
> Should not be an issue but you never know.
> Carter
>
>
>
> On Mar 10, 2011, at 2:38 PM, Phillip Deneault <deneault at WPI.EDU> wrote:
>
>> I managed to repeat this problem with a sniffer running.  It didn't turn
>> up as much useful information as I would have liked.
>>
>> For my test, I set up all of my sensors to restart argus once a day at
>> the same time via a init.d stop/start and set my tcpdump filter to look
>> like this:
>> port 561 and tcp[tcpflags] & (tcp-syn|tcp-fin|tcp-rst) != 0
>>
>> I can see two sets of shutdowns, one the 9th for which all my sensors
>> came back, and one on the 10th when only 17 came back.  All the Argus
>> daemons did restart and come online, but basically for some hosts,
>> radium never even attempted to restart the connection.
>>
>> tcpdump available upon request.
>>
>> Thanks,
>> Phil
>>
>>
>> On 3/5/2011 6:08 PM, Phillip G Deneault wrote:
>>> Actually, I spoke to soon.  It happened again last night after not
>>> happening for weeks.  I'm going to see if I can simulate this behavior
>>> tomorrow or Monday and try to get a packet capture of the behavior as it
>>> occurs.
>>>
>>> Thanks,
>>> Phil
>>>
>>> On Sat, 5 Mar 2011, Carter Bullard wrote:
>>>
>>>> Excerrent !!!!   That is great news !!!!
>>>> Carter
>>>>
>>>> On Mar 4, 2011, at 1:29 PM, Phillip Deneault <deneault at WPI.EDU> wrote:
>>>>
>>>>> Carter,
>>>>>
>>>>> I didn't forget about you.  I've been letting this run for a while then
>>>>> ran it for a little more when you released .23.  It seems to have fixed
>>>>> the bug as I still have not had any problems.
>>>>>
>>>>> Thanks,
>>>>> Phil
>>>>>
>>>>> On 2/4/2011 4:10 PM, Carter Bullard wrote:
>>>>>> Hey Phillip,
>>>>>> I did find a problem, and this patch should fix radium() apparently
>>>>>> not attempting to reconnect
>>>>>> after a while.  I've got it in the distribution but give it a try on
>>>>>> your machine to see if it doesn't
>>>>>> correct the problem.
>>>>>>
>>>>>> Carter
>>>>>>
>>>>>> ==== //depot/argus/clients/common/argus_client.c#204 -
>>>>>> /Users/carter/argus/clients/common/argus_client.c ====
>>>>>> 2523a2524,2525
>>>>>>>
>>>>>>>                    input->status &= ~ARGUS_CLOSED;
>>>>>>
>>>>>>
>>>>>> On Feb 4, 2011, at 4:01 PM, Carter Bullard wrote:
>>>>>>
>>>>>>> Hey Phillip,
>>>>>>> radium() doesn't have a retry counter, it should keep trying every
>>>>>>> 5 seconds if threaded and every 1 second if
>>>>>>> non-thread, and it should try forever.  I've recreated a problem
>>>>>>> where radium(), after the far side has gone
>>>>>>> away a few times, it loses the connection, so I'm working this now.
>>>>>>>
>>>>>>> Carter
>>>>>>>
>>>>>>>
>>>>>>> On Feb 4, 2011, at 12:11 PM, Phillip Deneault wrote:
>>>>>>>
>>>>>>>> On 1/31/2011 1:59 PM, Phillip Deneault wrote:
>>>>>>>>> This might muddle the issue, but I'm having an odd issue with radium
>>>>>>>>> too.  The longer radium is running, the fewer and fewer records
>>>>>>>>> seem >
>>>>>>>> to be recorded.  It appears that the radium instance loses it
>>>>>>>>> connection to the argi one at a time and doesn't keep retrying and
>>>>>>>>> doesn't throw an error in the logs about any soft of failed
>>>>>>>>> connection.
>>>>>>>>>
>>>>>>>>> There are quite a few nodes I'm connecting to (all on the local
>>>>>>>>> lan),
>>>>>>>>> and this was happening in 3.0.2 version of argus-clients as well as
>>>>>>>>> 3.0.3.21(which I am running now).  I'm running Centos 5.5.
>>>>>>>>>
>>>>>>>>> I'm uping the debug level and trying to figure this out, but can
>>>>>>>>> anyone else confirm they see this problem?
>>>>>>>>
>>>>>>>> So I'll assume no one else is seeing this problem.
>>>>>>>>
>>>>>>>> Yesterday we had some network interruption and a number of the nodes
>>>>>>>> once again got disconnected from the radium instance.  It appears the
>>>>>>>> radium instance tried to reconnect 10 times, all with a 'no route to
>>>>>>>> host' before it appeared to stop retrying.
>>>>>>>>
>>>>>>>> The number 10 sounds to me like a nice round number, is this a
>>>>>>>> hardcoded
>>>>>>>> retry count in radium?
>>>>>>>>
>>>>>>>> I might be getting ahead of myself but should I instead use the -p
>>>>>>>> option to kill radium if I drop a connection and use a process
>>>>>>>> monitor
>>>>>>>> to restart it?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Phil
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>
>>
>




More information about the argus mailing list