Radium dropping connections to argi
Carter Bullard
carter at qosient.com
Fri Mar 11 11:02:08 EST 2011
Hey Phillip,
50 is a good number. I'll go through the code to make sure that we are generating
messages to notify you of issues with restarting connections. I'll put that in argus-clients-3.0.5
as soon as we release today, and we'll get started on this problem on Monday.
Carter
On Mar 11, 2011, at 9:12 AM, Phillip G Deneault wrote:
> Over 50
>
> Thanks,
> Phil
>
> On Thu, 10 Mar 2011, Carter Bullard wrote:
>
>> Hey Phillip,
>> How many remote clients are you connecting to?
>> Should not be an issue but you never know.
>> Carter
>>
>>
>>
>> On Mar 10, 2011, at 2:38 PM, Phillip Deneault <deneault at WPI.EDU> wrote:
>>
>>> I managed to repeat this problem with a sniffer running. It didn't turn
>>> up as much useful information as I would have liked.
>>>
>>> For my test, I set up all of my sensors to restart argus once a day at
>>> the same time via a init.d stop/start and set my tcpdump filter to look
>>> like this:
>>> port 561 and tcp[tcpflags] & (tcp-syn|tcp-fin|tcp-rst) != 0
>>>
>>> I can see two sets of shutdowns, one the 9th for which all my sensors
>>> came back, and one on the 10th when only 17 came back. All the Argus
>>> daemons did restart and come online, but basically for some hosts,
>>> radium never even attempted to restart the connection.
>>>
>>> tcpdump available upon request.
>>>
>>> Thanks,
>>> Phil
>>>
>>>
>>> On 3/5/2011 6:08 PM, Phillip G Deneault wrote:
>>>> Actually, I spoke to soon. It happened again last night after not
>>>> happening for weeks. I'm going to see if I can simulate this behavior
>>>> tomorrow or Monday and try to get a packet capture of the behavior as it
>>>> occurs.
>>>>
>>>> Thanks,
>>>> Phil
>>>>
>>>> On Sat, 5 Mar 2011, Carter Bullard wrote:
>>>>
>>>>> Excerrent !!!! That is great news !!!!
>>>>> Carter
>>>>>
>>>>> On Mar 4, 2011, at 1:29 PM, Phillip Deneault <deneault at WPI.EDU> wrote:
>>>>>
>>>>>> Carter,
>>>>>>
>>>>>> I didn't forget about you. I've been letting this run for a while then
>>>>>> ran it for a little more when you released .23. It seems to have fixed
>>>>>> the bug as I still have not had any problems.
>>>>>>
>>>>>> Thanks,
>>>>>> Phil
>>>>>>
>>>>>> On 2/4/2011 4:10 PM, Carter Bullard wrote:
>>>>>>> Hey Phillip,
>>>>>>> I did find a problem, and this patch should fix radium() apparently
>>>>>>> not attempting to reconnect
>>>>>>> after a while. I've got it in the distribution but give it a try on
>>>>>>> your machine to see if it doesn't
>>>>>>> correct the problem.
>>>>>>>
>>>>>>> Carter
>>>>>>>
>>>>>>> ==== //depot/argus/clients/common/argus_client.c#204 -
>>>>>>> /Users/carter/argus/clients/common/argus_client.c ====
>>>>>>> 2523a2524,2525
>>>>>>>>
>>>>>>>> input->status &= ~ARGUS_CLOSED;
>>>>>>>
>>>>>>>
>>>>>>> On Feb 4, 2011, at 4:01 PM, Carter Bullard wrote:
>>>>>>>
>>>>>>>> Hey Phillip,
>>>>>>>> radium() doesn't have a retry counter, it should keep trying every
>>>>>>>> 5 seconds if threaded and every 1 second if
>>>>>>>> non-thread, and it should try forever. I've recreated a problem
>>>>>>>> where radium(), after the far side has gone
>>>>>>>> away a few times, it loses the connection, so I'm working this now.
>>>>>>>>
>>>>>>>> Carter
>>>>>>>>
>>>>>>>>
>>>>>>>> On Feb 4, 2011, at 12:11 PM, Phillip Deneault wrote:
>>>>>>>>
>>>>>>>>> On 1/31/2011 1:59 PM, Phillip Deneault wrote:
>>>>>>>>>> This might muddle the issue, but I'm having an odd issue with radium
>>>>>>>>>> too. The longer radium is running, the fewer and fewer records
>>>>>>>>>> seem >
>>>>>>>>> to be recorded. It appears that the radium instance loses it
>>>>>>>>>> connection to the argi one at a time and doesn't keep retrying and
>>>>>>>>>> doesn't throw an error in the logs about any soft of failed
>>>>>>>>>> connection.
>>>>>>>>>>
>>>>>>>>>> There are quite a few nodes I'm connecting to (all on the local
>>>>>>>>>> lan),
>>>>>>>>>> and this was happening in 3.0.2 version of argus-clients as well as
>>>>>>>>>> 3.0.3.21(which I am running now). I'm running Centos 5.5.
>>>>>>>>>>
>>>>>>>>>> I'm uping the debug level and trying to figure this out, but can
>>>>>>>>>> anyone else confirm they see this problem?
>>>>>>>>>
>>>>>>>>> So I'll assume no one else is seeing this problem.
>>>>>>>>>
>>>>>>>>> Yesterday we had some network interruption and a number of the nodes
>>>>>>>>> once again got disconnected from the radium instance. It appears the
>>>>>>>>> radium instance tried to reconnect 10 times, all with a 'no route to
>>>>>>>>> host' before it appeared to stop retrying.
>>>>>>>>>
>>>>>>>>> The number 10 sounds to me like a nice round number, is this a
>>>>>>>>> hardcoded
>>>>>>>>> retry count in radium?
>>>>>>>>>
>>>>>>>>> I might be getting ahead of myself but should I instead use the -p
>>>>>>>>> option to kill radium if I drop a connection and use a process
>>>>>>>>> monitor
>>>>>>>>> to restart it?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Phil
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3815 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20110311/33069782/attachment.bin>
More information about the argus
mailing list