Radium dropping connections to argi

Carter Bullard carter at qosient.com
Fri Mar 11 11:02:08 EST 2011


Hey Phillip,
50 is a good number.  I'll go through the code to make sure that we are generating
messages to notify you of issues with restarting connections.  I'll put that in argus-clients-3.0.5
as soon as we release today, and we'll get started on this problem on Monday.

Carter


On Mar 11, 2011, at 9:12 AM, Phillip G Deneault wrote:

> Over 50
> 
> Thanks,
> Phil
> 
> On Thu, 10 Mar 2011, Carter Bullard wrote:
> 
>> Hey Phillip,
>> How many remote clients are you connecting to?
>> Should not be an issue but you never know.
>> Carter
>> 
>> 
>> 
>> On Mar 10, 2011, at 2:38 PM, Phillip Deneault <deneault at WPI.EDU> wrote:
>> 
>>> I managed to repeat this problem with a sniffer running.  It didn't turn
>>> up as much useful information as I would have liked.
>>> 
>>> For my test, I set up all of my sensors to restart argus once a day at
>>> the same time via a init.d stop/start and set my tcpdump filter to look
>>> like this:
>>> port 561 and tcp[tcpflags] & (tcp-syn|tcp-fin|tcp-rst) != 0
>>> 
>>> I can see two sets of shutdowns, one the 9th for which all my sensors
>>> came back, and one on the 10th when only 17 came back.  All the Argus
>>> daemons did restart and come online, but basically for some hosts,
>>> radium never even attempted to restart the connection.
>>> 
>>> tcpdump available upon request.
>>> 
>>> Thanks,
>>> Phil
>>> 
>>> 
>>> On 3/5/2011 6:08 PM, Phillip G Deneault wrote:
>>>> Actually, I spoke to soon.  It happened again last night after not
>>>> happening for weeks.  I'm going to see if I can simulate this behavior
>>>> tomorrow or Monday and try to get a packet capture of the behavior as it
>>>> occurs.
>>>> 
>>>> Thanks,
>>>> Phil
>>>> 
>>>> On Sat, 5 Mar 2011, Carter Bullard wrote:
>>>> 
>>>>> Excerrent !!!!   That is great news !!!!
>>>>> Carter
>>>>> 
>>>>> On Mar 4, 2011, at 1:29 PM, Phillip Deneault <deneault at WPI.EDU> wrote:
>>>>> 
>>>>>> Carter,
>>>>>> 
>>>>>> I didn't forget about you.  I've been letting this run for a while then
>>>>>> ran it for a little more when you released .23.  It seems to have fixed
>>>>>> the bug as I still have not had any problems.
>>>>>> 
>>>>>> Thanks,
>>>>>> Phil
>>>>>> 
>>>>>> On 2/4/2011 4:10 PM, Carter Bullard wrote:
>>>>>>> Hey Phillip,
>>>>>>> I did find a problem, and this patch should fix radium() apparently
>>>>>>> not attempting to reconnect
>>>>>>> after a while.  I've got it in the distribution but give it a try on
>>>>>>> your machine to see if it doesn't
>>>>>>> correct the problem.
>>>>>>> 
>>>>>>> Carter
>>>>>>> 
>>>>>>> ==== //depot/argus/clients/common/argus_client.c#204 -
>>>>>>> /Users/carter/argus/clients/common/argus_client.c ====
>>>>>>> 2523a2524,2525
>>>>>>>> 
>>>>>>>>                   input->status &= ~ARGUS_CLOSED;
>>>>>>> 
>>>>>>> 
>>>>>>> On Feb 4, 2011, at 4:01 PM, Carter Bullard wrote:
>>>>>>> 
>>>>>>>> Hey Phillip,
>>>>>>>> radium() doesn't have a retry counter, it should keep trying every
>>>>>>>> 5 seconds if threaded and every 1 second if
>>>>>>>> non-thread, and it should try forever.  I've recreated a problem
>>>>>>>> where radium(), after the far side has gone
>>>>>>>> away a few times, it loses the connection, so I'm working this now.
>>>>>>>> 
>>>>>>>> Carter
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Feb 4, 2011, at 12:11 PM, Phillip Deneault wrote:
>>>>>>>> 
>>>>>>>>> On 1/31/2011 1:59 PM, Phillip Deneault wrote:
>>>>>>>>>> This might muddle the issue, but I'm having an odd issue with radium
>>>>>>>>>> too.  The longer radium is running, the fewer and fewer records
>>>>>>>>>> seem >
>>>>>>>>> to be recorded.  It appears that the radium instance loses it
>>>>>>>>>> connection to the argi one at a time and doesn't keep retrying and
>>>>>>>>>> doesn't throw an error in the logs about any soft of failed
>>>>>>>>>> connection.
>>>>>>>>>> 
>>>>>>>>>> There are quite a few nodes I'm connecting to (all on the local
>>>>>>>>>> lan),
>>>>>>>>>> and this was happening in 3.0.2 version of argus-clients as well as
>>>>>>>>>> 3.0.3.21(which I am running now).  I'm running Centos 5.5.
>>>>>>>>>> 
>>>>>>>>>> I'm uping the debug level and trying to figure this out, but can
>>>>>>>>>> anyone else confirm they see this problem?
>>>>>>>>> 
>>>>>>>>> So I'll assume no one else is seeing this problem.
>>>>>>>>> 
>>>>>>>>> Yesterday we had some network interruption and a number of the nodes
>>>>>>>>> once again got disconnected from the radium instance.  It appears the
>>>>>>>>> radium instance tried to reconnect 10 times, all with a 'no route to
>>>>>>>>> host' before it appeared to stop retrying.
>>>>>>>>> 
>>>>>>>>> The number 10 sounds to me like a nice round number, is this a
>>>>>>>>> hardcoded
>>>>>>>>> retry count in radium?
>>>>>>>>> 
>>>>>>>>> I might be getting ahead of myself but should I instead use the -p
>>>>>>>>> option to kill radium if I drop a connection and use a process
>>>>>>>>> monitor
>>>>>>>>> to restart it?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Phil
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> 
>> 
> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3815 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20110311/33069782/attachment.bin>


More information about the argus mailing list