Radium dropping connections to argi

Carter Bullard carter at qosient.com
Thu Mar 10 18:04:39 EST 2011


Hey Phillip,
How many remote clients are you connecting to?
Should not be an issue but you never know.
Carter



On Mar 10, 2011, at 2:38 PM, Phillip Deneault <deneault at WPI.EDU> wrote:

> I managed to repeat this problem with a sniffer running.  It didn't turn
> up as much useful information as I would have liked.
> 
> For my test, I set up all of my sensors to restart argus once a day at
> the same time via a init.d stop/start and set my tcpdump filter to look
> like this:
> port 561 and tcp[tcpflags] & (tcp-syn|tcp-fin|tcp-rst) != 0
> 
> I can see two sets of shutdowns, one the 9th for which all my sensors
> came back, and one on the 10th when only 17 came back.  All the Argus
> daemons did restart and come online, but basically for some hosts,
> radium never even attempted to restart the connection.
> 
> tcpdump available upon request.
> 
> Thanks,
> Phil
> 
> 
> On 3/5/2011 6:08 PM, Phillip G Deneault wrote:
>> Actually, I spoke to soon.  It happened again last night after not
>> happening for weeks.  I'm going to see if I can simulate this behavior
>> tomorrow or Monday and try to get a packet capture of the behavior as it
>> occurs.
>> 
>> Thanks,
>> Phil
>> 
>> On Sat, 5 Mar 2011, Carter Bullard wrote:
>> 
>>> Excerrent !!!!   That is great news !!!!
>>> Carter
>>> 
>>> On Mar 4, 2011, at 1:29 PM, Phillip Deneault <deneault at WPI.EDU> wrote:
>>> 
>>>> Carter,
>>>> 
>>>> I didn't forget about you.  I've been letting this run for a while then
>>>> ran it for a little more when you released .23.  It seems to have fixed
>>>> the bug as I still have not had any problems.
>>>> 
>>>> Thanks,
>>>> Phil
>>>> 
>>>> On 2/4/2011 4:10 PM, Carter Bullard wrote:
>>>>> Hey Phillip,
>>>>> I did find a problem, and this patch should fix radium() apparently
>>>>> not attempting to reconnect
>>>>> after a while.  I've got it in the distribution but give it a try on
>>>>> your machine to see if it doesn't
>>>>> correct the problem.
>>>>> 
>>>>> Carter
>>>>> 
>>>>> ==== //depot/argus/clients/common/argus_client.c#204 -
>>>>> /Users/carter/argus/clients/common/argus_client.c ====
>>>>> 2523a2524,2525
>>>>>> 
>>>>>>                    input->status &= ~ARGUS_CLOSED;
>>>>> 
>>>>> 
>>>>> On Feb 4, 2011, at 4:01 PM, Carter Bullard wrote:
>>>>> 
>>>>>> Hey Phillip,
>>>>>> radium() doesn't have a retry counter, it should keep trying every
>>>>>> 5 seconds if threaded and every 1 second if
>>>>>> non-thread, and it should try forever.  I've recreated a problem
>>>>>> where radium(), after the far side has gone
>>>>>> away a few times, it loses the connection, so I'm working this now.
>>>>>> 
>>>>>> Carter
>>>>>> 
>>>>>> 
>>>>>> On Feb 4, 2011, at 12:11 PM, Phillip Deneault wrote:
>>>>>> 
>>>>>>> On 1/31/2011 1:59 PM, Phillip Deneault wrote:
>>>>>>>> This might muddle the issue, but I'm having an odd issue with radium
>>>>>>>> too.  The longer radium is running, the fewer and fewer records
>>>>>>>> seem >
>>>>>>> to be recorded.  It appears that the radium instance loses it
>>>>>>>> connection to the argi one at a time and doesn't keep retrying and
>>>>>>>> doesn't throw an error in the logs about any soft of failed
>>>>>>>> connection.
>>>>>>>> 
>>>>>>>> There are quite a few nodes I'm connecting to (all on the local
>>>>>>>> lan),
>>>>>>>> and this was happening in 3.0.2 version of argus-clients as well as
>>>>>>>> 3.0.3.21(which I am running now).  I'm running Centos 5.5.
>>>>>>>> 
>>>>>>>> I'm uping the debug level and trying to figure this out, but can
>>>>>>>> anyone else confirm they see this problem?
>>>>>>> 
>>>>>>> So I'll assume no one else is seeing this problem.
>>>>>>> 
>>>>>>> Yesterday we had some network interruption and a number of the nodes
>>>>>>> once again got disconnected from the radium instance.  It appears the
>>>>>>> radium instance tried to reconnect 10 times, all with a 'no route to
>>>>>>> host' before it appeared to stop retrying.
>>>>>>> 
>>>>>>> The number 10 sounds to me like a nice round number, is this a
>>>>>>> hardcoded
>>>>>>> retry count in radium?
>>>>>>> 
>>>>>>> I might be getting ahead of myself but should I instead use the -p
>>>>>>> option to kill radium if I drop a connection and use a process
>>>>>>> monitor
>>>>>>> to restart it?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Phil
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
> 
> 



More information about the argus mailing list