Argus and rasqlinsert problems

Carter Bullard carter at qosient.com
Mon Apr 18 16:52:32 EDT 2011


Hey Leif,
OK.  Hopefully when it pops, you should be able to type things like:
   (gdb) where
   (gdb) info threads
   (gdb) thread 2
   (gdb) where

That should give us some notion of what is going on.  Thanks for helping
us get to the bottom of this bug.

Carter

On Apr 18, 2011, at 4:47 PM, Leif Tishendorf wrote:

> 2349  cd argus-3.0.4/
> 2350  touch .devel .debug
> 2351  ./configure; make clean; make
> 
> and made sure I was running argus in ./bin/argus through gdb.  Right now, since it's running stable, I don't want to go back to a testing config since we actively use the generated data.  However, if it crashes again I'll recompile again and run the test setup.
> 
> -Leif
> 
> 
> On 04/18/2011 01:38 PM, Carter Bullard wrote:
>> When you get that message in gdb() you should be able to type "where" and get a stack dump?
>> But I'm not sure that the radium has the symbols in it, as you should get much better output when
>> there is a a problem.
>> 
>> Could you verify that you compiled radium with the the correct tags, ".debug" and ".devel"?
>> Carter
>> 
>> 
>> On Apr 18, 2011, at 4:21 PM, Leif Tishendorf wrote:
>> 
>>> Carter,
>>> 
>>> Recompiled with devel and debug.  Everything runs fine and then this in dbg(other then the normal operation chatter):
>>> 
>>> Program received signal SIGSEGV, Segmentation fault.
>>> [Switching to Thread 0x7fffe59ae700 (LWP 6897)]
>>> 0x00007ffff7427f7c in ?? () from /lib/libc.so.6
>>> 
>>> Unfortunately there are no other radium, argus or rasqlinsert errors generated anywhere.
>>> 
>>> But going with the pressure theory I changed how I call rasqlinsert from:
>>> 
>>> rasqlinsert -M cache -S localhost:565 -w mysql://argus@localhost/argus/argus_%Y_%m_%d -M time 1d -d
>>> 
>>> to:
>>> 
>>> rasqlinsert -S localhost:565 -w mysql://argus@localhost/argus/argus_%Y_%m_%d -M time 1d -d -m none
>>> 
>>> in an attempt to alleviate some of the "pressure" on rasqlinsert/mysql and so far it seems to be stable.  Going on about 2 hours uptime across peek traffic times, as apposed to the 30 seconds I was achieving earlier.  CPU and RAM usage have been steadily creeping up but I'm hoping that starts to go back down with traffic later in the day.
>>> 
>>> -Leif
>>> 
>>> On 04/18/2011 06:51 AM, Carter Bullard wrote:
>>>> Hey Leif,
>>>> So I'm not getting anything like your experience here, with quite a bit of testing.
>>>> Could you look at your system logs, to see if radium or argus printed any type
>>>> of log entry when it disconnects?  You're on Debian?
>>>> 
>>>> So, what is bascially going on, is rasqlinsert() isn't keeping up with the load, so
>>>> there is back pressure on the last radium in your chain.  The radium will reach a
>>>> threshold of records that are waiting to be written to rasqlinsert(), and it will/should
>>>> decide to drop the connection (it could drop records at this point, rather that drop
>>>> the connection).  The last radium should drop the connection, and that should be
>>>> the end of it.  rasqlinsert() is suppose to finish processing, and then retry the
>>>> connection, and continue on its way.  That is the design.
>>>> 
>>>> The curiosity, is that whatever is going on, its back pressuring both radium and
>>>> argus, and they are having problems dealing with the condition.  At least that's
>>>> what it sounds like?
>>>> 
>>>> The second radium in your chain, shouldn't affect the first radium in any way.
>>>> That was one of the reasons why I asked you to put it in.  So if radium has any
>>>> log entries, that would be really helpful.
>>>> 
>>>> Carter
>>>> 
>>>> On Apr 15, 2011, at 7:46 PM, Leif Tishendorf wrote:
>>>> 
>>>>> Carter,
>>>>> 
>>>>> An update before the weekend here.  So I reverted everything to the latest stable release and put it back in a working local logging configuration (3 argus on localhost, 1 radium collector, rasplit to files).  Made sure that was all working stable.  Then made a change to stop rasplit on the argus box and fire up rasqlinsert on the remote box(no other changes to Argus config or radium config) with the expected problem rasqlinsert stops inserting within about a minute and Argus eventually crashs.
>>>>> 
>>>>> I then changed it to have a radium instance on the remote(DB) box connect to the radium instance on the Argus box.  That stayed up and stable, and then as soon as I started rasqlinsert again against the now local instance of radium the same problems returned.  I also noticed this seems to cause a cascading problem where rasqlinsert will stop inserting and the local radium instance stops outputting data and the upstream instance of radium (on the Argus box) stops outputting and eventually the Argus instances crash.
>>>>> 
>>>>> I haven't had a chance to recompile and test with devel and debug enabled but I thought I'd send out that bit of info and see if it lit any light bulbs.  I'll recompile for testing monday.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> -Leif
>>>>> 
>>>>> On 04/15/2011 01:07 PM, Carter Bullard wrote:
>>>>>> Hey Lief,
>>>>>> Well, that is disappointing.  I would recommend that you shift back,
>>>>>> to get something stable going,  and I'll work with you to get things so
>>>>>> you can go the database route.
>>>>>> 
>>>>>> I am not seeing this type of instability, but that doesn't mean anything.
>>>>>> 
>>>>>> First, things first.  We need to fix the argus seg faulting.  Did this start
>>>>>> with argus-3.0.4, or with the radium() connection approach?
>>>>>> 
>>>>>> If you can run gdb(), the best thing would be to run argus under gdb,
>>>>>> after compiling the symbols in, so it will tell us where it is dying.  In
>>>>>> the argus root directory:
>>>>>> 
>>>>>>    % touch .devel .debug
>>>>>>    % ./configure; make clean; make
>>>>>>    % sudo gdb ./bin/argus
>>>>>>    (gdb)
>>>>>> 
>>>>>> Stop your running argus, and then run the argus under gdb.
>>>>>> Assuming that your argus was running as a daemon, use the -d switch
>>>>>> when running argus, so that it won't go into the background while in gdb:
>>>>>> 
>>>>>>    (gdb) argus -d
>>>>>> 
>>>>>> Hopefully it will cough up blood and tell us where it was.  That should
>>>>>> help me to fix that.
>>>>>> 
>>>>>> Rather than have rasqlinsert() connect to a remote radium(), you can
>>>>>> radium() on the database system, connecting to the other radium(), and
>>>>>> have rasqlinsert() attach to a local radium.  That may or may not help,
>>>>>> but it at least leaves record distribution to radium, and lets the other
>>>>>> programs have local access to data.
>>>>>> 
>>>>>> With rasqlinsert(), there are a few possibilities.  When the CPU goes
>>>>>> down, has rasqlinsert() stopped inserting records into the database?
>>>>>> It may be having problems receiving records, or it could be having
>>>>>> problems with mysqld.
>>>>>> 
>>>>>> Are there any error messages in your mysqld error logs?
>>>>>> 
>>>>>> Sometimes its hard to find where the logs are.  I use:
>>>>>>    lsof -n | fgrep mysql
>>>>>> to show me where the directory is. You may have to be root to see.
>>>>>> 
>>>>>> How are you calling rasqlinsert?
>>>>>> 
>>>>>> If you would like to take this off the email list, feel free to email me
>>>>>> directly, although it is late on Friday, I'll still read some email this
>>>>>> weekend.
>>>>>> 
>>>>>> Carter
>>>>>> 
>>>>>> On Apr 15, 2011, at 2:42 PM, Leif Tishendorf wrote:
>>>>>> 
>>>>>>> Hey Carter,
>>>>>>> 
>>>>>>> I've change how we're logging argus data from regular files to a MySQL
>>>>>>> DB.  We used to have 3 Argus instances collected by one Radium
>>>>>>> instance and then logged to disk by rasplit, and it was all working
>>>>>>> fine.  Now everything is the same except instead of rasplit we use
>>>>>>> rasqlinsert and instead of logging local rasqlinsert is running on
>>>>>>> another system connecting to the radium instance via a private address
>>>>>>> direct link.
>>>>>>> 
>>>>>>> The first issue I noticed was every few minutes the argus instances
>>>>>>> were dieing(not necessarily at the same time) with the following
>>>>>>> syslog error:
>>>>>>> 
>>>>>>> kernel: [4374754.132368] argus[28333]: segfault at 188 ip
>>>>>>> 00007f27b7e61f7c sp 00007f27a63e7828 error 6 in
>>>>>>> libc-2.12.1.so[7f27b7ddb000+17a000]
>>>>>>> 
>>>>>>> Then the second issue we're having is rasqlinsert will work fine and
>>>>>>> then we'll see CPU/RAM usage decline over about 30 seconds until it's
>>>>>>> eventually no longer inserting new argus records.  We can get it
>>>>>>> working again (without touching the running rasqlinsert instance) by
>>>>>>> sometimes restarting radium and sometimes restarting the argus
>>>>>>> instances and sometimes it takes both.  but after a minute or so it
>>>>>>> all happens again.  The crashes don't coincide with the inserts
>>>>>>> stopping, although they do sometimes fix it when my monitor scripts
>>>>>>> restart the argus instances.
>>>>>>> 
>>>>>>> I'm currently running Argus version 3.0.4 and Argus-clients 3.0.5.5
>>>>>>> 
>>>>>>> Any ideas on where I should start troubleshooting this?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> -Leif
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> --
>>>>> --Leif
>>>>> 
>>>> 
>>> 
>>> --
>>> --Leif
>>> 
>> 
> 
> -- 
> --Leif
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3815 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20110418/84e4f1d6/attachment.bin>


More information about the argus mailing list