Endance DAG 8.1 and Argus problem
Carter Bullard
carter at qosient.com
Fri Feb 25 09:43:42 EST 2011
Hey Leif,
The fact that you had to restart the machine suggests that the problem is outside of argus, but we definitely need
to look into this.
First of all, if argus is not staying up for you, I'd like to fix that. Do you have any sense as to how often it fails?
Hopefully not often. Is it running now?
You have constructed a bit of a data flow machine or data mesh, that appears to be pumping a lot of data.
Any of the components can have problems, so when a situation like this starts, having an
idea of what component failed first is important. The best I can imagine from your description is that ratop()
had a problem (did it really 'crash'? did it generate a core file?), which must have generated a flow pressure
problem for radium(), as it stopped processing records from the argi. This suggests that the radium() failed,
so it was in a bad state. The argi look to have done the right thing, by closing the output sockets to radium(),
but one of the threads segfaulted, so at least some aspect of the argi was broken?
In the end you probably had a broken radium that was up, but not receiving any data, so you won't get anything
from it, but it at least responded to connection requests. You also probably had broken argi running, which
supported responding to connection requests, but totally broken outside of that.
Well, how it breaks can be complicated, but recovering from this type of problem should be pretty simple.
But if you were having lingering problems, my first guess is that you didn't successfully kill all or some
of the broken parts, and relaunching argus or radium was failing, but you were not aware that it was failing.
If you are running argus/radium with the DAEMON mode turned on, when you tried to relaunch, it could
be failing, but you won't see any error messages because the DAEMON mode turns that off. The messages
will be in the syslog, so be sure and look there for clues when things really are curious and unexpected.
OK, the best way to have recovered from this mess, would be to kill all the argi and radium, using signal 9,
(I'm assuming a linux/unix environment), and to verify that all the processes are completely gone, with
the ps command.
Restart one or all the argi, without the DAEMON mode, to see if you can connect and get data. If not, then
kill the argi, and then run just one using the "-D3" options, so you can see that it opens its interface. You can
run with "-D9" to see that its getting packets..... Again make sure that argus is not in the background.
Once you realize that one works, then you can start radium() to see that it connects, and then run ratop()
or ra() to see that you get data.
Carter
On Feb 24, 2011, at 5:51 PM, Leif Tishendorf wrote:
> Hey Carter,
>
> At the time, I had my 6 instances of Argus running with Radium connecting to them to aggregate the interfaces and then rasplit listening to Radium to log the data. After the error I noticed rasplit was no longer logging data to disk. So to try and figure out what was going on I attempted to read from the Radium interface direct using ra and ratop and was getting no data out. I shut down Radium and tried reading from each Argus instance directly with the same results. I restarted the instances of Argus and tried the procedure above again with the same results. Starting up new instances of Argus 3.0.2 and 3.0.3.22 yielded the same results. Lastly I restarted the box and was again able to get data out of Argus. No other functions, or programs using the Dag card, were affected like this. Somewhere between the data queue and getting the data out of the queue, there was a disconnect.
>
> By the way, thanks again for all the continued help on this.
>
> -Leif
>
> On 02/24/2011 02:27 PM, Carter Bullard wrote:
>> Hey Leif,
>> Sorry for the delay. argus is getting packets, it is processing flow data and it's queueing data to the oupt sockets. The error messages indicate that each output queue has 200,000 flow records ready to be written out, but no one appears to be reading the data.
>>
>> The queues getting long and complaining is usually because your reader cannot keep up with the output stream. It thinks that no one is reading and so it closes the output sockets. The segfault is a real problem, and I'll try to figure that out tonight.
>>
>> Not sure that I understand the current scenario. You say it is not writing anything but you are getting queue exceed errors?
>>
>> Carter
>>
>>
>> On Feb 23, 2011, at 6:54 PM, Leif Tishendorf<ltishend at gmail.com> wrote:
>>
>>> Carter,
>>>
>>> So I ran into an interesting problem this morning. I ran ratop against the new patched 3.0.3.22 for testing and after a couple minutes it crashed. I didn't look any more into it at the time because I was at Jury Judy. Now that I've had some time to do some back checking it would appear the new argus caused a kernel error, and then shortly after all 6 running instances of argus 3.0.2 threw the following error in short successtion.
>>>
>>> ========================================
>>> Feb 23 08:32:40 goldfinger kernel: [2840126.624223] argus[25584]: segfault at 188 ip 00007f58cf3ccf7c sp 00007f58bdf80718 error 6 in libc-2.12.1.so[7f58cf346000+17a000]
>>> Feb 23 08:32:55 goldfinger argus[17777]: 23 Feb 11 08:32:55.413984 ArgusWriteOutSocket(0xd6fa3010) maximum errors exceeded 200000
>>> Feb 23 08:32:55 goldfinger argus[17777]: 23 Feb 11 08:32:55.414030 ArgusWriteOutSocket(0xd6fa3010) maximum errors exceeded 200001
>>> Feb 23 08:32:59 goldfinger argus[30178]: 23 Feb 11 08:32:59.464853 ArgusWriteOutSocket(0xc2016010) maximum errors exceeded 200000
>>> Feb 23 08:32:59 goldfinger argus[30178]: 23 Feb 11 08:32:59.464893 ArgusWriteOutSocket(0xc2016010) maximum errors exceeded 200001
>>> Feb 23 08:32:59 goldfinger argus[16029]: 23 Feb 11 08:32:59.673235 ArgusWriteOutSocket(0x80fda010) maximum errors exceeded 200000
>>> Feb 23 08:32:59 goldfinger argus[16029]: 23 Feb 11 08:32:59.673276 ArgusWriteOutSocket(0x80fda010) maximum errors exceeded 200001
>>> Feb 23 08:33:00 goldfinger argus[29048]: 23 Feb 11 08:33:00.290119 ArgusWriteOutSocket(0xda789010) maximum errors exceeded 200000
>>> Feb 23 08:33:00 goldfinger argus[29048]: 23 Feb 11 08:33:00.290158 ArgusWriteOutSocket(0xda789010) maximum errors exceeded 200001
>>> Feb 23 08:33:00 goldfinger argus[12721]: 23 Feb 11 08:33:00.683134 ArgusWriteOutSocket(0x5db96010) maximum errors exceeded 200000
>>> Feb 23 08:33:00 goldfinger argus[12721]: 23 Feb 11 08:33:00.683165 ArgusWriteOutSocket(0x5db96010) maximum errors exceeded 200001
>>> Feb 23 08:33:00 goldfinger argus[13677]: 23 Feb 11 08:33:00.925746 ArgusWriteOutSocket(0xc37ba010) maximum errors exceeded 200000
>>> Feb 23 08:33:00 goldfinger argus[13677]: 23 Feb 11 08:33:00.925785 ArgusWriteOutSocket(0xc37ba010) maximum errors exceeded 200001
>>> ===============================================
>>>
>>> I have scripts in place to deal with the crashing and timestamp fits of argus 3.0.2 and so everything kept going, but I just noticed it is no longer actually writing data to the interfaces. Restarting them doesn't change it and 3.0.3.22 is behaving the same way. I also have continued to get the 'ArgusWriteOutSocket(0xd6fa3010) maximum errors exceeded' errors since, which I hadn't gotten before.
>>>
>>> I've double checked with tcpdump to make sure I'm still actually sending out data on the argus DAG interface data streams.
>>>
>>> Thanks,
>>>
>>> Leif
>>>
>>> On 02/21/2011 02:29 PM, Carter Bullard wrote:
>>>> Hey Leif,
>>>> I may have found the bug. We have new compile directives, "HAVE_DAG", that we use when you are
>>>> using the dag drivers, rather than using the libpcap interface. The bug caused us to use the dag specific
>>>> open routines, even though they were not linked in, which caused us to not try to open the interface at all.
>>>>
>>>> The fix is simple, but if you don't use patch() very often, it may be messy. I've included a new ArgusSource.c,
>>>> which you should copy over ./argus/ArgusSource.c. I've included the patch, for those that like patch files.
>>>> If this doesn't help you out, rerun argus using gdb(), and send the output again.
>>>>
>>>> Thanks, and sorry for the inconvenience,
>>>>
>>>> Carter
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---- included patch ----
>>>> ==== //depot/argus/argus/argus/ArgusSource.c#81 - /Users/carter/argus/argus/argus/ArgusSource.c ====
>>>> 295a296
>>>>> #ifdef HAVE_DAG
>>>> 296a298
>>>>> #endif
>>>> 347a350
>>>>> #ifdef HAVE_DAG
>>>> 349d351
>>>> < #ifdef HAVE_DAG
>>>> 368a371
>>>>> }
>>>> 370d372
>>>> < }
>>>> ---- end patch ----
>>>>
>>>> On Feb 18, 2011, at 12:27 PM, Leif Tishendorf wrote:
>>>>
>>>>> Carter,
>>>>>
>>>>>> Sorry, you need to run without the DAEMON mode on. Also add a -D1 just to verify that there is some activity.
>>>>>> So try:
>>>>>>
>>>>>> run -D1 -d
>>>>>
>>>>> Ah, ok, did that and here's the output now
>>>>>
>>>>> ----
>>>>> Reading symbols from /root/argus-3.0.3.22/bin/argus...done.
>>>>> (gdb) run -D1 -d
>>>>> Starting program: /root/argus-3.0.3.22/bin/argus -D1 -d
>>>>> [Thread debugging using libthread_db enabled]
>>>>> argus[26368.00d7fef7ff7f0000]: 18 Feb 11 09:19:44.777281 ArgusNewModeler() returning 0x671010
>>>>> argus[26368.00d7fef7ff7f0000]: 18 Feb 11 09:19:44.777427 ArgusNewOutput() returning retn 0x671d20
>>>>> argus[26368.00d7fef7ff7f0000]: 18 Feb 11 09:19:44.782451 setArgusID(0x7ffff690f040, 0xac16057b) done
>>>>> argus[26368.00d7fef7ff7f0000]: 18 Feb 11 09:19:44.782472 setArgusID(0x7ffff690f040, 0xac16057b) done
>>>>> argus[26368.00d7fef7ff7f0000]: 18 Feb 11 09:19:44.782478 setArgusID(0x7ffff690f040, 0xac16057b) done
>>>>> argus[26368.00d7fef7ff7f0000]: 18 Feb 11 09:19:44.782503 ArgusParseResourceFile: ArgusBindAddr "(null)"
>>>>> argus[26368.00d7fef7ff7f0000]: 18 Feb 11 09:19:46.990235 ArgusParseResourceFile (/etc/argus.conf) returning
>>>>> argus[26368.00d7fef7ff7f0000]: 18 Feb 11 09:19:46.990277 setArgusInterfaceStatus(0x7ffff690f010, 1)
>>>>> argus[26368.00d7fef7ff7f0000]: 18 Feb 11 09:19:46.991267 ArgusEstablishListen(0x671d20, 0x7fffffffd090) binding: 172.22.5.123:568 family: 2
>>>>> [New Thread 0x7ffff5f61700 (LWP 26405)]
>>>>> argus[26368.00d7fef7ff7f0000]: 18 Feb 11 09:19:46.992196 ArgusInitOutput() done
>>>>> argus[26368]: 18 Feb 11 09:19:46.992222 started
>>>>> argus[26368.0017f6f5ff7f0000]: 18 Feb 11 09:19:46.992246 ArgusOutputProcess(0x671d20) starting
>>>>> argus[26368.00d7fef7ff7f0000]: 18 Feb 11 09:19:46.994594 ArgusOpenInterface(0x7ffff5356010, 'dag0:36') returning 0
>>>>> argus[26368.00d7fef7ff7f0000]: 18 Feb 11 09:19:46.994606 ArgusInitSource: no packet sources for this device.
>>>>> argus[26368.00d7fef7ff7f0000]: 18 Feb 11 09:19:46.994611 ArgusInitSource(0x7ffff5356010) returning 0
>>>>> argus[26368.00d7fef7ff7f0000]: 18 Feb 11 09:19:47.994704 main() ArgusSourceProcess returned: shuting down
>>>>>
>>>>> argus[26368.00d7fef7ff7f0000]: 18 Feb 11 09:19:47.994747 ArgusShutDown(Normal Shutdown)
>>>>>
>>>>> argus[26368.00d7fef7ff7f0000]: 18 Feb 11 09:19:47.994756 ArgusCloseSource(0x7ffff690f010) starting
>>>>> argus[26368.00d7fef7ff7f0000]: 18 Feb 11 09:19:47.994775 ArgusCloseEvents() done
>>>>> argus[26368.00d7fef7ff7f0000]: 18 Feb 11 09:19:47.994783 ArgusCloseOutput(0x671d20) scheduling closure after 0 records
>>>>> argus[26368.0017f6f5ff7f0000]: 18 Feb 11 09:19:48.093424 ArgusOutputProcess(0x671d20) exiting
>>>>> argus[26368.00d7fef7ff7f0000]: 18 Feb 11 09:19:48.100631 ArgusCloseOutput(0x671d20) done
>>>>> [Thread 0x7ffff5f61700 (LWP 26405) exited]
>>>>> argus[26368.00d7fef7ff7f0000]: 18 Feb 11 09:19:48.100705 ArgusShutDown()
>>>>>
>>>>> Program exited normally.
>>>>> ----
>>>>>
>>>>> Thanks,
>>>>>
>>>>> --Leif
>>>>>
>>>>> On 02/18/2011 06:31 AM, Carter Bullard wrote:
>>>>>> Hey Leif,
>>>>>> Sorry, you need to run without the DAEMON mode on. Also add a -D1 just to verify that there is some activity.
>>>>>> So try:
>>>>>>
>>>>>> run -D1 -d
>>>>>>
>>>>>> Carter
>>>>>>
>>>>>>
>>>>>> On Feb 17, 2011, at 4:32 PM, Leif Tishendorf wrote:
>>>>>>
>>>>>>> Carter,
>>>>>>>
>>>>>>> Here is the output from gdb:
>>>>>>>
>>>>>>> ----
>>>>>>> Starting program: /root/argus-3.0.3.22/bin/argus -F ../support/Config/argus.conf
>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>> [New Thread 0x7ffff5f61700 (LWP 25329)]
>>>>>>> argus[25294]: 17 Feb 11 10:50:06.455798 started
>>>>>>> [Thread 0x7ffff5f61700 (LWP 25329) exited]
>>>>>>>
>>>>>>> Program exited normally.
>>>>>>> ----
>>>>>>>
>>>>>>> Though I've never run anything through gdb before so that's just a straight run command. If there is more you'd like me to do just let me know.
>>>>>>>
>>>>>>> Also, in the debug output I was wondering about the line:
>>>>>>>
>>>>>>> ----
>>>>>>> argus[13042.00172305347f0000]: 16 Feb 11 11:59:48.506743 ArgusOpenInterface(0x7f3402599010, 'dag0:62') returning 0
>>>>>>> ----
>>>>>>>
>>>>>>> Is Argus not finding the dag interface?
>>>>>>>
>>>>>>> --Leif
>>>>>>>
>>>>>>>
>>>>>>> On 02/17/2011 04:17 AM, Carter Bullard wrote:
>>>>>>>> Hey Leif,
>>>>>>>> I suspect that your packet source thread is crashing, and the rest of the argus is doing it's thing. Run argus under gdb to see if tells you more about the problem.
>>>>>>>>
>>>>>>>> To compile with symbols, create the development tag and reconfigure and remake:
>>>>>>>> % touch .devel
>>>>>>>> % ./configure
>>>>>>>> % make clean
>>>>>>>> % make
>>>>>>>> % gdb ./bin/argus
>>>>>>>>
>>>>>>>> Be sure to run without daemon mode.
>>>>>>>> Carter
>>>>>>>>
>>>>>>>>
>>>>>>>> On Feb 15, 2011, at 5:21 PM, Leif Tishendorf<ltishend at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Carter,
>>>>>>>>>
>>>>>>>>> I should probably start a different thread for this but it's the same system as the 3.0.3.22 issue and didn't want to clutter things up too much. I just recently installed 3.0.2 on this same box, and originally I thought it was functioning normally. However, after more testing I've noticed there are a couple issues and was wondering if you had any suggestions.
>>>>>>>>>
>>>>>>>>> 1. I have 6 load balanced streams to break up the traffic on a Dag 8.1 card and an argus process on each. Over time the argus processes will exit without error.
>>>>>>>>>
>>>>>>>>> 2. Time stamps over time get exteremely skewed (like it starts out puting year ranges from 1912 to 2057). This seems to be worse with higher load. Currently each process is running at about 20% CPU or less (8 core, 16 hyper-threaded). I have Snort, nTop and tcpdump running on other streams and they don't experience the time skew issue.
>>>>>>>>>
>>>>>>>>> Ideally I'd rather be using the 3.0.3.22(3.0.4 when it's released) to take advantage of it's multiple interface handling and multi-core support and not do over much trouble shooting on an older code base. Anything I can test/try, information I can provide I'd be happy to do so.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> --Leif
>>>>>>>>>
>>>>>>>>> On 02/14/2011 12:31 PM, Carter Bullard wrote:
>>>>>>>>>> Hey Leif,
>>>>>>>>>> It could be a bug. Argus has run on many versions of the dag, but I don't test
>>>>>>>>>> each dev release against dag's as I don't have access any longer.
>>>>>>>>>>
>>>>>>>>>> The easiest test is to make sure tcpdump gets packets from that interface. If
>>>>>>>>>> so, then running argus with the "-D debugLevel" option will give us some detail
>>>>>>>>>> printing on what is happening.
>>>>>>>>>>
>>>>>>>>>> Try with "-D 6" to start, and if that doesn't help, increase to get more info, and don't run
>>>>>>>>>> in daemon mode.
>>>>>>>>>>
>>>>>>>>>> Be sure and put the "-D 6" as the first option, so you get debug printing for parsing the
>>>>>>>>>> command line options, etc......
>>>>>>>>>>
>>>>>>>>>> To compile debug support into argus, in the argus distribution directory:
>>>>>>>>>> % touch .debug
>>>>>>>>>> % ./configure
>>>>>>>>>> % make clean
>>>>>>>>>> % make
>>>>>>>>>>
>>>>>>>>>> Carter
>>>>>>>>>>
>>>>>>>>>> On Feb 14, 2011, at 3:15 PM, Leif Tishendorf wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello all,
>>>>>>>>>>>
>>>>>>>>>>> I'm running an Endance Dag 8.1 card and I'm having difficulty getting Argus to work with it. I've compiled Argus 3.0.3.22 against the Dag enabled libpcap files and Argus will run if I set it to eth0, which is the management interface, but if I set it to a dag stream, e.g. ARGUS_INTERFACE=dag0:8, the daemon says it starts, and prints to syslog that it starts, but it doesn't actually start.
>>>>>>>>>>>
>>>>>>>>>>> I was wondering if anyone may have had a similar issue and be able to offer some pointers.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> --Leif
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> --Leif
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> --Leif
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> --Leif
>>>>>
>>>>
>>>
>>> --
>>> --Leif
>>>
>
> --
> --Leif
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3815 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20110225/8d684608/attachment.bin>
More information about the argus
mailing list