segfault at 000000000311c000 rip 000000000040fb46rsp 0000007fbffff830 error 4

Gunnar Lindberg Gunnar.Lindberg at chalmers.se
Tue May 12 01:09:40 EDT 2009


Usually it runs for a very long time (like days) but it has crashed
a few times within, say 30 min.

Now, if we just accept gdb's idea of where it crashed, we find:

    257       if (l2->start == NULL)
    258          l2->start = l1->start;
    259       else
->  260          l2->end->nxt = l1->start;

It seems like what happens in ArgusLoadList() is that we're moving
data from one list to another and maybe there is a condition some-
where else in the code where l2->end is not filled in correctly -
maybe at high load? The "ticking bomb" that is so hard to find.

Then I'm surprised thas gcc doesn't barf on "l2->end->nxt" since
I have'nt seen a "nxt" inside a "struct ArgusListStruct", which I
think is what "l2->end" is supposed to point at.

I'm out of the office for the rest of the week, so I'm not going
to replace the current argus with the -g version right away.


Hardware is Dell 2950.

OS is RHEL Linux
/etc/redhat-release
Red Hat Enterprise Linux AS release 4 (Nahant Update 7)

# /bin/uname -a
Linux argc.irt.chalmers.se 2.6.9-78.0.13.ELsmp #1 SMP Wed Jan 7 17:45:52 EST 2009 x86_64 x86_64 x86_64 GNU/Linux

# /bin/arch
x86_64

gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-10)


	Gunnar Lindberg

>From carter at qosient.com  Tue May 12 01:25:34 2009
>Cc: argus-info at lists.andrew.cmu.edu
>Message-Id: <5D9EB1B9-7EA5-4631-9B58-FF82865ADE70 at qosient.com>
>From: Carter Bullard <carter at qosient.com>
>To: Gunnar Lindberg <Gunnar.Lindberg at chalmers.se>
>In-Reply-To: <200905112110.n4BLAHXV017643 at grunert.cdg.chalmers.se>
>Subject: Re: [ARGUS] segfault at 000000000311c000 rip 000000000040fb46rsp	0000007fbffff830 error 4
>Date: Mon, 11 May 2009 19:24:46 -0400
>References: <200905112110.n4BLAHXV017643 at grunert.cdg.chalmers.se>

>Hey Gunnar,
>The way you turn on the "-g" option is to do this in the main  
>distribution directory:
>    % touch .devel
>    % ./configure;make clean;make

>That will compile everything with the appropriate flags.

>Well, l2 really looks screwed up.  What kind of machine is this 64-bit  
>thing?
>I think we're having alignment problems, possibly.  Does it run for  
>any amount
>of time before it blows up?

>Carter

>On May 11, 2009, at 5:10 PM, Gunnar Lindberg wrote:

>> First of all, I re-run make with "-g" and used that with the existing
>> core file; I can't tell whether that should be OK but as far as I can
>> see it still makes some kind of sense.
>>
>>
>> [lindberg at argv ~]$ gdb argus-g core.14369
>> (gdb) where
>> #0  0x0000000000410bc2 in ArgusLoadList (l1=0x651460, l2=0x6540a0)
>>    at ArgusUtil.c:260
>> #1  0x000000000041557b in ArgusOutputProcess (arg=Variable "arg" is  
>> not available.
>> ) at ArgusOutput.c:477
>> #2  0x000000000040bb6c in ArgusProcessPacket (src=Variable "src" is  
>> not available.
>> ) at ArgusModeler.c:1324
>> #3  0x000000000040d006 in ArgusEtherPacket (user=0x2a95786010 "",  
>> h=Variable "h" is not available.
>> )
>>    at ArgusSource.c:716
>> #4  0x00000034e2f04bff in ?? () from /usr/lib64/libpcap.so.0.8.3
>> #5  0x0000000000410759 in ArgusGetPackets (src=0x2a95786010)
>>    at ArgusSource.c:2093
>> #6  0x0000000000404f83 in main (argc=1, argv=0x7fbffffe08) at  
>> argus.c:535
>>
>>
>> (gdb) print *l1
>> $3 = {start = 0x18c21f0, end = 0x17e98b0, count = 589, pushed =  
>> 3044164,
>>  popped = 0, loaded = 3043575, outputTime = {tv_sec = 0, tv_usec = 0},
>>  reportTime = {tv_sec = 0, tv_usec = 0}}
>>
>> (gdb) print *l2
>> $5 = {start = 0x85d278d99c8b81d1, end = 0x63caa47a16f1492e,
>>  count = 1579875320, pushed = 1880390013, popped = 2415426777,
>>  loaded = 3722138485, outputTime = {tv_sec = 144115210689264557,
>>    tv_usec = 14930315638210660}, reportTime = {tv_sec =  
>> -7084847654803835648,
>>    tv_usec = -5817086215248780719}}
>>
>> argus/ArgusUtil.c
>>    246 void
>>    247 ArgusLoadList(struct ArgusListStruct *l1, struct  
>> ArgusListStruct *l2)
>>    248 {
>>    249    if (l1 && l2) {
>>    250       int count;
>>    251 #if defined(ARGUS_THREADS)
>>    252       pthread_mutex_lock(&l1->lock);
>>    253       pthread_mutex_lock(&l2->lock);
>>    254 #endif
>>    255       count = l1->count;
>>    256
>>    257       if (l2->start == NULL)
>>    258          l2->start = l1->start;
>>    259       else
>>    260          l2->end->nxt = l1->start;
>>    261
>>    262       l2->end = l1->end;
>>    263       l2->count += count;
>>    264
>>    265       l1->start = NULL;
>>    266       l1->end = NULL;
>>    267       l1->loaded += count;
>>    268       l1->count = 0;
>>    269
>>    270 #if defined(ARGUS_THREADS)
>>    271       pthread_mutex_unlock(&l2->lock);
>>    272       pthread_mutex_unlock(&l1->lock);
>>    273 #endif
>>    274
>>    275 #ifdef ARGUSDEBUG
>>    276    ArgusDebug (5, "ArgusLoadList (0x%x, 0x%x) load %d objects 
>> \n", l1, l2        , count);
>>    277 #endif
>>    278    }
>>    279 }
>>
>>
>> 		Gunnar Lindberg
>>
>>
>>> From SRS0=BzD3OK=BH=qosient.com=carter at srs.bis.na.blackberry.com   
>>> Mon May 11 13:18:08 2009
>>> Message-ID: <2044323243-1242040666-cardhu_decombobulator_blackberry.rim.net-2042564372- at bxe1165.bisx.prod.on.blackberry 
>>> >
>>> Reply-To: carter at qosient.com
>>> References: <E5F8710F-522D-4579-8569-A9DD5E130A06 at qosient.com><200905110551.n4B5pV62007936 at grunert.cdg.chalmers.se 
>>> >
>>> In-Reply-To: <200905110551.n4B5pV62007936 at grunert.cdg.chalmers.se>
>>> Subject: Re: [ARGUS] segfault at 000000000311c000 rip  
>>> 000000000040fb46rsp	0000007fbffff830 error 4
>>> To: "Gunnar Lindberg" <Gunnar.Lindberg at chalmers.se>,
>>>       argus-info-bounces+carter=qosient.com at lists.andrew.cmu.edu,
>>>       "Argus" <argus-info at lists.andrew.cmu.edu>
>>> From: carter at qosient.com
>>> Date: Mon, 11 May 2009 11:19:44 +0000
>>
>>> Hey Gunnar,
>>> The C level debugging in gdb() is very good, and gives you quick  
>>> access to the symbols and stack info.
>>>
>>> I have never seen problems with ArgusLoadList(), so if you have a  
>>> core file, if you could load it into gdb() and type:
>>>
>>> (gdb) where
>>> (gdb) print *l1. (assuming its in AtgusLoadList)
>>> (gdb) print *l2
>>>
>>> If not, if you could run it under gdb() until it stops, and type  
>>> the same, that would give me a good start.
>>>
>>> Carter
>>>
>>> Sent from my Verizon Wireless BlackBerry
>>>
>>> -----Original Message-----
>>> From: Gunnar Lindberg <Gunnar.Lindberg at chalmers.se>
>>>
>>> Date: Mon, 11 May 2009 07:51:31
>>> To: <argus-info at lists.andrew.cmu.edu>
>>> Subject: Re: [ARGUS] segfault at 000000000311c000 rip  
>>> 000000000040fb46
>>> 	rsp	0000007fbffff830 error 4
>>>
>>>
>>> No .threads in argus-3.0.1.beta.3
>>>
>>> My gdb knowledge is limited but I've done quite some amount of
>>> C/machine code debugging in my early days (25 years ago and MC68000
>>> I'd probably been able to write the C code from the optimized
>>> assembler :-). But, this is *86 - "same, same, but different"...
>>>
>>> Based on that I did the "disass" trick and <<<=== indicates the
>>> machine code where the crash occured. What beats me on *86 is
>>> which register is used for which C variable, but there seems to
>>> have been an offset "0x8(%rsi),%r9" involved just before - that
>>> was variables in a C struct on MC68000 and I guess it still is.
>>>
>>> So we picked up something 8 bytes into a C struct and than tried
>>> to us it as a pointer "%r10,(%r9)" - and pooof.
>>>
>>> The most probable thing is that data/pointers got screwed up minutes
>>> ago and then the bomb goes off now because we just got to that data.
>>> However, before going through the linked list of data I'd like to ask
>>> about a line of C code:
>>>
>>> argus/ArgusUtil.c:
>>>
>>> void
>>> ArgusLoadList(struct ArgusListStruct *l1, struct ArgusListStruct *l2)
>>> {
>>> ...
>>>     if (l2->start == NULL)
>>>        l2->start = l1->start;
>>>     else
>>>        l2->end->nxt = l1->start;		<=
>>> ...
>>> }
>>>
>>> The only "nxt" I find is within a "struct ArgusListRecord",
>>> but "l2" and "l2->end" points at a "struct ArgusListStruct".
>>> Could this be it?
>>>
>>> Or is there some condition where l2->end is not correctly set?
>>>
>>> 	Gunnar Lindberg
>>>
>>> May  7 16:33:30 argv kernel: argus[14369] general protection
>>> rip:410bc2 rsp:7fbffff308 error:0
>>>
>>> gdb argus.14369 /core.14369
>>> ...
>>> #0  0x0000000000410bc2 in ArgusLoadList ()
>>> (gdb) where
>>> #0  0x0000000000410bc2 in ArgusLoadList ()
>>> #1  0x000000000041557b in ArgusOutputProcess ()
>>> #2  0x000000000040bb6c in ArgusProcessPacket ()
>>> #3  0x000000000040d006 in ArgusEtherPacket ()
>>> #4  0x00000034e2f04bff in ?? () from /usr/lib64/libpcap.so.0.8.3
>>> #5  0x0000000000410759 in ArgusGetPackets ()
>>> #6  0x0000000000404f83 in main ()
>>> (gdb) disass 0x0000000000410bc2
>>> Dump of assembler code for function ArgusLoadList:
>>> 0x0000000000410ba0 <ArgusLoadList+0>:   test   %rdi,%rdi
>>> 0x0000000000410ba3 <ArgusLoadList+3>:   setne  %dl
>>> 0x0000000000410ba6 <ArgusLoadList+6>:   xor    %eax,%eax
>>> 0x0000000000410ba8 <ArgusLoadList+8>:   test   %rsi,%rsi
>>> 0x0000000000410bab <ArgusLoadList+11>:  setne  %al
>>> 0x0000000000410bae <ArgusLoadList+14>:  test   %eax,%edx
>>> 0x0000000000410bb0 <ArgusLoadList+16>:  je     0x410be9  
>>> <ArgusLoadList+73>
>>> 0x0000000000410bb2 <ArgusLoadList+18>:  cmpq   $0x0,(%rsi)
>>> 0x0000000000410bb6 <ArgusLoadList+22>:  mov    0x10(%rdi),%ecx
>>> 0x0000000000410bb9 <ArgusLoadList+25>:  je     0x410bf0  
>>> <ArgusLoadList+80>
>>> 0x0000000000410bbb <ArgusLoadList+27>:  mov    0x8(%rsi),%r9
>>> 0x0000000000410bbf <ArgusLoadList+31>:  mov    (%rdi),%r10
>>> 0x0000000000410bc2 <ArgusLoadList+34>:  mov    %r10,(%r9)        
>>> <<<===
>>> 0x0000000000410bc5 <ArgusLoadList+37>:  mov    0x8(%rdi),%r11
>>> 0x0000000000410bc9 <ArgusLoadList+41>:  add    %ecx,0x1c(%rdi)
>>> 0x0000000000410bcc <ArgusLoadList+44>:  add    %ecx,0x10(%rsi)
>>> 0x0000000000410bcf <ArgusLoadList+47>:  movq   $0x0,(%rdi)
>>> 0x0000000000410bd6 <ArgusLoadList+54>:  movl   $0x0,0x10(%rdi)
>>> 0x0000000000410bdd <ArgusLoadList+61>:  mov    %r11,0x8(%rsi)
>>> 0x0000000000410be1 <ArgusLoadList+65>:  movq   $0x0,0x8(%rdi)
>>> 0x0000000000410be9 <ArgusLoadList+73>:  repz retq
>>> 0x0000000000410beb <ArgusLoadList+75>:  data16
>>> 0x0000000000410bec <ArgusLoadList+76>:  data16
>>> 0x0000000000410bed <ArgusLoadList+77>:  nop
>>> 0x0000000000410bee <ArgusLoadList+78>:  data16
>>> 0x0000000000410bef <ArgusLoadList+79>:  nop
>>> 0x0000000000410bf0 <ArgusLoadList+80>:  mov    (%rdi),%r8
>>> 0x0000000000410bf3 <ArgusLoadList+83>:  mov    %r8,(%rsi)
>>> 0x0000000000410bf6 <ArgusLoadList+86>:  jmp    0x410bc5  
>>> <ArgusLoadList+37>
>>> 0x0000000000410bf8 <ArgusLoadList+88>:  data16
>>> 0x0000000000410bf9 <ArgusLoadList+89>:  data16
>>> 0x0000000000410bfa <ArgusLoadList+90>:  data16
>>> 0x0000000000410bfb <ArgusLoadList+91>:  nop
>>> 0x0000000000410bfc <ArgusLoadList+92>:  data16
>>> 0x0000000000410bfd <ArgusLoadList+93>:  data16
>>> 0x0000000000410bfe <ArgusLoadList+94>:  data16
>>> 0x0000000000410bff <ArgusLoadList+95>:  nop
>>> End of assembler dump.
>>> (gdb) info registers
>>> rax            0x1      1
>>> rbx            0x174f450        24441936
>>> rcx            0x24d    589
>>> rdx            0x4a02f101       1241706753
>>> rsi            0x6540a0 6635680
>>> rdi            0x651460 6624352
>>> rbp            0x6516c0 0x6516c0
>>> rsp            0x7fbffff308     0x7fbffff308
>>> r8             0x69c6d  433261
>>> r9             0x63caa47a16f1492e       7190740599328295214
>>> r10            0x18c21f0        25960944
>>> r11            0x41a1320        68817696
>>> r12            0x3      3
>>> r13            0x651738 6625080
>>> r14            0x0      0
>>> r15            0x7fbffff510     548682069264
>>> rip            0x410bc2 0x410bc2 <ArgusLoadList+34>
>>> eflags         0x10286  66182
>>> cs             0x33     51
>>> ss             0x2b     43
>>> ds             0x0      0
>>> es             0x0      0
>>> fs             0x0      0
>>> gs             0x0      0
>>>
>>>
>>>
>>>> From carter at qosient.com  Thu May  7 19:00:53 2009
>>>> Cc: argus-info at lists.andrew.cmu.edu
>>>> Message-Id: <E5F8710F-522D-4579-8569-A9DD5E130A06 at qosient.com>
>>>> From: Carter Bullard <carter at qosient.com>
>>>> To: Gunnar Lindberg <Gunnar.Lindberg at chalmers.se>
>>>> In-Reply-To: <200905071507.n47F7xeB026201 at grunert.cdg.chalmers.se>
>>>> Subject: Re: [ARGUS] segfault at 000000000311c000 rip  
>>>> 000000000040fb46 rsp	0000007fbffff830 error 4
>>>> Date: Thu, 7 May 2009 13:00:42 -0400
>>>> References: <200905071507.n47F7xeB026201 at grunert.cdg.chalmers.se>
>>>
>>>> Hey Gunnar,
>>>> The gdb() commands of interest are:
>>>
>>>>   (gdb) where
>>>
>>>> ArgusLoadList() is the routine that passes flow record status  
>>>> reports
>>>> from the
>>>> packet processing engine to the output processor.  This definitely
>>>> shouldn't
>>>> have a problem, so it will be interesting to figure out what the
>>>> problem maybe.
>>>
>>>> Are you running with threads enabled?  (is there a ./.threads file  
>>>> in
>>>> your root directory?)
>>>
>>>> Carter
>>>
>>>
>>



More information about the argus mailing list