new clients rc.62 on the server - description of rastream()

Carter Bullard carter at qosient.com
Thu Nov 1 23:24:23 EDT 2007


Hey Terry,
Ok, one thing that I've discovered in my tests with fprobe()
as a netflow record source, is that the hold time for rastream
may need to be very large.  Possibly in the order of 2-5 minutes,
rather than 10s.   This is because of netflow's very poor cache
management strategies.

If a record comes in that is outside the range of the "-B secs"
option, rastream() will toss it.  To test, compile the clients with
debug support ( "touch .devel .debug; ./configure; make clean; make")
and run rastream() with a -D2 and see if it complains about
the range of the input records.  I did find a leak where some
of these out of range records were dropped without being de-
allocated, so that may have been our problem.

I could have rastream() adjust its range timer to accomodate
records that come in way out of range, but, I'm not comfortable
with these types of dynamic behaviors, as you find after
some time that the rastream() stops outputting records, is
getting huge, because the hold time has increased to
some ridiculous value, like 1.5 years (not good).

I'll have new code up in the morning, and we'll see if that
doesn't help.

Carter


On Nov 1, 2007, at 10:32 PM, Carter Bullard wrote:

> Hey Terry,
> We're very close to releasing argus-3.0, and its going to be
> difficult to say the code is good if we have a known memory
> leak in a key component, so its important to me to get this fixed.
>
> So the question is, "is there a .threads file in your root directory".
> If so, try removing it and doing the "./configure;make clean;make"
> again, to see if that makes a difference.
>
> There is some clutter that valgrind() will report on that is
> not critical, such as the port names hash table memory, or
> a few strdup'd strings that are left behind.  I'm not worried
> about these.  But real memory leaks, that keep you from
> running these programs for a year at a time are very
> important to fix, so thanks for helping me out on this.
>
>
> Carter
>
> On Nov 1, 2007, at 9:57 PM, Terry Burton wrote:
>
>> On Nov 1, 2007 5:39 PM, Carter Bullard <carter at qosient.com> wrote:
>>> Hey Terry,
>>> OK, so looking at your graph and the valgrind output and all
>>> information so far,
>>> the system is not hurting for memory.  I'm working on the potential
>>> leak and
>>> may have found some things to clean up, but I'm not thinking that  
>>> its
>>> the
>>> cause of your issues.   It maybe that we are running too many  
>>> concurrent
>>> processes, and the first complaint by fork() (EAGAIN) just maps  
>>> to an
>>> error
>>> messages that sez there is not enough memory.  I'm going to  
>>> change the
>>> script scheduling and patch up the memory issues, and we'll try  
>>> again.
>>
>> Hi Carter,
>>
>> After performing some more basic tests I have found some information
>> that may help to find the leak. I'm not sure whether this correlates
>> with your current thinking on the problem or not...
>>
>> I run the following collectors:
>>
>> /opt/argus/sbin/argus -X -d -A -i eth2 -P 561
>> /opt/argus/sbin/radium -X -d -C -S 1006 -P 564
>> /opt/argus/sbin/radium -X -d -C -S 1007 -P 565
>>
>> I have another process that aggregates these:
>>
>> /opt/argus/sbin/radium -X -d -S localhost:561 -S localhost:564 -S
>> localhost:565 -P 569
>>
>> Connecting to the SPAN feed does not appear to leak (at least not
>> significantly enough for me to have noticed after one hour):
>>
>> /opt/argus/bin/rastream -X -S localhost:561 -M time 5m -B 10s -f
>> /bin/true -w /srv/argus/archive/%Y-%m-%d/\$srcid-%H:%M:%S.arg
>>
>> Connecting to either the aggregated feed or any of the individual
>> NetFlow feeds leaks rapidly (up to ~10MB/min per NetFlow):
>>
>> /opt/argus/bin/rastream -X -S localhost:569 -M time 5m -B 10s -f
>> /bin/true -w /srv/argus/archive/%Y-%m-%d/\$srcid-%H:%M:%S.arg
>> /opt/argus/bin/rastream -X -S localhost:564 -M time 5m -B 10s -f
>> /bin/true -w /srv/argus/archive/%Y-%m-%d/\$srcid-%H:%M:%S.arg
>> /opt/argus/bin/rastream -X -S localhost:565 -M time 5m -B 10s -f
>> /bin/true -w /srv/argus/archive/%Y-%m-%d/\$srcid-%H:%M:%S.arg
>>
>> So it would appear to be a NetFlow related problem, possibly with the
>> some memory allocated through the call path main ->  
>> ArgusReadStream ->
>> ArgusReadStreamSocket -> ArgusHandleDatum -> RaProcessRecord ->
>> RaProcessThisRecord -> ArgusAlignRecord -> ArgusCopyRecordStruct  
>> never
>> being freed, as hinted at by the following section of the valgrind
>> output:
>>
>> ==23957== 2,388 bytes in 3 blocks are possibly lost in loss record  
>> 12 of 17
>> ==23957==    at 0x401C6CA: calloc (vg_replace_malloc.c:279)
>> ==23957==    by 0x806B4F9: ArgusCalloc (argus_util.c:15011)
>> ==23957==    by 0x80838C8: ArgusCopyRecordStruct (argus_client.c: 
>> 3493)
>> ==23957==    by 0x8083FC8: ArgusAlignRecord (argus_client.c:7137)
>> ==23957==    by 0x804C7E7: RaProcessThisRecord (rastream.c:894)
>> ==23957==    by 0x804CC50: RaProcessRecord (rastream.c:872)
>> ==23957==    by 0x8077FD4: RaScheduleRecord (argus_util.c:860)
>> ==23957==    by 0x807820D: ArgusHandleDatum (argus_util.c:930)
>> ==23957==    by 0x808C8BE: ArgusReadStreamSocket (argus_client.c: 
>> 1622)
>> ==23957==    by 0x808D35E: ArgusReadStream (argus_client.c:1997)
>> ==23957==    by 0x80502BC: main (argus_main.c:359)
>>
>> Does this appear to be along the right lines?
>>
>> What is frustrating (from the point of view of debugging) is that I
>> seem to get consistently differing results from valgrind depending
>> upon whether I compile with or without CFLAGS="-O -g -fno-inline".  
>> The
>> above trace (with CFLAGS amendments) differs from my previous posting
>> by "possibly" loosing ~2KB rather than "definitely" loosing 1MB
>> (without CFLAGS mods) over similar 15min runs. Also with the CFLAGS
>> amendments I get this new "definite" leak from a different allocation
>> path:
>>
>> ==23957== 275,838 (275,736 direct, 102 indirect) bytes in 349 blocks
>> are definitely lost in loss record 16 of 17
>> ==23957==    at 0x401C6CA: calloc (vg_replace_malloc.c:279)
>> ==23957==    by 0x806B4F9: ArgusCalloc (argus_util.c:15011)
>> ==23957==    by 0x8075022: setArgusWfile (argus_util.c:18486)
>> ==23957==    by 0x804F1FD: ArgusParseArgs (argus_main.c:1193)
>> ==23957==    by 0x804F9D3: ArgusMainInit (argus_main.c:729)
>> ==23957==    by 0x804FA6F: main (argus_main.c:131)
>>
>> Anyhow, I greatly appreciate your efforts on this and do not want you
>> to take any of this feedback as though I am insisting upon you for a
>> quick fix - that's not my intention at all as their is no great
>> urgency on me to get this working.
>>
>> Let me know if there is anything that you would like me to do by way
>> of testing for this problem or anything else.
>>
>>
>> Hope this all helps,
>>
>> Tez
>>
>



More information about the argus mailing list