rastream exits after multi day run

Carter Bullard carter at qosient.com
Tue Nov 25 12:27:00 EST 2014


Hey Jesse,
OK !!! I’m on this today, so hopefully we can get to the bottom.
So, to review … you’re having problems on specific files not completing, but not all files.
I have a copy of this one file, and I’m starting to process it now.
Any additional insights that I can consider ???

Hope all is most excellent, and sorry for the inconvenience.

Carter

> On Nov 23, 2014, at 8:19 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
> 
> Doesn’t seem like that helped in terms of time:
> 
> $ /usr/bin/time /usr/local/bin/racluster -f racluster.conf -r problematic.argus -w cluster.problematic.argus
> 903.94user 3560.26system 13:44:30elapsed 9%CPU (0avgtext+0avgdata 38389104maxresident)k
> 1424260656inputs+2088872outputs (33388189major+48470923minor)pagefaults 0swaps
> 
> 
> On Nov 17, 2014, at 8:31 PM, Carter Bullard <carter at qosient.com> wrote:
> 
>> Hey Jesse,
>> Can you try the racluster.conf strategy I suggested earlier  ??
>> That maybe all that is needed.  You mentioned that you could share the file ???  I'm getting back into argus now, and can take a look !!!
>> 
>> Carter
>> 
>> 
>>> On Nov 17, 2014, at 3:46 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>>> 
>>> OK, potentially the racluster thing is just that it takes too long to finish...What I’m seeing is that it’s taking about 7 hours to cluster a 5 minute file on this machine; this is certainly not sustainable. This machine has 10 GB of RAM, and a sample file has the following stats:
>>> 
>>> $ racount -r problematic.argus 
>>> racount   records     total_pkts     src_pkts       dst_pkts       total_bytes        src_bytes          dst_bytes
>>>  sum   13938359    208024628      208024628      0              148229729004       148229729004       0                 
>>> 
>>> $ ls -lh problematic.argus 
>>> -rw-r--r-- 1 2.2G Oct 30 10:10 problematic.argus
>>> 
>>> Any recommendations on sizing for clustering this size of file?
>>> 
>>> The benefits of the generic cluster are apparent in file size at least:
>>> 
>>> $ racount -r cluster.problematic.argus 
>>> racount   records     total_pkts     src_pkts       dst_pkts       total_bytes        src_bytes          dst_bytes
>>>  sum   11050945    208024628      208024628      0              148229729004       148229729004       0                 
>>> 
>>> ls -lh cluster.problematic.argus
>>> -rw-r--r-- 1 1020M Nov 11 22:01 cluster.problematic.argus
>>> 
>>> Cheers,
>>> 
>>> Jesse
>>>> On Nov 11, 2014, at 7:50 AM, Carter Bullard <carter at qosient.com> wrote:
>>>> 
>>>> Just a few more thoughts... Its rastream.1 that is having the problem, as it is exiting ... ???  One of the gotchas with stream processing is data source time sync.  We do some extra work if the data doesn't show up in the ' -B secs ' window, and there could be an issue there ???  Are your data sources sync'ed up pretty good ???
>>>> 
>>>> Carter
>>>> 
>>>>> On Nov 11, 2014, at 7:42 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>> 
>>>>> Hey Jesse,
>>>>> Sorry for the delayed response !!!  Been a little busy.
>>>>> The timeout messages are fine ... we call that routine every 1 sec for maintnenace routines, like flushing timeout queues etc ... seems the index for that routine may have changed.
>>>>> 
>>>>> What does this file do as you process it ???  does it take forever  or just a massively long time ??  You mentioned 'stuck' ??
>>>>> 
>>>>> 
>>>>> Does the example racluster.conf file change anything ???
>>>>> If you can share that would be great, just nee to know what the fault is .
>>>>> Carter
>>>>> 
>>>>>> On Nov 3, 2014, at 9:50 AM, Jesse Bowling <jessebowling at gmail.com> wrote:
>>>>>> 
>>>>>> Hey Carter,
>>>>>> 
>>>>>> I don’t think memory is the issue, or at least, host graphs for memory usage don’t seem to indicate memory exhaustion...Oddly the problematic file when run manually exhibits something I’ve not run into:
>>>>>> 
>>>>>> $ /usr/local/bin/racluster -D 3 -f racluster.conf -r problematic.argus -w cluster.problematic.argus
>>>>>> racluster[61078.00278946357f0000]: 2014-11-03 09:27:07.388331 ArgusAddFileList (0x4674d010, problematic.argus, 1, -1, -1) returning 1
>>>>>> racluster[61079.00278946357f0000]: 2014-11-03 09:27:07.391338 Argusfinish_parse (0x1d58530)
>>>>>> racluster[61078.00278946357f0000]: 2014-11-03 09:27:07.391766 ArgusFilterCompile () done 0
>>>>>> racluster[61078.00278946357f0000]: 2014-11-03 09:27:07.397743 ArgusReadConnection() read 16 bytes
>>>>>> racluster[61078.00278946357f0000]: 2014-11-03 09:27:07.397843 ArgusReadConnection() read 112 bytes
>>>>>> racluster[61078.00278946357f0000]: 2014-11-03 09:27:07.406093 ArgusInitAddrtoname (0x7f354674d010, 0x0, 0x0)
>>>>>> racluster[61078.00278946357f0000]: 2014-11-03 09:27:07.406115 ArgusParseInit(0x7f354674d010 0x7f354665b010
>>>>>> racluster[61078.00278946357f0000]: 2014-11-03 09:27:07.406126 ArgusReadConnection(0x4665b010, 1) returning 1
>>>>>> racluster[61078.00278946357f0000]: 2014-11-03 09:27:08.509529 ArgusClientTimeout()
>>>>>> racluster[61078.00278946357f0000]: 2014-11-03 09:27:09.531465 ArgusClientTimeout()
>>>>>> racluster[61078.00278946357f0000]: 2014-11-03 09:27:10.535630 ArgusClientTimeout()
>>>>>> racluster[61078.00278946357f0000]: 2014-11-03 09:27:11.546747 ArgusClientTimeout()
>>>>>> racluster[61078.00278946357f0000]: 2014-11-03 09:27:12.625805 ArgusClientTimeout()
>>>>>> racluster[61078.00278946357f0000]: 2014-11-03 09:27:13.632538 ArgusClientTimeout()
>>>>>> 
>>>>>> The file is 2.2 GB....I can share the file if that’s helpful!
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> Jesse
>>>>>>> On Oct 30, 2014, at 8:02 PM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>> 
>>>>>>> Hey Jesse,
>>>>>>> The trace indicates that racluster is trying to find a flow in its cache, and if you have a lot of flows, racluster will spend considerable time trying to get through the number of flows that are being cached.  There are ways to speed this up, and we can do this once we get past this first round of issues.
>>>>>>> 
>>>>>>> How big is the file you’re trying to racluster ??  If its too big, then its too big.  If your running out of disk, not much we can do about that.  If you’re running out of memory, then the support in racluster.conf that does a bit of idle flow cache management should help.
>>>>>>> 
>>>>>>> If it is memory, you can try a simple rule like this:
>>>>>>> 
>>>>>>> racluster.conf:
>>>>>>> 
>>>>>>> filter=“tcp or udp”    idle=300
>>>>>>> filter=“” 
>>>>>>> 
>>>>>>> and run against the file you were choking on??  Checking out the amount of memory racluster.1 uses while its doing its thing is the key.
>>>>>>> 
>>>>>>> racluster -f racluster.conf -with -the -options -you -were -using
>>>>>>> 
>>>>>>> Maybe this can help ???
>>>>>>> 
>>>>>>> Carter
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Oct 30, 2014, at 5:14 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>>>>>>>> 
>>>>>>>> So perhaps rastream is the symptom and not the cause...I checked in on the restarted process just now and noticed that the “process_argus_file.bash” was still working on a file from 10 am this morning...Specifically it seemed stuck on the racluster command. I attached with strace initially and then gdb and it seems to be stuck in this loop:
>>>>>>>> 
>>>>>>>> 5608             unsigned short *ptr1 = (unsigned short *) hstruct->buf;
>>>>>>>> (gdb) 
>>>>>>>> 5609             unsigned short *ptr2 = (unsigned short *) target->hstruct.buf;
>>>>>>>> (gdb) 
>>>>>>>> 5611             if (ptr1 && ptr2) {
>>>>>>>> (gdb) 
>>>>>>>> 5612                for (i = 0, len = hstruct->len/sizeof(unsigned short); i < len; i++)
>>>>>>>> (gdb) 
>>>>>>>> 5613                   if (*ptr1++ != *ptr2++)
>>>>>>>> (gdb) 
>>>>>>>> 5612                for (i = 0, len = hstruct->len/sizeof(unsigned short); i < len; i++)
>>>>>>>> (gdb) 
>>>>>>>> 5613                   if (*ptr1++ != *ptr2++)
>>>>>>>> (gdb) 
>>>>>>>> 5612                for (i = 0, len = hstruct->len/sizeof(unsigned short); i < len; i++)
>>>>>>>> (gdb) 
>>>>>>>> 5613                   if (*ptr1++ != *ptr2++)
>>>>>>>> (gdb) 
>>>>>>>> 5614                      break;
>>>>>>>> (gdb) 
>>>>>>>> 5615                if (i == len) {
>>>>>>>> (gdb) 
>>>>>>>> 5611             if (ptr1 && ptr2) {
>>>>>>>> (gdb) 
>>>>>>>> 5626             target = target->nxt;
>>>>>>>> (gdb) 
>>>>>>>> 5627          } while (target != head);
>>>>>>>> 
>>>>>>>> (gdb) where
>>>>>>>> #0  0x0000000000475c21 in ArgusFindRecord (htable=0x1b74470, hstruct=0x1b75078) at ./argus_client.c:5609
>>>>>>>> #1  0x0000000000405c0d in RaProcessThisRecord (parser=0x7fac4def9010, argus=0x7fac4de07630) at ./racluster.c:720
>>>>>>>> #2  0x000000000040591d in RaProcessRecord (parser=0x7fac4def9010, ns=0x7fac4de07630) at ./racluster.c:667
>>>>>>>> #3  0x0000000000416688 in RaScheduleRecord (parser=0x7fac4def9010, ns=0x7fac4de07630) at ./argus_util.c:2664
>>>>>>>> #4  0x0000000000416ca5 in ArgusHandleRecord (parser=0x7fac4def9010, input=0x7fac4de07010, ptr=0x7fac4d9fa7c4, filter=0x7fac4e01a800) at ./argus_util.c:2776
>>>>>>>> #5  0x0000000000466d2f in ArgusReadStreamSocket (parser=0x7fac4def9010, input=0x7fac4de07010) at ./argus_client.c:440
>>>>>>>> #6  0x0000000000466f3d in ArgusReadFileStream (parser=0x7fac4def9010, input=0x7fac4de07010) at ./argus_client.c:496
>>>>>>>> #7  0x0000000000407841 in main (argc=5, argv=0x7fffc8ebbd68) at ./argus_main.c:253
>>>>>>>> 
>>>>>>>> Meanwhile, the scratch space for the original file gets filled up and rastream has nowhere to write to...Which I believe is why rastream eventually fails. Thoughts? Here’s where the rastream process is stuck:
>>>>>>>> 
>>>>>>>> (gdb) where
>>>>>>>> #0  0x0000003b436db4ad in write () from /lib64/libc.so.6
>>>>>>>> #1  0x0000003b43671ae3 in _IO_new_file_write () from /lib64/libc.so.6
>>>>>>>> #2  0x0000003b436719aa in _IO_new_file_xsputn () from /lib64/libc.so.6
>>>>>>>> #3  0x0000003b43649691 in buffered_vfprintf () from /lib64/libc.so.6
>>>>>>>> #4  0x0000003b4364422e in vfprintf () from /lib64/libc.so.6
>>>>>>>> #5  0x0000003b4364f0f8 in fprintf () from /lib64/libc.so.6
>>>>>>>> #6  0x000000000044fcf4 in ArgusDebug (d=1, fmt=0x4ac7a0 "ArgusProcessScripts(): waitpid(%d) returned for %d") at ./argus_util.c:20325
>>>>>>>> #7  0x00000000004051f2 in ArgusProcessScripts () at ./rastream.c:546
>>>>>>>> #8  0x00000000004053d0 in RaParseComplete (sig=3) at ./rastream.c:601
>>>>>>>> #9  0x000000000045f736 in ArgusWriteNewLogfile (parser=0x7fc202419010, input=0x7fc202098010, wfile=0x1da9200, argus=0x7fffab5bcb50) at ./argus_util.c:25872
>>>>>>>> #10 0x0000000000407535 in RaSendArgusRecord (argus=0x1da8fc0) at ./rastream.c:1345
>>>>>>>> #11 0x00000000004064fa in RaProcessRecord (parser=0x7fc202419010, ns=0x7fc202098630) at ./rastream.c:963
>>>>>>>> #12 0x0000000000422b48 in RaScheduleRecord (parser=0x7fc202419010, ns=0x7fc202098630) at ./argus_util.c:2664
>>>>>>>> #13 0x0000000000423165 in ArgusHandleRecord (parser=0x7fc202419010, input=0x7fc202098010, ptr=0x7fc1f12959e0, filter=0x7fc20253a800) at ./argus_util.c:2776
>>>>>>>> #14 0x000000000046849f in ArgusReadStreamSocket (parser=0x7fc202419010, input=0x7fc202098010) at ./argus_client.c:440
>>>>>>>> #15 0x0000000000469024 in ArgusReadStream (parser=0x7fc202419010, queue=0x1d841f0) at ./argus_client.c:755
>>>>>>>> #16 0x00000000004095e4 in main (argc=16, argv=0x7fffab60d1b8) at ./argus_main.c:387
>>>>>>>> (gdb) n
>>>>>>>> Single stepping until exit from function write,
>>>>>>>> which has no line number information.
>>>>>>>> 
>>>>>>>> ^C
>>>>>>>> Program received signal SIGINT, Interrupt.
>>>>>>>> 0x0000003b436db4ad in write () from /lib64/libc.so.6
>>>>>>>> 
>>>>>>>> I also noticed while watching the -D3 log that I have a LOT of this type of message:
>>>>>>>> 
>>>>>>>> rastream[33066.00e75502c27f0000]: 2014-10-30 10:51:34.829949 RaSendArgusRecord () rejecting late record secs 1414680600 done file secs
>>>>>>>> 
>>>>>>>> Is that normal?? Seems unrelated put potentially problematic?
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> 
>>>>>>>> Jesse
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Oct 30, 2014, at 11:18 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>>> 
>>>>>>>>> Hey Jesse,
>>>>>>>>> Looks like rastream() isn’t keeping up and radium.1 is dropping the connection.
>>>>>>>>> Your script is a bit heavy, but that runs in a separate process, and shouldn’t
>>>>>>>>> affect rastream, at least that is the design.
>>>>>>>>> 
>>>>>>>>> In gdb, you can’t run with the daemon flag, so take the ‘-d' out.  You say it crashes ???
>>>>>>>>> If so, can you type ‘where’ so we can see where the fault occurs ??
>>>>>>>>> 
>>>>>>>>> Run rastream with a ‘-D3’.  That will generate messages about the script and any time shift issues.  ‘-D4’ will add the socket read messages, which maybe of interest, but it may
>>>>>>>>> generate quite a bit of messages.  Try -D3 to see if we see anything interesting.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Carter
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Oct 29, 2014, at 8:09 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I’ve had an issue using rastream recently. What I’m seeing is that rastream will run successfully for a while (usually around 2 days or less) and then will simply exit. The only log messages I get are:
>>>>>>>>>> 
>>>>>>>>>> Oct 15 13:06:24 netf-test radium[33140]: 2014-10-15 13:06:24.908361 started
>>>>>>>>>> Oct 15 13:06:29 netf-test radium[33140]: 2014-10-15 13:06:29.599856 connect from localhost[127.0.0.1]
>>>>>>>>>> Oct 16 10:52:59 netf-test radium[33140]: 2014-10-16 10:52:59.537044 ArgusWriteOutSocket(0xb81269d0) client not processing: disconnecting
>>>>>>>>>> 
>>>>>>>>>> I tried running under gdb, but it quickly finishes/crashes...I’d can run it with -D, but unsure what level might catch what’s happening. Any recommendations? 
>>>>>>>>>> 
>>>>>>>>>> Any ideas of why this might be exiting? Is the bash script too heavy? Would it help to call a wrapper script that simply backgrounds the actual script I want to run?
>>>>>>>>>> 
>>>>>>>>>> Config information below.
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> 
>>>>>>>>>> Jesse
>>>>>>>>>> 
>>>>>>>>>> $ egrep -v '^#|^[ ]*$' /etc/radium.conf 
>>>>>>>>>> RADIUM_DAEMON="yes"
>>>>>>>>>> RADIUM_MONITOR_ID=10.138.3.168
>>>>>>>>>> RADIUM_MAR_STATUS_INTERVAL=60
>>>>>>>>>> RADIUM_ARGUS_SERVER=cisco://10.138.3.168:9995
>>>>>>>>>> RADIUM_ACCESS_PORT=561
>>>>>>>>>> RADIUM_BIND_IP=127.0.0.1
>>>>>>>>>> RADIUM_CLASSIFIER_FILE=/etc/ralabel.conf
>>>>>>>>>> 
>>>>>>>>>> $ egrep -v '^#|^[ ]*$' /etc/ra.conf
>>>>>>>>>> RA_SET_PID="no"
>>>>>>>>>> RA_PID_PATH="/var/run"
>>>>>>>>>> RA_RUN_TIME=0
>>>>>>>>>> RA_GENERATE_BIN_MAR_RECORDS=yes
>>>>>>>>>> RA_PRINT_LABELS=0
>>>>>>>>>> RA_FIELD_SPECIFIER="stime:25 flgs proto saddr sport sco sas dir daddr dport dco das spkts dpkts sbytes dbytes"
>>>>>>>>>> RA_FIELD_DELIMITER=''
>>>>>>>>>> RA_SEPARATE_ADDR_FROM_PORT_WITH_PERIOD="no"
>>>>>>>>>> RA_PRINT_NAMES='none'
>>>>>>>>>> RA_PRINT_RESPONSE_DATA=no
>>>>>>>>>> RA_PRINT_UNIX_TIME=no
>>>>>>>>>> RA_TIME_FORMAT="%F %T.%f"
>>>>>>>>>> RA_USEC_PRECISION=6
>>>>>>>>>> RA_USERDATA_ENCODE=Ascii
>>>>>>>>>> RA_SORT_ALGORITHMS="dpkts spkts saddr dport daddr sport"
>>>>>>>>>> RA_UPDATE_INTERVAL=2
>>>>>>>>>> RA_DELEGATED_IP="/usr/local/argus/delegated-ipv4-latest"
>>>>>>>>>> RA_RELIABLE_CONNECT=yes
>>>>>>>>>> RA_COLOR_SUPPORT="yes"
>>>>>>>>>> RA_COLOR_CONFIG=/etc/racolor.conf
>>>>>>>>>> RA_LOCAL=/usr/local/argus/local.addrs
>>>>>>>>>> 
>>>>>>>>>> Starting like this:
>>>>>>>>>> 
>>>>>>>>>> /usr/local/sbin/radium -f /etc/radium.conf
>>>>>>>>>> /usr/local/bin/rastream -F /etc/ra.conf -M time 5m -B 10s -S 127.0.0.1 -w /srv/scratch/%Y-%m-%d_%H-%M-%S.argus -f /argus/process_argus_file.bash -d
>>>>>>>>>> 
>>>>>>>>>> Contents of process_argus_file.bash:
>>>>>>>>>> 
>>>>>>>>>> #!/bin/bash
>>>>>>>>>> 
>>>>>>>>>> #
>>>>>>>>>> #  Argus Client Software.  Tools to read, analyze and manage Argus data.
>>>>>>>>>> #  Copyright (C) 2000-2014 QoSient, LLC.
>>>>>>>>>> #  All Rights Reserved
>>>>>>>>>> #
>>>>>>>>>> # Script called by rastream, to process files.
>>>>>>>>>> #
>>>>>>>>>> # Since this is being called from rastream(), it will have only a single
>>>>>>>>>> # parameter, filename,
>>>>>>>>>> #
>>>>>>>>>> # Carter Bullard <carter at qosient.com>
>>>>>>>>>> #
>>>>>>>>>> 
>>>>>>>>>> PATH="/usr/local/bin:$PATH"; export PATH
>>>>>>>>>> package="argus-clients"
>>>>>>>>>> version="3.0.8rc3"
>>>>>>>>>> 
>>>>>>>>>> OPTIONS="$*"
>>>>>>>>>> FILE=
>>>>>>>>>> while test $# != 0
>>>>>>>>>> do
>>>>>>>>>> case "$1" in
>>>>>>>>>> -r) shift; FILE="$1"; break;;
>>>>>>>>>> esac
>>>>>>>>>> shift
>>>>>>>>>> done
>>>>>>>>>> 
>>>>>>>>>> # Cluster ASN level data
>>>>>>>>>> ASN_DIR=/net/nfdata/asn/`date +%Y/%m/%d`
>>>>>>>>>> mkdir -p ${ASN_DIR}
>>>>>>>>>> ASN_FILE=`basename ${FILE}`
>>>>>>>>>> /usr/local/bin/racluster -m sas -r ${FILE} -w ${ASN_DIR}/${ASN_FILE}
>>>>>>>>>> chgrp -R oititso ${ASN_DIR}
>>>>>>>>>> chmod -R g+r ${ASN_DIR}
>>>>>>>>>> 
>>>>>>>>>> BASE=`basename ${FILE}`
>>>>>>>>>> DIR=/net/nfdata/archive/`echo ${BASE}|sed -e 's/_.*//'|awk -F '-' '{print $1"/"$2"/"$3}'`
>>>>>>>>>> mkdir -p $DIR
>>>>>>>>>> /usr/local/bin/racluster -r ${FILE} -w ${DIR}/${BASE} && rm ${FILE}
>>>>>>>>>> chgrp -R oititso ${DIR}
>>>>>>>>>> find /net/nfdata/archive/ -type f -exec chmod -R 0640 {} \;
>>>>>>>>>> find /net/nfdata/archive/ -type d -exec chmod -R 0750 {} \;
>>> 
>>> 
> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6837 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20141125/7862413b/attachment.bin>


More information about the argus mailing list