rastream exits after multi day run

Mon Nov 3 09:50:43 EST 2014

Hey Carter,

I don’t think memory is the issue, or at least, host graphs for memory usage don’t seem to indicate memory exhaustion...Oddly the problematic file when run manually exhibits something I’ve not run into:

$ /usr/local/bin/racluster -D 3 -f racluster.conf -r problematic.argus -w cluster.problematic.argus
racluster[61078.00278946357f0000]: 2014-11-03 09:27:07.388331 ArgusAddFileList (0x4674d010, problematic.argus, 1, -1, -1) returning 1
racluster[61079.00278946357f0000]: 2014-11-03 09:27:07.391338 Argusfinish_parse (0x1d58530)
racluster[61078.00278946357f0000]: 2014-11-03 09:27:07.391766 ArgusFilterCompile () done 0
racluster[61078.00278946357f0000]: 2014-11-03 09:27:07.397743 ArgusReadConnection() read 16 bytes
racluster[61078.00278946357f0000]: 2014-11-03 09:27:07.397843 ArgusReadConnection() read 112 bytes
racluster[61078.00278946357f0000]: 2014-11-03 09:27:07.406093 ArgusInitAddrtoname (0x7f354674d010, 0x0, 0x0)
racluster[61078.00278946357f0000]: 2014-11-03 09:27:07.406115 ArgusParseInit(0x7f354674d010 0x7f354665b010
racluster[61078.00278946357f0000]: 2014-11-03 09:27:07.406126 ArgusReadConnection(0x4665b010, 1) returning 1
racluster[61078.00278946357f0000]: 2014-11-03 09:27:08.509529 ArgusClientTimeout()
racluster[61078.00278946357f0000]: 2014-11-03 09:27:09.531465 ArgusClientTimeout()
racluster[61078.00278946357f0000]: 2014-11-03 09:27:10.535630 ArgusClientTimeout()
racluster[61078.00278946357f0000]: 2014-11-03 09:27:11.546747 ArgusClientTimeout()
racluster[61078.00278946357f0000]: 2014-11-03 09:27:12.625805 ArgusClientTimeout()
racluster[61078.00278946357f0000]: 2014-11-03 09:27:13.632538 ArgusClientTimeout()

The file is 2.2 GB....I can share the file if that’s helpful!

Cheers,

Jesse
On Oct 30, 2014, at 8:02 PM, Carter Bullard <carter at qosient.com> wrote:

> Hey Jesse,
> The trace indicates that racluster is trying to find a flow in its cache, and if you have a lot of flows, racluster will spend considerable time trying to get through the number of flows that are being cached.  There are ways to speed this up, and we can do this once we get past this first round of issues.
> 
> How big is the file you’re trying to racluster ??  If its too big, then its too big.  If your running out of disk, not much we can do about that.  If you’re running out of memory, then the support in racluster.conf that does a bit of idle flow cache management should help.
> 
> If it is memory, you can try a simple rule like this:
> 
> racluster.conf:
> 
> filter=“tcp or udp”    idle=300
> filter=“” 
> 
> and run against the file you were choking on??  Checking out the amount of memory racluster.1 uses while its doing its thing is the key.
> 
>    racluster -f racluster.conf -with -the -options -you -were -using
>  
> Maybe this can help ???
> 
> Carter
> 
> 
> 
>> On Oct 30, 2014, at 5:14 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>> 
>> So perhaps rastream is the symptom and not the cause...I checked in on the restarted process just now and noticed that the “process_argus_file.bash” was still working on a file from 10 am this morning...Specifically it seemed stuck on the racluster command. I attached with strace initially and then gdb and it seems to be stuck in this loop:
>> 
>> 5608             unsigned short *ptr1 = (unsigned short *) hstruct->buf;
>> (gdb) 
>> 5609             unsigned short *ptr2 = (unsigned short *) target->hstruct.buf;
>> (gdb) 
>> 5611             if (ptr1 && ptr2) {
>> (gdb) 
>> 5612                for (i = 0, len = hstruct->len/sizeof(unsigned short); i < len; i++)
>> (gdb) 
>> 5613                   if (*ptr1++ != *ptr2++)
>> (gdb) 
>> 5612                for (i = 0, len = hstruct->len/sizeof(unsigned short); i < len; i++)
>> (gdb) 
>> 5613                   if (*ptr1++ != *ptr2++)
>> (gdb) 
>> 5612                for (i = 0, len = hstruct->len/sizeof(unsigned short); i < len; i++)
>> (gdb) 
>> 5613                   if (*ptr1++ != *ptr2++)
>> (gdb) 
>> 5614                      break;
>> (gdb) 
>> 5615                if (i == len) {
>> (gdb) 
>> 5611             if (ptr1 && ptr2) {
>> (gdb) 
>> 5626             target = target->nxt;
>> (gdb) 
>> 5627          } while (target != head);
>> 
>> (gdb) where
>> #0  0x0000000000475c21 in ArgusFindRecord (htable=0x1b74470, hstruct=0x1b75078) at ./argus_client.c:5609
>> #1  0x0000000000405c0d in RaProcessThisRecord (parser=0x7fac4def9010, argus=0x7fac4de07630) at ./racluster.c:720
>> #2  0x000000000040591d in RaProcessRecord (parser=0x7fac4def9010, ns=0x7fac4de07630) at ./racluster.c:667
>> #3  0x0000000000416688 in RaScheduleRecord (parser=0x7fac4def9010, ns=0x7fac4de07630) at ./argus_util.c:2664
>> #4  0x0000000000416ca5 in ArgusHandleRecord (parser=0x7fac4def9010, input=0x7fac4de07010, ptr=0x7fac4d9fa7c4, filter=0x7fac4e01a800) at ./argus_util.c:2776
>> #5  0x0000000000466d2f in ArgusReadStreamSocket (parser=0x7fac4def9010, input=0x7fac4de07010) at ./argus_client.c:440
>> #6  0x0000000000466f3d in ArgusReadFileStream (parser=0x7fac4def9010, input=0x7fac4de07010) at ./argus_client.c:496
>> #7  0x0000000000407841 in main (argc=5, argv=0x7fffc8ebbd68) at ./argus_main.c:253
>> 
>> Meanwhile, the scratch space for the original file gets filled up and rastream has nowhere to write to...Which I believe is why rastream eventually fails. Thoughts? Here’s where the rastream process is stuck:
>> 
>> (gdb) where
>> #0  0x0000003b436db4ad in write () from /lib64/libc.so.6
>> #1  0x0000003b43671ae3 in _IO_new_file_write () from /lib64/libc.so.6
>> #2  0x0000003b436719aa in _IO_new_file_xsputn () from /lib64/libc.so.6
>> #3  0x0000003b43649691 in buffered_vfprintf () from /lib64/libc.so.6
>> #4  0x0000003b4364422e in vfprintf () from /lib64/libc.so.6
>> #5  0x0000003b4364f0f8 in fprintf () from /lib64/libc.so.6
>> #6  0x000000000044fcf4 in ArgusDebug (d=1, fmt=0x4ac7a0 "ArgusProcessScripts(): waitpid(%d) returned for %d") at ./argus_util.c:20325
>> #7  0x00000000004051f2 in ArgusProcessScripts () at ./rastream.c:546
>> #8  0x00000000004053d0 in RaParseComplete (sig=3) at ./rastream.c:601
>> #9  0x000000000045f736 in ArgusWriteNewLogfile (parser=0x7fc202419010, input=0x7fc202098010, wfile=0x1da9200, argus=0x7fffab5bcb50) at ./argus_util.c:25872
>> #10 0x0000000000407535 in RaSendArgusRecord (argus=0x1da8fc0) at ./rastream.c:1345
>> #11 0x00000000004064fa in RaProcessRecord (parser=0x7fc202419010, ns=0x7fc202098630) at ./rastream.c:963
>> #12 0x0000000000422b48 in RaScheduleRecord (parser=0x7fc202419010, ns=0x7fc202098630) at ./argus_util.c:2664
>> #13 0x0000000000423165 in ArgusHandleRecord (parser=0x7fc202419010, input=0x7fc202098010, ptr=0x7fc1f12959e0, filter=0x7fc20253a800) at ./argus_util.c:2776
>> #14 0x000000000046849f in ArgusReadStreamSocket (parser=0x7fc202419010, input=0x7fc202098010) at ./argus_client.c:440
>> #15 0x0000000000469024 in ArgusReadStream (parser=0x7fc202419010, queue=0x1d841f0) at ./argus_client.c:755
>> #16 0x00000000004095e4 in main (argc=16, argv=0x7fffab60d1b8) at ./argus_main.c:387
>> (gdb) n
>> Single stepping until exit from function write,
>> which has no line number information.
>> 
>> ^C
>> Program received signal SIGINT, Interrupt.
>> 0x0000003b436db4ad in write () from /lib64/libc.so.6
>> 
>> I also noticed while watching the -D3 log that I have a LOT of this type of message:
>> 
>> rastream[33066.00e75502c27f0000]: 2014-10-30 10:51:34.829949 RaSendArgusRecord () rejecting late record secs 1414680600 done file secs
>> 
>> Is that normal?? Seems unrelated put potentially problematic?
>> 
>> Cheers,
>> 
>> Jesse
>> 
>> 
>> 
>> On Oct 30, 2014, at 11:18 AM, Carter Bullard <carter at qosient.com> wrote:
>> 
>>> Hey Jesse,
>>> Looks like rastream() isn’t keeping up and radium.1 is dropping the connection.
>>> Your script is a bit heavy, but that runs in a separate process, and shouldn’t
>>> affect rastream, at least that is the design.
>>> 
>>> In gdb, you can’t run with the daemon flag, so take the ‘-d' out.  You say it crashes ???
>>> If so, can you type ‘where’ so we can see where the fault occurs ??
>>> 
>>> Run rastream with a ‘-D3’.  That will generate messages about the script and any time shift issues.  ‘-D4’ will add the socket read messages, which maybe of interest, but it may
>>> generate quite a bit of messages.  Try -D3 to see if we see anything interesting.
>>> 
>>> 
>>> Carter
>>> 
>>> 
>>>> On Oct 29, 2014, at 8:09 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> I’ve had an issue using rastream recently. What I’m seeing is that rastream will run successfully for a while (usually around 2 days or less) and then will simply exit. The only log messages I get are:
>>>> 
>>>> Oct 15 13:06:24 netf-test radium[33140]: 2014-10-15 13:06:24.908361 started
>>>> Oct 15 13:06:29 netf-test radium[33140]: 2014-10-15 13:06:29.599856 connect from localhost[127.0.0.1]
>>>> Oct 16 10:52:59 netf-test radium[33140]: 2014-10-16 10:52:59.537044 ArgusWriteOutSocket(0xb81269d0) client not processing: disconnecting
>>>> 
>>>> I tried running under gdb, but it quickly finishes/crashes...I’d can run it with -D, but unsure what level might catch what’s happening. Any recommendations? 
>>>> 
>>>> Any ideas of why this might be exiting? Is the bash script too heavy? Would it help to call a wrapper script that simply backgrounds the actual script I want to run?
>>>> 
>>>> Config information below.
>>>> 
>>>> Cheers,
>>>> 
>>>> Jesse
>>>> 
>>>> $ egrep -v '^#|^[ ]*$' /etc/radium.conf 
>>>> RADIUM_DAEMON="yes"
>>>> RADIUM_MONITOR_ID=10.138.3.168
>>>> RADIUM_MAR_STATUS_INTERVAL=60
>>>> RADIUM_ARGUS_SERVER=cisco://10.138.3.168:9995
>>>> RADIUM_ACCESS_PORT=561
>>>> RADIUM_BIND_IP=127.0.0.1
>>>> RADIUM_CLASSIFIER_FILE=/etc/ralabel.conf
>>>> 
>>>> $ egrep -v '^#|^[ ]*$' /etc/ra.conf
>>>> RA_SET_PID="no"
>>>> RA_PID_PATH="/var/run"
>>>> RA_RUN_TIME=0
>>>> RA_GENERATE_BIN_MAR_RECORDS=yes
>>>> RA_PRINT_LABELS=0
>>>> RA_FIELD_SPECIFIER="stime:25 flgs proto saddr sport sco sas dir daddr dport dco das spkts dpkts sbytes dbytes"
>>>> RA_FIELD_DELIMITER=''
>>>> RA_SEPARATE_ADDR_FROM_PORT_WITH_PERIOD="no"
>>>> RA_PRINT_NAMES='none'
>>>> RA_PRINT_RESPONSE_DATA=no
>>>> RA_PRINT_UNIX_TIME=no
>>>> RA_TIME_FORMAT="%F %T.%f"
>>>> RA_USEC_PRECISION=6
>>>> RA_USERDATA_ENCODE=Ascii
>>>> RA_SORT_ALGORITHMS="dpkts spkts saddr dport daddr sport"
>>>> RA_UPDATE_INTERVAL=2
>>>> RA_DELEGATED_IP="/usr/local/argus/delegated-ipv4-latest"
>>>> RA_RELIABLE_CONNECT=yes
>>>> RA_COLOR_SUPPORT="yes"
>>>> RA_COLOR_CONFIG=/etc/racolor.conf
>>>> RA_LOCAL=/usr/local/argus/local.addrs
>>>> 
>>>> Starting like this:
>>>> 
>>>> /usr/local/sbin/radium -f /etc/radium.conf
>>>> /usr/local/bin/rastream -F /etc/ra.conf -M time 5m -B 10s -S 127.0.0.1 -w /srv/scratch/%Y-%m-%d_%H-%M-%S.argus -f /argus/process_argus_file.bash -d
>>>> 
>>>> Contents of process_argus_file.bash:
>>>> 
>>>> #!/bin/bash
>>>> 
>>>> #
>>>> #  Argus Client Software.  Tools to read, analyze and manage Argus data.
>>>> #  Copyright (C) 2000-2014 QoSient, LLC.
>>>> #  All Rights Reserved
>>>> #
>>>> # Script called by rastream, to process files.
>>>> #
>>>> # Since this is being called from rastream(), it will have only a single
>>>> # parameter, filename,
>>>> #
>>>> # Carter Bullard <carter at qosient.com>
>>>> #
>>>> 
>>>> PATH="/usr/local/bin:$PATH"; export PATH
>>>> package="argus-clients"
>>>> version="3.0.8rc3"
>>>> 
>>>> OPTIONS="$*"
>>>> FILE=
>>>> while test $# != 0
>>>> do
>>>>  case "$1" in
>>>>  -r) shift; FILE="$1"; break;;
>>>>  esac
>>>>  shift
>>>> done
>>>> 
>>>> # Cluster ASN level data
>>>> ASN_DIR=/net/nfdata/asn/`date +%Y/%m/%d`
>>>> mkdir -p ${ASN_DIR}
>>>> ASN_FILE=`basename ${FILE}`
>>>> /usr/local/bin/racluster -m sas -r ${FILE} -w ${ASN_DIR}/${ASN_FILE}
>>>> chgrp -R oititso ${ASN_DIR}
>>>> chmod -R g+r ${ASN_DIR}
>>>> 
>>>> BASE=`basename ${FILE}`
>>>> DIR=/net/nfdata/archive/`echo ${BASE}|sed -e 's/_.*//'|awk -F '-' '{print $1"/"$2"/"$3}'`
>>>> mkdir -p $DIR
>>>> /usr/local/bin/racluster -r ${FILE} -w ${DIR}/${BASE} && rm ${FILE}
>>>> chgrp -R oititso ${DIR}
>>>> find /net/nfdata/archive/ -type f -exec chmod -R 0640 {} \;
>>>> find /net/nfdata/archive/ -type d -exec chmod -R 0750 {} \;
>>>> 
>>>> 
>>> 
>> 
>> 
>