rastream stopped processing

Tue Jul 8 13:08:38 EDT 2014

On Jul 8, 2014, at 12:49 PM, Carter Bullard <carter at qosient.com> wrote:

> Looks like radium is doing the right thing, and that rastream() is falling behind.
> So on the machine where rastream() is running, is it possible that the script
> that it runs eats all the memory on the machine, which will cause rastream to
> slow down in stream processing, backing up to radium, where radium shuts it down.
> 
> When radium hangs up, rastream should just reconnect, and I’m looking into why
> that may not be happening, but what is your rastream script doing ???
> Not sorting I hope ???
> 

Well, the ra.conf I was using did not have the reliable connect option on, but it’s now set thusly:

RA_RELIABLE_CONNECT=yes

so that might at least help that...The post-process script looks like this:

#!/bin/bash

#
#  Argus Client Software.  Tools to read, analyze and manage Argus data.
#  Copyright (C) 2000-2014 QoSient, LLC.
#  All Rights Reserved
#
# Script called by rastream, to process files.
#
# Since this is being called from rastream(), it will have only a single
# parameter, filename,
#
# Carter Bullard <carter at qosient.com>
#

PATH="/usr/local/bin:$PATH"; export PATH
package="argus-clients"
version="3.0.8rc3"

OPTIONS="$*"
FILE=
while test $# != 0
do
    case "$1" in
    -r) shift; FILE="$1"; break;;
    esac
    shift
done

RANGE=`/usr/local/bin/ratimerange -p 0 -r ${FILE} | awk -F ' - ' '{print $2}'`
ERANGE=`date +%s -d "${RANGE}"`
NOW=`date +%s`
if [[ "${ERANGE}" > "${NOW}" ]]
then
  echo "${RANGE}" | /bin/mailx -s "ratimerange reporting bad dates" me@${DAYJOB}
fi

ASN_DIR=/asn/`date +%Y/%m/%d`
mkdir -p ${ASN_DIR}
ASN_FILE=`basename ${FILE}`
/usr/local/bin/racluster -m sas -r ${FILE} -w ${ASN_DIR}/${ASN_FILE}

As an aside, would it be a better strategy to attach an rabins process to radium aggregating on sas and writing out a file on some interval (say one minute)?

Running that racluster manually shows it only using a few megs of memory, and for the most part this machine appears to have lots of free memory:

54.51user 0.51system 0:55.42elapsed 99%CPU (0avgtext+0avgdata 76848maxresident)k
0inputs+0outputs (0major+5426minor)pagefaults 0swaps

Here’s a snapshot of the box as it is currently (collecting netflow and not running the post-process script):

$ free -m
             total       used       free     shared    buffers     cached
Mem:          7864       7746        118          0         48       7163
-/+ buffers/cache:        534       7329
Swap:         9919         63       9856

So almost all of it’s 8 GB is free (aside from file caching of course)

Cheers,

Jesse

> Carter
> 
> 
> On Jul 8, 2014, at 12:31 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
> 
>> 
>> On Jul 8, 2014, at 11:05 AM, Carter Bullard <carter at qosient.com> wrote:
>> 
>>> Did radium stop collecting or sending ??  We’ve got some
>>> reports on reliable connection failure, so it maybe your
>>> rastream() disconnected and didn’t reconnect ????
>> 
>> It seems that radium is collecting; art least I can attach to the radium instance and receive 100 records with “ra -r 127.0.0.1 -N 100"
>> 
>>> check out your system log /var/log/messages /var/log/system.log
>>> to see if radium complained about the client going away, or if
>>> radium stopped reading.  If radium is still running you can just
>>> connect to it, to see if its transmitting anything.
>>> 
>> It looks like it must be on the rastream side...??:
>> 
>> Jul  6 22:59:00 test radium[57599]: 2014-07-06 22:59:00.572718 connect from localhost[127.0.0.1]
>> Jul  7 08:00:21 test radium[57599]: 2014-07-07 08:00:21.541077 ArgusWriteOutSocket(0x1269d0) client not processing: disconnecting
>> 
>> Likely unrelated, but I’m also seeing many of these messages in the logs:
>> 
>> Jul  2 16:31:26 test radium[47571]: 2014-07-02 16:31:26.358574 ArgusWriteOutSocket(0x181269d0) max queue exceeded 500001
>> Jul  2 16:31:26 test radium[47571]: 2014-07-02 16:31:26.390583 ArgusWriteOutSocket(0x181269d0) max queue exceeded 500001
>> 
>> 
>>> If there is a problem, and you’ve compiled with symbols in (.devel),
>>> then attach to radium with gdb() and look to see if any of the
>>> threads have terminated.
>>> 
>>> (gdb) attach pid.of.radium
>>> (gdb) info threads
>>> (gdb) thread 1
>>> (gdb) where
>>> (gdb) thread 2
>>> (gdb) where
>>> 
>>> etc ….. may not be exact syntax, but its something like that.
>>> With all the various end systems using clang and lldb, I’m kind
>>> of schizophrenic on debugging right now.
>>> 
>> 
>> Radium output:
>> (gdb) info threads
>>  3 Thread 0x7f001f752700 (LWP 57600)  0x0000003b53a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>  2 Thread 0x7f001ed51700 (LWP 57601)  0x0000003b532acced in nanosleep () from /lib64/libc.so.6
>> * 1 Thread 0x7f001fc87700 (LWP 57599)  0x0000003b532e15d3 in select () from /lib64/libc.so.6
>> (gdb) thread 1
>> [Switching to thread 1 (Thread 0x7f001fc87700 (LWP 57599))]#0 0x0000003b532e15d3 in select () from /lib64/libc.so.6
>> (gdb) where
>> #0  0x0000003b532e15d3 in select () from /lib64/libc.so.6
>> #1  0x00000000004669ee in ArgusReadStream (parser=0x7f001fb42010, queue=0x19511f0) at ./argus_client.c:738
>> #2  0x000000000040746c in main (argc=3, argv=0x7fff4ae0a728) at ./argus_main.c:387
>> (gdb) thread 2
>> [Switching to thread 2 (Thread 0x7f001ed51700 (LWP 57601))]#0 0x0000003b532acced in nanosleep () from /lib64/libc.so.6
>> (gdb) where
>> #0  0x0000003b532acced in nanosleep () from /lib64/libc.so.6
>> #1  0x0000003b532acb60 in sleep () from /lib64/libc.so.6
>> #2  0x0000000000466455 in ArgusConnectRemotes (arg=0x1951190) at ./argus_client.c:579
>> #3  0x0000003b53a079d1 in start_thread () from /lib64/libpthread.so.0
>> #4  0x0000003b532e8b5d in clone () from /lib64/libc.so.6
>> (gdb) thread 3
>> [Switching to thread 3 (Thread 0x7f001f752700 (LWP 57600))]#0 0x0000003b53a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>> (gdb) where
>> #0  0x0000003b53a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>> #1  0x00000000004ac1be in ArgusOutputProcess (arg=0x1953310) at ./argus_output.c:897
>> #2  0x0000003b53a079d1 in start_thread () from /lib64/libpthread.so.0
>> #3  0x0000003b532e8b5d in clone () from /lib64/libc.so.6
>> 
>> Connecting to the failing rastream process gave odd results:
>> 
>> (gdb) detach
>> Detaching from program: /usr/local/bin/radium, process 57599
>> (gdb) attach 57605
>> Attaching to program: /usr/local/bin/radium, process 57605
>> Cannot access memory at address 0x706f636373007064
>> (gdb) where
>> #0  0x0000003b53a0ef3d in ?? ()
>> #1  0x0000000000000000 in ?? ()
>> (gdb) info threads
>> * 1 process 57605  0x0000003b53a0ef3d in ?? ()
>> 
>> What should my next step be? Ensure the reliable connection setting is on? Run rastream under gdb?
>> 
>> Thanks and cheers,
>> 
>> Jesse
>> 
>>> Carter
>>> 
>>> 
>>> On Jul 7, 2014, at 4:28 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> Over the weekend my rastream process stopped processing records for some reason. The current setup is:
>>>> 
>>>> netflow records -> radium -> rastream -M time 5m
>>>> 
>>>> I noticed that records were no longer being written to disk. I connected a new ra instance to radium, and had no problems receiving records. Attaching strace to the rastream process all I could see were calls:
>>>> 
>>>> <snip>
>>>> nanosleep({0, 50000000}, NULL)          = 0
>>>> nanosleep({0, 50000000}, NULL)          = 0
>>>> nanosleep({0, 50000000}, NULL)          = 0 
>>>> <snip>
>>>> 
>>>> Is there any settings I can tweak or logs to check for or correct the issue? I vaguely recall something about persistent connections where if lost an attempt would be made to reconnect, but my gut says that’s not what’s happening here...
>>>> 
>>>> Cheers,
>>>> 
>>>> Jesse
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 204 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20140708/ccab5760/attachment.sig>