rastream stopped processing
Jesse Bowling
jessebowling at gmail.com
Tue Jul 8 13:08:38 EDT 2014
On Jul 8, 2014, at 12:49 PM, Carter Bullard <carter at qosient.com> wrote:
> Looks like radium is doing the right thing, and that rastream() is falling behind.
> So on the machine where rastream() is running, is it possible that the script
> that it runs eats all the memory on the machine, which will cause rastream to
> slow down in stream processing, backing up to radium, where radium shuts it down.
>
> When radium hangs up, rastream should just reconnect, and I’m looking into why
> that may not be happening, but what is your rastream script doing ???
> Not sorting I hope ???
>
Well, the ra.conf I was using did not have the reliable connect option on, but it’s now set thusly:
RA_RELIABLE_CONNECT=yes
so that might at least help that...The post-process script looks like this:
#!/bin/bash
#
# Argus Client Software. Tools to read, analyze and manage Argus data.
# Copyright (C) 2000-2014 QoSient, LLC.
# All Rights Reserved
#
# Script called by rastream, to process files.
#
# Since this is being called from rastream(), it will have only a single
# parameter, filename,
#
# Carter Bullard <carter at qosient.com>
#
PATH="/usr/local/bin:$PATH"; export PATH
package="argus-clients"
version="3.0.8rc3"
OPTIONS="$*"
FILE=
while test $# != 0
do
case "$1" in
-r) shift; FILE="$1"; break;;
esac
shift
done
RANGE=`/usr/local/bin/ratimerange -p 0 -r ${FILE} | awk -F ' - ' '{print $2}'`
ERANGE=`date +%s -d "${RANGE}"`
NOW=`date +%s`
if [[ "${ERANGE}" > "${NOW}" ]]
then
echo "${RANGE}" | /bin/mailx -s "ratimerange reporting bad dates" me@${DAYJOB}
fi
ASN_DIR=/asn/`date +%Y/%m/%d`
mkdir -p ${ASN_DIR}
ASN_FILE=`basename ${FILE}`
/usr/local/bin/racluster -m sas -r ${FILE} -w ${ASN_DIR}/${ASN_FILE}
As an aside, would it be a better strategy to attach an rabins process to radium aggregating on sas and writing out a file on some interval (say one minute)?
Running that racluster manually shows it only using a few megs of memory, and for the most part this machine appears to have lots of free memory:
54.51user 0.51system 0:55.42elapsed 99%CPU (0avgtext+0avgdata 76848maxresident)k
0inputs+0outputs (0major+5426minor)pagefaults 0swaps
Here’s a snapshot of the box as it is currently (collecting netflow and not running the post-process script):
$ free -m
total used free shared buffers cached
Mem: 7864 7746 118 0 48 7163
-/+ buffers/cache: 534 7329
Swap: 9919 63 9856
So almost all of it’s 8 GB is free (aside from file caching of course)
Cheers,
Jesse
> Carter
>
>
> On Jul 8, 2014, at 12:31 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>
>>
>> On Jul 8, 2014, at 11:05 AM, Carter Bullard <carter at qosient.com> wrote:
>>
>>> Did radium stop collecting or sending ?? We’ve got some
>>> reports on reliable connection failure, so it maybe your
>>> rastream() disconnected and didn’t reconnect ????
>>
>> It seems that radium is collecting; art least I can attach to the radium instance and receive 100 records with “ra -r 127.0.0.1 -N 100"
>>
>>> check out your system log /var/log/messages /var/log/system.log
>>> to see if radium complained about the client going away, or if
>>> radium stopped reading. If radium is still running you can just
>>> connect to it, to see if its transmitting anything.
>>>
>> It looks like it must be on the rastream side...??:
>>
>> Jul 6 22:59:00 test radium[57599]: 2014-07-06 22:59:00.572718 connect from localhost[127.0.0.1]
>> Jul 7 08:00:21 test radium[57599]: 2014-07-07 08:00:21.541077 ArgusWriteOutSocket(0x1269d0) client not processing: disconnecting
>>
>> Likely unrelated, but I’m also seeing many of these messages in the logs:
>>
>> Jul 2 16:31:26 test radium[47571]: 2014-07-02 16:31:26.358574 ArgusWriteOutSocket(0x181269d0) max queue exceeded 500001
>> Jul 2 16:31:26 test radium[47571]: 2014-07-02 16:31:26.390583 ArgusWriteOutSocket(0x181269d0) max queue exceeded 500001
>>
>>
>>> If there is a problem, and you’ve compiled with symbols in (.devel),
>>> then attach to radium with gdb() and look to see if any of the
>>> threads have terminated.
>>>
>>> (gdb) attach pid.of.radium
>>> (gdb) info threads
>>> (gdb) thread 1
>>> (gdb) where
>>> (gdb) thread 2
>>> (gdb) where
>>>
>>> etc ….. may not be exact syntax, but its something like that.
>>> With all the various end systems using clang and lldb, I’m kind
>>> of schizophrenic on debugging right now.
>>>
>>
>> Radium output:
>> (gdb) info threads
>> 3 Thread 0x7f001f752700 (LWP 57600) 0x0000003b53a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>> 2 Thread 0x7f001ed51700 (LWP 57601) 0x0000003b532acced in nanosleep () from /lib64/libc.so.6
>> * 1 Thread 0x7f001fc87700 (LWP 57599) 0x0000003b532e15d3 in select () from /lib64/libc.so.6
>> (gdb) thread 1
>> [Switching to thread 1 (Thread 0x7f001fc87700 (LWP 57599))]#0 0x0000003b532e15d3 in select () from /lib64/libc.so.6
>> (gdb) where
>> #0 0x0000003b532e15d3 in select () from /lib64/libc.so.6
>> #1 0x00000000004669ee in ArgusReadStream (parser=0x7f001fb42010, queue=0x19511f0) at ./argus_client.c:738
>> #2 0x000000000040746c in main (argc=3, argv=0x7fff4ae0a728) at ./argus_main.c:387
>> (gdb) thread 2
>> [Switching to thread 2 (Thread 0x7f001ed51700 (LWP 57601))]#0 0x0000003b532acced in nanosleep () from /lib64/libc.so.6
>> (gdb) where
>> #0 0x0000003b532acced in nanosleep () from /lib64/libc.so.6
>> #1 0x0000003b532acb60 in sleep () from /lib64/libc.so.6
>> #2 0x0000000000466455 in ArgusConnectRemotes (arg=0x1951190) at ./argus_client.c:579
>> #3 0x0000003b53a079d1 in start_thread () from /lib64/libpthread.so.0
>> #4 0x0000003b532e8b5d in clone () from /lib64/libc.so.6
>> (gdb) thread 3
>> [Switching to thread 3 (Thread 0x7f001f752700 (LWP 57600))]#0 0x0000003b53a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>> (gdb) where
>> #0 0x0000003b53a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>> #1 0x00000000004ac1be in ArgusOutputProcess (arg=0x1953310) at ./argus_output.c:897
>> #2 0x0000003b53a079d1 in start_thread () from /lib64/libpthread.so.0
>> #3 0x0000003b532e8b5d in clone () from /lib64/libc.so.6
>>
>> Connecting to the failing rastream process gave odd results:
>>
>> (gdb) detach
>> Detaching from program: /usr/local/bin/radium, process 57599
>> (gdb) attach 57605
>> Attaching to program: /usr/local/bin/radium, process 57605
>> Cannot access memory at address 0x706f636373007064
>> (gdb) where
>> #0 0x0000003b53a0ef3d in ?? ()
>> #1 0x0000000000000000 in ?? ()
>> (gdb) info threads
>> * 1 process 57605 0x0000003b53a0ef3d in ?? ()
>>
>> What should my next step be? Ensure the reliable connection setting is on? Run rastream under gdb?
>>
>> Thanks and cheers,
>>
>> Jesse
>>
>>> Carter
>>>
>>>
>>> On Jul 7, 2014, at 4:28 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> Over the weekend my rastream process stopped processing records for some reason. The current setup is:
>>>>
>>>> netflow records -> radium -> rastream -M time 5m
>>>>
>>>> I noticed that records were no longer being written to disk. I connected a new ra instance to radium, and had no problems receiving records. Attaching strace to the rastream process all I could see were calls:
>>>>
>>>> <snip>
>>>> nanosleep({0, 50000000}, NULL) = 0
>>>> nanosleep({0, 50000000}, NULL) = 0
>>>> nanosleep({0, 50000000}, NULL) = 0
>>>> <snip>
>>>>
>>>> Is there any settings I can tweak or logs to check for or correct the issue? I vaguely recall something about persistent connections where if lost an attempt would be made to reconnect, but my gut says that’s not what’s happening here...
>>>>
>>>> Cheers,
>>>>
>>>> Jesse
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 204 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20140708/ccab5760/attachment.sig>
More information about the argus
mailing list