rastream stopped processing

Tue Jul 8 14:50:18 EDT 2014

Hey Jesse,
Well, 17K records per second is also a bit snappy.  At that rate,
your queues can back up in possibly 10-20 seconds, which isn’t
that long.  Looks like you hit it in about 16 seconds.

Because the rate is a bit snappy, your problems with rastream()
maybe related to the duration of the run of the script.
So your rastream() doesn’t block to run its script, but there
maybe some weird locking such that it doesn’t process its queue
if the script is running.  This isn’t right so I’ll look into
that.

So, ratop was bumped off because it was slow, just like your
rastream was bumped off, and its now just sitting there, wondering
what to do with no input.  Does it show any records on the screen ???

If you don’t care about sorting, which is what is killing ratop(),
you can set 

You should be able to navigate
and see the 100Ks of records in its cache ???

If you want to run ratop, definately change the UPDATE variable in
its rarc file, and try running with a “-m” option to reduce the number
of objects that its tracking, like:

   -m matrix/16
   -m smac dmac
   -m mac addr (this will kickin the -M rmon option, so print out the smac saddr fields)

or your -m sas value if you’ve labeled them.

Carter

On Jul 8, 2014, at 2:24 PM, Jesse Bowling <jessebowling at gmail.com> wrote:

> 
> On Jul 8, 2014, at 2:07 PM, Carter Bullard <carter at qosient.com> wrote:
> 
>> well, 7K records per second is a bit snappy.
>> are you using the latest ratop() as earlier versions have real performance problems ??
>> 
> $ ratop --version
> RaCurses Version 3.0.7.36
> 
> Updated to 3.0.8.rc.1 and now...
> 
> Actually, looking at it now I’m seeing a starting value of around 17K/sec and similar behavior where the RPS falls off constantly and the queues build to around 100K each...then around then then the queues start falling and although the status remains “Active” records are no longer displayed, and after a :q command ratop doesn’t actually quit:
> 
> Jul  8 14:19:10 test radium[40569]: 2014-07-08 14:19:10.453649 connect from localhost[127.0.0.1]
> Jul  8 14:19:26 test radium[40569]: 2014-07-08 14:19:26.774006 ArgusWriteOutSocket(0x136ce0) max queue exceeded 500001
> Jul  8 14:19:26 test radium[40569]: 2014-07-08 14:19:26.774141 ArgusWriteOutSocket(0x136ce0) max queue exceeded 500001
> 
> (gdb) where
> #0  0x0000003b53a0822d in pthread_join () from /lib64/libpthread.so.0
> #1  0x000000000040519d in main (argc=5, argv=0x7fffa260f248) at ./racurses.c:85
> (gdb) info threads
>  2 Thread 0x7fe176f90700 (LWP 40799)  0x0000003b53a0ef3d in nanosleep () from /lib64/libpthread.so.0
> * 1 Thread 0x7fe178d5c700 (LWP 40798)  0x0000003b53a0822d in pthread_join () from /lib64/libpthread.so.0
> (gdb) thread 2
> [Switching to thread 2 (Thread 0x7fe176f90700 (LWP 40799))]#0  0x0000003b53a0ef3d in nanosleep () from /lib64/libpthread.so.0
> (gdb) where
> #0  0x0000003b53a0ef3d in nanosleep () from /lib64/libpthread.so.0
> #1  0x00000000004055c2 in ArgusCursesProcess (arg=0x0) at ./racurses.c:234
> #2  0x0000003b53a079d1 in start_thread () from /lib64/libpthread.so.0
> #3  0x0000003b532e8b5d in clone () from /lib64/libc.so.6
> (gdb) 
> 
> 
> 
>> And by default on ratop, your 100K records are being sorted 10x per second, if
>> so you will have some performance issues.  500K records in the queue for 7K records
>> a second, is only 70 seconds of build up.  Still, rastream() should keep up.
>> 
>> To run ratop() at these rates, you will want to modify the RA_UPDATE_INTERVAL in
>> your .rarc file to so that it doesn’t update the screen, which means sort the
>> entire queue, every tenth of a second.  Or use the “:u” option to modify it
>> on the fly, once you start ratop().  Every 2 seconds works pretty good.
>> 
>> Carter
>> 
>> On Jul 8, 2014, at 1:59 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>> 
>>> 
>>> On Jul 8, 2014, at 1:12 PM, Carter Bullard <carter at qosient.com> wrote:
>>> 
>>>> That type of processing shouldn’t cause any problems, but
>>>> you can be doing that level of processing in real time with
>>>> rasqlinsert() or doing your rabins() as another task reading
>>>> from radium.   You can also be running a local radium
>>>> that is collecting from the remote radium, so that many local
>>>> clients can feed from the local radium, but you’ve only written
>>>> the data once onto the wire.  That maybe the best way to
>>>> go.
>>>> 
>>>> Hmmmm, so is rastream() busy ??  Lots of records per second ???
>>>> ratop connected to the radium, will print the average rps on 
>>>> the bottom of the screen.
>>>> 
>>> 
>>> rastream doesn’t appear to be that busy, CPU-wise at the moment, however ratop definitely has issues keeping up it seems...
>>> 
>>> Immediately upon connecting (ratop -S 127.0.0.1) the rps is about 7K, and the ProcessQueue and DisplayQueue rapidly increase until they hit about 100K (as the RPS drops during that time)...Once the two queue values drop to 0, the status changes to idle and ratop won’t respond to commands (:q, Ctrl-C, Ctrl-Z)...Additionally I get the queue exceed messages out of /var/log/messages:
>>> 
>>> Jul  8 13:51:31 test radium[9015]: 2014-07-08 13:51:31.427787 connect from localhost[127.0.0.1]
>>> Jul  8 13:51:50 test radium[9015]: 2014-07-08 13:51:50.591503 ArgusWriteOutSocket(0x78136ce0) max queue exceeded 500001
>>> Jul  8 13:51:50 test radium[9015]: 2014-07-08 13:51:50.591591 ArgusWriteOutSocket(0x78136ce0) max queue exceeded 500001
>>> 
>>> (no similar messages for rastream since last restart)
>>> 
>>> Any other steps to look at here? 
>>> 
>>> In the meantime I’ll look at using a second radium collecting from the first to feed clients...
>>> 
>>> Cheers,
>>> 
>>> Jesse
>>> 
>>>> Carter
>>>> 
>>>> 
>>>> 
>>>> On Jul 8, 2014, at 1:08 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>>>> 
>>>>> 
>>>>> On Jul 8, 2014, at 12:49 PM, Carter Bullard <carter at qosient.com> wrote:
>>>>> 
>>>>>> Looks like radium is doing the right thing, and that rastream() is falling behind.
>>>>>> So on the machine where rastream() is running, is it possible that the script
>>>>>> that it runs eats all the memory on the machine, which will cause rastream to
>>>>>> slow down in stream processing, backing up to radium, where radium shuts it down.
>>>>>> 
>>>>>> When radium hangs up, rastream should just reconnect, and I’m looking into why
>>>>>> that may not be happening, but what is your rastream script doing ???
>>>>>> Not sorting I hope ???
>>>>>> 
>>>>> 
>>>>> Well, the ra.conf I was using did not have the reliable connect option on, but it’s now set thusly:
>>>>> 
>>>>> RA_RELIABLE_CONNECT=yes
>>>>> 
>>>>> so that might at least help that...The post-process script looks like this:
>>>>> 
>>>>> #!/bin/bash
>>>>> 
>>>>> #
>>>>> #  Argus Client Software.  Tools to read, analyze and manage Argus data.
>>>>> #  Copyright (C) 2000-2014 QoSient, LLC.
>>>>> #  All Rights Reserved
>>>>> #
>>>>> # Script called by rastream, to process files.
>>>>> #
>>>>> # Since this is being called from rastream(), it will have only a single
>>>>> # parameter, filename,
>>>>> #
>>>>> # Carter Bullard <carter at qosient.com>
>>>>> #
>>>>> 
>>>>> PATH="/usr/local/bin:$PATH"; export PATH
>>>>> package="argus-clients"
>>>>> version="3.0.8rc3"
>>>>> 
>>>>> OPTIONS="$*"
>>>>> FILE=
>>>>> while test $# != 0
>>>>> do
>>>>> case "$1" in
>>>>> -r) shift; FILE="$1"; break;;
>>>>> esac
>>>>> shift
>>>>> done
>>>>> 
>>>>> RANGE=`/usr/local/bin/ratimerange -p 0 -r ${FILE} | awk -F ' - ' '{print $2}'`
>>>>> ERANGE=`date +%s -d "${RANGE}"`
>>>>> NOW=`date +%s`
>>>>> if [[ "${ERANGE}" > "${NOW}" ]]
>>>>> then
>>>>> echo "${RANGE}" | /bin/mailx -s "ratimerange reporting bad dates" me@${DAYJOB}
>>>>> fi
>>>>> 
>>>>> ASN_DIR=/asn/`date +%Y/%m/%d`
>>>>> mkdir -p ${ASN_DIR}
>>>>> ASN_FILE=`basename ${FILE}`
>>>>> /usr/local/bin/racluster -m sas -r ${FILE} -w ${ASN_DIR}/${ASN_FILE}
>>>>> 
>>>>> As an aside, would it be a better strategy to attach an rabins process to radium aggregating on sas and writing out a file on some interval (say one minute)?
>>>>> 
>>>>> Running that racluster manually shows it only using a few megs of memory, and for the most part this machine appears to have lots of free memory:
>>>>> 
>>>>> 54.51user 0.51system 0:55.42elapsed 99%CPU (0avgtext+0avgdata 76848maxresident)k
>>>>> 0inputs+0outputs (0major+5426minor)pagefaults 0swaps
>>>>> 
>>>>> Here’s a snapshot of the box as it is currently (collecting netflow and not running the post-process script):
>>>>> 
>>>>> $ free -m
>>>>>          total       used       free     shared    buffers     cached
>>>>> Mem:          7864       7746        118          0         48       7163
>>>>> -/+ buffers/cache:        534       7329
>>>>> Swap:         9919         63       9856
>>>>> 
>>>>> So almost all of it’s 8 GB is free (aside from file caching of course)
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Jesse
>>>>> 
>>>>>> Carter
>>>>>> 
>>>>>> 
>>>>>> On Jul 8, 2014, at 12:31 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> On Jul 8, 2014, at 11:05 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>> 
>>>>>>>> Did radium stop collecting or sending ??  We’ve got some
>>>>>>>> reports on reliable connection failure, so it maybe your
>>>>>>>> rastream() disconnected and didn’t reconnect ????
>>>>>>> 
>>>>>>> It seems that radium is collecting; art least I can attach to the radium instance and receive 100 records with “ra -r 127.0.0.1 -N 100"
>>>>>>> 
>>>>>>>> check out your system log /var/log/messages /var/log/system.log
>>>>>>>> to see if radium complained about the client going away, or if
>>>>>>>> radium stopped reading.  If radium is still running you can just
>>>>>>>> connect to it, to see if its transmitting anything.
>>>>>>>> 
>>>>>>> It looks like it must be on the rastream side...??:
>>>>>>> 
>>>>>>> Jul  6 22:59:00 test radium[57599]: 2014-07-06 22:59:00.572718 connect from localhost[127.0.0.1]
>>>>>>> Jul  7 08:00:21 test radium[57599]: 2014-07-07 08:00:21.541077 ArgusWriteOutSocket(0x1269d0) client not processing: disconnecting
>>>>>>> 
>>>>>>> Likely unrelated, but I’m also seeing many of these messages in the logs:
>>>>>>> 
>>>>>>> Jul  2 16:31:26 test radium[47571]: 2014-07-02 16:31:26.358574 ArgusWriteOutSocket(0x181269d0) max queue exceeded 500001
>>>>>>> Jul  2 16:31:26 test radium[47571]: 2014-07-02 16:31:26.390583 ArgusWriteOutSocket(0x181269d0) max queue exceeded 500001
>>>>>>> 
>>>>>>> 
>>>>>>>> If there is a problem, and you’ve compiled with symbols in (.devel),
>>>>>>>> then attach to radium with gdb() and look to see if any of the
>>>>>>>> threads have terminated.
>>>>>>>> 
>>>>>>>> (gdb) attach pid.of.radium
>>>>>>>> (gdb) info threads
>>>>>>>> (gdb) thread 1
>>>>>>>> (gdb) where
>>>>>>>> (gdb) thread 2
>>>>>>>> (gdb) where
>>>>>>>> 
>>>>>>>> etc ….. may not be exact syntax, but its something like that.
>>>>>>>> With all the various end systems using clang and lldb, I’m kind
>>>>>>>> of schizophrenic on debugging right now.
>>>>>>>> 
>>>>>>> 
>>>>>>> Radium output:
>>>>>>> (gdb) info threads
>>>>>>> 3 Thread 0x7f001f752700 (LWP 57600)  0x0000003b53a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>>>>>> 2 Thread 0x7f001ed51700 (LWP 57601)  0x0000003b532acced in nanosleep () from /lib64/libc.so.6
>>>>>>> * 1 Thread 0x7f001fc87700 (LWP 57599)  0x0000003b532e15d3 in select () from /lib64/libc.so.6
>>>>>>> (gdb) thread 1
>>>>>>> [Switching to thread 1 (Thread 0x7f001fc87700 (LWP 57599))]#0 0x0000003b532e15d3 in select () from /lib64/libc.so.6
>>>>>>> (gdb) where
>>>>>>> #0  0x0000003b532e15d3 in select () from /lib64/libc.so.6
>>>>>>> #1  0x00000000004669ee in ArgusReadStream (parser=0x7f001fb42010, queue=0x19511f0) at ./argus_client.c:738
>>>>>>> #2  0x000000000040746c in main (argc=3, argv=0x7fff4ae0a728) at ./argus_main.c:387
>>>>>>> (gdb) thread 2
>>>>>>> [Switching to thread 2 (Thread 0x7f001ed51700 (LWP 57601))]#0 0x0000003b532acced in nanosleep () from /lib64/libc.so.6
>>>>>>> (gdb) where
>>>>>>> #0  0x0000003b532acced in nanosleep () from /lib64/libc.so.6
>>>>>>> #1  0x0000003b532acb60 in sleep () from /lib64/libc.so.6
>>>>>>> #2  0x0000000000466455 in ArgusConnectRemotes (arg=0x1951190) at ./argus_client.c:579
>>>>>>> #3  0x0000003b53a079d1 in start_thread () from /lib64/libpthread.so.0
>>>>>>> #4  0x0000003b532e8b5d in clone () from /lib64/libc.so.6
>>>>>>> (gdb) thread 3
>>>>>>> [Switching to thread 3 (Thread 0x7f001f752700 (LWP 57600))]#0 0x0000003b53a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>>>>>> (gdb) where
>>>>>>> #0  0x0000003b53a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>>>>>> #1  0x00000000004ac1be in ArgusOutputProcess (arg=0x1953310) at ./argus_output.c:897
>>>>>>> #2  0x0000003b53a079d1 in start_thread () from /lib64/libpthread.so.0
>>>>>>> #3  0x0000003b532e8b5d in clone () from /lib64/libc.so.6
>>>>>>> 
>>>>>>> Connecting to the failing rastream process gave odd results:
>>>>>>> 
>>>>>>> (gdb) detach
>>>>>>> Detaching from program: /usr/local/bin/radium, process 57599
>>>>>>> (gdb) attach 57605
>>>>>>> Attaching to program: /usr/local/bin/radium, process 57605
>>>>>>> Cannot access memory at address 0x706f636373007064
>>>>>>> (gdb) where
>>>>>>> #0  0x0000003b53a0ef3d in ?? ()
>>>>>>> #1  0x0000000000000000 in ?? ()
>>>>>>> (gdb) info threads
>>>>>>> * 1 process 57605  0x0000003b53a0ef3d in ?? ()
>>>>>>> 
>>>>>>> What should my next step be? Ensure the reliable connection setting is on? Run rastream under gdb?
>>>>>>> 
>>>>>>> Thanks and cheers,
>>>>>>> 
>>>>>>> Jesse
>>>>>>> 
>>>>>>>> Carter
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Jul 7, 2014, at 4:28 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> Hello,
>>>>>>>>> 
>>>>>>>>> Over the weekend my rastream process stopped processing records for some reason. The current setup is:
>>>>>>>>> 
>>>>>>>>> netflow records -> radium -> rastream -M time 5m
>>>>>>>>> 
>>>>>>>>> I noticed that records were no longer being written to disk. I connected a new ra instance to radium, and had no problems receiving records. Attaching strace to the rastream process all I could see were calls:
>>>>>>>>> 
>>>>>>>>> <snip>
>>>>>>>>> nanosleep({0, 50000000}, NULL)          = 0
>>>>>>>>> nanosleep({0, 50000000}, NULL)          = 0
>>>>>>>>> nanosleep({0, 50000000}, NULL)          = 0 
>>>>>>>>> <snip>
>>>>>>>>> 
>>>>>>>>> Is there any settings I can tweak or logs to check for or correct the issue? I vaguely recall something about persistent connections where if lost an attempt would be made to reconnect, but my gut says that’s not what’s happening here...
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> 
>>>>>>>>> Jesse
>>>> 
>>> 
>> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 455 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20140708/a793c245/attachment.sig>