rastream stopped processing

Jesse Bowling jessebowling at gmail.com
Tue Jul 8 15:13:47 EDT 2014


On Jul 8, 2014, at 2:50 PM, Carter Bullard <carter at qosient.com> wrote:

> Hey Jesse,
> Well, 17K records per second is also a bit snappy.  At that rate,
> your queues can back up in possibly 10-20 seconds, which isn’t
> that long.  Looks like you hit it in about 16 seconds.
> 
> Because the rate is a bit snappy, your problems with rastream()
> maybe related to the duration of the run of the script.
> So your rastream() doesn’t block to run its script, but there
> maybe some weird locking such that it doesn’t process its queue
> if the script is running.  This isn’t right so I’ll look into
> that.
> 
> So, ratop was bumped off because it was slow, just like your
> rastream was bumped off, and its now just sitting there, wondering
> what to do with no input.  Does it show any records on the screen ???
> 
> If you don’t care about sorting, which is what is killing ratop(),
> you can set 
> 
> You should be able to navigate
> and see the 100Ks of records in its cache ???
> 
> If you want to run ratop, definately change the UPDATE variable in
> its rarc file, and try running with a “-m” option to reduce the number
> of objects that its tracking, like:
> 
>   -m matrix/16
>   -m smac dmac
>   -m mac addr (this will kickin the -M rmon option, so print out the smac saddr fields)
> 
> or your -m sas value if you’ve labeled them.
> 

OK, setting -m sas let’s ratop not backlog...and now I know the actual RPS is around 30K-35K/sec. :)

Nothing crashes, but I’m not convinced I’m seeing everything either...Does this look correct?

ProcessQueue     31 DisplayQueue     31 TotalRecords 15079221  Secs  34473.2227 rps   Status Active                                                                                                                                                                             

The ProcessQueue and DisplayQueue totals don’t really vary (although sometimes it drops to 30)...Is that letting me know what is drawn on the screen (as opposed to a backlog, which is what I assumed at first)?

I also see what seem like data oddities:

                      StartTime      Flgs  Proto            SrcAddr  Sport sCo   sAS   Dir            DstAddr  Dport dCo   dAS  SrcPkts  DstPkts     SrcBytes     DstBytes                                                                                                      
     2014-07-08 14:52:21.698000 N             ip            0.0.0.0         US <snip>   <->            0.0.0.0            20940 2695281* 1060512* 156417443277  88651276430                                                                                                      
2014-07-08 14:52:22.293000 N             ip            0.0.0.0                    <->            0.0.0.0                  51007216      228  32196610358        19574        

Specifically, I doubt that the top entry traffic is all destined for AS 20940; a second run and all traffic is headed to AS 1668, a third 33739...Should I assume that das doesn’t make sense here? Also, I’m collecting Netflow records, so I would expect dst counters (bytes, pkts, etc) to all be zero, but for some of the top ASN’s, it’s not.

As a final touch, when I hit :q to quit, I got on two of three runs (appears to be related to how long I let ratop run):

Segmentation fault (core dumped)

Thanks for all the help and sorry for the likely silly questions; I’ve just not used ratop much...

Cheers,

Jesse

> Carter
> 
> 
> On Jul 8, 2014, at 2:24 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
> 
>> 
>> On Jul 8, 2014, at 2:07 PM, Carter Bullard <carter at qosient.com> wrote:
>> 
>>> well, 7K records per second is a bit snappy.
>>> are you using the latest ratop() as earlier versions have real performance problems ??
>>> 
>> $ ratop --version
>> RaCurses Version 3.0.7.36
>> 
>> Updated to 3.0.8.rc.1 and now...
>> 
>> Actually, looking at it now I’m seeing a starting value of around 17K/sec and similar behavior where the RPS falls off constantly and the queues build to around 100K each...then around then then the queues start falling and although the status remains “Active” records are no longer displayed, and after a :q command ratop doesn’t actually quit:
>> 
>> Jul  8 14:19:10 test radium[40569]: 2014-07-08 14:19:10.453649 connect from localhost[127.0.0.1]
>> Jul  8 14:19:26 test radium[40569]: 2014-07-08 14:19:26.774006 ArgusWriteOutSocket(0x136ce0) max queue exceeded 500001
>> Jul  8 14:19:26 test radium[40569]: 2014-07-08 14:19:26.774141 ArgusWriteOutSocket(0x136ce0) max queue exceeded 500001
>> 
>> (gdb) where
>> #0  0x0000003b53a0822d in pthread_join () from /lib64/libpthread.so.0
>> #1  0x000000000040519d in main (argc=5, argv=0x7fffa260f248) at ./racurses.c:85
>> (gdb) info threads
>> 2 Thread 0x7fe176f90700 (LWP 40799)  0x0000003b53a0ef3d in nanosleep () from /lib64/libpthread.so.0
>> * 1 Thread 0x7fe178d5c700 (LWP 40798)  0x0000003b53a0822d in pthread_join () from /lib64/libpthread.so.0
>> (gdb) thread 2
>> [Switching to thread 2 (Thread 0x7fe176f90700 (LWP 40799))]#0  0x0000003b53a0ef3d in nanosleep () from /lib64/libpthread.so.0
>> (gdb) where
>> #0  0x0000003b53a0ef3d in nanosleep () from /lib64/libpthread.so.0
>> #1  0x00000000004055c2 in ArgusCursesProcess (arg=0x0) at ./racurses.c:234
>> #2  0x0000003b53a079d1 in start_thread () from /lib64/libpthread.so.0
>> #3  0x0000003b532e8b5d in clone () from /lib64/libc.so.6
>> (gdb) 
>> 
>> 
>> 
>>> And by default on ratop, your 100K records are being sorted 10x per second, if
>>> so you will have some performance issues.  500K records in the queue for 7K records
>>> a second, is only 70 seconds of build up.  Still, rastream() should keep up.
>>> 
>>> To run ratop() at these rates, you will want to modify the RA_UPDATE_INTERVAL in
>>> your .rarc file to so that it doesn’t update the screen, which means sort the
>>> entire queue, every tenth of a second.  Or use the “:u” option to modify it
>>> on the fly, once you start ratop().  Every 2 seconds works pretty good.
>>> 
>>> Carter
>>> 
>>> On Jul 8, 2014, at 1:59 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>>> 
>>>> 
>>>> On Jul 8, 2014, at 1:12 PM, Carter Bullard <carter at qosient.com> wrote:
>>>> 
>>>>> That type of processing shouldn’t cause any problems, but
>>>>> you can be doing that level of processing in real time with
>>>>> rasqlinsert() or doing your rabins() as another task reading
>>>>> from radium.   You can also be running a local radium
>>>>> that is collecting from the remote radium, so that many local
>>>>> clients can feed from the local radium, but you’ve only written
>>>>> the data once onto the wire.  That maybe the best way to
>>>>> go.
>>>>> 
>>>>> Hmmmm, so is rastream() busy ??  Lots of records per second ???
>>>>> ratop connected to the radium, will print the average rps on 
>>>>> the bottom of the screen.
>>>>> 
>>>> 
>>>> rastream doesn’t appear to be that busy, CPU-wise at the moment, however ratop definitely has issues keeping up it seems...
>>>> 
>>>> Immediately upon connecting (ratop -S 127.0.0.1) the rps is about 7K, and the ProcessQueue and DisplayQueue rapidly increase until they hit about 100K (as the RPS drops during that time)...Once the two queue values drop to 0, the status changes to idle and ratop won’t respond to commands (:q, Ctrl-C, Ctrl-Z)...Additionally I get the queue exceed messages out of /var/log/messages:
>>>> 
>>>> Jul  8 13:51:31 test radium[9015]: 2014-07-08 13:51:31.427787 connect from localhost[127.0.0.1]
>>>> Jul  8 13:51:50 test radium[9015]: 2014-07-08 13:51:50.591503 ArgusWriteOutSocket(0x78136ce0) max queue exceeded 500001
>>>> Jul  8 13:51:50 test radium[9015]: 2014-07-08 13:51:50.591591 ArgusWriteOutSocket(0x78136ce0) max queue exceeded 500001
>>>> 
>>>> (no similar messages for rastream since last restart)
>>>> 
>>>> Any other steps to look at here? 
>>>> 
>>>> In the meantime I’ll look at using a second radium collecting from the first to feed clients...
>>>> 
>>>> Cheers,
>>>> 
>>>> Jesse
>>>> 
>>>>> Carter
>>>>> 
>>>>> 
>>>>> 
>>>>> On Jul 8, 2014, at 1:08 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>>>>> 
>>>>>> 
>>>>>> On Jul 8, 2014, at 12:49 PM, Carter Bullard <carter at qosient.com> wrote:
>>>>>> 
>>>>>>> Looks like radium is doing the right thing, and that rastream() is falling behind.
>>>>>>> So on the machine where rastream() is running, is it possible that the script
>>>>>>> that it runs eats all the memory on the machine, which will cause rastream to
>>>>>>> slow down in stream processing, backing up to radium, where radium shuts it down.
>>>>>>> 
>>>>>>> When radium hangs up, rastream should just reconnect, and I’m looking into why
>>>>>>> that may not be happening, but what is your rastream script doing ???
>>>>>>> Not sorting I hope ???
>>>>>>> 
>>>>>> 
>>>>>> Well, the ra.conf I was using did not have the reliable connect option on, but it’s now set thusly:
>>>>>> 
>>>>>> RA_RELIABLE_CONNECT=yes
>>>>>> 
>>>>>> so that might at least help that...The post-process script looks like this:
>>>>>> 
>>>>>> #!/bin/bash
>>>>>> 
>>>>>> #
>>>>>> #  Argus Client Software.  Tools to read, analyze and manage Argus data.
>>>>>> #  Copyright (C) 2000-2014 QoSient, LLC.
>>>>>> #  All Rights Reserved
>>>>>> #
>>>>>> # Script called by rastream, to process files.
>>>>>> #
>>>>>> # Since this is being called from rastream(), it will have only a single
>>>>>> # parameter, filename,
>>>>>> #
>>>>>> # Carter Bullard <carter at qosient.com>
>>>>>> #
>>>>>> 
>>>>>> PATH="/usr/local/bin:$PATH"; export PATH
>>>>>> package="argus-clients"
>>>>>> version="3.0.8rc3"
>>>>>> 
>>>>>> OPTIONS="$*"
>>>>>> FILE=
>>>>>> while test $# != 0
>>>>>> do
>>>>>> case "$1" in
>>>>>> -r) shift; FILE="$1"; break;;
>>>>>> esac
>>>>>> shift
>>>>>> done
>>>>>> 
>>>>>> RANGE=`/usr/local/bin/ratimerange -p 0 -r ${FILE} | awk -F ' - ' '{print $2}'`
>>>>>> ERANGE=`date +%s -d "${RANGE}"`
>>>>>> NOW=`date +%s`
>>>>>> if [[ "${ERANGE}" > "${NOW}" ]]
>>>>>> then
>>>>>> echo "${RANGE}" | /bin/mailx -s "ratimerange reporting bad dates" me@${DAYJOB}
>>>>>> fi
>>>>>> 
>>>>>> ASN_DIR=/asn/`date +%Y/%m/%d`
>>>>>> mkdir -p ${ASN_DIR}
>>>>>> ASN_FILE=`basename ${FILE}`
>>>>>> /usr/local/bin/racluster -m sas -r ${FILE} -w ${ASN_DIR}/${ASN_FILE}
>>>>>> 
>>>>>> As an aside, would it be a better strategy to attach an rabins process to radium aggregating on sas and writing out a file on some interval (say one minute)?
>>>>>> 
>>>>>> Running that racluster manually shows it only using a few megs of memory, and for the most part this machine appears to have lots of free memory:
>>>>>> 
>>>>>> 54.51user 0.51system 0:55.42elapsed 99%CPU (0avgtext+0avgdata 76848maxresident)k
>>>>>> 0inputs+0outputs (0major+5426minor)pagefaults 0swaps
>>>>>> 
>>>>>> Here’s a snapshot of the box as it is currently (collecting netflow and not running the post-process script):
>>>>>> 
>>>>>> $ free -m
>>>>>>         total       used       free     shared    buffers     cached
>>>>>> Mem:          7864       7746        118          0         48       7163
>>>>>> -/+ buffers/cache:        534       7329
>>>>>> Swap:         9919         63       9856
>>>>>> 
>>>>>> So almost all of it’s 8 GB is free (aside from file caching of course)
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> Jesse
>>>>>> 
>>>>>>> Carter
>>>>>>> 
>>>>>>> 
>>>>>>> On Jul 8, 2014, at 12:31 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> On Jul 8, 2014, at 11:05 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>> 
>>>>>>>>> Did radium stop collecting or sending ??  We’ve got some
>>>>>>>>> reports on reliable connection failure, so it maybe your
>>>>>>>>> rastream() disconnected and didn’t reconnect ????
>>>>>>>> 
>>>>>>>> It seems that radium is collecting; art least I can attach to the radium instance and receive 100 records with “ra -r 127.0.0.1 -N 100"
>>>>>>>> 
>>>>>>>>> check out your system log /var/log/messages /var/log/system.log
>>>>>>>>> to see if radium complained about the client going away, or if
>>>>>>>>> radium stopped reading.  If radium is still running you can just
>>>>>>>>> connect to it, to see if its transmitting anything.
>>>>>>>>> 
>>>>>>>> It looks like it must be on the rastream side...??:
>>>>>>>> 
>>>>>>>> Jul  6 22:59:00 test radium[57599]: 2014-07-06 22:59:00.572718 connect from localhost[127.0.0.1]
>>>>>>>> Jul  7 08:00:21 test radium[57599]: 2014-07-07 08:00:21.541077 ArgusWriteOutSocket(0x1269d0) client not processing: disconnecting
>>>>>>>> 
>>>>>>>> Likely unrelated, but I’m also seeing many of these messages in the logs:
>>>>>>>> 
>>>>>>>> Jul  2 16:31:26 test radium[47571]: 2014-07-02 16:31:26.358574 ArgusWriteOutSocket(0x181269d0) max queue exceeded 500001
>>>>>>>> Jul  2 16:31:26 test radium[47571]: 2014-07-02 16:31:26.390583 ArgusWriteOutSocket(0x181269d0) max queue exceeded 500001
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> If there is a problem, and you’ve compiled with symbols in (.devel),
>>>>>>>>> then attach to radium with gdb() and look to see if any of the
>>>>>>>>> threads have terminated.
>>>>>>>>> 
>>>>>>>>> (gdb) attach pid.of.radium
>>>>>>>>> (gdb) info threads
>>>>>>>>> (gdb) thread 1
>>>>>>>>> (gdb) where
>>>>>>>>> (gdb) thread 2
>>>>>>>>> (gdb) where
>>>>>>>>> 
>>>>>>>>> etc ….. may not be exact syntax, but its something like that.
>>>>>>>>> With all the various end systems using clang and lldb, I’m kind
>>>>>>>>> of schizophrenic on debugging right now.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Radium output:
>>>>>>>> (gdb) info threads
>>>>>>>> 3 Thread 0x7f001f752700 (LWP 57600)  0x0000003b53a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>>>>>>> 2 Thread 0x7f001ed51700 (LWP 57601)  0x0000003b532acced in nanosleep () from /lib64/libc.so.6
>>>>>>>> * 1 Thread 0x7f001fc87700 (LWP 57599)  0x0000003b532e15d3 in select () from /lib64/libc.so.6
>>>>>>>> (gdb) thread 1
>>>>>>>> [Switching to thread 1 (Thread 0x7f001fc87700 (LWP 57599))]#0 0x0000003b532e15d3 in select () from /lib64/libc.so.6
>>>>>>>> (gdb) where
>>>>>>>> #0  0x0000003b532e15d3 in select () from /lib64/libc.so.6
>>>>>>>> #1  0x00000000004669ee in ArgusReadStream (parser=0x7f001fb42010, queue=0x19511f0) at ./argus_client.c:738
>>>>>>>> #2  0x000000000040746c in main (argc=3, argv=0x7fff4ae0a728) at ./argus_main.c:387
>>>>>>>> (gdb) thread 2
>>>>>>>> [Switching to thread 2 (Thread 0x7f001ed51700 (LWP 57601))]#0 0x0000003b532acced in nanosleep () from /lib64/libc.so.6
>>>>>>>> (gdb) where
>>>>>>>> #0  0x0000003b532acced in nanosleep () from /lib64/libc.so.6
>>>>>>>> #1  0x0000003b532acb60 in sleep () from /lib64/libc.so.6
>>>>>>>> #2  0x0000000000466455 in ArgusConnectRemotes (arg=0x1951190) at ./argus_client.c:579
>>>>>>>> #3  0x0000003b53a079d1 in start_thread () from /lib64/libpthread.so.0
>>>>>>>> #4  0x0000003b532e8b5d in clone () from /lib64/libc.so.6
>>>>>>>> (gdb) thread 3
>>>>>>>> [Switching to thread 3 (Thread 0x7f001f752700 (LWP 57600))]#0 0x0000003b53a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>>>>>>> (gdb) where
>>>>>>>> #0  0x0000003b53a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>>>>>>> #1  0x00000000004ac1be in ArgusOutputProcess (arg=0x1953310) at ./argus_output.c:897
>>>>>>>> #2  0x0000003b53a079d1 in start_thread () from /lib64/libpthread.so.0
>>>>>>>> #3  0x0000003b532e8b5d in clone () from /lib64/libc.so.6
>>>>>>>> 
>>>>>>>> Connecting to the failing rastream process gave odd results:
>>>>>>>> 
>>>>>>>> (gdb) detach
>>>>>>>> Detaching from program: /usr/local/bin/radium, process 57599
>>>>>>>> (gdb) attach 57605
>>>>>>>> Attaching to program: /usr/local/bin/radium, process 57605
>>>>>>>> Cannot access memory at address 0x706f636373007064
>>>>>>>> (gdb) where
>>>>>>>> #0  0x0000003b53a0ef3d in ?? ()
>>>>>>>> #1  0x0000000000000000 in ?? ()
>>>>>>>> (gdb) info threads
>>>>>>>> * 1 process 57605  0x0000003b53a0ef3d in ?? ()
>>>>>>>> 
>>>>>>>> What should my next step be? Ensure the reliable connection setting is on? Run rastream under gdb?
>>>>>>>> 
>>>>>>>> Thanks and cheers,
>>>>>>>> 
>>>>>>>> Jesse
>>>>>>>> 
>>>>>>>>> Carter
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Jul 7, 2014, at 4:28 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>>> Hello,
>>>>>>>>>> 
>>>>>>>>>> Over the weekend my rastream process stopped processing records for some reason. The current setup is:
>>>>>>>>>> 
>>>>>>>>>> netflow records -> radium -> rastream -M time 5m
>>>>>>>>>> 
>>>>>>>>>> I noticed that records were no longer being written to disk. I connected a new ra instance to radium, and had no problems receiving records. Attaching strace to the rastream process all I could see were calls:
>>>>>>>>>> 
>>>>>>>>>> <snip>
>>>>>>>>>> nanosleep({0, 50000000}, NULL)          = 0
>>>>>>>>>> nanosleep({0, 50000000}, NULL)          = 0
>>>>>>>>>> nanosleep({0, 50000000}, NULL)          = 0 
>>>>>>>>>> <snip>
>>>>>>>>>> 
>>>>>>>>>> Is there any settings I can tweak or logs to check for or correct the issue? I vaguely recall something about persistent connections where if lost an attempt would be made to reconnect, but my gut says that’s not what’s happening here...
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> 
>>>>>>>>>> Jesse
>>>>> 
>>>> 
>>> 
>> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 204 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20140708/7e7f10d2/attachment.sig>


More information about the argus mailing list