rastream stopped processing

Tue Jul 8 18:39:06 EDT 2014

Hey Jesse,
Yep, there should be 3 threads.  The ArgusCursesInputProcess is missing.
Great maybe all I need !!!!

Carter

On Jul 8, 2014, at 5:01 PM, Jesse Bowling <jessebowling at gmail.com> wrote:

> 
> On Jul 8, 2014, at 3:56 PM, Carter Bullard <carter at qosient.com> wrote:
> 
>> 
>> The ratop() not handling the :q, the curses thread may have failed (that
>> is the thread that is reading the curses input).  If you could attach to
>> it with gdb() and check the threads and see if you can print the stack
>> “where” for the one that is dead, I’ll fix that post haste !!!!
>> 
> 
> Here’s what I see:
> 
> (gdb) info threads 
>  2 Thread 0x7fccbd180700 (LWP 21979)  0x0000003b53a0ef3d in nanosleep () from /lib64/libpthread.so.0
> * 1 Thread 0x7fccbef4c700 (LWP 21978)  0x0000003b53a0822d in pthread_join () from /lib64/libpthread.so.0
> (gdb) where
> #0  0x0000003b53a0822d in pthread_join () from /lib64/libpthread.so.0
> #1  0x000000000040519d in main (argc=17, argv=0x7fffdb0ac868) at ./racurses.c:85
> (gdb) thread 2
> [Switching to thread 2 (Thread 0x7fccbd180700 (LWP 21979))]#0  0x0000003b53a0ef3d in nanosleep () from /lib64/libpthread.so.0
> (gdb) where
> #0  0x0000003b53a0ef3d in nanosleep () from /lib64/libpthread.so.0
> #1  0x00000000004055c2 in ArgusCursesProcess (arg=0x0) at ./racurses.c:234
> #2  0x0000003b53a079d1 in start_thread () from /lib64/libpthread.so.0
> #3  0x0000003b532e8b5d in clone () from /lib64/libc.so.6
> 
> Thanks for the patient explanations! :)
> 
> Cheers,
> 
> Jesse
> 
>> 
>> Carter
>> 
>> On Jul 8, 2014, at 3:34 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>> 
>>> Thanks for that Carter.
>>> 
>>> It looks like with the rmon option ratop can’t keep up as I get:
>>> 
>>> Jul  8 15:25:43 test radium[40569]: 2014-07-08 15:25:43.153935 connect from localhost[127.0.0.1]
>>> Jul  8 15:26:48 test radium[40569]: 2014-07-08 15:26:48.708503 ArgusWriteOutSocket(0x136dc0) max queue exceeded 500001
>>> Jul  8 15:26:48 test radium[40569]: 2014-07-08 15:26:48.708594 ArgusWriteOutSocket(0x136dc0) max queue exceeded 500001
>>> 
>>> along with the symptom that :q won’t quit ratop...
>>> 
>>> As far as the rastream issue I originally reported, I’ll test the theory that having the reconnect option turned on helps...However I’m a bit worried about the implications of a disconnect. Is it fair to assume that a disconnect in my case is implying that rastream can’t keep up with the incoming records? If so, is there anything else I can tweak? 
>>> 
>>> I’m concerned about the timing of my first rastream failure (we were being massively scanned at that time, based on activity to our honeypots which saw a 17 fold increase in attempts) and about later on in the year (when our traffic levels tend to dramatically increase compared to the summer...
>>> 
>>> Cheers,
>>> 
>>> Jesse
>>> 
>>> 
>>> On Jul 8, 2014, at 3:22 PM, Carter Bullard <carter at qosient.com> wrote:
>>> 
>>>> Hey Jesse,
>>>> Looks good.
>>>> 
>>>> ratop() removes idle entries, and the timeout is default 60 seconds.
>>>> 
>>>> So, 31 source originating AS’s active in the last 60 seconds (assuming defaults).
>>>> “ProcessQueue" is total number of caches ratop is tracking, “DisplayQueue”
>>>> is the number on the screen.
>>>> 
>>>> So with a -m sas, you are only tracking the source originating AS numbers.
>>>> If you want to see all AS's, use the “-M rmon” option, so that you track all
>>>> (source and destination) as a single object.
>>>> 
>>>> ratop -M rmon -m sas
>>>> 
>>>> and you will see all the AS’s as singular objects.  The only thing to print
>>>> in that case is:
>>>> 
>>>> -s stime flgs sas spkts:16 dpkts:16 sbytes:20 dbytes:20 pcr
>>>> 
>>>> 
>>>> If you want to track the AS matrix, which unique sAS and dAS’s, DONT use
>>>> the -M rmon option, just add the das to the aggregation mask
>>>> 
>>>> ratop -m sas das
>>>> 
>>>> Now you will see the AS matrix.
>>>> 
>>>> Check out your timeouts using the “:T” value, and change that to say 3600.
>>>> 
>>>> Carter
>>>> 
>>>> 
>>>> 
>>>> On Jul 8, 2014, at 3:13 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>>>> 
>>>>> 
>>>>> On Jul 8, 2014, at 2:50 PM, Carter Bullard <carter at qosient.com> wrote:
>>>>> 
>>>>>> Hey Jesse,
>>>>>> Well, 17K records per second is also a bit snappy.  At that rate,
>>>>>> your queues can back up in possibly 10-20 seconds, which isn’t
>>>>>> that long.  Looks like you hit it in about 16 seconds.
>>>>>> 
>>>>>> Because the rate is a bit snappy, your problems with rastream()
>>>>>> maybe related to the duration of the run of the script.
>>>>>> So your rastream() doesn’t block to run its script, but there
>>>>>> maybe some weird locking such that it doesn’t process its queue
>>>>>> if the script is running.  This isn’t right so I’ll look into
>>>>>> that.
>>>>>> 
>>>>>> So, ratop was bumped off because it was slow, just like your
>>>>>> rastream was bumped off, and its now just sitting there, wondering
>>>>>> what to do with no input.  Does it show any records on the screen ???
>>>>>> 
>>>>>> If you don’t care about sorting, which is what is killing ratop(),
>>>>>> you can set 
>>>>>> 
>>>>>> You should be able to navigate
>>>>>> and see the 100Ks of records in its cache ???
>>>>>> 
>>>>>> If you want to run ratop, definately change the UPDATE variable in
>>>>>> its rarc file, and try running with a “-m” option to reduce the number
>>>>>> of objects that its tracking, like:
>>>>>> 
>>>>>> -m matrix/16
>>>>>> -m smac dmac
>>>>>> -m mac addr (this will kickin the -M rmon option, so print out the smac saddr fields)
>>>>>> 
>>>>>> or your -m sas value if you’ve labeled them.
>>>>>> 
>>>>> 
>>>>> OK, setting -m sas let’s ratop not backlog...and now I know the actual RPS is around 30K-35K/sec. :)
>>>>> 
>>>>> Nothing crashes, but I’m not convinced I’m seeing everything either...Does this look correct?
>>>>> 
>>>>> ProcessQueue     31 DisplayQueue     31 TotalRecords 15079221  Secs 34473.2227 rps   Status Active                                               
>>>>> 
>>>>> The ProcessQueue and DisplayQueue totals don’t really vary (although sometimes it drops to 30)...Is that letting me know what is drawn on the screen (as opposed to a backlog, which is what I assumed at first)?
>>>>> 
>>>>> I also see what seem like data oddities:
>>>>> 
>>>>>                   StartTime      Flgs  Proto            SrcAddr  Sport sCo sAS   Dir            DstAddr  Dport dCo   dAS  SrcPkts  DstPkts     SrcBytes DstBytes                                                                     
>>>>>  2014-07-08 14:52:21.698000 N             ip            0.0.0.0         US <snip>   <->            0.0.0.0            20940 2695281* 1060512* 156417443277  88651276430                                                    
>>>>> 2014-07-08 14:52:22.293000 N             ip            0.0.0.0                <->            0.0.0.0                  51007216      228  32196610358        19574        
>>>>> 
>>>>> Specifically, I doubt that the top entry traffic is all destined for AS 20940; a second run and all traffic is headed to AS 1668, a third 33739...Should I assume that das doesn’t make sense here? Also, I’m collecting Netflow records, so I would expect dst counters (bytes, pkts, etc) to all be zero, but for some of the top ASN’s, it’s not.
>>>>> 
>>>>> As a final touch, when I hit :q to quit, I got on two of three runs (appears to be related to how long I let ratop run):
>>>>> 
>>>>> Segmentation fault (core dumped)
>>>>> 
>>>>> Thanks for all the help and sorry for the likely silly questions; I’ve just not used ratop much...
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Jesse
>>>>> 
>>>>>> Carter
>>>>>> 
>>>>>> 
>>>>>> On Jul 8, 2014, at 2:24 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> On Jul 8, 2014, at 2:07 PM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>> 
>>>>>>>> well, 7K records per second is a bit snappy.
>>>>>>>> are you using the latest ratop() as earlier versions have real performance problems ??
>>>>>>>> 
>>>>>>> $ ratop --version
>>>>>>> RaCurses Version 3.0.7.36
>>>>>>> 
>>>>>>> Updated to 3.0.8.rc.1 and now...
>>>>>>> 
>>>>>>> Actually, looking at it now I’m seeing a starting value of around 17K/sec and similar behavior where the RPS falls off constantly and the queues build to around 100K each...then around then then the queues start falling and although the status remains “Active” records are no longer displayed, and after a :q command ratop doesn’t actually quit:
>>>>>>> 
>>>>>>> Jul  8 14:19:10 test radium[40569]: 2014-07-08 14:19:10.453649 connect from localhost[127.0.0.1]
>>>>>>> Jul  8 14:19:26 test radium[40569]: 2014-07-08 14:19:26.774006 ArgusWriteOutSocket(0x136ce0) max queue exceeded 500001
>>>>>>> Jul  8 14:19:26 test radium[40569]: 2014-07-08 14:19:26.774141 ArgusWriteOutSocket(0x136ce0) max queue exceeded 500001
>>>>>>> 
>>>>>>> (gdb) where
>>>>>>> #0  0x0000003b53a0822d in pthread_join () from /lib64/libpthread.so.0
>>>>>>> #1  0x000000000040519d in main (argc=5, argv=0x7fffa260f248) at ./racurses.c:85
>>>>>>> (gdb) info threads
>>>>>>> 2 Thread 0x7fe176f90700 (LWP 40799)  0x0000003b53a0ef3d in nanosleep () from /lib64/libpthread.so.0
>>>>>>> * 1 Thread 0x7fe178d5c700 (LWP 40798)  0x0000003b53a0822d in pthread_join () from /lib64/libpthread.so.0
>>>>>>> (gdb) thread 2
>>>>>>> [Switching to thread 2 (Thread 0x7fe176f90700 (LWP 40799))]#0 0x0000003b53a0ef3d in nanosleep () from /lib64/libpthread.so.0
>>>>>>> (gdb) where
>>>>>>> #0  0x0000003b53a0ef3d in nanosleep () from /lib64/libpthread.so.0
>>>>>>> #1  0x00000000004055c2 in ArgusCursesProcess (arg=0x0) at ./racurses.c:234
>>>>>>> #2  0x0000003b53a079d1 in start_thread () from /lib64/libpthread.so.0
>>>>>>> #3  0x0000003b532e8b5d in clone () from /lib64/libc.so.6
>>>>>>> (gdb) 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> And by default on ratop, your 100K records are being sorted 10x per second, if
>>>>>>>> so you will have some performance issues.  500K records in the queue for 7K records
>>>>>>>> a second, is only 70 seconds of build up.  Still, rastream() should keep up.
>>>>>>>> 
>>>>>>>> To run ratop() at these rates, you will want to modify the RA_UPDATE_INTERVAL in
>>>>>>>> your .rarc file to so that it doesn’t update the screen, which means sort the
>>>>>>>> entire queue, every tenth of a second.  Or use the “:u” option to modify it
>>>>>>>> on the fly, once you start ratop().  Every 2 seconds works pretty good.
>>>>>>>> 
>>>>>>>> Carter
>>>>>>>> 
>>>>>>>> On Jul 8, 2014, at 1:59 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Jul 8, 2014, at 1:12 PM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>>> 
>>>>>>>>>> That type of processing shouldn’t cause any problems, but
>>>>>>>>>> you can be doing that level of processing in real time with
>>>>>>>>>> rasqlinsert() or doing your rabins() as another task reading
>>>>>>>>>> from radium.   You can also be running a local radium
>>>>>>>>>> that is collecting from the remote radium, so that many local
>>>>>>>>>> clients can feed from the local radium, but you’ve only written
>>>>>>>>>> the data once onto the wire.  That maybe the best way to
>>>>>>>>>> go.
>>>>>>>>>> 
>>>>>>>>>> Hmmmm, so is rastream() busy ??  Lots of records per second ???
>>>>>>>>>> ratop connected to the radium, will print the average rps on 
>>>>>>>>>> the bottom of the screen.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> rastream doesn’t appear to be that busy, CPU-wise at the moment, however ratop definitely has issues keeping up it seems...
>>>>>>>>> 
>>>>>>>>> Immediately upon connecting (ratop -S 127.0.0.1) the rps is about 7K, and the ProcessQueue and DisplayQueue rapidly increase until they hit about 100K (as the RPS drops during that time)...Once the two queue values drop to 0, the status changes to idle and ratop won’t respond to commands (:q, Ctrl-C, Ctrl-Z)...Additionally I get the queue exceed messages out of /var/log/messages:
>>>>>>>>> 
>>>>>>>>> Jul  8 13:51:31 test radium[9015]: 2014-07-08 13:51:31.427787 connect from localhost[127.0.0.1]
>>>>>>>>> Jul  8 13:51:50 test radium[9015]: 2014-07-08 13:51:50.591503 ArgusWriteOutSocket(0x78136ce0) max queue exceeded 500001
>>>>>>>>> Jul  8 13:51:50 test radium[9015]: 2014-07-08 13:51:50.591591 ArgusWriteOutSocket(0x78136ce0) max queue exceeded 500001
>>>>>>>>> 
>>>>>>>>> (no similar messages for rastream since last restart)
>>>>>>>>> 
>>>>>>>>> Any other steps to look at here? 
>>>>>>>>> 
>>>>>>>>> In the meantime I’ll look at using a second radium collecting from the first to feed clients...
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> 
>>>>>>>>> Jesse
>>>>>>>>> 
>>>>>>>>>> Carter
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Jul 8, 2014, at 1:08 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Jul 8, 2014, at 12:49 PM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Looks like radium is doing the right thing, and that rastream() is falling behind.
>>>>>>>>>>>> So on the machine where rastream() is running, is it possible that the script
>>>>>>>>>>>> that it runs eats all the memory on the machine, which will cause rastream to
>>>>>>>>>>>> slow down in stream processing, backing up to radium, where radium shuts it down.
>>>>>>>>>>>> 
>>>>>>>>>>>> When radium hangs up, rastream should just reconnect, and I’m looking into why
>>>>>>>>>>>> that may not be happening, but what is your rastream script doing ???
>>>>>>>>>>>> Not sorting I hope ???
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Well, the ra.conf I was using did not have the reliable connect option on, but it’s now set thusly:
>>>>>>>>>>> 
>>>>>>>>>>> RA_RELIABLE_CONNECT=yes
>>>>>>>>>>> 
>>>>>>>>>>> so that might at least help that...The post-process script looks like this:
>>>>>>>>>>> 
>>>>>>>>>>> #!/bin/bash
>>>>>>>>>>> 
>>>>>>>>>>> #
>>>>>>>>>>> #  Argus Client Software.  Tools to read, analyze and manage Argus data.
>>>>>>>>>>> #  Copyright (C) 2000-2014 QoSient, LLC.
>>>>>>>>>>> #  All Rights Reserved
>>>>>>>>>>> #
>>>>>>>>>>> # Script called by rastream, to process files.
>>>>>>>>>>> #
>>>>>>>>>>> # Since this is being called from rastream(), it will have only a single
>>>>>>>>>>> # parameter, filename,
>>>>>>>>>>> #
>>>>>>>>>>> # Carter Bullard <carter at qosient.com>
>>>>>>>>>>> #
>>>>>>>>>>> 
>>>>>>>>>>> PATH="/usr/local/bin:$PATH"; export PATH
>>>>>>>>>>> package="argus-clients"
>>>>>>>>>>> version="3.0.8rc3"
>>>>>>>>>>> 
>>>>>>>>>>> OPTIONS="$*"
>>>>>>>>>>> FILE=
>>>>>>>>>>> while test $# != 0
>>>>>>>>>>> do
>>>>>>>>>>> case "$1" in
>>>>>>>>>>> -r) shift; FILE="$1"; break;;
>>>>>>>>>>> esac
>>>>>>>>>>> shift
>>>>>>>>>>> done
>>>>>>>>>>> 
>>>>>>>>>>> RANGE=`/usr/local/bin/ratimerange -p 0 -r ${FILE} | awk -F ' - ' '{print $2}'`
>>>>>>>>>>> ERANGE=`date +%s -d "${RANGE}"`
>>>>>>>>>>> NOW=`date +%s`
>>>>>>>>>>> if [[ "${ERANGE}" > "${NOW}" ]]
>>>>>>>>>>> then
>>>>>>>>>>> echo "${RANGE}" | /bin/mailx -s "ratimerange reporting bad dates" me@${DAYJOB}
>>>>>>>>>>> fi
>>>>>>>>>>> 
>>>>>>>>>>> ASN_DIR=/asn/`date +%Y/%m/%d`
>>>>>>>>>>> mkdir -p ${ASN_DIR}
>>>>>>>>>>> ASN_FILE=`basename ${FILE}`
>>>>>>>>>>> /usr/local/bin/racluster -m sas -r ${FILE} -w ${ASN_DIR}/${ASN_FILE}
>>>>>>>>>>> 
>>>>>>>>>>> As an aside, would it be a better strategy to attach an rabins process to radium aggregating on sas and writing out a file on some interval (say one minute)?
>>>>>>>>>>> 
>>>>>>>>>>> Running that racluster manually shows it only using a few megs of memory, and for the most part this machine appears to have lots of free memory:
>>>>>>>>>>> 
>>>>>>>>>>> 54.51user 0.51system 0:55.42elapsed 99%CPU (0avgtext+0avgdata 76848maxresident)k
>>>>>>>>>>> 0inputs+0outputs (0major+5426minor)pagefaults 0swaps
>>>>>>>>>>> 
>>>>>>>>>>> Here’s a snapshot of the box as it is currently (collecting netflow and not running the post-process script):
>>>>>>>>>>> 
>>>>>>>>>>> $ free -m
>>>>>>>>>>>     total       used       free     shared    buffers     cached
>>>>>>>>>>> Mem:          7864       7746        118          0         48       7163
>>>>>>>>>>> -/+ buffers/cache:        534       7329
>>>>>>>>>>> Swap:         9919         63       9856
>>>>>>>>>>> 
>>>>>>>>>>> So almost all of it’s 8 GB is free (aside from file caching of course)
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> 
>>>>>>>>>>> Jesse
>>>>>>>>>>> 
>>>>>>>>>>>> Carter
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 8, 2014, at 12:31 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Jul 8, 2014, at 11:05 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Did radium stop collecting or sending ??  We’ve got some
>>>>>>>>>>>>>> reports on reliable connection failure, so it maybe your
>>>>>>>>>>>>>> rastream() disconnected and didn’t reconnect ????
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It seems that radium is collecting; art least I can attach to the radium instance and receive 100 records with “ra -r 127.0.0.1 -N 100"
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> check out your system log /var/log/messages /var/log/system.log
>>>>>>>>>>>>>> to see if radium complained about the client going away, or if
>>>>>>>>>>>>>> radium stopped reading.  If radium is still running you can just
>>>>>>>>>>>>>> connect to it, to see if its transmitting anything.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> It looks like it must be on the rastream side...??:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Jul  6 22:59:00 test radium[57599]: 2014-07-06 22:59:00.572718 connect from localhost[127.0.0.1]
>>>>>>>>>>>>> Jul  7 08:00:21 test radium[57599]: 2014-07-07 08:00:21.541077 ArgusWriteOutSocket(0x1269d0) client not processing: disconnecting
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Likely unrelated, but I’m also seeing many of these messages in the logs:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Jul  2 16:31:26 test radium[47571]: 2014-07-02 16:31:26.358574 ArgusWriteOutSocket(0x181269d0) max queue exceeded 500001
>>>>>>>>>>>>> Jul  2 16:31:26 test radium[47571]: 2014-07-02 16:31:26.390583 ArgusWriteOutSocket(0x181269d0) max queue exceeded 500001
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If there is a problem, and you’ve compiled with symbols in (.devel),
>>>>>>>>>>>>>> then attach to radium with gdb() and look to see if any of the
>>>>>>>>>>>>>> threads have terminated.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> (gdb) attach pid.of.radium
>>>>>>>>>>>>>> (gdb) info threads
>>>>>>>>>>>>>> (gdb) thread 1
>>>>>>>>>>>>>> (gdb) where
>>>>>>>>>>>>>> (gdb) thread 2
>>>>>>>>>>>>>> (gdb) where
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> etc ….. may not be exact syntax, but its something like that.
>>>>>>>>>>>>>> With all the various end systems using clang and lldb, I’m kind
>>>>>>>>>>>>>> of schizophrenic on debugging right now.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Radium output:
>>>>>>>>>>>>> (gdb) info threads
>>>>>>>>>>>>> 3 Thread 0x7f001f752700 (LWP 57600)  0x0000003b53a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>>>>>>>>>>>> 2 Thread 0x7f001ed51700 (LWP 57601)  0x0000003b532acced in nanosleep () from /lib64/libc.so.6
>>>>>>>>>>>>> * 1 Thread 0x7f001fc87700 (LWP 57599)  0x0000003b532e15d3 in select () from /lib64/libc.so.6
>>>>>>>>>>>>> (gdb) thread 1
>>>>>>>>>>>>> [Switching to thread 1 (Thread 0x7f001fc87700 (LWP 57599))]#0 0x0000003b532e15d3 in select () from /lib64/libc.so.6
>>>>>>>>>>>>> (gdb) where
>>>>>>>>>>>>> #0  0x0000003b532e15d3 in select () from /lib64/libc.so.6
>>>>>>>>>>>>> #1  0x00000000004669ee in ArgusReadStream (parser=0x7f001fb42010, queue=0x19511f0) at ./argus_client.c:738
>>>>>>>>>>>>> #2  0x000000000040746c in main (argc=3, argv=0x7fff4ae0a728) at ./argus_main.c:387
>>>>>>>>>>>>> (gdb) thread 2
>>>>>>>>>>>>> [Switching to thread 2 (Thread 0x7f001ed51700 (LWP 57601))]#0 0x0000003b532acced in nanosleep () from /lib64/libc.so.6
>>>>>>>>>>>>> (gdb) where
>>>>>>>>>>>>> #0  0x0000003b532acced in nanosleep () from /lib64/libc.so.6
>>>>>>>>>>>>> #1  0x0000003b532acb60 in sleep () from /lib64/libc.so.6
>>>>>>>>>>>>> #2  0x0000000000466455 in ArgusConnectRemotes (arg=0x1951190) at ./argus_client.c:579
>>>>>>>>>>>>> #3  0x0000003b53a079d1 in start_thread () from /lib64/libpthread.so.0
>>>>>>>>>>>>> #4  0x0000003b532e8b5d in clone () from /lib64/libc.so.6
>>>>>>>>>>>>> (gdb) thread 3
>>>>>>>>>>>>> [Switching to thread 3 (Thread 0x7f001f752700 (LWP 57600))]#0 0x0000003b53a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>>>>>>>>>>>> (gdb) where
>>>>>>>>>>>>> #0  0x0000003b53a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>>>>>>>>>>>> #1  0x00000000004ac1be in ArgusOutputProcess (arg=0x1953310) at ./argus_output.c:897
>>>>>>>>>>>>> #2  0x0000003b53a079d1 in start_thread () from /lib64/libpthread.so.0
>>>>>>>>>>>>> #3  0x0000003b532e8b5d in clone () from /lib64/libc.so.6
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Connecting to the failing rastream process gave odd results:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> (gdb) detach
>>>>>>>>>>>>> Detaching from program: /usr/local/bin/radium, process 57599
>>>>>>>>>>>>> (gdb) attach 57605
>>>>>>>>>>>>> Attaching to program: /usr/local/bin/radium, process 57605
>>>>>>>>>>>>> Cannot access memory at address 0x706f636373007064
>>>>>>>>>>>>> (gdb) where
>>>>>>>>>>>>> #0  0x0000003b53a0ef3d in ?? ()
>>>>>>>>>>>>> #1  0x0000000000000000 in ?? ()
>>>>>>>>>>>>> (gdb) info threads
>>>>>>>>>>>>> * 1 process 57605  0x0000003b53a0ef3d in ?? ()
>>>>>>>>>>>>> 
>>>>>>>>>>>>> What should my next step be? Ensure the reliable connection setting is on? Run rastream under gdb?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks and cheers,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Jesse
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Carter
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Jul 7, 2014, at 4:28 PM, Jesse Bowling <jessebowling at gmail.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Over the weekend my rastream process stopped processing records for some reason. The current setup is:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> netflow records -> radium -> rastream -M time 5m
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I noticed that records were no longer being written to disk. I connected a new ra instance to radium, and had no problems receiving records. Attaching strace to the rastream process all I could see were calls:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>> nanosleep({0, 50000000}, NULL)          = 0
>>>>>>>>>>>>>>> nanosleep({0, 50000000}, NULL)          = 0
>>>>>>>>>>>>>>> nanosleep({0, 50000000}, NULL)          = 0 
>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Is there any settings I can tweak or logs to check for or correct the issue? I vaguely recall something about persistent connections where if lost an attempt would be made to reconnect, but my gut says that’s not what’s happening here...
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Jesse
>>>> 
>>> 
>> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 455 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20140708/771054f5/attachment.sig>