lost flows and memory leak in radium

Thu Jan 31 09:00:48 EST 2013

Hey Craig,
None of this makes any sense to me.   Why does it take time for radium to exit?
I can't imagine what would cause radium to be slow exiting.  

What does it mean for clients to have to wait?   What are they waiting for?
Dozens of reboots?  Why are you rebooting?  I'm afraid that we're not getting
enough information, and something is terribly wrong, as this is all pretty weird.

Normally you get problems with radium when radium connects to itself, or
if you create a data loop, where a downstream radium connects to an upstream
radium.  Are you doing that?

Carter

On Jan 29, 2013, at 4:01 PM, Craig Merchant <cmerchant at responsys.com> wrote:

> I’ve got a few more data points…  Unfortunately, the server takes a long time to gracefully shut down, so any test that causes radium to fail means everyone working on that box has to wait.
>  
> I rebooted the server with a radium.conf file that included support for labels and used our large label file.  Radium’s CPU ran around 15-25%.  That seems pretty normal from what I’ve seen.  I did notice that even though no ra clients have connected to it, it’s memory usage was slowly ticking up over time.
>  
> I commented out the label support and the RADIUM_ARGUS_SERVER line in radium.conf and rebooted.  When it came up, radium was running at less than 1% CPU.  Memory did not increase at all.  I was able to use the init script to start and stop the service without causing the CPU spike. 
>  
> I then re-enabled the RADIUM_ARGUS_SERVER line and restarted radium.  The CPU spiked to 185%.
>  
> I added label support back in and made the iana label file a single line:  91.189.181.0/24 blacklisted.  I rebooted.  Radium does not appear to have any memory leak.
>  
> I commented out label support and rebooted.  Radium has been running at 15% and the memory does not appear to be increasing, though I’ll have to keep an eye on it for another few hours.
>  
> So, it would appear to me that radium is having issues connecting to an argusd instance after radium has been restarted at least once.  It also seems that radium has a small memory leak when using a large label file, but not with a very small file.  I hope to be able to test the memory leak further to see if the size and composition of the label file effects it per your instructions in the previous post. 
>  
> Is it safe to presume that radium requires a restart to pick up modifications to the iana label file?  I’ll probably have to find some off hours times to do those tests since they’ll probably involve dozens of reboots…
>  
> If you want to try replicating my environment and eliminate underlying OS/kernel issues, we’re using the redBorder IDS solution (http://www.redborder.net).  Their sensor and manager can be installed from their bootable ISO.  We run argus on the sensor and radium on the manager.  I’ve gotten both to run under VMware and VirtualBox.  I can send you our setup guide and any packages we used.
>  
> We’ll be adding argus to a second data center in the next week or so that has less traffic than the current data center.  So, that might give us a different volume of data to test against.
>  
> Thx.
>  
> Craig
>  
>  
>  
> From: argus-info-bounces+cmerchant=responsys.com at lists.andrew.cmu.edu [mailto:argus-info-bounces+cmerchant=responsys.com at lists.andrew.cmu.edu] On Behalf Of Craig Merchant
> Sent: Tuesday, January 29, 2013 11:11 AM
> To: Carter Bullard
> Cc: Argus (argus-info at lists.andrew.cmu.edu)
> Subject: Re: [ARGUS] lost flows and memory leak in radium
>  
> The first thing I did was get radium working without labels.  I was able to test various ra clients against it and they worked perfectly.  I then added label support to radium.conf and restarted.  That’s when I saw the high CPU utilization and saw the lost flows.  So, I assumed the problem was with the labels. 
>  
> I tried label files with a small number of networks, a small number of hosts, long labels, short labels, etc.  Each time, restarting radium after I’d changed the label file.
>  
> It wasn’t until I tried rebooting the entire machine that radium started behaving normally again.  I rebooted the server without any label support in radium.conf.  Radium came up working normally.  I restarted radium with the init script and experienced the high CPU utilization and flow loss.
>  
> I enabled label support in radium.conf and used my 1500+ line label file and rebooted again.  Radium came up appearing to normally – CPU around 20% and 2% memory use.  But after twelve hours or so, the memory use was up to about 55%.  No ra clients connected to radium during that time.
>  
> So, from what I can tell, there are two issues and they don’t seem to be related to labels.  Restarting radium (with labels on or off) causes it to run at like 185% CPU and flows get dropped.  Radium will run normally after a machine reboot (with labels on or off), but there appears to be a slow memory leak.
>  
> Thx.
> 
> Craig 
>  
>  
> From: Carter Bullard [mailto:carter at qosient.com] 
> Sent: Tuesday, January 29, 2013 3:01 AM
> To: Craig Merchant
> Cc: Argus (argus-info at lists.andrew.cmu.edu)
> Subject: Re: [ARGUS] lost flows and memory leak in radium
>  
> Hey Craig,
> Hmmm, you've got too much going on to know what is happening.
> Your problem was that radium() had poor performance when providing labels.  Now its when you restart radium().  Which problem are we working on ?
>  
> If you restart radium, all the clients will close their connections, and then reconnect.  Is that where the data loss occurs ?  Why are you restarting radium ?
>  
> Carter
> 
> 
> On Jan 29, 2013, at 1:08 AM, Craig Merchant <cmerchant at responsys.com> wrote:
> 
> Carter,
>  
> After much testing, it doesn’t appear that the problem is with the size or makeup of the iana label file.  Restarting radium is the problem.  Restarting the service, even with labeling commented out in the /etc/radium.conf file, causes the spike in CPU and the data loss for ra clients.
>  
> What kind of data can I provide that would be helpful to you?
> 
> Thx.
>  
> Craig
>  
>  
> From: Carter Bullard [mailto:carter at qosient.com] 
> Sent: Saturday, January 26, 2013 3:32 PM
> To: Craig Merchant
> Cc: Argus (argus-info at lists.andrew.cmu.edu)
> Subject: Re: [ARGUS] lost flows and memory leak in radium
>  
> Hey Craig,
> This is interesting, as we haven't had much in the way of pure radium performance
> reports with labeling.  The cycle requirements for labels will vary quite a bit depending
> on the strategy.  Address based labeling will perform the best, as we have a pretty fast
> patricia tree structure for address and label lookup.  The flow based labeling may be
> the worst performing, as we have to switch out the search contexts for each rule.
> And no telling how fast the GeoIP goes, but its been the most used label to date, so
> I think they do a pretty good job.
>  
> Can you try a few sample label strategies, just to tease out where the loads are?
> Maybe start with a single rule in each label strategy, doing one strategy at a time,
> and then ramp them up with 2, 4 8, etc... rules, until we get to your complexity.
>  
> A good sample would be a label rule that labels everything, with a small label,
> vs a rule that labels everything with a large label, so that we're accounting for the
> label sizes as an impact on performance.
>  
> That will help.  There are a lot of queues, a lot of buffering, a lot of things going on.
>  
> Can you share your radium.conf file?  and the ralabel.conf style Classifier file?
>  
> Carter
>  
> On Jan 26, 2013, at 2:45 PM, Craig Merchant <cmerchant at responsys.com> wrote:
> 
> 
> 
> I tried rebooting the server with the label options commented out in radium.conf.  When the server came up, radium was running at 11% CPU and there were no pauses or loss of flows when clients connected.  I added the labeling config back to radium.conf and restarted.  The CPU ran at over 190% and the flow loss and pauses returned.
>  
> I commented those lines back out again and restarted radium.  Radium ran at around 150% with flow loss and pauses.  I rebooted the server again and it radium was back to normal.
>  
> From: argus-info-bounces+cmerchant=responsys.com at lists.andrew.cmu.edu [mailto:argus-info-bounces+cmerchant=responsys.com at lists.andrew.cmu.edu] On Behalf Of Craig Merchant
> Sent: Friday, January 25, 2013 4:44 PM
> To: Argus (argus-info at lists.andrew.cmu.edu)
> Subject: [ARGUS] lost flows and memory leak in radium
>  
> We’ve got one data center currently running argus on our IDS sensor (CentOS 6.2) and it listens on a DNA/libzero interface thanks to code from Chris Wakelin.  So, we do experience the bug in PF_RING where some select() call causes argusd to run at 100% CPU all the time.
>  
> We probably average between 4-8 Gbps of traffic.  A separate host is running radium and pulls the flows off of the sensor by connecting to tcp 561.  Top shows radium running at 190% CPU most of the time. 
>  
> If I connect any of the ra clients to radium (such as ra –S radium:561), flows will appear for 10-30 seconds and then pause for 30-60 seconds.  If I connect the ra clients directly to the remote argusd instance, they work fine.  We’ll be deploying argus in a second data center soon, so we’d really like to take advantage of radium’s ability to dedup flows.
>  
> Radium’s memory usage slowly climbed whether an ra client was connected or not.
>  
> I tried commenting out the two RADIUM_CLASSIFIER settings and restarted radium.  Our label file is something like 1500 lines long, so I thought that could be causing problems.    Radium uses about 30% less CPU and memory stays at 0.8%.  The intermittent pauses still happen though.
>  
> I then tried setting RADIUM_CLASSIFIER=no instead of commenting it out and the CPU went back up by 30% and the memory usage climbed steadily with no ra clients connected.  Does that not disable labeling in radium?
>  
> I’m not sure how to diagnose it any further.  My argus.conf and radium.conf are in the spreadsheet I sent you earlier.  Let me know what I can do to help diagnose this further.
>  
> Thanks.
> 
> Craig
>  
>  
>  
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20130131/462f0086/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2589 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20130131/462f0086/attachment.bin>