racluster() memory control Re: New To Argus

Fri Feb 29 20:45:05 EST 2008

Carter,

Thanks for the information.  I have been playing around with the timeout 
period with great success, though what is the status entry for?  If this 
is documented somewhere, I apologize, but I couldn't find it.

I think the radark() method is quite clever, but in my situation I am 
not able to do that (yet).  I am capturing data at a transit provider 
and immediately anonymizing the data.  I don't have access to know which 
subnets are darks, but I will investigate if I can find one.  I do think 
this could be a powerful research tool for me.

For now after I merge status flows, I think I will create a filter to 
purge the port scans from some of my outputs.

Nick

Carter Bullard wrote:
> The default idle timeout is infinity.
>
> I think if you pre-process the stream with something like radark()
> which provides you with the IP addresses of scanners, you can
> reject traffic involving those addresses or you just filter traffic
> that is going to IP addresses that don't exist, you will do well.
>
> We have limits in the code, I just need to reduce the number so it
> doesn't kill the average machine.  We have means of passing the
> limit to the clients as well, in the .rarc file, so that should be easy
> to do.
>
> Carter
>
>
>
> On Feb 28, 2008, at 4:59 PM, Nick Diel wrote:
>
>> Carter,
>>
>> I am going to start playing around with idle=timeout, if that 
>> parameter is not specified is there a default value used or will all 
>> flows stay in cache?  Though this parameter looks very promising for 
>> my use.
>>
>> Where we do most of our capturing we can see millions of port scans 
>> in a 12 hour trace, so that is an issue for us too when we do flow 
>> filtering.  I wonder if a separate timeout would be useful for flows 
>> that just have a syn?  Basically trying to purge out port scans 
>> faster.  Or maybe in a memory constraint model these flows are picked 
>> first to be outputted.
>>
>> I also think flushing out "complete" tcp flows is a good idea too.  
>> Maybe a second timeout should be in place for these flows (it would 
>> be shorter than the regular timeout and potentially 0), that way if 
>> you wanted you could capture anomalies such as duplicate fin acks.  
>> This second timer could also be used for flows that have a reset, 
>> since it is very common to see additional packets after a reset 
>> (packets still on the wire, reset gets lost, etc.).
>>
>> Finally, I can see how a memory limit could be beneficial.  While yes 
>> it does create a problem where results are going to be influenced on 
>> size of memory available, it will allow for processing that may not 
>> otherwise be possible (or at least easily doable).  When I was 
>> producing a list of flows on port 25, I had to use very aggressive 
>> filters to handle memory issues and I know I missed some flows 
>> anyways.  We ended up with 20 million plus flows for our 12 hour 
>> capture.  I would have been willing to give a memory limit and had 
>> known that possibly not all flows would have been combined properly.  
>> In my case at least I would expect most flows outputted early due to 
>> memory constraints would have been port scans and complete flows that 
>> haven't reached their idle timeout yet.  So again this would be a 
>> site specific option.  Per your list:
>>
>>    1. filter input data
>>    2. change the flow model definition to reduce the number of flows,
>>    3. use the "idle=timeout" option on items in the racluster.conf file.
>>    4. use memory limit for very large data sets with knowledge it 
>> could affect the actual output.
>>
>> Basically a memory limit is used when the others are not working.  It 
>> just allows for processing that may not have been easily possible.
>>
>> Nick
>>
>>
>> Carter Bullard wrote:
>>> Hey Nick,
>>> I can put memory limits into racluster(), but then there is the 
>>> possibililty
>>> that you get different results based on the available space.  I'm 
>>> not sure
>>> that is the right way to go, but who knows, it maybe a great option.
>>>
>>> The trick to keeping racluster memory use down is to:
>>>    1. filter input data
>>>    2. change the flow model definition to reduce the number of flows,
>>>    3. use the "idle=timeout" option on items in the racluster.conf 
>>> file.  
>>>
>>> This all needs to be customized for each site, so working with the
>>> racluster.conf file is the way to go, and running different .conf files
>>> against one sample test data file allows you fine tune the 
>>> configuration.
>>>
>>> Getting darknet traffic out of the mix is important.  For many sites
>>> "all the flows" are really scans, and should be ignored, 99.999%
>>> of the time.  I track if something new responds to a scan, not that the
>>> scan exists, because there is always a scan, because many of the sites
>>> that I pay attention to have literally 100,000's of scans a day.  As a
>>> result, we want to pay attention to the originator of the scan, the scan
>>> type, if the addresses involved are real, if its coming from inside 
>>> or outside
>>> and if there was a "new" response.  Sloughing scan traffic to tools
>>> that do scan analysis, and tracking the other flows makes this
>>> a doable thing, and programs like radark()/rafilteraddr() help here
>>> (but they are just examples).
>>>
>>> For traffic you really want to track, modifying the flow model allows
>>> us to reduce the number of flow caches, say, by ignoring the source
>>> port for flows going to well known servers.  Lines like this:
>>>
>>>    filter="dst host 1.2.3.4 and src pkts eq 1 and dst pkts eq 1 and 
>>> dst port 53" model="saddr/16 daddr proto dport"
>>>
>>> will reduce the number of in memory caches for this DNS server to
>>> just the number of class B networks hitting the server.  The filter
>>> only applies to successful connections, and the resulting aggregation
>>> record will contain stats like the number of connections and avgdur.
>>> (I'm assuming here that you will run racluster() against a file).
>>>
>>> You can have literally 1000's of these filters to generate the data
>>> you really want.
>>>
>>> When a flow hits its idle time, racluster will write out the record (if
>>> needed) and deallocate the memory used for that flow.  So having
>>> aggressive idle times is very helpful.
>>>
>>> But now that I'm thinking about it, we don't really have a way of
>>> flushing records based on the flow state for flows that are being merged
>>> together.  What I mean by that, is that if a TCP connection has 12 
>>> status
>>> records, and we finally get the last one that "closes" the connection,
>>> there isn't a way, currently, for us to check to see if that flow should
>>> be "flushed".
>>>
>>> Possibly we should take the resulting merged record, and run it back
>>> through the filters to see if there is something we need to do with it.
>>>
>>> Well, anyway, keep sending email if any of this is useful.
>>>
>>> Carter
>>>
>>>
>>>
>>> On Feb 28, 2008, at 1:25 PM, Nick Diel wrote:
>>>
>>>> Carter,
>>>>
>>>> Thanks for all of your input.  Also thanks for the updated Argus.
>>>>
>>>> After reading what you said, I can understand why Argus was 
>>>> designed the way it was.  I was just initially evaluating Argus 
>>>> with some very simple and discrete examples.  Looking at some of 
>>>> the source code also helped me wrap my head around Argus.
>>>>
>>>> On to the memory issue.  The system I am using has 2GB in it and 
>>>> racluster wants to use all of it.  When 1.7GB< starts to get used 
>>>> by racluster heavy swapping occurs and racluster's cpu usage drops 
>>>> below 25%.  So this was why I was thinking out loud about 
>>>> potentially giving racluster a memory limit from the command line.  
>>>> This way the system could avoid the heavy swapping and just have 
>>>> racluster write out the oldest records before moving on.
>>>>
>>>> Again thanks for putting up with me as I start to understand Argus.
>>>>
>>>> Nick
>>>>
>>>> Carter Bullard wrote:
>>>>> Hey Nick,
>>>>> The problem with packet capture, is primarily the disk performance.
>>>>> Argus can go as fast as you can collect packets, assuming that
>>>>> you're using Endace cards, and although argus does great in the
>>>>> presence of packet loss, it generates its best results when it gets
>>>>> everything.
>>>>>
>>>>> The best architecture is to run argus on the packet capture box,
>>>>> and to blow argus records to another machine that does the disk
>>>>> operations to store records.  This division of labor works best for
>>>>> the 10Gbps capture facilities.
>>>>>
>>>>> We sort input file names in the ra* programs, so doing this for
>>>>> argus is a cut, copy, paste job.  No problem, I'll put it in this 
>>>>> week.
>>>>>
>>>>> Argus can read from stdin.
>>>>>
>>>>> There are many incantations that work to decrease the memory
>>>>> demands of an argus client.  Just really need to know what it is
>>>>> that you want to do.
>>>>>
>>>>> OK, to your question.  
>>>>>>
>>>>>> Now let me ask about what I have been working on (merging flows 
>>>>>> across argus data files).  First, if I was capturing with Argus 
>>>>>> (not reading pcap files, capturing off the wire: argus | raspilt) 
>>>>>> wouldn't I run into the same problem of having flows broken up 
>>>>>> across different argus files?
>>>>>>
>>>>>> If racluster is merging records as it finds them (not reading all 
>>>>>> records into memory first), it seems it might be nice to specify 
>>>>>> a  memory limit for racluster at command line.  Then as racluster 
>>>>>> approaches the memory limit it could remove the oldest records 
>>>>>> from memory and print them to the output.
>>>>>
>>>>> Multiple argus status records spanning files.  Well, yes that is 
>>>>> the actual design
>>>>> goal.  When you think about most types of 
>>>>> operations/security/performance
>>>>> analysis, you want to see flow data scoped over some time 
>>>>> boundary.  Regardless
>>>>> of what that boundary is, whether its the last millisecond, second 
>>>>> or minute or hour,
>>>>> you will have flows that span that boundary.   There are a lot of 
>>>>> flows that are
>>>>> persistent, so you can't have a file big enough to hold complete 
>>>>> flows, ....,
>>>>> really.
>>>>>
>>>>> But you don't seem to be too interested in really granular data, 
>>>>> so you should
>>>>> modify the ARGUS_FAR_STATUS_INTERVAL value to be something larger
>>>>> than your file duration.  That way argus generates  only one 
>>>>> record per flow
>>>>> per file.  You use ra() to split files that are complete from 
>>>>> those that may
>>>>> continue into the next file, using the "-E" option and then after 
>>>>> you're done
>>>>> with all the files you have, then run racluster() against these 
>>>>> continuation files.
>>>>>
>>>>>    for i in *pcap; do argus -S 5m -r $i -w $i.argus; done
>>>>>    for i in *argus; do ra -r $i -E $i.cont -w argus.out - tcp and 
>>>>> ((syn or synack) and (fin or finack or reset)); done
>>>>>    racluster -r *.cont -w argus.out
>>>>>
>>>>> They won't be sorted, but thats easy to do with an additional step:
>>>>>    rasplit -M nomodify -r argus.out -M time 5m -w 
>>>>> data/argus.%Y.%m.%d.%H.%M.%S
>>>>>    rm argus.out
>>>>>    rasort -R data -M replace
>>>>>    ra -R data -w argus.out
>>>>>    rm -rf data
>>>>>
>>>>> Or at least something like that should work.  The "-M nomodify" is 
>>>>> critical, as rasplit()
>>>>> with break records up into time boundaries if you don't specify 
>>>>> this option, which
>>>>> puts you back in trouble, if you're really trying to keep the 
>>>>> flows together.
>>>>>
>>>>> Argus clients aren't suppose to consume more than, what 1GB of 
>>>>> memory, so there
>>>>> are limits in the code.  Do you have a smaller machine than that?
>>>>>
>>>>>
>>>>> Carter
>>>>>
>>>>>
>>>>> On Feb 25, 2008, at 2:01 PM, Nick Diel wrote:
>>>>>
>>>>>> Carter,
>>>>>>
>>>>>> First of all thanks for your detailed response and updated 
>>>>>> clients.  And I am glad you like twists.
>>>>>>
>>>>>> Let me tell you a little bit more about the research setup.  The 
>>>>>> research project I am part of (made up of several universities in 
>>>>>> the US) has several collection boxes in different large 
>>>>>> commercial environments.  The boxes were customized specifically 
>>>>>> for high speed packet capturing (RAID, Endace capture card, 
>>>>>> etc.).  We will run a 12 hour capture and then analyze the 
>>>>>> capture for some time.  Sometimes up to several months.  So I do 
>>>>>> have time to correctly create my argus output files and do any 
>>>>>> other processing I need to do.
>>>>>>
>>>>>> Some of the researchers focus on packet based research, where as 
>>>>>> other parts of the group focus more on flow based analysis.  So 
>>>>>> Argus looks like a great match for us.  Immediately after the 
>>>>>> capture, we can create Argus flow records and do our flow 
>>>>>> analysis with Argus clients.
>>>>>>
>>>>>> So for my first question, is Argus capable of capturing at high 
>>>>>> line speeds (at least 1Gbit) where doing a packet capture using 
>>>>>> libpcap and a standard NIC may fail (libpcap dropping packets)?  
>>>>>> Or since Argus is flow based it doesn't care if it misses 
>>>>>> packets?  Some of the anomalies we research require us to account 
>>>>>> for almost every packet in the anomaly, so say dropping every 
>>>>>> 100th or even every 1000th packet could hamper us.  The reason I 
>>>>>> ask I about Argus high speed captures, is if it is very capable 
>>>>>> at high speeds, it would allow us to deploy more collection boxes 
>>>>>> (these boxes would then primarily be used by the flow based 
>>>>>> researchers).  We wouldn't have to buy an expensive capture card 
>>>>>> for each collection box.
>>>>>>
>>>>>> As for reading multiple files into Argus, one easy way to 
>>>>>> accomplish this would have Argus be able to read pcap files from 
>>>>>> stdin.  Then one can use a utility such as mergecap or tcpslice 
>>>>>> to feed Argus a list of out of order files: mergecap -r 
>>>>>> /packets/*.pcap -w - | argus -r - ....
>>>>>>
>>>>>> My files are named so chronological order equals lexical order so 
>>>>>> argus -r * would work in my case (this helps us with a number of 
>>>>>> utilities we use).  I do understand actually implementing this in 
>>>>>> Argus would require probably a number of things such as dieing 
>>>>>> when files are out of order and then telling the user what order 
>>>>>> argus was reading the files.  Though doing this would be quite 
>>>>>> faster then having tcpslice or mergecap feed Argus the pcap files.
>>>>>>
>>>>>> Now let me ask about what I have been working on (merging flows 
>>>>>> across argus data files).  First, if I was capturing with Argus 
>>>>>> (not reading pcap files, capturing off the wire: argus | raspilt) 
>>>>>> wouldn't I run into the same problem of having flows broken up 
>>>>>> across different argus files?
>>>>>>
>>>>>> If racluster is merging records as it finds them (not reading all 
>>>>>> records into memory first), it seems it might be nice to specify 
>>>>>> a  memory limit for racluster at command line.  Then as racluster 
>>>>>> approaches the memory limit it could remove the oldest records 
>>>>>> from memory and print them to the output.
>>>>>>
>>>>>> I was able to use your suggestion successfully to merge most of 
>>>>>> my flows together.  Though I needed to make a few modifications 
>>>>>> to the filter.  I moved parenthesis, "tcp and ((syn or synack) 
>>>>>> and (*(*fin or finack) or reset*)*)" vs. "tcp and (*(*(syn or 
>>>>>> synack) and (fin or finack)*)* or reset)."  And I added "not con" 
>>>>>> to filter out the many, many packet scans, though this also does 
>>>>>> not merge syn-synack flows which exist at the end of the argus 
>>>>>> output files.  This filter still caused most of the memory to be 
>>>>>> used, but not a whole lot of time was spent in the upper range 
>>>>>> where swapping was slowing the system to a crawl.  Without "not 
>>>>>> con" I would reach the upper limits of memory usage quite fast 
>>>>>> and go into a crawl with the swapping.
>>>>>>
>>>>>> Thanks again for all your help,
>>>>>> Nick
>>>>>>
>>>>>>
>>>>>> Carter Bullard wrote:
>>>>>>> Hey Nick,
>>>>>>> The argus project from the very beginning has been trying
>>>>>>> to get people away from capturing packets, and instead
>>>>>>> capturing comprehensive flow records that account for every
>>>>>>> packet on the wire.  This is because capturing packets at modern
>>>>>>> speeds seems impractical, and there are a lot of problems that can
>>>>>>> be worked out without all that data.
>>>>>>>
>>>>>>> So to use argus in the way you want to use argus is a bit of a
>>>>>>> twist on the model.  But I like twists ;o)print
>>>>>>>
>>>>>>> >>> To start out with something simple I want to be able to 
>>>>>>> count the number of flows over TCP port 25.
>>>>>>>
>>>>>>> The easiest way to do that right now is to do something like 
>>>>>>> this in bash:
>>>>>>>
>>>>>>>    % for i in pcap*; do argus -r $i -w - - tcp and port 25 | \
>>>>>>>         rasplit -M time 5m -w - 
>>>>>>> argus.data/%Y/%m/%d/argus.%Y.%m.%d.%H.%M.%S ; \
>>>>>>>         done
>>>>>>>
>>>>>>> That will put the tcp:25  "micro flow" argus records into a 
>>>>>>> manageable
>>>>>>> set of files.  Now the files themselves need to be processed to
>>>>>>> get the flows merged together:
>>>>>>>
>>>>>>>    % racluster -M replace -R argus.data
>>>>>>>
>>>>>>> So now you'll get the data needed to ask questions, split into 
>>>>>>> 5m bins,
>>>>>>> so to speak.  Changing the "5m" to "1h", "4h", or "1d", may 
>>>>>>> generate
>>>>>>> file structures that you can work with, but eventually you will 
>>>>>>> hit a memory
>>>>>>> wall. Without doing something clever.
>>>>>>>
>>>>>>> Now that you have these intermediate files, in order to merge the
>>>>>>> tcp flows that span multiple files, you will need to give 
>>>>>>> racluster()
>>>>>>> a different aggregation strategy than the default.  Try a
>>>>>>> racluster.conf file that contains these lines against the argus 
>>>>>>> files
>>>>>>> you have.
>>>>>>>
>>>>>>> ------- start racluster.conf ---------
>>>>>>>
>>>>>>> filter="tcp and ((syn or synack) and ((fin or finack) or 
>>>>>>> reset))"  status=-1 idle=0
>>>>>>> filter="" model="saddr daddr proto sport dport"
>>>>>>>
>>>>>>> ------- end racluster.conf --------
>>>>>>>
>>>>>>> What this will do is:
>>>>>>>    1. any tcp connection that is complete, where we saw the 
>>>>>>> beginning and the
>>>>>>>        end, just pass it through, don't track anything.
>>>>>>>    2. any partial tcp connection, track and merge records that 
>>>>>>> match.
>>>>>>>
>>>>>>> So it only allocates memory for flows that are 'continuation' 
>>>>>>> records.
>>>>>>> The output is unsorted, so you will need to run rasort() if you 
>>>>>>> want
>>>>>>> to do any time oriented operations on the output.
>>>>>>>
>>>>>>> In testing this, I found a problem with parsing "-1" from the 
>>>>>>> status
>>>>>>> field in some weird conditions, so I fixed it.  Grab the newest
>>>>>>> clients from the dev directory if you want to try this method.
>>>>>>>
>>>>>>> ftp://qosient.com/dev/argus-3.0/argus-clients-3.0.0.rc.69.tar.gz
>>>>>>>
>>>>>>> Give that a try, and send email to the list with any kind of result
>>>>>>> yiou get.
>>>>>>>
>>>>>>> With so many pcap files, we probably need to make some other
>>>>>>> changes.
>>>>>>>
>>>>>>> The easiest way for you to do what you eventually want do,
>>>>>>> would be for you to say something like this:
>>>>>>>    argus -r * -w - | rawhatever
>>>>>>>
>>>>>>> This current won't work, and there is a reason, but maybe we
>>>>>>> can change it.  Argus currently can read multiple input files, 
>>>>>>> but you
>>>>>>> need to specify each file using a "-r filename -r filename " 
>>>>>>> like command
>>>>>>> line list.   With 1000's of files, that is somewhat 
>>>>>>> impractical.  It is this
>>>>>>> way on purpose, because argus really does need to see packets in 
>>>>>>> time order.
>>>>>>>
>>>>>>> If you try to do something like this:
>>>>>>>
>>>>>>>    argus -r * -w - | rasplit -M time 5m -w 
>>>>>>> argus.out.%Y.%m.%d.%H.%M.%S
>>>>>>>
>>>>>>> which is designed generate argus record files that represent packet
>>>>>>> behavior with hard cutoffs every 5 minutes, on the hour;    if the
>>>>>>> packet files are not read in time order, you get really weird
>>>>>>> results.  It's as if the realtime argus was jumping into the 
>>>>>>> future and
>>>>>>> then into the past and then back to the future again.
>>>>>>>
>>>>>>> Now, if you name your pcap files so they can be sorted, I can
>>>>>>> make it so "argus -r *" can work.  How do you name your pcap files?
>>>>>>>
>>>>>>>
>>>>>>> Because argus has the same timestamps as the packets in your
>>>>>>> pcap files, the timestamps can be used as an "external key" if
>>>>>>> you will.  If you build a database that has tuples (entries) like:
>>>>>>>
>>>>>>>    "pcap_filename start_time end_time"
>>>>>>>
>>>>>>> then by looking at a single argus record, which has a start time
>>>>>>> and an end time, you can  find the pcap files that contain its 
>>>>>>> packets.
>>>>>>> And with something like perl and tcpdump or wireshark, you can
>>>>>>> feed a simple shell to look in those pcap files looking for packets
>>>>>>> with this type of filter:
>>>>>>>
>>>>>>>    ( ether host $smac and $dmac) and (host $saddr and $daddr) 
>>>>>>> and ports \
>>>>>>>    ($sport and $dport)
>>>>>>>
>>>>>>> and you get all the packets that are referenced in the record.
>>>>>>>
>>>>>>>
>>>>>>> Carter
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Feb 21, 2008, at 4:49 PM, Nick Diel wrote:
>>>>>>>
>>>>>>>> I am new to Argus, but have found it has great potential for 
>>>>>>>> the research project I work on.  We collect pcap files from 
>>>>>>>> several high traffic networks (20k-100k packets/second).  We 
>>>>>>>> collect for approximately 12 hours and have ~1000 pcap files 
>>>>>>>> that are roughly 500MB each.
>>>>>>>> I am wanting to do a number of different flow analysis and 
>>>>>>>> think Argus might be perfect for me.  I am having a hard time 
>>>>>>>> grasping some of the fundamentals of Argus, but I think once I 
>>>>>>>> get some of the basics I will be able to really start to use 
>>>>>>>> Argus.
>>>>>>>>
>>>>>>>> To start out with something simple I want to be able to count 
>>>>>>>> the number of flows over TCP port 25.  I know I need to use 
>>>>>>>> RACluster to merge the Argus output (I have one argus file for 
>>>>>>>> each pcap file I have),  that way I can combine identical flow 
>>>>>>>> records into one.  I can do this fine on one argus output file, 
>>>>>>>> but I know many flows span the numerous files I have.  I also 
>>>>>>>> know I can't load all the files at once into RACluster as it 
>>>>>>>> fills all available memory.  So my question is how can I 
>>>>>>>> accomplish this while making sure I capture most flows that 
>>>>>>>> span multiple files.
>>>>>>>>
>>>>>>>> Once I understand this, I hope to be able to do things like 
>>>>>>>> create a list of flow sizes (in bytes) for port 25.  Basically 
>>>>>>>> I will be asking a lot of questions involving all flows that 
>>>>>>>> match a certain filter and I am not sure how to accommodate for 
>>>>>>>> flows spanning multiple files.
>>>>>>>>
>>>>>>>> A separate question.  I don't think Argus has this ability, but 
>>>>>>>> I wanted to know if the community already had a utility for 
>>>>>>>> this.  I am looking into creating a DB of some sort that would 
>>>>>>>> match Argus's flow IDs to pcap file name(s) and packet 
>>>>>>>> numbers.  This way one could extract the packets for a flow 
>>>>>>>> that needed further investigation.
>>>>>>>>
>>>>>>>> And finally, thanks for the great tool.  It does a number of 
>>>>>>>> things I have been doing manually for a while.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Nick
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20080229/4556fb2f/attachment.html>