racluster() memory control Re: New To Argus
Nick Diel
ndiel at engr.colostate.edu
Fri Feb 29 20:45:05 EST 2008
Carter,
Thanks for the information. I have been playing around with the timeout
period with great success, though what is the status entry for? If this
is documented somewhere, I apologize, but I couldn't find it.
I think the radark() method is quite clever, but in my situation I am
not able to do that (yet). I am capturing data at a transit provider
and immediately anonymizing the data. I don't have access to know which
subnets are darks, but I will investigate if I can find one. I do think
this could be a powerful research tool for me.
For now after I merge status flows, I think I will create a filter to
purge the port scans from some of my outputs.
Nick
Carter Bullard wrote:
> The default idle timeout is infinity.
>
> I think if you pre-process the stream with something like radark()
> which provides you with the IP addresses of scanners, you can
> reject traffic involving those addresses or you just filter traffic
> that is going to IP addresses that don't exist, you will do well.
>
> We have limits in the code, I just need to reduce the number so it
> doesn't kill the average machine. We have means of passing the
> limit to the clients as well, in the .rarc file, so that should be easy
> to do.
>
> Carter
>
>
>
> On Feb 28, 2008, at 4:59 PM, Nick Diel wrote:
>
>> Carter,
>>
>> I am going to start playing around with idle=timeout, if that
>> parameter is not specified is there a default value used or will all
>> flows stay in cache? Though this parameter looks very promising for
>> my use.
>>
>> Where we do most of our capturing we can see millions of port scans
>> in a 12 hour trace, so that is an issue for us too when we do flow
>> filtering. I wonder if a separate timeout would be useful for flows
>> that just have a syn? Basically trying to purge out port scans
>> faster. Or maybe in a memory constraint model these flows are picked
>> first to be outputted.
>>
>> I also think flushing out "complete" tcp flows is a good idea too.
>> Maybe a second timeout should be in place for these flows (it would
>> be shorter than the regular timeout and potentially 0), that way if
>> you wanted you could capture anomalies such as duplicate fin acks.
>> This second timer could also be used for flows that have a reset,
>> since it is very common to see additional packets after a reset
>> (packets still on the wire, reset gets lost, etc.).
>>
>> Finally, I can see how a memory limit could be beneficial. While yes
>> it does create a problem where results are going to be influenced on
>> size of memory available, it will allow for processing that may not
>> otherwise be possible (or at least easily doable). When I was
>> producing a list of flows on port 25, I had to use very aggressive
>> filters to handle memory issues and I know I missed some flows
>> anyways. We ended up with 20 million plus flows for our 12 hour
>> capture. I would have been willing to give a memory limit and had
>> known that possibly not all flows would have been combined properly.
>> In my case at least I would expect most flows outputted early due to
>> memory constraints would have been port scans and complete flows that
>> haven't reached their idle timeout yet. So again this would be a
>> site specific option. Per your list:
>>
>> 1. filter input data
>> 2. change the flow model definition to reduce the number of flows,
>> 3. use the "idle=timeout" option on items in the racluster.conf file.
>> 4. use memory limit for very large data sets with knowledge it
>> could affect the actual output.
>>
>> Basically a memory limit is used when the others are not working. It
>> just allows for processing that may not have been easily possible.
>>
>> Nick
>>
>>
>> Carter Bullard wrote:
>>> Hey Nick,
>>> I can put memory limits into racluster(), but then there is the
>>> possibililty
>>> that you get different results based on the available space. I'm
>>> not sure
>>> that is the right way to go, but who knows, it maybe a great option.
>>>
>>> The trick to keeping racluster memory use down is to:
>>> 1. filter input data
>>> 2. change the flow model definition to reduce the number of flows,
>>> 3. use the "idle=timeout" option on items in the racluster.conf
>>> file.
>>>
>>> This all needs to be customized for each site, so working with the
>>> racluster.conf file is the way to go, and running different .conf files
>>> against one sample test data file allows you fine tune the
>>> configuration.
>>>
>>> Getting darknet traffic out of the mix is important. For many sites
>>> "all the flows" are really scans, and should be ignored, 99.999%
>>> of the time. I track if something new responds to a scan, not that the
>>> scan exists, because there is always a scan, because many of the sites
>>> that I pay attention to have literally 100,000's of scans a day. As a
>>> result, we want to pay attention to the originator of the scan, the scan
>>> type, if the addresses involved are real, if its coming from inside
>>> or outside
>>> and if there was a "new" response. Sloughing scan traffic to tools
>>> that do scan analysis, and tracking the other flows makes this
>>> a doable thing, and programs like radark()/rafilteraddr() help here
>>> (but they are just examples).
>>>
>>> For traffic you really want to track, modifying the flow model allows
>>> us to reduce the number of flow caches, say, by ignoring the source
>>> port for flows going to well known servers. Lines like this:
>>>
>>> filter="dst host 1.2.3.4 and src pkts eq 1 and dst pkts eq 1 and
>>> dst port 53" model="saddr/16 daddr proto dport"
>>>
>>> will reduce the number of in memory caches for this DNS server to
>>> just the number of class B networks hitting the server. The filter
>>> only applies to successful connections, and the resulting aggregation
>>> record will contain stats like the number of connections and avgdur.
>>> (I'm assuming here that you will run racluster() against a file).
>>>
>>> You can have literally 1000's of these filters to generate the data
>>> you really want.
>>>
>>> When a flow hits its idle time, racluster will write out the record (if
>>> needed) and deallocate the memory used for that flow. So having
>>> aggressive idle times is very helpful.
>>>
>>> But now that I'm thinking about it, we don't really have a way of
>>> flushing records based on the flow state for flows that are being merged
>>> together. What I mean by that, is that if a TCP connection has 12
>>> status
>>> records, and we finally get the last one that "closes" the connection,
>>> there isn't a way, currently, for us to check to see if that flow should
>>> be "flushed".
>>>
>>> Possibly we should take the resulting merged record, and run it back
>>> through the filters to see if there is something we need to do with it.
>>>
>>> Well, anyway, keep sending email if any of this is useful.
>>>
>>> Carter
>>>
>>>
>>>
>>> On Feb 28, 2008, at 1:25 PM, Nick Diel wrote:
>>>
>>>> Carter,
>>>>
>>>> Thanks for all of your input. Also thanks for the updated Argus.
>>>>
>>>> After reading what you said, I can understand why Argus was
>>>> designed the way it was. I was just initially evaluating Argus
>>>> with some very simple and discrete examples. Looking at some of
>>>> the source code also helped me wrap my head around Argus.
>>>>
>>>> On to the memory issue. The system I am using has 2GB in it and
>>>> racluster wants to use all of it. When 1.7GB< starts to get used
>>>> by racluster heavy swapping occurs and racluster's cpu usage drops
>>>> below 25%. So this was why I was thinking out loud about
>>>> potentially giving racluster a memory limit from the command line.
>>>> This way the system could avoid the heavy swapping and just have
>>>> racluster write out the oldest records before moving on.
>>>>
>>>> Again thanks for putting up with me as I start to understand Argus.
>>>>
>>>> Nick
>>>>
>>>> Carter Bullard wrote:
>>>>> Hey Nick,
>>>>> The problem with packet capture, is primarily the disk performance.
>>>>> Argus can go as fast as you can collect packets, assuming that
>>>>> you're using Endace cards, and although argus does great in the
>>>>> presence of packet loss, it generates its best results when it gets
>>>>> everything.
>>>>>
>>>>> The best architecture is to run argus on the packet capture box,
>>>>> and to blow argus records to another machine that does the disk
>>>>> operations to store records. This division of labor works best for
>>>>> the 10Gbps capture facilities.
>>>>>
>>>>> We sort input file names in the ra* programs, so doing this for
>>>>> argus is a cut, copy, paste job. No problem, I'll put it in this
>>>>> week.
>>>>>
>>>>> Argus can read from stdin.
>>>>>
>>>>> There are many incantations that work to decrease the memory
>>>>> demands of an argus client. Just really need to know what it is
>>>>> that you want to do.
>>>>>
>>>>> OK, to your question.
>>>>>>
>>>>>> Now let me ask about what I have been working on (merging flows
>>>>>> across argus data files). First, if I was capturing with Argus
>>>>>> (not reading pcap files, capturing off the wire: argus | raspilt)
>>>>>> wouldn't I run into the same problem of having flows broken up
>>>>>> across different argus files?
>>>>>>
>>>>>> If racluster is merging records as it finds them (not reading all
>>>>>> records into memory first), it seems it might be nice to specify
>>>>>> a memory limit for racluster at command line. Then as racluster
>>>>>> approaches the memory limit it could remove the oldest records
>>>>>> from memory and print them to the output.
>>>>>
>>>>> Multiple argus status records spanning files. Well, yes that is
>>>>> the actual design
>>>>> goal. When you think about most types of
>>>>> operations/security/performance
>>>>> analysis, you want to see flow data scoped over some time
>>>>> boundary. Regardless
>>>>> of what that boundary is, whether its the last millisecond, second
>>>>> or minute or hour,
>>>>> you will have flows that span that boundary. There are a lot of
>>>>> flows that are
>>>>> persistent, so you can't have a file big enough to hold complete
>>>>> flows, ....,
>>>>> really.
>>>>>
>>>>> But you don't seem to be too interested in really granular data,
>>>>> so you should
>>>>> modify the ARGUS_FAR_STATUS_INTERVAL value to be something larger
>>>>> than your file duration. That way argus generates only one
>>>>> record per flow
>>>>> per file. You use ra() to split files that are complete from
>>>>> those that may
>>>>> continue into the next file, using the "-E" option and then after
>>>>> you're done
>>>>> with all the files you have, then run racluster() against these
>>>>> continuation files.
>>>>>
>>>>> for i in *pcap; do argus -S 5m -r $i -w $i.argus; done
>>>>> for i in *argus; do ra -r $i -E $i.cont -w argus.out - tcp and
>>>>> ((syn or synack) and (fin or finack or reset)); done
>>>>> racluster -r *.cont -w argus.out
>>>>>
>>>>> They won't be sorted, but thats easy to do with an additional step:
>>>>> rasplit -M nomodify -r argus.out -M time 5m -w
>>>>> data/argus.%Y.%m.%d.%H.%M.%S
>>>>> rm argus.out
>>>>> rasort -R data -M replace
>>>>> ra -R data -w argus.out
>>>>> rm -rf data
>>>>>
>>>>> Or at least something like that should work. The "-M nomodify" is
>>>>> critical, as rasplit()
>>>>> with break records up into time boundaries if you don't specify
>>>>> this option, which
>>>>> puts you back in trouble, if you're really trying to keep the
>>>>> flows together.
>>>>>
>>>>> Argus clients aren't suppose to consume more than, what 1GB of
>>>>> memory, so there
>>>>> are limits in the code. Do you have a smaller machine than that?
>>>>>
>>>>>
>>>>> Carter
>>>>>
>>>>>
>>>>> On Feb 25, 2008, at 2:01 PM, Nick Diel wrote:
>>>>>
>>>>>> Carter,
>>>>>>
>>>>>> First of all thanks for your detailed response and updated
>>>>>> clients. And I am glad you like twists.
>>>>>>
>>>>>> Let me tell you a little bit more about the research setup. The
>>>>>> research project I am part of (made up of several universities in
>>>>>> the US) has several collection boxes in different large
>>>>>> commercial environments. The boxes were customized specifically
>>>>>> for high speed packet capturing (RAID, Endace capture card,
>>>>>> etc.). We will run a 12 hour capture and then analyze the
>>>>>> capture for some time. Sometimes up to several months. So I do
>>>>>> have time to correctly create my argus output files and do any
>>>>>> other processing I need to do.
>>>>>>
>>>>>> Some of the researchers focus on packet based research, where as
>>>>>> other parts of the group focus more on flow based analysis. So
>>>>>> Argus looks like a great match for us. Immediately after the
>>>>>> capture, we can create Argus flow records and do our flow
>>>>>> analysis with Argus clients.
>>>>>>
>>>>>> So for my first question, is Argus capable of capturing at high
>>>>>> line speeds (at least 1Gbit) where doing a packet capture using
>>>>>> libpcap and a standard NIC may fail (libpcap dropping packets)?
>>>>>> Or since Argus is flow based it doesn't care if it misses
>>>>>> packets? Some of the anomalies we research require us to account
>>>>>> for almost every packet in the anomaly, so say dropping every
>>>>>> 100th or even every 1000th packet could hamper us. The reason I
>>>>>> ask I about Argus high speed captures, is if it is very capable
>>>>>> at high speeds, it would allow us to deploy more collection boxes
>>>>>> (these boxes would then primarily be used by the flow based
>>>>>> researchers). We wouldn't have to buy an expensive capture card
>>>>>> for each collection box.
>>>>>>
>>>>>> As for reading multiple files into Argus, one easy way to
>>>>>> accomplish this would have Argus be able to read pcap files from
>>>>>> stdin. Then one can use a utility such as mergecap or tcpslice
>>>>>> to feed Argus a list of out of order files: mergecap -r
>>>>>> /packets/*.pcap -w - | argus -r - ....
>>>>>>
>>>>>> My files are named so chronological order equals lexical order so
>>>>>> argus -r * would work in my case (this helps us with a number of
>>>>>> utilities we use). I do understand actually implementing this in
>>>>>> Argus would require probably a number of things such as dieing
>>>>>> when files are out of order and then telling the user what order
>>>>>> argus was reading the files. Though doing this would be quite
>>>>>> faster then having tcpslice or mergecap feed Argus the pcap files.
>>>>>>
>>>>>> Now let me ask about what I have been working on (merging flows
>>>>>> across argus data files). First, if I was capturing with Argus
>>>>>> (not reading pcap files, capturing off the wire: argus | raspilt)
>>>>>> wouldn't I run into the same problem of having flows broken up
>>>>>> across different argus files?
>>>>>>
>>>>>> If racluster is merging records as it finds them (not reading all
>>>>>> records into memory first), it seems it might be nice to specify
>>>>>> a memory limit for racluster at command line. Then as racluster
>>>>>> approaches the memory limit it could remove the oldest records
>>>>>> from memory and print them to the output.
>>>>>>
>>>>>> I was able to use your suggestion successfully to merge most of
>>>>>> my flows together. Though I needed to make a few modifications
>>>>>> to the filter. I moved parenthesis, "tcp and ((syn or synack)
>>>>>> and (*(*fin or finack) or reset*)*)" vs. "tcp and (*(*(syn or
>>>>>> synack) and (fin or finack)*)* or reset)." And I added "not con"
>>>>>> to filter out the many, many packet scans, though this also does
>>>>>> not merge syn-synack flows which exist at the end of the argus
>>>>>> output files. This filter still caused most of the memory to be
>>>>>> used, but not a whole lot of time was spent in the upper range
>>>>>> where swapping was slowing the system to a crawl. Without "not
>>>>>> con" I would reach the upper limits of memory usage quite fast
>>>>>> and go into a crawl with the swapping.
>>>>>>
>>>>>> Thanks again for all your help,
>>>>>> Nick
>>>>>>
>>>>>>
>>>>>> Carter Bullard wrote:
>>>>>>> Hey Nick,
>>>>>>> The argus project from the very beginning has been trying
>>>>>>> to get people away from capturing packets, and instead
>>>>>>> capturing comprehensive flow records that account for every
>>>>>>> packet on the wire. This is because capturing packets at modern
>>>>>>> speeds seems impractical, and there are a lot of problems that can
>>>>>>> be worked out without all that data.
>>>>>>>
>>>>>>> So to use argus in the way you want to use argus is a bit of a
>>>>>>> twist on the model. But I like twists ;o)print
>>>>>>>
>>>>>>> >>> To start out with something simple I want to be able to
>>>>>>> count the number of flows over TCP port 25.
>>>>>>>
>>>>>>> The easiest way to do that right now is to do something like
>>>>>>> this in bash:
>>>>>>>
>>>>>>> % for i in pcap*; do argus -r $i -w - - tcp and port 25 | \
>>>>>>> rasplit -M time 5m -w -
>>>>>>> argus.data/%Y/%m/%d/argus.%Y.%m.%d.%H.%M.%S ; \
>>>>>>> done
>>>>>>>
>>>>>>> That will put the tcp:25 "micro flow" argus records into a
>>>>>>> manageable
>>>>>>> set of files. Now the files themselves need to be processed to
>>>>>>> get the flows merged together:
>>>>>>>
>>>>>>> % racluster -M replace -R argus.data
>>>>>>>
>>>>>>> So now you'll get the data needed to ask questions, split into
>>>>>>> 5m bins,
>>>>>>> so to speak. Changing the "5m" to "1h", "4h", or "1d", may
>>>>>>> generate
>>>>>>> file structures that you can work with, but eventually you will
>>>>>>> hit a memory
>>>>>>> wall. Without doing something clever.
>>>>>>>
>>>>>>> Now that you have these intermediate files, in order to merge the
>>>>>>> tcp flows that span multiple files, you will need to give
>>>>>>> racluster()
>>>>>>> a different aggregation strategy than the default. Try a
>>>>>>> racluster.conf file that contains these lines against the argus
>>>>>>> files
>>>>>>> you have.
>>>>>>>
>>>>>>> ------- start racluster.conf ---------
>>>>>>>
>>>>>>> filter="tcp and ((syn or synack) and ((fin or finack) or
>>>>>>> reset))" status=-1 idle=0
>>>>>>> filter="" model="saddr daddr proto sport dport"
>>>>>>>
>>>>>>> ------- end racluster.conf --------
>>>>>>>
>>>>>>> What this will do is:
>>>>>>> 1. any tcp connection that is complete, where we saw the
>>>>>>> beginning and the
>>>>>>> end, just pass it through, don't track anything.
>>>>>>> 2. any partial tcp connection, track and merge records that
>>>>>>> match.
>>>>>>>
>>>>>>> So it only allocates memory for flows that are 'continuation'
>>>>>>> records.
>>>>>>> The output is unsorted, so you will need to run rasort() if you
>>>>>>> want
>>>>>>> to do any time oriented operations on the output.
>>>>>>>
>>>>>>> In testing this, I found a problem with parsing "-1" from the
>>>>>>> status
>>>>>>> field in some weird conditions, so I fixed it. Grab the newest
>>>>>>> clients from the dev directory if you want to try this method.
>>>>>>>
>>>>>>> ftp://qosient.com/dev/argus-3.0/argus-clients-3.0.0.rc.69.tar.gz
>>>>>>>
>>>>>>> Give that a try, and send email to the list with any kind of result
>>>>>>> yiou get.
>>>>>>>
>>>>>>> With so many pcap files, we probably need to make some other
>>>>>>> changes.
>>>>>>>
>>>>>>> The easiest way for you to do what you eventually want do,
>>>>>>> would be for you to say something like this:
>>>>>>> argus -r * -w - | rawhatever
>>>>>>>
>>>>>>> This current won't work, and there is a reason, but maybe we
>>>>>>> can change it. Argus currently can read multiple input files,
>>>>>>> but you
>>>>>>> need to specify each file using a "-r filename -r filename "
>>>>>>> like command
>>>>>>> line list. With 1000's of files, that is somewhat
>>>>>>> impractical. It is this
>>>>>>> way on purpose, because argus really does need to see packets in
>>>>>>> time order.
>>>>>>>
>>>>>>> If you try to do something like this:
>>>>>>>
>>>>>>> argus -r * -w - | rasplit -M time 5m -w
>>>>>>> argus.out.%Y.%m.%d.%H.%M.%S
>>>>>>>
>>>>>>> which is designed generate argus record files that represent packet
>>>>>>> behavior with hard cutoffs every 5 minutes, on the hour; if the
>>>>>>> packet files are not read in time order, you get really weird
>>>>>>> results. It's as if the realtime argus was jumping into the
>>>>>>> future and
>>>>>>> then into the past and then back to the future again.
>>>>>>>
>>>>>>> Now, if you name your pcap files so they can be sorted, I can
>>>>>>> make it so "argus -r *" can work. How do you name your pcap files?
>>>>>>>
>>>>>>>
>>>>>>> Because argus has the same timestamps as the packets in your
>>>>>>> pcap files, the timestamps can be used as an "external key" if
>>>>>>> you will. If you build a database that has tuples (entries) like:
>>>>>>>
>>>>>>> "pcap_filename start_time end_time"
>>>>>>>
>>>>>>> then by looking at a single argus record, which has a start time
>>>>>>> and an end time, you can find the pcap files that contain its
>>>>>>> packets.
>>>>>>> And with something like perl and tcpdump or wireshark, you can
>>>>>>> feed a simple shell to look in those pcap files looking for packets
>>>>>>> with this type of filter:
>>>>>>>
>>>>>>> ( ether host $smac and $dmac) and (host $saddr and $daddr)
>>>>>>> and ports \
>>>>>>> ($sport and $dport)
>>>>>>>
>>>>>>> and you get all the packets that are referenced in the record.
>>>>>>>
>>>>>>>
>>>>>>> Carter
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Feb 21, 2008, at 4:49 PM, Nick Diel wrote:
>>>>>>>
>>>>>>>> I am new to Argus, but have found it has great potential for
>>>>>>>> the research project I work on. We collect pcap files from
>>>>>>>> several high traffic networks (20k-100k packets/second). We
>>>>>>>> collect for approximately 12 hours and have ~1000 pcap files
>>>>>>>> that are roughly 500MB each.
>>>>>>>> I am wanting to do a number of different flow analysis and
>>>>>>>> think Argus might be perfect for me. I am having a hard time
>>>>>>>> grasping some of the fundamentals of Argus, but I think once I
>>>>>>>> get some of the basics I will be able to really start to use
>>>>>>>> Argus.
>>>>>>>>
>>>>>>>> To start out with something simple I want to be able to count
>>>>>>>> the number of flows over TCP port 25. I know I need to use
>>>>>>>> RACluster to merge the Argus output (I have one argus file for
>>>>>>>> each pcap file I have), that way I can combine identical flow
>>>>>>>> records into one. I can do this fine on one argus output file,
>>>>>>>> but I know many flows span the numerous files I have. I also
>>>>>>>> know I can't load all the files at once into RACluster as it
>>>>>>>> fills all available memory. So my question is how can I
>>>>>>>> accomplish this while making sure I capture most flows that
>>>>>>>> span multiple files.
>>>>>>>>
>>>>>>>> Once I understand this, I hope to be able to do things like
>>>>>>>> create a list of flow sizes (in bytes) for port 25. Basically
>>>>>>>> I will be asking a lot of questions involving all flows that
>>>>>>>> match a certain filter and I am not sure how to accommodate for
>>>>>>>> flows spanning multiple files.
>>>>>>>>
>>>>>>>> A separate question. I don't think Argus has this ability, but
>>>>>>>> I wanted to know if the community already had a utility for
>>>>>>>> this. I am looking into creating a DB of some sort that would
>>>>>>>> match Argus's flow IDs to pcap file name(s) and packet
>>>>>>>> numbers. This way one could extract the packets for a flow
>>>>>>>> that needed further investigation.
>>>>>>>>
>>>>>>>> And finally, thanks for the great tool. It does a number of
>>>>>>>> things I have been doing manually for a while.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Nick
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20080229/4556fb2f/attachment.html>
More information about the argus
mailing list