n00b Questions

Fri Aug 28 17:28:33 EDT 2009

My average bandwidth today is 84 Meg per hour. During peak times it can get
up to 120 Meg.  In 40 min I have 1.5 gig and in an hour I will see about 2.3
- 2.5 gig of "primitive" data or more.

So in 24 hours I will see about 60 gig of primitive data.  Pumping just this
one sensor's data to my server, which I described before, cannot handle a
Daily Top Talker Aggregation.  I have to chunk reports up by hour to keep
the processor available for other reports.

You can see that just in a week I would store close to 420 Gigs of primitive
data.  Processing a day's primitive data ( from one sensor ) tends to bring
it to it's knees just running racluster. Now throw in 4 other sensors ( one
of which sees almost as much traffic ) and that is a lot of raw data to
process and store.

Clearly I need to better utilize the tools to efficiently process the raw
data.  Hence the email to see how others may do it.

Regards,

John

On Fri, Aug 28, 2009 at 8:56 AM, Carter Bullard <carter at qosient.com> wrote:

> Hey John,Historically, this list has been pretty quiet as to how they
> are doing particular
> things.  So you may not get a lot of responses.  Hopefully I can help.
>
> Most universities and corporations that run argus, use it along with snort,
> or some
> other type of IDS at their enterprise border.  They use the IDS as their
> up-front
> security sensor, and argus as the "cover your behind" technology.   The two
> basic
> strategies are to  keep all their argus data to support historical
> forensics or toss
> it after looking at the IDS logs and seeing that not much is/was happening.
>
> The first approach is usually chosen by sites that have technically
> advanced
> security personnel, that have been seriously attacked or for some reason
> have
> a real awareness of the issues and know that the commercial IDS/IPS market
> is
> lacking.  For sites that are under funded or are less technically oriented,
> argus,
> or argus like strategies usually aren't being used.  If these types of
> sites are using
> flow data, its almost always Netflow data and they are using a commercial
> report
> generator to give  the data some utility.  These strategies normally do not
> store
> significant amounts of flow data, as that would be a cost to the customer.
>
> So when a site does collect a lot of flow data,  they generally partition
> the
> data for scaling (like you are doing).  For universities/small
> corporations, they
> generate argus data in the subdomain/workgroups/dorms, where 500GB can
> store a years worth of flow data.
>
> When the point of collection is the enterprise boundary, and a site is
> really using
> the data, and justifying the expense of collecting it all, the site invests
> in storage,
> but they also do a massive amount of preprocessing to get the data load
> down.
>
> Most sites generate 5m-1h files.  We recommend 5 minutes.  Most sites run
> racluster()
> with the default settings on their files, sometime early in the process,
> and then
> gzip the files.  Just running racluster() with the default parameters will
> usually
> reduce a particular file by 50-70%.    I took yesterdays data from one of
> my
> small workgroups, clustered it and compressed it and got these listings:
>
>    thoth:tmp carter$ ls -lag data*
>    -rw-r--r--  1 wheel  93096940 Aug 28 10:30 data
>    -rw-r--r--  1 wheel  12534420 Aug 28 10:34 data.clustered
>    -rw-r--r--  1 wheel   2781879 Aug 28 10:30 data.clustered.gz
>
> So, from 93 MB to 2MB is pretty good.  Reading these gzip'd files performs
> pretty well, but if you are going to processing them repeatedly, then
> delaying
> compression for a few days is the norm.
>
> Because searching 100's of GB of primitive data is not very gratifying if
> you're
> looking for speed,  almost all big sites process the data as it comes in to
> generate "derived views" that are their first glance tables, charts and
> information
> systems.  After creating these "derived views" some sites toss the
> primitive data
> (the data from the probes).  For billing or quota system verification, most
> sites
> generate the daily reports, and retain the aggregated/processed argus
> records,
> and throw away the primitive data.  I've seen methods that toss, literally,
> 99.8%
> of the data within the first 24 hours, and still retain enough to do a good
> job on
> security awareness.
>
> There was a time where scanning traffic generated most of the flow data (>
> 40%).
> That has shifted in the last 3-4 years, but we have filters that can very
> quickly
> remove data to your dark address space and split to other directories.
>  Some sites
> use the data, many sites toss it.
>
> Some sites want to track their IP address space, because they have found
> that that
> is important to them, some want to retain  flow records only for "the
> dorms".  The
> argus-clients package has programs to help facilitate all of this, but you
> need to
> figure out what will work for you.
>
> I'm aware that my response may not answer your questions, but just
> keep asking away, and maybe there will be an answer in there that you can
> use.
>
> In terms of what kind of hardware to get?  Well, what's wrong with what
> you're using?
>
> Carter
>
>
> On Aug 28, 2009, at 2:53 AM, John Kennedy wrote:
>
> While reading the argus website for System Auditing, it got me thinking;
> With multiple ways to collect analyze and store Argus data, I am curious how
> some have tackled the collection, processing, management and storage of it?
> I am always curious when it comes to how others do it because like
> programming there is almost always more than one way to do it.  I would also
> like to find out if there are ways in which I could be more efficient.
>
> I use argus strictly for Network Security Monitoring.  In an ArcSight
> webinar I attended the other day the presenter said "Your business paints a
> picture everyday... is anyone watching" For me, argus helps connect the dots
> in order to see the picture(s).  I could throw many more analogies here, but
> I think you get the point.
>
> It has come time for me to refresh some of the hardware that argus is
> running on.  In order to effectively put together a proposal that will meet
> the needs of my monitoring efforts for the enterprise, I would like to
> understand a little about how those on this list are deploying argus.
>
> For me processing the data is the hardest hurdle i have to overcome each
> day.  The server in which I run the reporting from is on a dual core
> processor with 2 gigs of ram and 500 Gigs of storage.  Is this typical?
> Retention is also an issue.  On my sensors I run argus and write the data to
> a file. Every hour I have a script that takes the file compresses it and
> copies it to an archive. Every 4 hours I rsync it to the server.  On the
> server I have some scripts that process the last four hours of files that
> were just Rsynced.  I realize that I could use radium() to save files to my
> server; however with only a 500 gig RAID it gets a little tight with 5
> sensors. I keep archives on the sensors themselves to aid in some retention.
> The sensors by-the-way have a 200 Gig RAID.  When I first was working with
> argus and finding equipment to use. I was sure that 500 gig would be
> plenty... It's 500 gig, for crying out loud.
>
> So, give a n00b some feedback.
>
> Thanks
>
> John
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20090828/47ad839b/attachment.html>