200GB a day
Peter Van Epp
vanepp at sfu.ca
Sat Aug 20 16:32:55 EDT 2011
On Thu, Aug 18, 2011 at 09:14:00AM -0700, Eric Gustafson wrote:
> Hey all,
> Just weighing in, and a quick question:
> Not only can Argus itself handle 200GB a day, or 1TB a day, like was
> mentioned, but we are pushing roughly ten times that. This thing can
> scale, given the right hardware! (Bivio 7000 series) No dropped
> packets, no signs of memory issues, running for months and months
> straight.
>
> This leads to the question of how one manages the massive amount of
> data such a setup generates. How do those of you with larger argus
> installs manage your data? Right now, our in-house Perl wizard has
> prepared some scripts to attempt to wrangle (search / compute stats
> on) the trees of datestamped bzips that make up our data, but this
> seems far from ideal, but given the size of the data being processed,
> and the number of records, I don't know of a better one. I briefly
> thought about SQL, but even taking a smaller file of ours and running
> it through made a test SQL instance cry and beg for mercy, obviously
> due to the number of records involved.
>
> Is a linear search with ra the best I can do?
>
> (Thankfully, we don't need to do searches and stuff too often!)
>
> Cheers,
> - Eric
>
The linear search is likely the most cost effective approach :-). Our
data (~1.5 gigabytes per 24 hour day) is in argusarchive format on an archive
machine separate from the sensors. On the few occations I needed a scan of a
large time period (weeks to months) usually to find the cause of a breakin
after argus detected the breakin and the machine was removed from the network
I'd manually create a file of all the argus data for the period of interest
and feed that file (for the input data one file at a time) and a filter
expression to a perl script that ran ra with the filter across all the files.
That ran on the archive machine and thus didn't affect collection and could
take between hours and days (I remember one 6 month run taking about 3 days
to complete :-)). The thing is I didn't care about how long the search took
as long as I wasn't tied up doing it and there wasn't a requirement for speed
in output as usually things that needed fast reaction were near real time in
nature and covered by the daily traffic scripts not long term searches.
Overkill (i.e. searching for a greater time period than you think you need)
won't normally be a problem in these types of searches so do what is quick and
easy (and minimizes your time) then let the computer grind away at it :-).
Peter Van Epp
More information about the argus
mailing list