Removing possibly unused metadata?
MN
m.newton at stanford.edu
Mon Nov 7 19:31:35 EST 2011
Hi Carter -
> Many of these ragator() changes are now incorporated into 3.0 racluster(), so
> shouldn't have to modify the source.
Agreed and thanks - I was just showing Jason what we had done as he'd
asked for the patches.
> For the rastrip(), looks like you want to blow away the fractional part of the
> timestamp? I'll can add that to rastrip() later this week, but that won't
> reduce the size of the stored record. How would you want to specify that
> on rastrip()'s command line?
It will not reduce the size of the uncompressed record, but of the
compressed record it makes a _huge_ difference (lots of entropy is
lost => much better compression).
Now, for long term data (when we do not need precise time stamps),
we convert to ascii, round to Nths of a second and compress for
quite good compression. We also create summay files (of seen IPs)
that drastically reduce needle-in-the-haystack searches.
Here's details for one hour today for one tap, with rounding to 3
decimal places of time:
# gunzip < argus.13.gz | wc
9395130 66723491 2688418924
# ls -al *.13*
-r--r--r-- 1 xfer xfer 1180649821 Nov 7 14:00 argus.13.gz
-r--r--r-- 1 xfer xfer 349355238 Nov 7 14:00 ascii-argus.13.gz
-r--r--r-- 1 xfer xfer 820980 Nov 7 15:13 ip_summary.13.gz
So, 2.7GB ---gzip--> 1.2GB ---round/ascii---> .35 GB --summarize--> .82MB
(and we're partially moving to xz for better compression)
Many thanks,
- mike
On Mon, Nov 07, 2011 at 07:07:03PM -0500, Carter Bullard wrote:
> Hey Mike,
> Many of these ragator() changes are now incorporated into 3.0 racluster(), so
> shouldn't have to modify the source.
>
> For the rastrip(), looks like you want to blow away the fractional part of the
> timestamp? I'll can add that to rastrip() later this week, but that won't
> reduce the size of the stored record. How would you want to specify that
> on rastrip()'s command line?
>
> Zeroing out a value is more in line with ranonymize(), rather than rastrip().
> I'd suggest that you just strip out the ' net ' DSR to achieve what you're after ?
>
> Carter
>
> On Nov 7, 2011, at 6:10 PM, MN wrote:
>
> >
> > Hi Jason -
> >
> > These diffs are based on the 2.0.6 distribution, so are old.
> > They worked well for us over a several year period. I believe
> > the ragator changes made about a 3% improvement. The timestamp
> > and other rastrip changes made a variable difference depending
> > upon the mask, but were substantial.
> >
> > Now, the asciification and xz compression - and lots more
> > storage - allow us to keep ~400 days.
> >
> > Hope these help,
> > - mike
> >
> >
> > % diff ragator.c raGATOR.c
> > 37a38,39
> >> int fromFilesOnly = 0; /* -- MN */
> >>
> > 132a135,141
> >> /* if we are not "real-time", then do not purge the queue as often -- MN */
> >> if (rflag & !Sflag) {
> >> extern struct timeval RaClientTimeout;
> >> RaClientTimeout.tv_sec = 8;
> >> RaClientTimeout.tv_usec = 0;
> >> }
> >>
> > 223a233
> >> fprintf (stderr, " -H bins[L]:range Do Historgram-related processing (range is value-value, where value is %%d[ums]) [UNDOC'ed]"); /* -- MN */
> > 524,526c534,537
> > <
> > < #define RA_MAXQSCAN 25600
> > < #define RA_MAXQSIZE 250000
> > ---
> >> /* original multipliers were each 1; RA_MAXQSIZE roughly equals the amount of memory (in K) used */
> >> #define RA_MAXQSCAN (1 * 25600)
> >> #define RA_MAXQSIZE (2 * 250000)
> >> /* #define RA_MAXQSIZE 625000 */
> > 527a539,540
> >> /* Note: this is called once every RaClientTimeout whether or not reading from streams. */
> >>
> > 542,559c555,573
> > < while (queue->count > RA_MAXQSIZE) {
> > < obj = (struct ArgusRecordStore *) RaRemoveFromQueue(RaModelerQueue, RaModelerQueue->start->prv);
> > < RaTimeoutArgusStore(obj);
> > < }
> > <
> > < if ((cnt = ((queue->count > RA_MAXQSCAN) ? RA_MAXQSCAN : queue->count)) != 0) {
> > < while (cnt--) {
> > < if ((obj = (struct ArgusRecordStore *) RaPopQueue(queue)) != NULL) {
> > < if (RaCheckTimeout(obj, NULL))
> > < RaTimeoutArgusStore(obj);
> > < else
> > < RaAddToQueue(queue, &obj->qhdr);
> > <
> > < } else
> > < cnt++;
> > < }
> > < }
> > < break;
> > ---
> >> while (queue->count > RA_MAXQSIZE) {
> >> obj = (struct ArgusRecordStore *) RaRemoveFromQueue(RaModelerQueue, RaModelerQueue->start->prv);
> >> RaTimeoutArgusStore(obj);
> >> }
> >>
> >> if ((cnt = ((queue->count > RA_MAXQSCAN) ? RA_MAXQSCAN : queue->count)) != 0) {
> >> while (cnt--) {
> >> if ((obj = (struct ArgusRecordStore *) RaPopQueue(queue)) != NULL) {
> >> if (RaCheckTimeout(obj, NULL))
> >> RaTimeoutArgusStore(obj);
> >> else
> >> RaAddToQueue(queue, &obj->qhdr);
> >>
> >> } else
> >> cnt++;
> >> }
> >> }
> >>
> >> break;
> >
> >
> > % diff rastrip.c raxstrip.c
> > 83a84,88
> >> /* MN: later we should offer these as options, but for now */
> >> int XFall = 1; /* do all record zeroings */
> >> int XFusecs = 1; /* zero synAckuSecs & ackDatauSecs */
> >> int XFtimedescusecs = 1; /* zero .time.start.tv_usec & .time.last.tv_usec */
> >>
> > 197c202,205
> > < fprintf (stderr, "Rastrip Version %s\n", version);
> > ---
> >> fprintf (stderr, "RaXstrip Version %s\n", version);
> >> fprintf (stderr, "Does all rastrip processing and also zeros many non-essential fields,\n");
> >> fprintf (stderr, "which, with bzip2, produces much higher compression ratios.");
> >> fprintf (stderr, " ... zxm.zxnewton at zxstanford.zxedu\n");
> > 222d229
> > < struct ArgusRecord * RaConstructArgusRecord (struct ArgusRecord *);
> > 223a231,252
> >> #ifdef SUBSECOND
> >> /*
> >> * for these, the high 12 bits should never be on (usec < 1000000);
> >> * USEC_BITS_TO_DROP gives number of low order bits dropped
> >> */
> >> #define struct USEC_BITS_TO_DROP 12
> >> #define USEC_BITS_MASK ((unsigned) (2^USEC_BITS_TO_DROP) - 1)
> >> #define USEC_UP(x) ((x+USEC_BITS_MASK) & (~USEC_BITS_MASK))
> >>
> >> void
> >> trunc_up_timeval(struct timeval *t)
> >> {
> >> if (t->tv_usec & USEC_BITS_MASK) {
> >> t->tv_usec &= USEC_BITS_MASK;
> >> t->tv_usec += (USEC_BITS_MASK+1);
> >> if (t->tv_usec >= 1000000) {
> >> t->tv_sec++;
> >> t->tv_usec = 0;
> >> }
> >> }
> >> }
> >> #endif /* SUBSECOND */
> > 245a275,277
> >> if (XFall) {
> >> newarg->ahdr.seqNumber = 0; /* this may be dangerous */
> >> }
> > 247a280,296
> >> struct ArgusFarStruct *t = (struct ArgusFarStruct *) &((char *)newarg)[newarg->ahdr.length];
> >> if (XFall) { /* MN: clear the microseconds part of the timestamps, others */
> >> t->time.start.tv_usec = 0; /* truncate start time */
> >> if (t->time.last.tv_usec) t->time.last.tv_sec++; /* round up end time */
> >> t->time.last.tv_usec = 0;
> >> t->flow.flow_union.ip.ip_id = 0; /* clear the ID field */
> >>
> >> t->ArgusTransRefNum = 0; /* this may be dangerous */
> >>
> >> if (((argus->ahdr.status & 0xFFFF) != ETHERTYPE_ARP) &&
> >> ((argus->ahdr.status & 0xFFFF) != ETHERTYPE_REVARP)) {
> >> /* these would wipe out ArgusARPAttributes otherwise */
> >> t->attr.attr_union.ip.soptions = t->attr.attr_union.ip.doptions = 0;
> >> t->attr.attr_union.ip.sttl = t->attr.attr_union.ip.dttl = 0;
> >> t->attr.attr_union.ip.stos = t->attr.attr_union.ip.dtos = 0;
> >> }
> >> }
> > 257a307,318
> >> struct ArgusTCPObject *t = (struct ArgusTCPObject *)&((char *)newarg)[newarg->ahdr.length];
> >> if (XFusecs) { /* MN: remove performance stats from ArugsTCPObject */
> >> t->synAckuSecs = t->ackDatauSecs = 0;
> >> t->src.pad = t->dst.pad = 0; /* should be 0 anyway, but just in case */
> >> t->src.win = t->dst.win = 0;
> >> t->src.seqbase = t->dst.seqbase = 0;
> >>
> >> /* more dangerous... */
> >> t->src.ackbytes = t->dst.ackbytes = 0;
> >> t->src.rpkts = t->dst.rpkts = 0;
> >> t->src.bytes = t->dst.bytes = 0;
> >> }
> > 360c421,425
> > <
> > ---
> >> #ifdef NEVERUSED
> >> /*
> >> * MN: as far as I can tell, this is never used - there does not appear to be any calls,
> >> * even in the original rastrip.c or libraries used by it.
> >> */
> > 436a502
> >> #endif /* NEVERUSED */
> >
> >
> >
> >
> > On Mon, Nov 07, 2011 at 07:57:39PM +0000, Jason Carr wrote:
> >> Hi Mike,
> >>
> >> If you have any of those scripts handy, I'd appreciate a copy. I'm
> >> actually interested in the conversion to ASCII, I didn't think of that
> >> one. It makes things a little harder if we want to do things like network
> >> range queries, but it might be worth it.
> >>
> >> I'm also trying xz. We are using plain gzip current archiving.
> >>
> >> Here's my current result set:
> >>
> >> -rw-r--r-- 1 root root 1146339108 2011-10-27 15:36 core.2011.10.13.14.00
> >>
> >> -rw-r--r-- 1 root root 564863918 2011-10-27 15:37 core.gz
> >>
> >> -rw-r--r-- 1 root root 523034738 2011-10-27 15:37 core.bz2
> >>
> >> -rw-r--r-- 1 root root 358668276 2011-10-27 15:37 core-9.xz
> >> -rw-r--r-- 1 root root 396348980 2011-10-27 15:37 core-6.xz
> >>
> >>
> >> Pretty decent compression with xz. Takes a long time to compress it, -9
> >> takes 14 minutes for my tests. -6 takes 11 minutes. For longer term
> >> compression, it's probably worth it.
> >>
> >>
> >> Thanks Mike,
> >>
> >> Jason
> >>
> >> On 11/4/11 2:46 PM, "MN" <m.newton at stanford.edu> wrote:
> >>
> >>>
> >>> Formerly, for data that we kept long-term, rounding time stamps to the
> >>> nearest 1/4 or 1/8 of a second reduced entropy sufficiently to make a
> >>> significant difference in compressed file sizes (this will not help on
> >>> non-compressed argus files). I can send the old code if desired, but
> >>> it was for an older version of Argus.
> >>>
> >>> Now we save our longer term data in ascii format, saving just the fields
> >>> that we want, and using a combination of -p and RA_TIME_FORMAT.
> >>>
> >>> Consider using xz instead of bzip2, especially if you look at the log
> >>> files frequently, as the decompression time is significantly less - at
> >>> the cost of longer compression times. Note xz defaults to '-6'.
> >>>
> >>> We've been keeping more than a years worth of data on roughly ten 1-4g/s
> >>> links.
> >>>
> >>> - mike
> >>>
> >>> On Oct 28, 2011, at 5:06 PM, Jason Carr wrote:
> >>>
> >>>> We write argus data into five minute chunked files. We typically have
> >>>> +1G
> >>>> files for those 5 minutes. Is there any metadata that we might be able
> >>>> to
> >>>> purge to decrease the size significantly?
> >>>>
> >>>> I normally only care about StartTime, flags, pro to, src/dst
> >>>> {mac,ip,port}, direction, packets, bytes, state, and user data in either
> >>>> direction.
> >>>>
> >>>> I already gzip compress the files, I tried using bzip2 on a few test
> >>>> files
> >>>> and got a 1.1G file down to 500M instead of 539M, but I'm looking for a
> >>>> larger compression and/or size difference.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Jason
> >>
> >
>
More information about the argus
mailing list