yet another kdd cup question

Mon Sep 30 12:12:49 EDT 2013

Hey Oğuz,
I don't know what the KDD whatever is, and I'm thinking that
its a pretty odd set of observations that they want to do some
expectation based anomaly detection on ???

These are not well defined metrics, so, I'll try to respond, but
understand, whatever it is that they think they are doing, there
is a better way to do anomaly detection.

Suggestions inline.
Carter

On Sep 30, 2013, at 7:41 AM, Oğuz Yarımtepe <oguzyarimtepe at gmail.com> wrote:

> It seems payload analysis is not a good approach in my situation, because i will be testing the algorithm against DDoS attacks.
> 
> So better to ask how can i calculate the below features by using Argus?
> 
> count: number of connections to the same host as the current connection in the past two second,

So, if you have the current connection, then you'll want to print out the
number of concurrent connections involving the two hosts of that connection??
Over the life the connection, or for the day, or the 2 seconds prior to this
connection ????  What is prior, for a connection that lasts 1 month, is it still
-2 seconds from SYN time?? or is it -2 seconds from SYN/ACK time ???

Too weird, so lets count the concurrent connections to every host in a data file,
for any 2 second period, and then you can query that list when you pick a connection
of interest?????  At least that file can be the source of some anomaly detection
data that maybe useful.

rabins() is the tool of choice here.  We'll split the file into 2 second intervals,
and aggregate the transactions, per interval.  That we do in a single call to rabins().

   % rabins -M time 2s -r file -w transactions.per.2s.period

But already we're going to have to ask, any protocol, and interaction ???
For this exercise, you may want to limit it to IP ????

   % rabins -M time 2s -r file -w transactions.per.2s.period - ip

Then we'll count the total number of "trans" actions per host, for each 2 second
period.  From that output you can pick the host you're interested in.

   % rabins -M time 2s -r transactions.per.2s.period -M dsrs="-agr" \
          -M rmon -m saddr -w transactions.per.host.2s.period

This second rabins() will account for single objects, " -M rmon ", since
we're only interested in a single IP address.  We'll aggregate based on the
new flow model " -m saddr ", this will count the occurrences of each
different transaction to each specific host.  The ' -M dsrs="-agr" ' is
critical to getting the right numbers, as it clears the aggregation stats
that were generated by the first call to rabins().

What you will have in the end is a file that accounts for the concurrent
transactions for each host, every 2 seconds.  From here, you can pick a
connection, grab the hosts of interest from the connection, and maybe the
start and stop times, and then query the transactions.per.host.2s.period
file for your answer:

   % ra -r transactions.per.host.2s.period -s stime dur trans saddr - host x.y.z.w

This will give you the complete connection history for that host through the
whole accounting period.  If you want a specific time range:

   % ra -t time-range -r transactions.per.host.2s.period -s stime dur trans saddr - host x.y.z.w

The " trans " field will tell you the number of connections.

If you have gargantuan files, and you want to limit the metrics to just 2 seconds at a time,
you'll want to use rasqltimeindex(), to get the range data.  You'll still need the rabins(),
but that's out of the scope of this response...

   % rasql -t time-range -w - - filter | rabins -M time 2s -w - | \
          rabins -M time 2s -M dsrs="-agr" -M rmon -m saddr -w transactions.this.host.time.range

This will go very fast.  You get the time-range (stime-etime), from the
connection of interest.  Now, you have to be sensitive to the ARGUS_FAR_STATUS_INTERVAL,
as you have to ensure that you get all the data that hits you're 2 second interval.
So grab the stime, and back up a complete ARGUS_FAR_STATUS_INTERVAL + the 2 seconds, to
get data for the complete range.  You do this easily with rasql().  Lets try a sample
stime of 2013/04/12.12:04:55 and an ARGUS_FAR_STATUS_INTERVAL of 5 seconds:

      % rasql -t 2013/04/12.12:04:55-7s

Will do the trick.  

Add a filter to get data that just deals with the host or hosts of interest.
So, this basically means that if you're traffic is coming into a standard argus
archive, and you have rastream() running rasqltimeindex() on the files that
are being ingested, then you can do this.

   With the connection of interest, get its stime, and host address, then run:

% time rasql -t 2013/04/12.12:04:55-7s -w - - ip and host 192.168.0.68 | \
    rabins -M time 2s -w - | \
    rabins -M dsrs="-agr" time 2s hard rmon -m saddr -s stime dur saddr trans

                 StartTime        Dur               Host  Trans 
2013/04/12.12:04:48.000000   2.000000       192.168.0.66      2
2013/04/12.12:04:48.000000   2.000000       192.168.0.68      2
2013/04/12.12:04:50.000000   2.000000       192.168.0.68      3
2013/04/12.12:04:50.000000   2.000000       192.168.0.66      2
2013/04/12.12:04:50.000000   2.000000       192.168.0.70      1
2013/04/12.12:04:52.000000   2.000000       192.168.0.66      2
2013/04/12.12:04:52.000000   2.000000       192.168.0.68      2
2013/04/12.12:04:54.000000   2.000000       192.168.0.66      6
2013/04/12.12:04:54.000000   2.000000       192.168.0.68      6

real	0m0.113s
user	0m0.069s
sys	0m0.026s

> the number of connections whose source IP address and destination IP address are the same to those of the current connection in the past two seconds

This is similar to the one above, but because you want to preserve "X -> Y", you
don't use the " -M rmon " option in the second run of rabins().

   % rabins -M time 2s -r transactions.per.2s.period -M dsrs="-agr" \
         -m saddr daddr -w transactions.per.hostpair.2s.period

This second rabins() will aggregate based on the new flow model " -m saddr daddr ",
this will count the occurrences of direction specific host -> host transactions.
The ' -M dsrs="-agr" ' is critical to getting the right number into the "trans" field.

The " trans " field, again, will tell you the number of connections.

   % ra -r transactions.per.hostpair.2s.period -s stime dur trans saddr - src host x.y.z.w and dst host w.z.y.x

> 
> serror_rate:  % of connections that have ``SYN'' errors,  % of connections that have “SYN” errors in Count feature
> rerror_rate:  % of connections that have ``REJ'' errors

What are SYN or REJ errors?  Are these connections where you seen SYNs but no response?
Or is a SYN that gets a RST a REJ error ???  Pretty weird since there is no REJ
state in TCP.

> same_srv_rate:  % of connections to the same service, % of connections to the same service in Count feature

These are not statistics. % connections to the same service, by definition, is 100%.
Its a selection bias problem:
   sum(connections to same services)/sum(connections to same services)

Do they mean "% service connections" listed by service???  This would give you
a percent of the total going to each service.

   sum(connections to service X)/sum(connections)

What is a service, as destination port number?

This will do that:

   racluster -r file -w - - udp or tcp | racluster -M dsrs="-agr" rmon -m proto sport -w connections.service

Sort the output by the service port and protocol.

   rasort -M replace -m sport proto -r connections.service

To print out the percentages, to 3 decimal places:

   ra -% -n -p3 -r connections.service -s stime dur proto sport trans

> diff_srv_rate:  % of connections to different services in Count feature

Again, the % of connections to different services, I have no idea what that means.

> 
> srv_count:  number of connections to the same service as the current connection in the past two seconds 
> 
> srv_serror_rate:  % of connections that have ``SYN'' errors, % of connections that have “SYN” errors in Srv_count(the number of connections whose service type is the same to that of the current connection in the past two seconds) feature
> srv_rerror_rate:  % of connections that have ``REJ'' errors 
> srv_diff_host_rate:  % of connections to different hosts
> 
> Any tip will be great.
> 
> -- 
> Oğuz Yarımtepe
> http://about.me/oguzy

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20130930/32f1dd72/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6837 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20130930/32f1dd72/attachment.bin>