yet another kdd cup question

Mon Sep 30 14:06:06 EDT 2013

KDD (Knowledge Discovery and Datamining) is one of the premier database conferences.
That said, it's a sad commentary that they're still using the ancient 1999 dataset.
The threats in the dataset were cutting edge in 1999, but are now almost totally irrelevant.
Having automated algorithms to discover what's in that dataset is about as important as
wanting to have good ways of detecting buffalo stampedes on the Great Plains.

Unfortunately, lack of good datasets is probably the major impediment to academics
wanting to do research in network security. This has been an intractable problem
for more than a decade, but getting into that is a whole story in itself.

--
John Gerth      gerth at graphics.stanford.edu  Gates 378   (650) 725-3273

On 9/30/13 9:12 AM, Carter Bullard wrote:
> Hey Oğuz,
> I don't know what the KDD whatever is, and I'm thinking that
> its a pretty odd set of observations that they want to do some
> expectation based anomaly detection on ???
> 
> These are not well defined metrics, so, I'll try to respond, but
> understand, whatever it is that they think they are doing, there
> is a better way to do anomaly detection.
> 
> Suggestions inline.
> Carter
> 
> 
> On Sep 30, 2013, at 7:41 AM, Oğuz Yarımtepe <oguzyarimtepe at gmail.com <mailto:oguzyarimtepe at gmail.com>> wrote:
> 
>> It seems payload analysis is not a good approach in my situation, because i will be testing the algorithm against DDoS attacks.
>>
>> So better to ask how can i calculate the below features by using Argus?
>>
>> count: number of connections to the same host as the current connection in the past two second, 
> 
> So, if you have the current connection, then you'll want to print out the
> number of concurrent connections involving the two hosts of that connection??
> Over the life the connection, or for the day, or the 2 seconds prior to this
> connection ????  What is prior, for a connection that lasts 1 month, is it still
> -2 seconds from SYN time?? or is it -2 seconds from SYN/ACK time ???
> 
> Too weird, so lets count the concurrent connections to every host in a data file,
> for any 2 second period, and then you can query that list when you pick a connection
> of interest?????  At least that file can be the source of some anomaly detection
> data that maybe useful.
> 
> rabins() is the tool of choice here.  We'll split the file into 2 second intervals,
> and aggregate the transactions, per interval.  That we do in a single call to rabins().
> 
>    % rabins -M time 2s -r file -w transactions.per.2s.period
> 
> But already we're going to have to ask, any protocol, and interaction ???
> For this exercise, you may want to limit it to IP ????
> 
>    % rabins -M time 2s -r file -w transactions.per.2s.period - ip
> 
> 
> Then we'll count the total number of "trans" actions per host, for each 2 second
> period.  From that output you can pick the host you're interested in.
> 
>    % rabins -M time 2s -r transactions.per.2s.period -M dsrs="-agr" \
>           -M rmon -m saddr -w transactions.per.host.2s.period
> 
> This second rabins() will account for single objects, " -M rmon ", since
> we're only interested in a single IP address.  We'll aggregate based on the
> new flow model " -m saddr ", this will count the occurrences of each
> different transaction to each specific host.  The ' -M dsrs="-agr" ' is
> critical to getting the right numbers, as it clears the aggregation stats
> that were generated by the first call to rabins().
> 
> What you will have in the end is a file that accounts for the concurrent
> transactions for each host, every 2 seconds.  From here, you can pick a
> connection, grab the hosts of interest from the connection, and maybe the
> start and stop times, and then query the transactions.per.host.2s.period
> file for your answer:
> 
>    % ra -r transactions.per.host.2s.period -s stime dur trans saddr - host x.y.z.w
> 
> This will give you the complete connection history for that host through the
> whole accounting period.  If you want a specific time range:
> 
>    % ra -t time-range -r transactions.per.host.2s.period -s stime dur trans saddr - host x.y.z.w
> 
> The " trans " field will tell you the number of connections.
> 
> 
> If you have gargantuan files, and you want to limit the metrics to just 2 seconds at a time,
> you'll want to use rasqltimeindex(), to get the range data.  You'll still need the rabins(),
> but that's out of the scope of this response...
> 
>    % rasql -t time-range -w - - filter | rabins -M time 2s -w - | \
>           rabins -M time 2s -M dsrs="-agr" -M rmon -m saddr -w transactions.this.host.time.range
> 
> This will go very fast.  You get the time-range (stime-etime), from the
> connection of interest.  Now, you have to be sensitive to the ARGUS_FAR_STATUS_INTERVAL,
> as you have to ensure that you get all the data that hits you're 2 second interval.
> So grab the stime, and back up a complete ARGUS_FAR_STATUS_INTERVAL + the 2 seconds, to
> get data for the complete range.  You do this easily with rasql().  Lets try a sample
> stime of 2013/04/12.12:04:55 and an ARGUS_FAR_STATUS_INTERVAL of 5 seconds:
> 
>       % rasql -t 2013/04/12.12:04:55-7s
> 
> Will do the trick.  
> 
> Add a filter to get data that just deals with the host or hosts of interest.
> So, this basically means that if you're traffic is coming into a standard argus
> archive, and you have rastream() running rasqltimeindex() on the files that
> are being ingested, then you can do this.
> 
>    With the connection of interest, get its stime, and host address, then run:
> 
> % time rasql -t 2013/04/12.12:04:55-7s -w - - ip and host 192.168.0.68 | \
>     rabins -M time 2s -w - | \
>     rabins -M dsrs="-agr" time 2s hard rmon -m saddr -s stime dur saddr trans
> 
>                  StartTime        Dur               Host  Trans 
> 2013/04/12.12:04:48.000000   2.000000       192.168.0.66      2
> 2013/04/12.12:04:48.000000   2.000000       192.168.0.68      2
> 2013/04/12.12:04:50.000000   2.000000       192.168.0.68      3
> 2013/04/12.12:04:50.000000   2.000000       192.168.0.66      2
> 2013/04/12.12:04:50.000000   2.000000       192.168.0.70      1
> 2013/04/12.12:04:52.000000   2.000000       192.168.0.66      2
> 2013/04/12.12:04:52.000000   2.000000       192.168.0.68      2
> 2013/04/12.12:04:54.000000   2.000000       192.168.0.66      6
> 2013/04/12.12:04:54.000000   2.000000       192.168.0.68      6
> 
> real0m0.113s
> user0m0.069s
> sys0m0.026s
> 
> 
>> the number of connections whose source IP address and destination IP address are the same to those of the current connection in the past two seconds
> 
> This is similar to the one above, but because you want to preserve "X -> Y", you
> don't use the " -M rmon " option in the second run of rabins().
> 
>    % rabins -M time 2s -r transactions.per.2s.period -M dsrs="-agr" \
>          -m saddr daddr -w transactions.per.hostpair.2s.period
> 
> 
> This second rabins() will aggregate based on the new flow model " -m saddr daddr ",
> this will count the occurrences of direction specific host -> host transactions.
> The ' -M dsrs="-agr" ' is critical to getting the right number into the "trans" field.
> 
> The " trans " field, again, will tell you the number of connections.
> 
>    % ra -r transactions.per.hostpair.2s.period -s stime dur trans saddr - src host x.y.z.w and dst host w.z.y.x
> 
>>
>> serror_rate:  % of connections that have ``SYN'' errors,  % of connections that have “SYN” errors in Count feature
>> rerror_rate:  % of connections that have ``REJ'' errors
> 
> What are SYN or REJ errors?  Are these connections where you seen SYNs but no response?
> Or is a SYN that gets a RST a REJ error ???  Pretty weird since there is no REJ
> state in TCP.
> 
>> same_srv_rate:  % of connections to the same service, % of connections to the same service in Count feature
> 
> These are not statistics. % connections to the same service, by definition, is 100%.
> Its a selection bias problem:
>    sum(connections to same services)/sum(connections to same services)
> 
> Do they mean "% service connections" listed by service???  This would give you
> a percent of the total going to each service.
> 
>    sum(connections to service X)/sum(connections)
> 
> What is a service, as destination port number?
> 
> This will do that:
> 
>    racluster -r file -w - - udp or tcp | racluster -M dsrs="-agr" rmon -m proto sport -w connections.service
> 
> Sort the output by the service port and protocol.
> 
>    rasort -M replace -m sport proto -r connections.service
> 
> To print out the percentages, to 3 decimal places:
> 
>    ra -% -n -p3 -r connections.service -s stime dur proto sport trans
> 
> 
> 
>> diff_srv_rate:  % of connections to different services in Count feature
> 
> Again, the % of connections to different services, I have no idea what that means.
> 
>>
>> srv_count:  number of connections to the same service as the current connection in the past two seconds
>>
>> srv_serror_rate:  % of connections that have ``SYN'' errors, % of connections that have “SYN” errors in Srv_count(the number of connections whose
>> service type is the same to that of the current connection in the past two seconds) feature
>> srv_rerror_rate:  % of connections that have ``REJ'' errors
>> srv_diff_host_rate:  % of connections to different hosts
>>
>> Any tip will be great.
>>
>> -- 
>> Oğuz Yarımtepe
>> http://about.me/oguzy
>