yet another kdd cup question

Carter Bullard carter at qosient.com
Wed Oct 2 11:32:47 EDT 2013


Hey Oğuz,
The KDD Cup 1999 connection data was not a good data schema
for detecting unauthorized access to computers.  While good
people were involved, what they were doing in the competition
should not be considered a learned approach to cyber intrusion
detection.

Possibly the cup could be described as a good exercise in classifier
analytics based on a curious data type.  If you look at the cup
results, you'll see that its really the strategies that are
interesting.  But, in the end, none of the methods were significantly
better than the simplest "1-nearest neighbor classifier" strategy.

I personally would not recommend that you spend any time on
the KDD Cup 1999 data schema, unless you want to replicate their
indirect conclusion, that with this type of data, you're not going
to do very well.  The concept of "same host" is just not important.
The concept of "same service" is just not relevant to the mechanisms
of intrusion.  And where did 2 seconds come from ???  Its all a pretty
weird data presentation for a pretty weird set of simulated activity.

I've been a non-contributing member of the ACM KDD SIG for
over 15 years, which means I read the SIGKDD explorations
journal when it comes out, and I've gone to a few meetings.  So I
am a KDD wannabe.  However, I'm pretty comfortable saying that the
KDD Cup 1999 results, which were very poor, by the way, indicate to
me that the data used for training and testing was a poor starting point.
That hasn't changed in 14 years.  So the winner had a 25.4% failure
rate for detecting "normal" traffic.  This is not useful.

The KDD Cup attracts some serious people to focus on a set of data,
and these guys with this data didn't do a particularly useful job.
I don't think its was their fault.  I think the exercised proved that
with this data schema, you're not going to do a good job.

Now, if you can get the original 4GB tcpdump data files, then
that would be something to spend some time on.

Carter




On Oct 2, 2013, at 7:29 AM, Oğuz Yarımtepe <oguzyarimtepe at gmail.com> wrote:

> Hi,
> 
> I figured a bit. A line from KDD Cup Data set, representing the value of each attribute gave the idea indeed.
> 
> 0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.
> 
> 
> Lets check the first attributes i am interested and last ones i also interested.
> 
> duration 	length (number of seconds) of the connection 	continuous
> protocol_type 	type of the protocol, e.g. tcp, udp, etc. 	discrete
> service 	network service on the destination, e.g., http, telnet, etc. 	discrete
> src_bytes 	number of data bytes from source to destination 	continuous
> dst_bytes 	number of data bytes from destination to source 	continuous
> flag 	normal or error status of the connection 	discrete 
> land 	1 if connection is from/to the same host/port; 0 otherwise 	discrete
> wrong_fragment 	number of ``wrong'' fragments 	continuous
> urgent 	number of urgent packets 	continuous
> 
> 
> By looking at these attributes, i think Argus will calculate many of them. I am not sure about the number of wrong fragments part. I checked the ra documentation and saw it is possible to display the flags attributes. It has Urgent flag value. But the wrong_fragmentation part is the one that i am not sure about. Any idea how will i calculate it?
> 
> And now the more ambiguous ones
> 
> feature name	description 	type
> count 	number of connections to the same host as the current connection in the past two seconds: Since it is calculated for the current connection i think it is the the number of connections whose source IP address and destination IP address are the same to those of the current connection in the past two seconds, meaning 2 seconds prior to this connection.	continuous
> 
> Note: The following  features refer to these same-host connections.	
> serror_rate 	% of connections that have ``SYN'' errors: I found some information at Bro-IDS documentation. It is how they display the status of a connection/flow at the conn.log.
> S0: Connection attempt seen, no reply.
> S1: Connection established, not terminated.
> SF: Normal establishment and termination. Note that this is the same symbol as for state S1. You can tell the two apart because for S1 there will not be any byte counts in the summary, while for SF there will be.
> REJ: Connection attempt rejected.
> RSTO: Connection established, originator aborted (sent a RST).
> RSTR: Established, responder aborted.
> So RSTO and RSTR can be SYN errors. REJ is the mentioned thing. A connection attemt is made but it is rejected. Is there a flag to see this event?
> continuous
> rerror_rate 	% of connections that have ``REJ'' errors 	continuous
> same_srv_rate 	% of connections to the same service: This is he percentage. And by service i assume the port number. So in two seconds time, number of connection attempts/connections done to the same port / count calculated above will give the percentage i think
> continuous
> diff_srv_rate 	% of connections to different services: This will be 1 - above_percentage i think
> continuous
> srv_count 	number of connections to the same service as the current connection in the past two seconds: It is already calculated above 
> continuous
> 
> Note: The following features refer to these same-service connections.	
> srv_serror_rate 	% of connections that have ``SYN'' errors: These will be calculated by looking at the port number and in two seconds period.  
> continuous
> srv_rerror_rate 	% of connections that have ``REJ'' errors 	continuous
> srv_diff_host_rate	% of connections to different hosts 
> 
> What my plan was to listen a mirrored port and save the calculated data to db. I am not sure whether i will calculate all properties in one time and save to db. What do you suggest? First listen the GBit traffic and save it as Argus format and then work on to with Argus commends and save to db?
> 
> Or, directly save to db whatever i can calculate with ra and then run some other scripts to calculate percentages and two second issues. But saving to db will take into consideration of 1 minute time interval by default i guess and i should be doing something for two second thing. Not sure indeed. What do you suggest?
> 
> I am not dying to use this attributes but unfortunately it is a dataset still in use. Just in case, better to have some solution for my problem.
> 
> Thank you.
> 
> -- 
> Oğuz Yarımtepe
> http://about.me/oguzy

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20131002/97c4a3ce/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6837 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20131002/97c4a3ce/attachment.bin>


More information about the argus mailing list