new ranonymize() tool

Fri Oct 11 08:58:30 EDT 2002

Hey Peter,
I should describe how ranonymize() anonymizes IPv4 addresses
so we can see what kind of problems might exist.  ranonymize()
provides several methods for address anonymization, I'll describe
the default one to start out.

IPv4 addresses are anonymized using a non-cryptographic Class based
24-bit prefix preserving sequential allocation strategy, which
is not distributable.  So what does this mean.  All IPv4
addresses are treated as having a 24-bit netmask.  Each unique
24-bit network address is translated to a reserved 24-bit netmask
from the same Class, sequentially on a first come basis.  So a
Class A address is assigned a reserved Class A network part, a
Class B address is assigned a reserved ClassB network part.  These
addresses are allocated sequentially, so that the first Class A address
encountered in s stream will get 1.0.1, the second will get 1.0.2.
Class B's start with 100.0.1, and Class C's start with 197.0.1.
Multicast addresses start with 224.0.1.  You can specify exceptions
and specific net or complete address translations, so there is some
flexibility.

The 8-bit host part is allocated sequentially, starting with 1.
I've included an example below.

Once an address has been allocated, any occurrence of that address
in any part of an argus record in the stream is translated to the
new anonymized address, using a hashed lookup strategy.

This approach provides a Class preserving, 24-bit prefix preserving
anonymizing strategy that is pseudo-random, but not distributable.

Since addresses would arrive in an argus() stream somewhat randomly,
you get a pseudo-random assignment.  This helps to assure that
two independent anonymizers using the same algorithms, seeds and
everything, anonymizing argus data streams from differing parts of
the network, will not anonymize transactions to the same anonymized
addresses.  However for research purposes, this may not be what
we're looking for and a keyed version of this should allow us to
provide distributable anonymization.

Simple method, pretty fast.  Uses memory, so persistent anonymization
will grow to hold the growing translation table.  So what do you think?
If I gave you the traces below, am I in trouble?  (bytes and packet
counts are the only metrics not anonymized, and differential stats,
like transaction duration are also preserved, so there are opportunities
for comparison, but if the bad guys are not on the same network, its
going to be a challenge, to find common transactions, and if they
break one 24-bit network assignment, they don't get any others).

Carter

[qosient at isis tmp]$ ra !*
 ra -nr argus.out -p3 -s startime proto saddr sport dir daddr dport
status

         StartTime      Type     SrcAddr      Sport Dir     DstAddr
Dport State
2002/10/08.15:59:53.759  tcp   192.168.0.161.1661    ->
66.12.27.73.5190   CON
2002/10/08.15:59:54.748  tcp    192.168.0.64.3997    ->
215.92.197.167.110    FIN
2002/10/08.15:59:54.760  tcp    192.168.0.64.3999    ->
216.46.170.10.110    FIN
2002/10/08.16:00:01.124  tcp    192.168.0.64.4002    ->
236.92.197.167.110    FIN
2002/10/08.16:00:35.634  udp    192.168.0.16.1102   <->
149.192.0.38.53     CON
2002/10/08.16:00:35.657  tcp   192.168.0.161.1835    ->
66.94.185.200.80     CON
2002/10/08.16:00:41.109  tcp   192.168.0.161.1835    ->
66.94.185.200.80     RST
2002/10/08.16:00:48.652  tcp   192.168.0.161.1656    ->
62.124.26.194.5190  CON
2002/10/08.16:00:51.766  tcp   192.168.0.161.1661    ->
61.12.27.73.5190   CON

[qosient at isis tmp]$ ranonymize !*
ranonymize -nr argus.out -p3 -s startime proto saddr sport dir daddr
dport status

         StartTime      Type     SrcAddr      Sport Dir     DstAddr
Dport State
1996/05/05.00:41:05.297  tcp       197.0.1.3.13461   ->
1.0.2.1.16990  CON
1996/05/05.00:41:06.286  tcp       197.0.1.4.15797   ->
197.0.2.1.110    FIN
1996/05/05.00:41:06.297  tcp       197.0.1.4.15799   ->
197.0.3.1.110    FIN
1996/05/05.00:41:12.661  tcp       197.0.1.4.15802   ->
197.0.2.1.110    FIN
1996/05/05.00:41:47.172  udp       197.0.1.5.12902  <->
100.0.1.1.53     CON
1996/05/05.00:41:47.194  tcp       197.0.1.3.13635   ->
1.0.3.1.80     CON
1996/05/05.00:41:52.646  tcp       197.0.1.3.13635   ->
1.0.3.1.80     RST
1996/05/05.00:42:00.189  tcp       197.0.1.3.13456   ->
1.0.4.1.16990  CON
1996/05/05.00:42:03.304  tcp       197.0.1.3.13461   ->
1.0.2.1.16990  CON

-----Original Message-----
From: owner-argus-info at lists.andrew.cmu.edu
[mailto:owner-argus-info at lists.andrew.cmu.edu] On Behalf Of Peter Van
Epp
Sent: Thursday, October 10, 2002 4:16 PM
To: argus
Subject: Re: new ranonymize() tool

	Without (yet) having looked at Carter's new tool here are some
thoughts
on this subject from a discussion some months ago  about putting Argus
up 
locally and being able to release the traffic traces for network
researchers. 
Note in this case we want to keep at least destination port numbers to
allow 
researchers to determine what kind of traffic it was and keep the time 
syncronization (possibly offset by a constant amount to obscure it
slightly). 
A later look over the CAIDA web site indicates they don't have a
solution 
either, the anomymiser they use is fairly simple and doesn't appear to
address 
the issues raised below.

	 A fly in the anonymous ointment. Unfortunatly I thought about
the 
issue of anonymizing trace data on the way back to the hill. It is
essentially
cryptography (we want to encryt the data but not decrypt it) which is 
unfortunatly trivially subject to a chosen plaintext attack which will
defeat
the encryption (and thus the anonymity).
	If we postulate the following users: I (innocent victem) A
(scumbag
attacker) and sites AS (attacker's site) IS (innocent victem's site) P1
(porno
site 1) and p2 (porno site 2) then look at the possibilities in
anonymized
trace data we find a problem. Assume we have anonymized both IP
addresses by 
random translation and shifted time by a fixed amount to try and defeat
traffic 
pattern analysis as we discussed this morning. Unfortunatly since we are
on a 
public network, if we assume the attacker can identify the victem and
determine 
the IP address the victem is using then our entire scheme can be
defeated as 
follows:

A pings (logging the current time on machine AS) the victem's machine
IS, 
P1, and P2. He may need to ping in an unusual pattern to make the
pattern 
stand out in that anonymized logfile. Now the attacker obtains the
anonymized
trace file for the time period described above. By sorting all the data
by
source and dest IP address he can pick out the ping pattern that he
initiated
above. He knows his IP address (and now what his IP address has
translated in
to in the anonymous trace, no net gain here). Unfortunatly by the first
ping
made by his machine (who's anonymous ID he now knows) he has identified
the 
anonymized IP address of the victem's machine IS. The next 2 pings give
him 
the anonymized IP addresses of porn sites p1 and p2. Now a search of the
trace 
file for anonymized IS for connections to anonymized p1 and p2 will tell
the 
attacker if the victem IP address has accessed the porn sites which is
what we 
are trying to prevent. On the way by (given the time stamps in our trace
file 
and the real time from his local log) he has also extracted the fixed
time 
offset we used and can trivially convert the trace file back to real
time.
I'm not sure thats deadly, but it does make the time shift idea not
really 
useful for defeating traffic analysis attacks.
	This may make an interesting problem for a grad student
interested in 
crypto since there may be a solution (although I have a sneaking
suspicion 
because of the uncontrolled nature of the public net there isn't ...).
We 
should also ask the CIADA folks how they deal with this problem in their
traces
(or if indeed they have thought of this issue, although I hope they
have). We 
do need to make the risk clear to the bosses that have to approve this
being 
done. I'm pretty sure Worth was assuming that I meant that the data
would be 
anonymous (which I just demonstrated it isn't) when he said he thought
he 
could get permission to release our traces. In the end all it may mean
is that 
we have to restrict distribution of trace files more than we would like
(i.e. 
researchers in I2 and elsewhere may not be deemed safe enough ...).
	Happy paranoia day :-)

Peter Van Epp / Operations and Technical Support 
Simon Fraser University, Burnaby, B.C. Canada