new ranonymize() tool

Thu Oct 10 08:35:02 EDT 2002

Gentle people,
   I'd like to get some conversation going on ranonymize(). It's
a very interesting tool for scrambling argus data so that the data
retains enough semantics so it can still be analyzed, but anonymized
enough so that the data can be shared as well.

   I think the goal of anonymization is to minimize discovery and
traffic engineering capabilities from argus data when you share
it, or even store it for long periods of time. This means that
obvious identifiers, such as addresses and port numbers need to
be modified, but also non-obvious values like TCP base sequence
numbers, ESP spi values, and TTL's.  Because the purpose of sharing
argus data is generally to convey some set of semantics, like the
relationship of addresses and ports that are of interest, or some
aspect of time, you need some flexibility in scrambling the data. 
Maybe you want to demonstrate how two hosts are interacting among
other traffic, so you want to translate two addresses of interest
to known values and randomize the rest.  Or maybe you want to preserve
the concept of local vs. remote hosts, so you need to retain some
aspect of the address hierarchy.  Simply randomizing every 8, 16
and 32-bit objects in an argus record isn't going to be the most
helpful.

   Ranonymize has a rich set of configuration parameters to provide
anonymization with exceptions.  You can tell ranonymize, "don't
translate these objects", "translate this value to this value", 
and you tell it "use these techniques to translate these objects".
These options are primarily focused on MAC and IPv4 addresses, port
values, time, IP header fields, such as the ip_id, TOS and TTL,
and dealing with sequence numbers in the various supported protocols.

   There are two basic anonymization techniques that ranonymize
uses.  Random/fixed sequential offset and random allocation. Sequential
offset anonymization involves simply adding a constant to a data
field with carry, which shifts values in a controlled fashion
through a number space.  The constant can be randomly chosen or user 
(fixed) supplied, so that on each run against the same data, you
either get a randomly different result or the same result.

   The importance of sequential offset anonymization is that it
preserves differential relationships.  So for data items like
timestamps,
sequence numbers, port numbers, ttl values, this technique shifts
the data an unknown amount, but allows you to calculate differential
metrics such as transaction duration and hop count, detect breaks in
source port allocation, and realize missing sequence numbers.

   In random allocation, a number space is randomly rearranged, such
as 0-65536.  The resulting anonymized space is then used to translate
a particular 16-bit value in an argus record.   This can be used to
anonymize port values, Ip_Id values, the 16-bit host part of a Class
B IPv4 address, any 16-bit values.

   Ranonymize provides some flexibility, for instance with port
randomization.  There are 3 port ranges, "well-known", "registered
port numbers" and the unknown range.  You can specify that you want
to retain or randomize any of these three ranges.  By default,
ranonymize will preserve the "well-known" port (0-1023), and anonymize
the rest.  By default it uses random sequential offset anonymization
for the rest, so you can see how ports are allocated, but the actual
values are all shifted.

   Hopefully this helps to set the stage as to what ranonymize()
is doing.  There are some very cool concepts here, so lets talk
about it!!!!

Hope all is well,

Carter

Carter Bullard
QoSient, LLC
300 E. 56th Street, Suite 18K
New York, New York  10022

carter at qosient.com
Phone +1 212 588-9133
Fax   +1 212 588-9134
http://qosient.com