ranonymize() randomize function
Carter Bullard
carter at qosient.com
Sat Oct 19 11:52:27 EDT 2002
Gentle people,
For completeness I'll be sending descriptions of the new
programs to the list, in hopes that it will stimulate some
conversation on the new concepts and features in the new
tools.
Still working on ranonymize(), which is an important
new tool, this is a brief description of the randomizer
that ranonymize() uses. Of particular interest is the
randomizer for 8 and 16-bit number spaces. In order to get
quick and effective anonymization, lookup tables are used
to replace 8 and 16-bit numbers of interest with an anonymized
value. ranonymize() uses disjoint anonymized spaces,
so that ports numbers and ip_id values, as an example, are
anonymized independently. The more lookup tables used,
the more important it is to have an efficient algorithm for
creating the lookup tables. Don't want to have to wait too
long before you anonymize your first record.
The technique is to build a lookup table that is populated
with a random substitution sequence. The property of the
lookup table is that the contents must be unique while
completely covering the original number space. Ranonymize()
uses a random pick strategy to populate the translation lookup
table.
The algorithm looks something like this:
for (i = 0; i < N; i++) {
newtable[i] = oldtable[random() % (N - i)];
compact(oldtable);
}
Seems pretty straight forward. However, ranonymize uses
a more optimal algorithm as the one above has performance O(N!).
ranonymize() uses a pseudo bin hash algorithm that removes
the need to compact the table on each iteration, so you get
more like O(N log N) but the concept is the same.
This generates a random sequence from which simple value
replacements can be made. So to anonymize the source port
within an argus record, as an example, we just:
{
unsigned short *sport = &argus->argus_far.ip_flow.sport;
*sport = newtable[*sport];
}
The technique is a simple ECB (electronic code book) cipher, it
does have some weaknesses, such that if you figure out one
of the values, every value with that number can be decoded,
but that's it.
Any issues with this strategy? Simple, fast, and provides
good protection for small amounts of data. For large sets of
data, where there are many many records relative to the small
port space, a technique that builds a lookup table per IP address
would be ideal. Expensive but better anonymization.
Comments?
Carter
Carter Bullard
QoSient, LLC
300 E. 56th Street, Suite 18K
New York, New York 10022
carter at qosient.com
Phone +1 212 588-9133
Fax +1 212 588-9134
http://qosient.com
More information about the argus
mailing list