ranonymize too slow?

Sat Dec 6 03:01:21 EST 2014

Fall back to the default of cidr/24 and if that doesn't work completely, set the hash size to 0x10000.  There is some different logic for hash sizes bigger than 0x10000.

Carter

> On Dec 6, 2014, at 7:16 AM, Kaustubh Gadkari <kaustubh at cs.colostate.edu> wrote:
> 
> Hi,
> 
> I had kicked off a run of ranonymize with the new hash size. Good news: the code doesn't segfault. Bad news: ranonymize quits with the following error after about 12 minutes.
> 
> RaMapNewNetwork: no addresses
> 
> RANON_PRESERVE_NET_ADDRESS_HIERARCHY is set to cidr/8 in the config file.
> 
> Kaustubh
> 
>> On Dec 5, 2014, at 4:38 PM, Christos Papadopoulos <christos at CS.ColoState.EDU> wrote:
>> 
>> Thanks Kaustubh!
>> 
>> Carter, Kaustubh is one of my graduate students and he took it upon himself to look into the problem.
>> 
>> He will do another anonymization run and report back to us with timing results.
>> 
>> This is good progress, thanks all!
>> 
>> Christos.
>> 
>>> On 12/05/2014 04:06 PM, Kaustubh Gadkari wrote:
>>> 
>>> 
>>> On Fri, Dec 5, 2014 at 1:05 PM, Christos Papadopoulos
>>> <christos at cs.colostate.edu <mailto:christos at cs.colostate.edu>> wrote:
>>> 
>>>   On 12/05/2014 12:41 PM, Carter Bullard wrote:
>>> 
>>>       That is about 2500 records per sec.  We should be able to do
>>>       10-50x that.  I have gotten upto 1M rps, but not with open
>>>       source argus.
>>>       The has size change should make a huge difference !!
>>> 
>>> 
>>>   With the new hash, ranonymize produces 37,615 records in just over a
>>>   second and then promptly crashes with both cidr/8 and cidr/16.
>>> 
>>> 
>>> I think I fixed the segfault issue. The patch is simple:
>>> 
>>> kaustubh at proton:~/argus-clients-3.0.8/clients$ diff ranonymize.c
>>> ranonymize.c.new
>>> 1454c1454
>>> <    int RaMapHash = 0;
>>> ---
>>>>   unsigned int RaMapHash = 0;
>>> 
>>> Kaustubh
>>> 
>>> 
>>>   If you have some quick suggestions I can try them, else it will take
>>>   some time to dig deeper.
>>> 
>>>   Christos.
>>> 
>>> 
>>> 
>>>       Carter
>>> 
>>>           On Dec 5, 2014, at 2:54 PM, Christos Papadopoulos
>>>           <christos at cs.colostate.edu
>>>           <mailto:christos at cs.colostate.edu>> wrote:
>>> 
>>>           Hi Carter,
>>> 
>>>           You are right, my apologies.
>>> 
>>>           With cidr/8 after three hours it anonymized about 8.8M
>>>           records out of the nearly 1B records in the file. I counted
>>>           this by running wc on the output file, which is a text file.
>>> 
>>>           The machine is a Dell Poweredge 2950, 3GHz Xeon with 8
>>>           cores, 32GB of RAM and about 30TB of directly attached
>>>           storage, running 64bit CentOS 6.6.
>>> 
>>>           I will try running it with cidr/16 and also with the change
>>>           in the hash function you suggested in your other message.
>>> 
>>>           Thanks for your help!
>>> 
>>>           Christos.
>>> 
>>>               On 12/05/2014 04:14 AM, Carter Bullard wrote:
>>>               Hey Christos,
>>>               We could be a bit more scientific about this.  How much
>>>               of the file was completed after 3 hours ?
>>>               Did you try cidr/8 and cidr/16 ??   What kind of machine
>>>               is this running on ???
>>> 
>>>               Carter
>>> 
>>>                   On Dec 5, 2014, at 7:53 AM, Christos Papadopoulos
>>>                   <christos at cs.colostate.edu
>>>                   <mailto:christos at cs.colostate.edu>> wrote:
>>> 
>>>                   On 12/04/2014 02:54 AM, Carter Bullard wrote:
>>> 
>>>                       Hey Christos,
>>>                       With CIDR/24 address hierarchy preservation, it
>>>                       maybe thrashing trying to find an appropriate
>>>                       CIDR/24 prefix that hasn’t been allocated, when
>>>                       it needs a new one.  I suspect that your 55M
>>>                       addresses are really 55M CIDR/24’s.  You may get
>>>                       some real speed up if you go to CIDR/16,
>>>                       or CIDR/8.  If you could try that, just as an
>>>                       experiment, and see if the output is a bit quicker,
>>>                       I think I can make some changes to improve the
>>>                       allocation.
>>> 
>>> 
>>>                   I tried it by changing the config file to CIDR/8. I
>>>                   don't think it made much of a difference. I let the
>>>                   process run for over 3 hours before I had to kill it
>>>                   again. At that point I saw similar progress as before.
>>> 
>>>                   Sorry!
>>> 
>>>                   Christos.
>>> 
>>> 
>>>                       I suspect that you get decent output at first
>>>                       and then it slows down to a crawl, as its busy
>>>                       trying to find an address slot that is
>>>                       appropriate for the next CIDR/24.  Its a hash
>>>                       collision
>>>                       and then a search for an open slot, which may
>>>                       not be optimal.  It should be easy to thread
>>>                       out to another processor.
>>> 
>>>                       Carter
>>> 
>>>                           On Dec 2, 2014, at 2:59 PM, Christos
>>>                           Papadopoulos <christos at cs.colostate.edu
>>>                           <mailto:christos at cs.colostate.edu>> wrote:
>>> 
>>>                           On 12/02/2014 12:40 AM, Carter Bullard wrote:
>>> 
>>>                               Hey Christos,
>>>                               Did you specify a ranonymize.conf file,
>>>                               or are you using all defaults ?
>>> 
>>> 
>>>                           I customized the ranonymize.conf file do
>>>                           anonymize IP adresses only. See below.
>>> 
>>>                               You may want to allocate addresses using
>>>                               a different strategy.  Using the default
>>>                               algorithm, the allocation of 55M
>>>                               addresses will take some time, did you
>>>                               get any output at all  ???
>>> 
>>> 
>>>                           I need to use prefix-preserving
>>>                           anonymization, similar to cryptopan. Which
>>>                           algorithm would you suggest?
>>> 
>>>                           I do see the output file growing. It just
>>>                           takes a really long time, to the point where
>>>                           it is unusable for our case.
>>> 
>>>                           Here are the settings I used. Please let me
>>>                           know if I should change anything. I only
>>>                           need IP addresses anonymized,
>>> 
>>>                           RANON_SEED=29384938
>>>                           RANON_TRANSREFNUM_OFFSET=no
>>>                           RANON_SEQNUM_OFFSET=no
>>>                           RANON_TIME_SEC_OFFSET=no
>>>                           RANON_TIME_USEC_OFFSET=no
>>>                           RANON_ETHERNET_ANONYMIZATION=__no
>>>                           RANON_PRESERVE_ETHERNET___VENDOR=yes
>>>                           RANON_PRESERVE_ETHERNET___BROADCAST=yes
>>>                           RANON_PRESERVE_ETHERNET___MULTICAST=yes
>>> 
>>>                           RANON_NET_ANONYMIZATION=__sequential
>>>                           RANON_HOST_ANONYMIZATION=__sequential
>>>                           RANON_AS_ANONYMIZATION=__sequential
>>>                           RANON_NETWORK_ADDRESS_LENGTH=__24
>>> 
>>>                           RANON_PRESERVE_NET_ADDRESS___HIERARCHY=cidr/24
>>>                           RANON_PRESERVE_BROADCAST___ADDRESS=yes
>>>                           RANON_PRESERVE_MULTICAST___ADDRESS=yes
>>>                           RANON_PRESERVE_IP_ID=none
>>>                           RANON_PRESERVE_ICMPMAPPED_TTL=__yes
>>>                           RANON_PRESERVE_IP_TTL=none
>>>                           RANON_PRESERVE_IP_TOS=none
>>>                           RANON_PRESERVE_WELLKNOWN_PORT___NUMS=yes
>>>                           RANON_PRESERVE_REGISTERED___PORT_NUMS=yes
>>>                           RANON_PRESERVE_PRIVATE_PORT___NUMS=yes
>>>                           RANON_PORT_METHOD=no
>>> 
>>>                           Christos.
>>> 
>>> 
>>>                               Carter
>>> 
>>> 
>>> 
>>>                                   On Dec 2, 2014, at 2:38 AM, Christos
>>>                                   Papadopoulos
>>>                                   <christos at cs.colostate.edu
>>>                                   <mailto:christos at cs.colostate.edu>>
>>>                                   wrote:
>>> 
>>>                                   Hi Carter,
>>> 
>>>                                   We are using the latest version of
>>>                                   the client tools.
>>> 
>>>                                   After letting it run for 4.5 hours I
>>>                                   had to kill it. There are just under
>>>                                   a billion records in the file. When
>>>                                   I killed it, this is what I got. I
>>>                                   have no idea how much longer it
>>>                                   would run.
>>> 
>>>                                   Address Summary
>>>                                      IPv4 Unicast              src
>>>                                   11411339    dst 43953546
>>>                                      IPv4 Unicast Private      src 85
>>>                                            dst 353
>>>                                      IPv4 Unicast Reserved     src
>>>                                   12654028    dst 51692353
>>>                                      IPv4 Multicast Local      src 0
>>>                                             dst 2
>>> 
>>>                                   Christos.
>>> 
>>>                                       On 12/01/2014 11:49 AM, Carter
>>>                                       Bullard wrote:
>>>                                       Hey Christos,
>>>                                       The primary demand in IP address
>>>                                       anonymization is the number of
>>>                                       IP addresses that need to be
>>>                                       anonymized.   So how many
>>>                                       addresses are in the file ??
>>> 
>>>                                           racount -M addr -r big.file
>>> 
>>>                                       What version of clients are you
>>>                                       using ??
>>>                                       Carter
>>> 
>>>                                           On Dec 1, 2014, at 1:14 AM,
>>>                                           Christos Papadopoulos
>>>                                           <christos at cs.colostate.edu
>>>                                           <mailto:christos at cs.colostate.edu>>
>>>                                           wrote:
>>> 
>>>                                           Hi folks,
>>> 
>>>                                           I am trying to use
>>>                                           ranonymize for some large
>>>                                           argus files. This is useful
>>>                                           for us because we want to
>>>                                           share some argus data with
>>>                                           fellow researchers, but
>>>                                           anonymize them to protect
>>>                                           the innocent.
>>> 
>>>                                           The file I am trying to
>>>                                           anonymize is large, about
>>>                                           18GB compressed. As you can
>>>                                           imagine, there are millions
>>>                                           of flows in there.
>>> 
>>>                                           I only want IP address
>>>                                           anonymization, so I turned
>>>                                           everything else off in the
>>>                                           ranonymize.conf file.
>>> 
>>>                                           Well, ranonymize has been
>>>                                           running for almost 3 hours
>>>                                           with about 1/20th of the
>>>                                           file done. It is using 100%
>>>                                           of a CPU, but only 4% of
>>>                                           memory in a 32GB machine.
>>>                                           Clearly it's not a memory or
>>>                                           swap issue.
>>> 
>>>                                           I can't figure out why it's
>>>                                           taking so long. I thought it
>>>                                           would be almost as fast as
>>>                                           reading and writing the file
>>>                                           plus some time to
>>>                                           compress/decompress and some
>>>                                           time for checking the hash
>>>                                           for the anonymized addresses.
>>> 
>>>                                           Any idea what's pounding the
>>>                                           CPU and slowing it down? I
>>>                                           can investigate further by
>>>                                           profiling the code, but
>>>                                           thought I throw the question
>>>                                           out there first in case
>>>                                           someone else has done it.
>>> 
>>>                                           Thanks!
>>> 
>>>                                           Christos.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Kaustubh Gadkari
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2443 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20141206/7e4b12ab/attachment.bin>