ranonymize too slow?

Christos Papadopoulos christos at cs.colostate.edu
Fri Dec 5 18:38:41 EST 2014


Thanks Kaustubh!

Carter, Kaustubh is one of my graduate students and he took it upon 
himself to look into the problem.

He will do another anonymization run and report back to us with timing 
results.

This is good progress, thanks all!

Christos.

On 12/05/2014 04:06 PM, Kaustubh Gadkari wrote:
>
>
> On Fri, Dec 5, 2014 at 1:05 PM, Christos Papadopoulos
> <christos at cs.colostate.edu <mailto:christos at cs.colostate.edu>> wrote:
>
>     On 12/05/2014 12:41 PM, Carter Bullard wrote:
>
>         That is about 2500 records per sec.  We should be able to do
>         10-50x that.  I have gotten upto 1M rps, but not with open
>         source argus.
>         The has size change should make a huge difference !!
>
>
>     With the new hash, ranonymize produces 37,615 records in just over a
>     second and then promptly crashes with both cidr/8 and cidr/16.
>
>
> ​I think I fixed the segfault issue. The patch is simple:
>
> kaustubh at proton:~/argus-clients-3.0.8/clients$ diff ranonymize.c
> ranonymize.c.new
> 1454c1454
> <    int RaMapHash = 0;
> ---
>  >    unsigned int RaMapHash = 0;
>
> Kaustubh
>>
>     If you have some quick suggestions I can try them, else it will take
>     some time to dig deeper.
>
>     Christos.
>
>
>
>         Carter
>
>             On Dec 5, 2014, at 2:54 PM, Christos Papadopoulos
>             <christos at cs.colostate.edu
>             <mailto:christos at cs.colostate.edu>> wrote:
>
>             Hi Carter,
>
>             You are right, my apologies.
>
>             With cidr/8 after three hours it anonymized about 8.8M
>             records out of the nearly 1B records in the file. I counted
>             this by running wc on the output file, which is a text file.
>
>             The machine is a Dell Poweredge 2950, 3GHz Xeon with 8
>             cores, 32GB of RAM and about 30TB of directly attached
>             storage, running 64bit CentOS 6.6.
>
>             I will try running it with cidr/16 and also with the change
>             in the hash function you suggested in your other message.
>
>             Thanks for your help!
>
>             Christos.
>
>                 On 12/05/2014 04:14 AM, Carter Bullard wrote:
>                 Hey Christos,
>                 We could be a bit more scientific about this.  How much
>                 of the file was completed after 3 hours ?
>                 Did you try cidr/8 and cidr/16 ??   What kind of machine
>                 is this running on ???
>
>                 Carter
>
>                     On Dec 5, 2014, at 7:53 AM, Christos Papadopoulos
>                     <christos at cs.colostate.edu
>                     <mailto:christos at cs.colostate.edu>> wrote:
>
>                     On 12/04/2014 02:54 AM, Carter Bullard wrote:
>
>                         Hey Christos,
>                         With CIDR/24 address hierarchy preservation, it
>                         maybe thrashing trying to find an appropriate
>                         CIDR/24 prefix that hasn’t been allocated, when
>                         it needs a new one.  I suspect that your 55M
>                         addresses are really 55M CIDR/24’s.  You may get
>                         some real speed up if you go to CIDR/16,
>                         or CIDR/8.  If you could try that, just as an
>                         experiment, and see if the output is a bit quicker,
>                         I think I can make some changes to improve the
>                         allocation.
>
>
>                     I tried it by changing the config file to CIDR/8. I
>                     don't think it made much of a difference. I let the
>                     process run for over 3 hours before I had to kill it
>                     again. At that point I saw similar progress as before.
>
>                     Sorry!
>
>                     Christos.
>
>
>                         I suspect that you get decent output at first
>                         and then it slows down to a crawl, as its busy
>                         trying to find an address slot that is
>                         appropriate for the next CIDR/24.  Its a hash
>                         collision
>                         and then a search for an open slot, which may
>                         not be optimal.  It should be easy to thread
>                         out to another processor.
>
>                         Carter
>
>                             On Dec 2, 2014, at 2:59 PM, Christos
>                             Papadopoulos <christos at cs.colostate.edu
>                             <mailto:christos at cs.colostate.edu>> wrote:
>
>                             On 12/02/2014 12:40 AM, Carter Bullard wrote:
>
>                                 Hey Christos,
>                                 Did you specify a ranonymize.conf file,
>                                 or are you using all defaults ?
>
>
>                             I customized the ranonymize.conf file do
>                             anonymize IP adresses only. See below.
>
>                                 You may want to allocate addresses using
>                                 a different strategy.  Using the default
>                                 algorithm, the allocation of 55M
>                                 addresses will take some time, did you
>                                 get any output at all  ???
>
>
>                             I need to use prefix-preserving
>                             anonymization, similar to cryptopan. Which
>                             algorithm would you suggest?
>
>                             I do see the output file growing. It just
>                             takes a really long time, to the point where
>                             it is unusable for our case.
>
>                             Here are the settings I used. Please let me
>                             know if I should change anything. I only
>                             need IP addresses anonymized,
>
>                             RANON_SEED=29384938
>                             RANON_TRANSREFNUM_OFFSET=no
>                             RANON_SEQNUM_OFFSET=no
>                             RANON_TIME_SEC_OFFSET=no
>                             RANON_TIME_USEC_OFFSET=no
>                             RANON_ETHERNET_ANONYMIZATION=__no
>                             RANON_PRESERVE_ETHERNET___VENDOR=yes
>                             RANON_PRESERVE_ETHERNET___BROADCAST=yes
>                             RANON_PRESERVE_ETHERNET___MULTICAST=yes
>
>                             RANON_NET_ANONYMIZATION=__sequential
>                             RANON_HOST_ANONYMIZATION=__sequential
>                             RANON_AS_ANONYMIZATION=__sequential
>                             RANON_NETWORK_ADDRESS_LENGTH=__24
>
>                             RANON_PRESERVE_NET_ADDRESS___HIERARCHY=cidr/24
>                             RANON_PRESERVE_BROADCAST___ADDRESS=yes
>                             RANON_PRESERVE_MULTICAST___ADDRESS=yes
>                             RANON_PRESERVE_IP_ID=none
>                             RANON_PRESERVE_ICMPMAPPED_TTL=__yes
>                             RANON_PRESERVE_IP_TTL=none
>                             RANON_PRESERVE_IP_TOS=none
>                             RANON_PRESERVE_WELLKNOWN_PORT___NUMS=yes
>                             RANON_PRESERVE_REGISTERED___PORT_NUMS=yes
>                             RANON_PRESERVE_PRIVATE_PORT___NUMS=yes
>                             RANON_PORT_METHOD=no
>
>                             Christos.
>
>
>                                 Carter
>
>
>
>                                     On Dec 2, 2014, at 2:38 AM, Christos
>                                     Papadopoulos
>                                     <christos at cs.colostate.edu
>                                     <mailto:christos at cs.colostate.edu>>
>                                     wrote:
>
>                                     Hi Carter,
>
>                                     We are using the latest version of
>                                     the client tools.
>
>                                     After letting it run for 4.5 hours I
>                                     had to kill it. There are just under
>                                     a billion records in the file. When
>                                     I killed it, this is what I got. I
>                                     have no idea how much longer it
>                                     would run.
>
>                                     Address Summary
>                                        IPv4 Unicast              src
>                                     11411339    dst 43953546
>                                        IPv4 Unicast Private      src 85
>                                              dst 353
>                                        IPv4 Unicast Reserved     src
>                                     12654028    dst 51692353
>                                        IPv4 Multicast Local      src 0
>                                               dst 2
>
>                                     Christos.
>
>                                         On 12/01/2014 11:49 AM, Carter
>                                         Bullard wrote:
>                                         Hey Christos,
>                                         The primary demand in IP address
>                                         anonymization is the number of
>                                         IP addresses that need to be
>                                         anonymized.   So how many
>                                         addresses are in the file ??
>
>                                             racount -M addr -r big.file
>
>                                         What version of clients are you
>                                         using ??
>                                         Carter
>
>                                             On Dec 1, 2014, at 1:14 AM,
>                                             Christos Papadopoulos
>                                             <christos at cs.colostate.edu
>                                             <mailto:christos at cs.colostate.edu>>
>                                             wrote:
>
>                                             Hi folks,
>
>                                             I am trying to use
>                                             ranonymize for some large
>                                             argus files. This is useful
>                                             for us because we want to
>                                             share some argus data with
>                                             fellow researchers, but
>                                             anonymize them to protect
>                                             the innocent.
>
>                                             The file I am trying to
>                                             anonymize is large, about
>                                             18GB compressed. As you can
>                                             imagine, there are millions
>                                             of flows in there.
>
>                                             I only want IP address
>                                             anonymization, so I turned
>                                             everything else off in the
>                                             ranonymize.conf file.
>
>                                             Well, ranonymize has been
>                                             running for almost 3 hours
>                                             with about 1/20th of the
>                                             file done. It is using 100%
>                                             of a CPU, but only 4% of
>                                             memory in a 32GB machine.
>                                             Clearly it's not a memory or
>                                             swap issue.
>
>                                             I can't figure out why it's
>                                             taking so long. I thought it
>                                             would be almost as fast as
>                                             reading and writing the file
>                                             plus some time to
>                                             compress/decompress and some
>                                             time for checking the hash
>                                             for the anonymized addresses.
>
>                                             Any idea what's pounding the
>                                             CPU and slowing it down? I
>                                             can investigate further by
>                                             profiling the code, but
>                                             thought I throw the question
>                                             out there first in case
>                                             someone else has done it.
>
>                                             Thanks!
>
>                                             Christos.
>
>
>
>
>
>
>
> --
> Kaustubh Gadkari




More information about the argus mailing list