ranonymize too slow?

Kaustubh Gadkari kaustubh at cs.colostate.edu
Sat Dec 6 01:16:13 EST 2014


Hi,

I had kicked off a run of ranonymize with the new hash size. Good news: the code doesn't segfault. Bad news: ranonymize quits with the following error after about 12 minutes.

RaMapNewNetwork: no addresses

RANON_PRESERVE_NET_ADDRESS_HIERARCHY is set to cidr/8 in the config file.

Kaustubh

> On Dec 5, 2014, at 4:38 PM, Christos Papadopoulos <christos at CS.ColoState.EDU> wrote:
> 
> Thanks Kaustubh!
> 
> Carter, Kaustubh is one of my graduate students and he took it upon himself to look into the problem.
> 
> He will do another anonymization run and report back to us with timing results.
> 
> This is good progress, thanks all!
> 
> Christos.
> 
> On 12/05/2014 04:06 PM, Kaustubh Gadkari wrote:
>> 
>> 
>> On Fri, Dec 5, 2014 at 1:05 PM, Christos Papadopoulos
>> <christos at cs.colostate.edu <mailto:christos at cs.colostate.edu>> wrote:
>> 
>>    On 12/05/2014 12:41 PM, Carter Bullard wrote:
>> 
>>        That is about 2500 records per sec.  We should be able to do
>>        10-50x that.  I have gotten upto 1M rps, but not with open
>>        source argus.
>>        The has size change should make a huge difference !!
>> 
>> 
>>    With the new hash, ranonymize produces 37,615 records in just over a
>>    second and then promptly crashes with both cidr/8 and cidr/16.
>> 
>> 
>> ​I think I fixed the segfault issue. The patch is simple:
>> 
>> kaustubh at proton:~/argus-clients-3.0.8/clients$ diff ranonymize.c
>> ranonymize.c.new
>> 1454c1454
>> <    int RaMapHash = 0;
>> ---
>> >    unsigned int RaMapHash = 0;
>> 
>> Kaustubh
>>>> 
>>    If you have some quick suggestions I can try them, else it will take
>>    some time to dig deeper.
>> 
>>    Christos.
>> 
>> 
>> 
>>        Carter
>> 
>>            On Dec 5, 2014, at 2:54 PM, Christos Papadopoulos
>>            <christos at cs.colostate.edu
>>            <mailto:christos at cs.colostate.edu>> wrote:
>> 
>>            Hi Carter,
>> 
>>            You are right, my apologies.
>> 
>>            With cidr/8 after three hours it anonymized about 8.8M
>>            records out of the nearly 1B records in the file. I counted
>>            this by running wc on the output file, which is a text file.
>> 
>>            The machine is a Dell Poweredge 2950, 3GHz Xeon with 8
>>            cores, 32GB of RAM and about 30TB of directly attached
>>            storage, running 64bit CentOS 6.6.
>> 
>>            I will try running it with cidr/16 and also with the change
>>            in the hash function you suggested in your other message.
>> 
>>            Thanks for your help!
>> 
>>            Christos.
>> 
>>                On 12/05/2014 04:14 AM, Carter Bullard wrote:
>>                Hey Christos,
>>                We could be a bit more scientific about this.  How much
>>                of the file was completed after 3 hours ?
>>                Did you try cidr/8 and cidr/16 ??   What kind of machine
>>                is this running on ???
>> 
>>                Carter
>> 
>>                    On Dec 5, 2014, at 7:53 AM, Christos Papadopoulos
>>                    <christos at cs.colostate.edu
>>                    <mailto:christos at cs.colostate.edu>> wrote:
>> 
>>                    On 12/04/2014 02:54 AM, Carter Bullard wrote:
>> 
>>                        Hey Christos,
>>                        With CIDR/24 address hierarchy preservation, it
>>                        maybe thrashing trying to find an appropriate
>>                        CIDR/24 prefix that hasn’t been allocated, when
>>                        it needs a new one.  I suspect that your 55M
>>                        addresses are really 55M CIDR/24’s.  You may get
>>                        some real speed up if you go to CIDR/16,
>>                        or CIDR/8.  If you could try that, just as an
>>                        experiment, and see if the output is a bit quicker,
>>                        I think I can make some changes to improve the
>>                        allocation.
>> 
>> 
>>                    I tried it by changing the config file to CIDR/8. I
>>                    don't think it made much of a difference. I let the
>>                    process run for over 3 hours before I had to kill it
>>                    again. At that point I saw similar progress as before.
>> 
>>                    Sorry!
>> 
>>                    Christos.
>> 
>> 
>>                        I suspect that you get decent output at first
>>                        and then it slows down to a crawl, as its busy
>>                        trying to find an address slot that is
>>                        appropriate for the next CIDR/24.  Its a hash
>>                        collision
>>                        and then a search for an open slot, which may
>>                        not be optimal.  It should be easy to thread
>>                        out to another processor.
>> 
>>                        Carter
>> 
>>                            On Dec 2, 2014, at 2:59 PM, Christos
>>                            Papadopoulos <christos at cs.colostate.edu
>>                            <mailto:christos at cs.colostate.edu>> wrote:
>> 
>>                            On 12/02/2014 12:40 AM, Carter Bullard wrote:
>> 
>>                                Hey Christos,
>>                                Did you specify a ranonymize.conf file,
>>                                or are you using all defaults ?
>> 
>> 
>>                            I customized the ranonymize.conf file do
>>                            anonymize IP adresses only. See below.
>> 
>>                                You may want to allocate addresses using
>>                                a different strategy.  Using the default
>>                                algorithm, the allocation of 55M
>>                                addresses will take some time, did you
>>                                get any output at all  ???
>> 
>> 
>>                            I need to use prefix-preserving
>>                            anonymization, similar to cryptopan. Which
>>                            algorithm would you suggest?
>> 
>>                            I do see the output file growing. It just
>>                            takes a really long time, to the point where
>>                            it is unusable for our case.
>> 
>>                            Here are the settings I used. Please let me
>>                            know if I should change anything. I only
>>                            need IP addresses anonymized,
>> 
>>                            RANON_SEED=29384938
>>                            RANON_TRANSREFNUM_OFFSET=no
>>                            RANON_SEQNUM_OFFSET=no
>>                            RANON_TIME_SEC_OFFSET=no
>>                            RANON_TIME_USEC_OFFSET=no
>>                            RANON_ETHERNET_ANONYMIZATION=__no
>>                            RANON_PRESERVE_ETHERNET___VENDOR=yes
>>                            RANON_PRESERVE_ETHERNET___BROADCAST=yes
>>                            RANON_PRESERVE_ETHERNET___MULTICAST=yes
>> 
>>                            RANON_NET_ANONYMIZATION=__sequential
>>                            RANON_HOST_ANONYMIZATION=__sequential
>>                            RANON_AS_ANONYMIZATION=__sequential
>>                            RANON_NETWORK_ADDRESS_LENGTH=__24
>> 
>>                            RANON_PRESERVE_NET_ADDRESS___HIERARCHY=cidr/24
>>                            RANON_PRESERVE_BROADCAST___ADDRESS=yes
>>                            RANON_PRESERVE_MULTICAST___ADDRESS=yes
>>                            RANON_PRESERVE_IP_ID=none
>>                            RANON_PRESERVE_ICMPMAPPED_TTL=__yes
>>                            RANON_PRESERVE_IP_TTL=none
>>                            RANON_PRESERVE_IP_TOS=none
>>                            RANON_PRESERVE_WELLKNOWN_PORT___NUMS=yes
>>                            RANON_PRESERVE_REGISTERED___PORT_NUMS=yes
>>                            RANON_PRESERVE_PRIVATE_PORT___NUMS=yes
>>                            RANON_PORT_METHOD=no
>> 
>>                            Christos.
>> 
>> 
>>                                Carter
>> 
>> 
>> 
>>                                    On Dec 2, 2014, at 2:38 AM, Christos
>>                                    Papadopoulos
>>                                    <christos at cs.colostate.edu
>>                                    <mailto:christos at cs.colostate.edu>>
>>                                    wrote:
>> 
>>                                    Hi Carter,
>> 
>>                                    We are using the latest version of
>>                                    the client tools.
>> 
>>                                    After letting it run for 4.5 hours I
>>                                    had to kill it. There are just under
>>                                    a billion records in the file. When
>>                                    I killed it, this is what I got. I
>>                                    have no idea how much longer it
>>                                    would run.
>> 
>>                                    Address Summary
>>                                       IPv4 Unicast              src
>>                                    11411339    dst 43953546
>>                                       IPv4 Unicast Private      src 85
>>                                             dst 353
>>                                       IPv4 Unicast Reserved     src
>>                                    12654028    dst 51692353
>>                                       IPv4 Multicast Local      src 0
>>                                              dst 2
>> 
>>                                    Christos.
>> 
>>                                        On 12/01/2014 11:49 AM, Carter
>>                                        Bullard wrote:
>>                                        Hey Christos,
>>                                        The primary demand in IP address
>>                                        anonymization is the number of
>>                                        IP addresses that need to be
>>                                        anonymized.   So how many
>>                                        addresses are in the file ??
>> 
>>                                            racount -M addr -r big.file
>> 
>>                                        What version of clients are you
>>                                        using ??
>>                                        Carter
>> 
>>                                            On Dec 1, 2014, at 1:14 AM,
>>                                            Christos Papadopoulos
>>                                            <christos at cs.colostate.edu
>>                                            <mailto:christos at cs.colostate.edu>>
>>                                            wrote:
>> 
>>                                            Hi folks,
>> 
>>                                            I am trying to use
>>                                            ranonymize for some large
>>                                            argus files. This is useful
>>                                            for us because we want to
>>                                            share some argus data with
>>                                            fellow researchers, but
>>                                            anonymize them to protect
>>                                            the innocent.
>> 
>>                                            The file I am trying to
>>                                            anonymize is large, about
>>                                            18GB compressed. As you can
>>                                            imagine, there are millions
>>                                            of flows in there.
>> 
>>                                            I only want IP address
>>                                            anonymization, so I turned
>>                                            everything else off in the
>>                                            ranonymize.conf file.
>> 
>>                                            Well, ranonymize has been
>>                                            running for almost 3 hours
>>                                            with about 1/20th of the
>>                                            file done. It is using 100%
>>                                            of a CPU, but only 4% of
>>                                            memory in a 32GB machine.
>>                                            Clearly it's not a memory or
>>                                            swap issue.
>> 
>>                                            I can't figure out why it's
>>                                            taking so long. I thought it
>>                                            would be almost as fast as
>>                                            reading and writing the file
>>                                            plus some time to
>>                                            compress/decompress and some
>>                                            time for checking the hash
>>                                            for the anonymized addresses.
>> 
>>                                            Any idea what's pounding the
>>                                            CPU and slowing it down? I
>>                                            can investigate further by
>>                                            profiling the code, but
>>                                            thought I throw the question
>>                                            out there first in case
>>                                            someone else has done it.
>> 
>>                                            Thanks!
>> 
>>                                            Christos.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> Kaustubh Gadkari

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5272 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20141205/2c956199/attachment.bin>


More information about the argus mailing list