ranonymize too slow?
Carter Bullard
carter at qosient.com
Sat Dec 6 03:01:21 EST 2014
Fall back to the default of cidr/24 and if that doesn't work completely, set the hash size to 0x10000. There is some different logic for hash sizes bigger than 0x10000.
Carter
> On Dec 6, 2014, at 7:16 AM, Kaustubh Gadkari <kaustubh at cs.colostate.edu> wrote:
>
> Hi,
>
> I had kicked off a run of ranonymize with the new hash size. Good news: the code doesn't segfault. Bad news: ranonymize quits with the following error after about 12 minutes.
>
> RaMapNewNetwork: no addresses
>
> RANON_PRESERVE_NET_ADDRESS_HIERARCHY is set to cidr/8 in the config file.
>
> Kaustubh
>
>> On Dec 5, 2014, at 4:38 PM, Christos Papadopoulos <christos at CS.ColoState.EDU> wrote:
>>
>> Thanks Kaustubh!
>>
>> Carter, Kaustubh is one of my graduate students and he took it upon himself to look into the problem.
>>
>> He will do another anonymization run and report back to us with timing results.
>>
>> This is good progress, thanks all!
>>
>> Christos.
>>
>>> On 12/05/2014 04:06 PM, Kaustubh Gadkari wrote:
>>>
>>>
>>> On Fri, Dec 5, 2014 at 1:05 PM, Christos Papadopoulos
>>> <christos at cs.colostate.edu <mailto:christos at cs.colostate.edu>> wrote:
>>>
>>> On 12/05/2014 12:41 PM, Carter Bullard wrote:
>>>
>>> That is about 2500 records per sec. We should be able to do
>>> 10-50x that. I have gotten upto 1M rps, but not with open
>>> source argus.
>>> The has size change should make a huge difference !!
>>>
>>>
>>> With the new hash, ranonymize produces 37,615 records in just over a
>>> second and then promptly crashes with both cidr/8 and cidr/16.
>>>
>>>
>>> I think I fixed the segfault issue. The patch is simple:
>>>
>>> kaustubh at proton:~/argus-clients-3.0.8/clients$ diff ranonymize.c
>>> ranonymize.c.new
>>> 1454c1454
>>> < int RaMapHash = 0;
>>> ---
>>>> unsigned int RaMapHash = 0;
>>>
>>> Kaustubh
>>>
>>>
>>> If you have some quick suggestions I can try them, else it will take
>>> some time to dig deeper.
>>>
>>> Christos.
>>>
>>>
>>>
>>> Carter
>>>
>>> On Dec 5, 2014, at 2:54 PM, Christos Papadopoulos
>>> <christos at cs.colostate.edu
>>> <mailto:christos at cs.colostate.edu>> wrote:
>>>
>>> Hi Carter,
>>>
>>> You are right, my apologies.
>>>
>>> With cidr/8 after three hours it anonymized about 8.8M
>>> records out of the nearly 1B records in the file. I counted
>>> this by running wc on the output file, which is a text file.
>>>
>>> The machine is a Dell Poweredge 2950, 3GHz Xeon with 8
>>> cores, 32GB of RAM and about 30TB of directly attached
>>> storage, running 64bit CentOS 6.6.
>>>
>>> I will try running it with cidr/16 and also with the change
>>> in the hash function you suggested in your other message.
>>>
>>> Thanks for your help!
>>>
>>> Christos.
>>>
>>> On 12/05/2014 04:14 AM, Carter Bullard wrote:
>>> Hey Christos,
>>> We could be a bit more scientific about this. How much
>>> of the file was completed after 3 hours ?
>>> Did you try cidr/8 and cidr/16 ?? What kind of machine
>>> is this running on ???
>>>
>>> Carter
>>>
>>> On Dec 5, 2014, at 7:53 AM, Christos Papadopoulos
>>> <christos at cs.colostate.edu
>>> <mailto:christos at cs.colostate.edu>> wrote:
>>>
>>> On 12/04/2014 02:54 AM, Carter Bullard wrote:
>>>
>>> Hey Christos,
>>> With CIDR/24 address hierarchy preservation, it
>>> maybe thrashing trying to find an appropriate
>>> CIDR/24 prefix that hasn’t been allocated, when
>>> it needs a new one. I suspect that your 55M
>>> addresses are really 55M CIDR/24’s. You may get
>>> some real speed up if you go to CIDR/16,
>>> or CIDR/8. If you could try that, just as an
>>> experiment, and see if the output is a bit quicker,
>>> I think I can make some changes to improve the
>>> allocation.
>>>
>>>
>>> I tried it by changing the config file to CIDR/8. I
>>> don't think it made much of a difference. I let the
>>> process run for over 3 hours before I had to kill it
>>> again. At that point I saw similar progress as before.
>>>
>>> Sorry!
>>>
>>> Christos.
>>>
>>>
>>> I suspect that you get decent output at first
>>> and then it slows down to a crawl, as its busy
>>> trying to find an address slot that is
>>> appropriate for the next CIDR/24. Its a hash
>>> collision
>>> and then a search for an open slot, which may
>>> not be optimal. It should be easy to thread
>>> out to another processor.
>>>
>>> Carter
>>>
>>> On Dec 2, 2014, at 2:59 PM, Christos
>>> Papadopoulos <christos at cs.colostate.edu
>>> <mailto:christos at cs.colostate.edu>> wrote:
>>>
>>> On 12/02/2014 12:40 AM, Carter Bullard wrote:
>>>
>>> Hey Christos,
>>> Did you specify a ranonymize.conf file,
>>> or are you using all defaults ?
>>>
>>>
>>> I customized the ranonymize.conf file do
>>> anonymize IP adresses only. See below.
>>>
>>> You may want to allocate addresses using
>>> a different strategy. Using the default
>>> algorithm, the allocation of 55M
>>> addresses will take some time, did you
>>> get any output at all ???
>>>
>>>
>>> I need to use prefix-preserving
>>> anonymization, similar to cryptopan. Which
>>> algorithm would you suggest?
>>>
>>> I do see the output file growing. It just
>>> takes a really long time, to the point where
>>> it is unusable for our case.
>>>
>>> Here are the settings I used. Please let me
>>> know if I should change anything. I only
>>> need IP addresses anonymized,
>>>
>>> RANON_SEED=29384938
>>> RANON_TRANSREFNUM_OFFSET=no
>>> RANON_SEQNUM_OFFSET=no
>>> RANON_TIME_SEC_OFFSET=no
>>> RANON_TIME_USEC_OFFSET=no
>>> RANON_ETHERNET_ANONYMIZATION=__no
>>> RANON_PRESERVE_ETHERNET___VENDOR=yes
>>> RANON_PRESERVE_ETHERNET___BROADCAST=yes
>>> RANON_PRESERVE_ETHERNET___MULTICAST=yes
>>>
>>> RANON_NET_ANONYMIZATION=__sequential
>>> RANON_HOST_ANONYMIZATION=__sequential
>>> RANON_AS_ANONYMIZATION=__sequential
>>> RANON_NETWORK_ADDRESS_LENGTH=__24
>>>
>>> RANON_PRESERVE_NET_ADDRESS___HIERARCHY=cidr/24
>>> RANON_PRESERVE_BROADCAST___ADDRESS=yes
>>> RANON_PRESERVE_MULTICAST___ADDRESS=yes
>>> RANON_PRESERVE_IP_ID=none
>>> RANON_PRESERVE_ICMPMAPPED_TTL=__yes
>>> RANON_PRESERVE_IP_TTL=none
>>> RANON_PRESERVE_IP_TOS=none
>>> RANON_PRESERVE_WELLKNOWN_PORT___NUMS=yes
>>> RANON_PRESERVE_REGISTERED___PORT_NUMS=yes
>>> RANON_PRESERVE_PRIVATE_PORT___NUMS=yes
>>> RANON_PORT_METHOD=no
>>>
>>> Christos.
>>>
>>>
>>> Carter
>>>
>>>
>>>
>>> On Dec 2, 2014, at 2:38 AM, Christos
>>> Papadopoulos
>>> <christos at cs.colostate.edu
>>> <mailto:christos at cs.colostate.edu>>
>>> wrote:
>>>
>>> Hi Carter,
>>>
>>> We are using the latest version of
>>> the client tools.
>>>
>>> After letting it run for 4.5 hours I
>>> had to kill it. There are just under
>>> a billion records in the file. When
>>> I killed it, this is what I got. I
>>> have no idea how much longer it
>>> would run.
>>>
>>> Address Summary
>>> IPv4 Unicast src
>>> 11411339 dst 43953546
>>> IPv4 Unicast Private src 85
>>> dst 353
>>> IPv4 Unicast Reserved src
>>> 12654028 dst 51692353
>>> IPv4 Multicast Local src 0
>>> dst 2
>>>
>>> Christos.
>>>
>>> On 12/01/2014 11:49 AM, Carter
>>> Bullard wrote:
>>> Hey Christos,
>>> The primary demand in IP address
>>> anonymization is the number of
>>> IP addresses that need to be
>>> anonymized. So how many
>>> addresses are in the file ??
>>>
>>> racount -M addr -r big.file
>>>
>>> What version of clients are you
>>> using ??
>>> Carter
>>>
>>> On Dec 1, 2014, at 1:14 AM,
>>> Christos Papadopoulos
>>> <christos at cs.colostate.edu
>>> <mailto:christos at cs.colostate.edu>>
>>> wrote:
>>>
>>> Hi folks,
>>>
>>> I am trying to use
>>> ranonymize for some large
>>> argus files. This is useful
>>> for us because we want to
>>> share some argus data with
>>> fellow researchers, but
>>> anonymize them to protect
>>> the innocent.
>>>
>>> The file I am trying to
>>> anonymize is large, about
>>> 18GB compressed. As you can
>>> imagine, there are millions
>>> of flows in there.
>>>
>>> I only want IP address
>>> anonymization, so I turned
>>> everything else off in the
>>> ranonymize.conf file.
>>>
>>> Well, ranonymize has been
>>> running for almost 3 hours
>>> with about 1/20th of the
>>> file done. It is using 100%
>>> of a CPU, but only 4% of
>>> memory in a 32GB machine.
>>> Clearly it's not a memory or
>>> swap issue.
>>>
>>> I can't figure out why it's
>>> taking so long. I thought it
>>> would be almost as fast as
>>> reading and writing the file
>>> plus some time to
>>> compress/decompress and some
>>> time for checking the hash
>>> for the anonymized addresses.
>>>
>>> Any idea what's pounding the
>>> CPU and slowing it down? I
>>> can investigate further by
>>> profiling the code, but
>>> thought I throw the question
>>> out there first in case
>>> someone else has done it.
>>>
>>> Thanks!
>>>
>>> Christos.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Kaustubh Gadkari
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2443 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20141206/7e4b12ab/attachment.bin>
More information about the argus
mailing list