Anonymization of argus flow data
Carter Bullard
carter at qosient.com
Tue Oct 8 08:45:46 EDT 2013
Hey Kaustubh,
I have not had a chance, but thanks for reminding me.
I'll look at it today !!!! Keep bugging me !!!
Carter
> On Oct 7, 2013, at 12:39 PM, Kaustubh Gadkari <kaustubh at cs.colostate.edu> wrote:
>
> Hey Carter,
>
> I just wanted to check if you've found any reasons why ranonymize is taking so long to complete on my dataset?
>
> Thanks,
> Kaustubh
>
>> On Sep 10, 2013, at 10:40 AM, Kaustubh Gadkari <kaustubh at CS.ColoState.EDU> wrote:
>>
>>
>>> On Sep 10, 2013, at 9:33 AM, Carter Bullard <carter at qosient.com> wrote:
>>>
>>> Well,
>>> On my system 80% of the cycles are being spent doing the address,
>>> port, mac, AS number mappings (managing allocation of a new object
>>> and caching the values), and a small amount on the lookups.
>>>
>>> I'll work on profiling the mapping logic to see if we've got
>>> something askew.
>>
>> Great. Thanks again for the help.
>>
>>> Hope all is most excellent,
>>
>> And with you too :)
>>
>> Kaustubh
>>
>>> Carter
>>>
>>>
>>>
>>>> On Sep 10, 2013, at 12:22 PM, Kaustubh Gadkari <kaustubh at CS.ColoState.EDU> wrote:
>>>>
>>>>
>>>>> On Sep 10, 2013, at 8:40 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>
>>>>> Hey Kaustubh,
>>>>> I've been profiling ranonymize() with a lot of data, and
>>>>> while I do see opportunities to improve performance, I don't
>>>>> see many massively inefficient parts of the code, when run
>>>>> against my data sets. There are still some things for
>>>>> me to look at, so I wanted you to know that I'm working on
>>>>> your problem.
>>>>
>>>> Thanks for looking at this, Carter.
>>>>
>>>>> Based on what you've seen me so far, you're machine is 85%
>>>>> idle, is ranonymize() using 100% of a single core, or is it
>>>>> sleeping a lot?
>>>>
>>>> top says ranonymize is using 100% of a single core.
>>>>
>>>>> What kind of machine are you running on?? Can you describe the
>>>>> machine a bit? CPUs, memory, disks, etc….
>>>>
>>>> I've been testing this on two machines. One is a Dell PowerEdge 2970, with 2 quad core AMD Opteron processors. The machine has 32GB RAM, a 130GB system disk and 16 8TB RAID5 partitions. The other machine is a Dell PowerEdge 2950. It has 2 quad core Intel Xeon X5450 CPUs, with 32GB RAM, a 140GB system disk and 3 8TB RAID5 partitions.
>>>>
>>>> Thanks,
>>>> Kaustubh
>>>>
>>>>> Carter
>>>>>
>>>>>
>>>>>> On Sep 3, 2013, at 3:05 PM, Kaustubh Gadkari <kaustubh at cs.colostate.edu> wrote:
>>>>>>
>>>>>>> On Tue, Sep 3, 2013 at 12:33 PM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>> Hmmmmmm, well, you're not using the machine much (85% idle)
>>>>>>> so I'm looking into whether we're making any calls to any
>>>>>>> routines that would add some wait states, like name lookups, or
>>>>>>> sleeping somewhere.
>>>>>>>
>>>>>>> Lets assume that there is a big problem, and I'll try to make
>>>>>>> some changes to improve your performance.
>>>>>>
>>>>>> Thanks, Carter.
>>>>>>
>>>>>> Kaustubh
>>>>>>
>>>>>>> Carter
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On Sep 3, 2013, at 1:57 PM, Kaustubh Gadkari <kaustubh at cs.colostate.edu> wrote:
>>>>>>>>
>>>>>>>> Hey Carter,
>>>>>>>>
>>>>>>>>> On Sep 3, 2013, at 11:36 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>>>
>>>>>>>>> Hey Kaustubh,
>>>>>>>>> If its still writing records to the output file, its not in an infinite loop,
>>>>>>>>> although I'm sure that it feels like one. So, no need to print debug msgs
>>>>>>>>> or run under gdb().
>>>>>>>>>
>>>>>>>>> Hmmmmmm, you must have a very large number of IP addresses. racount() isn't doing
>>>>>>>>> anything exotic with the "-M addr" mode. Its hashing and storing each unique
>>>>>>>>> IP address, so that we can report on how many and what types.
>>>>>>>>>
>>>>>>>>> My guess is that you must be short on physical memory, and the programs are swapping,
>>>>>>>>> which means that everything on this machine will be going very slowly.
>>>>>>>>> Run " top " to see if one of our programs is eating all the memory, or
>>>>>>>>> use vmstat() or vm_stat() or whatever to see if there is any paging.
>>>>>>>>
>>>>>>>> No, the machine is not running out of memory. ranonymize is the largest memory user, and it is using 42.1% of a total of 32GB RAM. The swap usage is only 205MB, which is OK. vmstat shows me the following:
>>>>>>>>
>>>>>>>> kaustubh at proton:~$ sudo vmstat -w
>>>>>>>> procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
>>>>>>>> r b swpd free buff cache si so bi bo in cs us sy id wa st
>>>>>>>> 1 0 205916 1638176 101636 16287400 0 0 527 342 1 1 14 0 85 1 0
>>>>>>>>
>>>>>>>> There are no other memory intensive processes running on the box.
>>>>>>>>
>>>>>>>>> If it is a memory problem, then you will need to subdivide the data based
>>>>>>>>> on size, not on time, using rasplit(). And yes its easy to merge split files
>>>>>>>>> back to a single file.
>>>>>>>>>
>>>>>>>>> UNFORTUNATELY, because the scope of anonymization is the file, anonymizing a
>>>>>>>>> single big file of records will generate different results compared to
>>>>>>>>> anonymizing a set of split files created from the big file. Address A will be
>>>>>>>>> anonymized potentially to a different address in each file.
>>>>>>>>>
>>>>>>>>> The configuration provides the means to get consistent results between files,
>>>>>>>>> but its a bit of work to do so.
>>>>>>>>>
>>>>>>>>> Do you think you're running out of memory?
>>>>>>>>
>>>>>>>> No, I think I'm ok in terms of memory usage.
>>>>>>>>
>>>>>>>> Kaustubh
>>>>>>>>
>>>>>>>>> Carter
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Sep 3, 2013, at 1:11 PM, Kaustubh Gadkari <kaustubh.gadkari at gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 3, 2013 at 8:49 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>>>>> Hmmm, if racount() takes 18min, I would think ranonymize() should take about 20min
>>>>>>>>>>> to complete. You can run " racount -M addr " to get racount() to printout address
>>>>>>>>>>> information, like how many addresses are in the file.
>>>>>>>>>>
>>>>>>>>>> Carter, I ran racount with -M addr, but the process hasn't finished
>>>>>>>>>> yet (it's been running for about 90 min now). I'll let it run for a
>>>>>>>>>> while longer and keep you updated.
>>>>>>>>>>
>>>>>>>>>>> ranonymize() works on a single argus record at a time, reading a single record,
>>>>>>>>>>> anonymizing all the various data elements, and then writing the anonymized
>>>>>>>>>>> record out to the output file. If ranonymize() hasn't written out a record recently,
>>>>>>>>>>> then its possible that its in an infinite loop, especially if its running at 100%, and
>>>>>>>>>>> its been running for a month, and it seems to have stopped writing into the file.
>>>>>>>>>>> What was the last " modified " time on your output file ???
>>>>>>>>>>
>>>>>>>>>> It hasn't stopped writing to file .. the last modified time is right
>>>>>>>>>> now, since the process is still running.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> If you've compiled debug support into your ra* programs, you can send a USR1
>>>>>>>>>>> signal to the running ranonymize() and it will start writing debug information out
>>>>>>>>>>> to stderr(). Send a USR2 to turn debug output off. Assuming that ranonymize()s
>>>>>>>>>>> process id is 35122, you can do this:
>>>>>>>>>>>
>>>>>>>>>>> % kill -USR1 35122
>>>>>>>>>>> % kill -USR2 35122
>>>>>>>>>>>
>>>>>>>>>>> If you've compiled development support into your programs, you can attach
>>>>>>>>>>> to ranonymize() using gdb(), and then step through the program to see where
>>>>>>>>>>> it is.
>>>>>>>>>>
>>>>>>>>>> I haven't compiled my ra* programs with debug or development support.
>>>>>>>>>> If you can tell me what I need to change in the Makefiles, I can do so
>>>>>>>>>> and run ranonymize with gdb and see what's happening.
>>>>>>>>>>
>>>>>>>>>> Kaustubh
>>>>>>>>>>
>>>>>>>>>>> % gdb ranonymize 35122
>>>>>>>>>>>
>>>>>>>>>>> This will attach to the program, and stop the acitve process. If this all seems
>>>>>>>>>>> unfamiliar, send more email, and I'll walk you through one of these strategies.
>>>>>>>>>>>
>>>>>>>>>>> Carter
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On Sep 3, 2013, at 9:56 AM, Kaustubh Gadkari <kaustubh.gadkari at gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 3, 2013 at 7:19 AM, Kaustubh Gadkari
>>>>>>>>>>>> <kaustubh.gadkari at gmail.com> wrote:
>>>>>>>>>>>>> On Tue, Sep 3, 2013 at 6:00 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>>>>>>>> Hmmmm,
>>>>>>>>>>>>>> There shouldn't be any performance issues with anonymizing a file, if your
>>>>>>>>>>>>>> just
>>>>>>>>>>>>>> anonymizing the IP addresses. How many addresses are in the file?
>>>>>>>>>>>>>> What does your ranonymize.conf file look like? How much memory is it
>>>>>>>>>>>>>> using?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am not quite sure how many IP addresses there are in the file. My
>>>>>>>>>>>>> ranonymize.conf looks like this:
>>>>>>>>>>>>>
>>>>>>>>>>>>> RANON_PRESERVE_ETHERNET_VENDOR=yes
>>>>>>>>>>>>> RANON_PRESERVE_BROADCAST_ADDRESS=yes
>>>>>>>>>>>>> RANON_NET_ANONYMIZATION=sequential
>>>>>>>>>>>>> RANON_HOST_ANONYMIZATION=sequential
>>>>>>>>>>>>> RANON_PRESERVE_NET_ADDRESS_HIERARCHY=class
>>>>>>>>>>>>>
>>>>>>>>>>>>> I took a look at how much memory ranonymize is using .. the usage is
>>>>>>>>>>>>> about 42% on a machine with 32GB RAM.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> ranonymize() can be a little complex O(nLogN + C), but it should be
>>>>>>>>>>>>>> in the same time frame as racount(). How long does it take for racount()
>>>>>>>>>>>>>> to read the file?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am running racount right now .. I will post results once it finishes.
>>>>>>>>>>>>
>>>>>>>>>>>> racount takes about 18min to run on the file:
>>>>>>>>>>>>
>>>>>>>>>>>> real 17m58.528s
>>>>>>>>>>>> user 17m12.413s
>>>>>>>>>>>> sys 2m0.332s
>>>>>>>>>>>>
>>>>>>>>>>>> Kaustubh
>>>>>>>>>>>>
>>>>>>>>>>>>>> Just a rule of thumb. If a ra* program doesn't complete in a few minutes,
>>>>>>>>>>>>>> you
>>>>>>>>>>>>>> should stop it and try to figure out if there is a memory problem or not.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks, I'll keep this in mind :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Kaustubh
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Carter
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sep 2, 2013, at 2:20 PM, Kaustubh Gadkari <kaustubh.gadkari at gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have a set of argus flow data captured at our data capture vantage point,
>>>>>>>>>>>>>> and I want to anonymize the IP addresses (both source and destination) fully
>>>>>>>>>>>>>> i.e. I want to replace both the addresses, using a prefix preserving
>>>>>>>>>>>>>> technique. I have tried using ranonymize, but it is taking an extremely long
>>>>>>>>>>>>>> time to anonymize the file (I started the process a couple of months ago, on
>>>>>>>>>>>>>> a ~125GB file, and the output file size today is only ~30GB).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can anyone suggest the right way to go about anonymizing the data set I
>>>>>>>>>>>>>> have? Is ranonymize the right tool for the job?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Kaustubh
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Kaustubh Gadkari
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Kaustubh Gadkari
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Kaustubh Gadkari
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Kaustubh Gadkari
>>>>>>>>
>>>>>>>> --
>>>>>>>> Kaustubh Gadkari
>>>>>>>> kaustubh at cs.colostate.edu
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Kaustubh Gadkari
>>>>
>>>> --
>>>> Kaustubh Gadkari
>>>> kaustubh at cs.colostate.edu
>>
>> --
>> Kaustubh Gadkari
>> kaustubh at cs.colostate.edu
>
> --
> Kaustubh Gadkari
> kaustubh at cs.colostate.edu
>
More information about the argus
mailing list