Anonymization of argus flow data
Kaustubh Gadkari
kaustubh at cs.colostate.edu
Tue Oct 22 11:19:45 EDT 2013
Carter,
These are the results of racount -M addr on my input file:
racount records total_pkts src_pkts dst_pkts total_bytes src_bytes dst_bytes
sum 660346226 22569184689 13942009249 8627175440 23277747475402 12592489900324 10685257575078
Address Summary
IPv4 Unicast src 6305958 dst 47659327
IPv4 Unicast This Network src 3 dst 0
IPv4 Unicast Private src 16325 dst 932
IPv4 Unicast Reserved src 9169301 dst 68904256
IPv6 LinkLocal src 742 dst 0
IPv6 Multicast Link Local src 0 dst 737
Kaustubh
On Oct 15, 2013, at 1:27 PM, Kaustubh Gadkari <kaustubh at CS.ColoState.EDU> wrote:
>
> On Oct 15, 2013, at 1:24 PM, Carter Bullard <carter at qosient.com> wrote:
>
>> Hey Kaustubh,
>> I haven't found anything that would generate obvious delays in the algorithms.
>> How many IP addresses are we talking about??
>>
>> racount -M addr -r input.argus
>>
>
> I am not sure. I have run this before and the process never finished (I let it run for about 2 hours before killing it), so I expect we have quite a large number of IP addresses. I can racount and let it run and I'll post results when it finishes.
>
> Kaustubh
>
>> Carter
>>
>>
>> On Oct 8, 2013, at 2:11 PM, Kaustubh Gadkari <kaustubh at cs.colostate.edu> wrote:
>>
>>> Hey Carter,
>>>
>>> I don't have a .rarc file, and I am not setting RA_PRINT_NAMES explicitly anywhere. My invocation of ranonymize is as follows:
>>>
>>> ranonymize -f /path/to/configfile -r input.argus -w output.argus - <filter expression>
>>>
>>> The config file has the following entries:
>>> RANON_PRESERVE_ETHERNET_VENDOR=yes
>>> RANON_PRESERVE_BROADCAST_ADDRESS=yes
>>> RANON_NET_ANONYMIZATION=sequential
>>> RANON_HOST_ANONYMIZATION=sequential
>>> RANON_PRESERVE_NET_ADDRESS_HIERARCHY=class
>>>
>>> Thanks,
>>> Kaustubh
>>>
>>>
>>> On Oct 8, 2013, at 10:04 AM, Carter Bullard <carter at qosient.com> wrote:
>>>
>>>> Hey Kaustubh,
>>>> There is a chance that if you run ranonymize() with the options to
>>>> print hostnames, either in the .rarc file or using the -nn option
>>>> on the command line, you will hurt ranonymize's performance by doing
>>>> bind lookups on each address before the number is translated.
>>>>
>>>> Any chance that is going on here? What is the value of your RA_PRINT_NAMES
>>>> variable in your .rarc, and/or how are you calling ranonymize() ?
>>>>
>>>> Carter
>>>>
>>>>
>>>> On Oct 8, 2013, at 8:45 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>
>>>>> Hey Kaustubh,
>>>>> I have not had a chance, but thanks for reminding me.
>>>>> I'll look at it today !!!! Keep bugging me !!!
>>>>>
>>>>> Carter
>>>>>
>>>>>> On Oct 7, 2013, at 12:39 PM, Kaustubh Gadkari <kaustubh at cs.colostate.edu> wrote:
>>>>>>
>>>>>> Hey Carter,
>>>>>>
>>>>>> I just wanted to check if you've found any reasons why ranonymize is taking so long to complete on my dataset?
>>>>>>
>>>>>> Thanks,
>>>>>> Kaustubh
>>>>>>
>>>>>>> On Sep 10, 2013, at 10:40 AM, Kaustubh Gadkari <kaustubh at CS.ColoState.EDU> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Sep 10, 2013, at 9:33 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>>
>>>>>>>> Well,
>>>>>>>> On my system 80% of the cycles are being spent doing the address,
>>>>>>>> port, mac, AS number mappings (managing allocation of a new object
>>>>>>>> and caching the values), and a small amount on the lookups.
>>>>>>>>
>>>>>>>> I'll work on profiling the mapping logic to see if we've got
>>>>>>>> something askew.
>>>>>>>
>>>>>>> Great. Thanks again for the help.
>>>>>>>
>>>>>>>> Hope all is most excellent,
>>>>>>>
>>>>>>> And with you too :)
>>>>>>>
>>>>>>> Kaustubh
>>>>>>>
>>>>>>>> Carter
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Sep 10, 2013, at 12:22 PM, Kaustubh Gadkari <kaustubh at CS.ColoState.EDU> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Sep 10, 2013, at 8:40 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hey Kaustubh,
>>>>>>>>>> I've been profiling ranonymize() with a lot of data, and
>>>>>>>>>> while I do see opportunities to improve performance, I don't
>>>>>>>>>> see many massively inefficient parts of the code, when run
>>>>>>>>>> against my data sets. There are still some things for
>>>>>>>>>> me to look at, so I wanted you to know that I'm working on
>>>>>>>>>> your problem.
>>>>>>>>>
>>>>>>>>> Thanks for looking at this, Carter.
>>>>>>>>>
>>>>>>>>>> Based on what you've seen me so far, you're machine is 85%
>>>>>>>>>> idle, is ranonymize() using 100% of a single core, or is it
>>>>>>>>>> sleeping a lot?
>>>>>>>>>
>>>>>>>>> top says ranonymize is using 100% of a single core.
>>>>>>>>>
>>>>>>>>>> What kind of machine are you running on?? Can you describe the
>>>>>>>>>> machine a bit? CPUs, memory, disks, etc….
>>>>>>>>>
>>>>>>>>> I've been testing this on two machines. One is a Dell PowerEdge 2970, with 2 quad core AMD Opteron processors. The machine has 32GB RAM, a 130GB system disk and 16 8TB RAID5 partitions. The other machine is a Dell PowerEdge 2950. It has 2 quad core Intel Xeon X5450 CPUs, with 32GB RAM, a 140GB system disk and 3 8TB RAID5 partitions.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Kaustubh
>>>>>>>>>
>>>>>>>>>> Carter
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On Sep 3, 2013, at 3:05 PM, Kaustubh Gadkari <kaustubh at cs.colostate.edu> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 3, 2013 at 12:33 PM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>>>>>> Hmmmmmm, well, you're not using the machine much (85% idle)
>>>>>>>>>>>> so I'm looking into whether we're making any calls to any
>>>>>>>>>>>> routines that would add some wait states, like name lookups, or
>>>>>>>>>>>> sleeping somewhere.
>>>>>>>>>>>>
>>>>>>>>>>>> Lets assume that there is a big problem, and I'll try to make
>>>>>>>>>>>> some changes to improve your performance.
>>>>>>>>>>>
>>>>>>>>>>> Thanks, Carter.
>>>>>>>>>>>
>>>>>>>>>>> Kaustubh
>>>>>>>>>>>
>>>>>>>>>>>> Carter
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Sep 3, 2013, at 1:57 PM, Kaustubh Gadkari <kaustubh at cs.colostate.edu> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hey Carter,
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sep 3, 2013, at 11:36 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey Kaustubh,
>>>>>>>>>>>>>> If its still writing records to the output file, its not in an infinite loop,
>>>>>>>>>>>>>> although I'm sure that it feels like one. So, no need to print debug msgs
>>>>>>>>>>>>>> or run under gdb().
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hmmmmmm, you must have a very large number of IP addresses. racount() isn't doing
>>>>>>>>>>>>>> anything exotic with the "-M addr" mode. Its hashing and storing each unique
>>>>>>>>>>>>>> IP address, so that we can report on how many and what types.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My guess is that you must be short on physical memory, and the programs are swapping,
>>>>>>>>>>>>>> which means that everything on this machine will be going very slowly.
>>>>>>>>>>>>>> Run " top " to see if one of our programs is eating all the memory, or
>>>>>>>>>>>>>> use vmstat() or vm_stat() or whatever to see if there is any paging.
>>>>>>>>>>>>>
>>>>>>>>>>>>> No, the machine is not running out of memory. ranonymize is the largest memory user, and it is using 42.1% of a total of 32GB RAM. The swap usage is only 205MB, which is OK. vmstat shows me the following:
>>>>>>>>>>>>>
>>>>>>>>>>>>> kaustubh at proton:~$ sudo vmstat -w
>>>>>>>>>>>>> procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
>>>>>>>>>>>>> r b swpd free buff cache si so bi bo in cs us sy id wa st
>>>>>>>>>>>>> 1 0 205916 1638176 101636 16287400 0 0 527 342 1 1 14 0 85 1 0
>>>>>>>>>>>>>
>>>>>>>>>>>>> There are no other memory intensive processes running on the box.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> If it is a memory problem, then you will need to subdivide the data based
>>>>>>>>>>>>>> on size, not on time, using rasplit(). And yes its easy to merge split files
>>>>>>>>>>>>>> back to a single file.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> UNFORTUNATELY, because the scope of anonymization is the file, anonymizing a
>>>>>>>>>>>>>> single big file of records will generate different results compared to
>>>>>>>>>>>>>> anonymizing a set of split files created from the big file. Address A will be
>>>>>>>>>>>>>> anonymized potentially to a different address in each file.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The configuration provides the means to get consistent results between files,
>>>>>>>>>>>>>> but its a bit of work to do so.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you think you're running out of memory?
>>>>>>>>>>>>>
>>>>>>>>>>>>> No, I think I'm ok in terms of memory usage.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Kaustubh
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Carter
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sep 3, 2013, at 1:11 PM, Kaustubh Gadkari <kaustubh.gadkari at gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Sep 3, 2013 at 8:49 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>>>>>>>>>> Hmmm, if racount() takes 18min, I would think ranonymize() should take about 20min
>>>>>>>>>>>>>>>> to complete. You can run " racount -M addr " to get racount() to printout address
>>>>>>>>>>>>>>>> information, like how many addresses are in the file.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Carter, I ran racount with -M addr, but the process hasn't finished
>>>>>>>>>>>>>>> yet (it's been running for about 90 min now). I'll let it run for a
>>>>>>>>>>>>>>> while longer and keep you updated.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ranonymize() works on a single argus record at a time, reading a single record,
>>>>>>>>>>>>>>>> anonymizing all the various data elements, and then writing the anonymized
>>>>>>>>>>>>>>>> record out to the output file. If ranonymize() hasn't written out a record recently,
>>>>>>>>>>>>>>>> then its possible that its in an infinite loop, especially if its running at 100%, and
>>>>>>>>>>>>>>>> its been running for a month, and it seems to have stopped writing into the file.
>>>>>>>>>>>>>>>> What was the last " modified " time on your output file ???
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It hasn't stopped writing to file .. the last modified time is right
>>>>>>>>>>>>>>> now, since the process is still running.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If you've compiled debug support into your ra* programs, you can send a USR1
>>>>>>>>>>>>>>>> signal to the running ranonymize() and it will start writing debug information out
>>>>>>>>>>>>>>>> to stderr(). Send a USR2 to turn debug output off. Assuming that ranonymize()s
>>>>>>>>>>>>>>>> process id is 35122, you can do this:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> % kill -USR1 35122
>>>>>>>>>>>>>>>> % kill -USR2 35122
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If you've compiled development support into your programs, you can attach
>>>>>>>>>>>>>>>> to ranonymize() using gdb(), and then step through the program to see where
>>>>>>>>>>>>>>>> it is.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I haven't compiled my ra* programs with debug or development support.
>>>>>>>>>>>>>>> If you can tell me what I need to change in the Makefiles, I can do so
>>>>>>>>>>>>>>> and run ranonymize with gdb and see what's happening.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Kaustubh
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> % gdb ranonymize 35122
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This will attach to the program, and stop the acitve process. If this all seems
>>>>>>>>>>>>>>>> unfamiliar, send more email, and I'll walk you through one of these strategies.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Carter
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sep 3, 2013, at 9:56 AM, Kaustubh Gadkari <kaustubh.gadkari at gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Sep 3, 2013 at 7:19 AM, Kaustubh Gadkari
>>>>>>>>>>>>>>>>> <kaustubh.gadkari at gmail.com> wrote:
>>>>>>>>>>>>>>>>>> On Tue, Sep 3, 2013 at 6:00 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>>>>>>>>>>>>> Hmmmm,
>>>>>>>>>>>>>>>>>>> There shouldn't be any performance issues with anonymizing a file, if your
>>>>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>>> anonymizing the IP addresses. How many addresses are in the file?
>>>>>>>>>>>>>>>>>>> What does your ranonymize.conf file look like? How much memory is it
>>>>>>>>>>>>>>>>>>> using?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I am not quite sure how many IP addresses there are in the file. My
>>>>>>>>>>>>>>>>>> ranonymize.conf looks like this:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> RANON_PRESERVE_ETHERNET_VENDOR=yes
>>>>>>>>>>>>>>>>>> RANON_PRESERVE_BROADCAST_ADDRESS=yes
>>>>>>>>>>>>>>>>>> RANON_NET_ANONYMIZATION=sequential
>>>>>>>>>>>>>>>>>> RANON_HOST_ANONYMIZATION=sequential
>>>>>>>>>>>>>>>>>> RANON_PRESERVE_NET_ADDRESS_HIERARCHY=class
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I took a look at how much memory ranonymize is using .. the usage is
>>>>>>>>>>>>>>>>>> about 42% on a machine with 32GB RAM.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ranonymize() can be a little complex O(nLogN + C), but it should be
>>>>>>>>>>>>>>>>>>> in the same time frame as racount(). How long does it take for racount()
>>>>>>>>>>>>>>>>>>> to read the file?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I am running racount right now .. I will post results once it finishes.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> racount takes about 18min to run on the file:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> real 17m58.528s
>>>>>>>>>>>>>>>>> user 17m12.413s
>>>>>>>>>>>>>>>>> sys 2m0.332s
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Kaustubh
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Just a rule of thumb. If a ra* program doesn't complete in a few minutes,
>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>> should stop it and try to figure out if there is a memory problem or not.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks, I'll keep this in mind :)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Kaustubh
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Carter
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sep 2, 2013, at 2:20 PM, Kaustubh Gadkari <kaustubh.gadkari at gmail.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I have a set of argus flow data captured at our data capture vantage point,
>>>>>>>>>>>>>>>>>>> and I want to anonymize the IP addresses (both source and destination) fully
>>>>>>>>>>>>>>>>>>> i.e. I want to replace both the addresses, using a prefix preserving
>>>>>>>>>>>>>>>>>>> technique. I have tried using ranonymize, but it is taking an extremely long
>>>>>>>>>>>>>>>>>>> time to anonymize the file (I started the process a couple of months ago, on
>>>>>>>>>>>>>>>>>>> a ~125GB file, and the output file size today is only ~30GB).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Can anyone suggest the right way to go about anonymizing the data set I
>>>>>>>>>>>>>>>>>>> have? Is ranonymize the right tool for the job?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Kaustubh
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Kaustubh Gadkari
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Kaustubh Gadkari
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Kaustubh Gadkari
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Kaustubh Gadkari
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Kaustubh Gadkari
>>>>>>>>>>>>> kaustubh at cs.colostate.edu
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Kaustubh Gadkari
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Kaustubh Gadkari
>>>>>>>>> kaustubh at cs.colostate.edu
>>>>>>>
>>>>>>> --
>>>>>>> Kaustubh Gadkari
>>>>>>> kaustubh at cs.colostate.edu
>>>>>>
>>>>>> --
>>>>>> Kaustubh Gadkari
>>>>>> kaustubh at cs.colostate.edu
>>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Kaustubh Gadkari
>>> kaustubh at cs.colostate.edu
>>>
>>
>
> --
> Kaustubh Gadkari
> kaustubh at cs.colostate.edu
>
--
Kaustubh Gadkari
kaustubh at cs.colostate.edu
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20131022/c73aba1c/attachment.sig>
More information about the argus
mailing list