'packet engine' discussion.

Wed Mar 21 12:01:38 EST 2001

> 
> I'm hoping the subject line will attract attention :)

	Yep :-)

> 
>   I would like to get a discussion going about how to build the best platform 
> for monitoring the worst case scenarios in the life of a network, usually DoS 
> attacks.

	There are two broad approaches you can take: fast hardware or 
multiplexing. Both cost money (but so do fast networks usually, so we will 
ignore cost for this discussion :-)). Of the two because of your interest in
DOS / attack senarios, fast hardware is the preferred solution. The multiplexed
case is easier / more mainstream (because the same hardware that does load 
sharing for big web farms can also be used to spread the load to multiple 
argi for parallel analyisis. Unfortunatly in the attack case because the 
solution is depending on a distribution of load (that isn't necessarily true
of the incoming pipe) it can break down (or more correctly be broken by 
enemy action) in the attack case. If the attacker concentrates all traffic
(spoofed and real attack) to the same address (or other selection critera)
of the multiplexor it is possible to overload the mux without necessarily
overloading the communications pipe and possibly masking the attack (because
the packets will be delivered to the attacked machine but not necessarily to
the monitor). This means in the case where attacks are possible you really
want the monitor to be able to keep up with the full speed of the input pipe
so that an undetectable attack shouldn't be possible. That in turn means that
Russell's suggestion of the special hardware the CAIDA folks are making is
likely the profitable place to concentrate our efforts (especially since they
are trying to do the same basic thing for different reasons i.e they want to 
do traffic accounting / analysis on full bandwith Internet backbone connections
at high speeds).

> 
>   I have an article at home (I'll post the URL tonight), that talks about how 
> a guy got the Intel Etherexpress Pro 1000 (one variant, anyways), with a 
> modified Linux driver to be able to receive (he had stats on send too), 60 
> byte packets at a rate of 680,000/second.  Now, this is only the card... and I 
> assume the machine was very likely consumed, being able to d o no real work on 
> these packets it was receiving... but, that is indeed a very good number (I 
> think).

	Speaking to some folks from Intel, they said the most they've seen 
is around 700 megs per second on a PC from a their gig card and they 
attributed that to operating system overhead issues rather than card or machine
hardware (whether they are correct or not is another issue :-)).

> 
>   I have heard that most Intel network cards will generate 1 interrupt per 
> packet, and that the CPU will start having difficulties at around 20,000 
> interrupts per second.  The Etherexpress pro 1000, and other cards, batch up 
> packets and send them, DMA, to memory on one interrupt, saving lots of CPU 
> over head.  Other cards do some of the TCP/ip header work, on the card.  
> Others do... (fill in cool performance feature here), and so on.  Question:  
> Which card, OS, drivers, features, and setup are best??

	A cpu with low interrupt overhead (which probably means not X86 :-)).
A CPU which saves (by default) nothing but PSW and program counter (letting 
you decide what other registers you can't avoid saving) is going to be best.
As I recall X86 in protected mode saves all kinds of things (== eats many 
memory cycles). We probably want to look at whats popular in the various 
wire speed routers (besides custom ASICs to avoid going near any CPU of 
course :-)). Static ram as main memory is also of advantage, much faster than
DRAM and more importantly no requirement for refresh stealing some number of
memory cycles at possibly inconvienient times.
	Next we probably want to move away from a general purpose OS (such as
Unix) which has lots of overhead we neither need nor can afford. There is an
open source embedded OS available (I'll dig up the web site later) that would
probably be a good start. The best way to get "zero copy" type performance is
to cheat and not ever switch out of kernel mode (which is bad in a general
purpose OS but good when we are only concerned with processing a packet as 
fast as computerly possible :-)). Then we string the three argus tasks across
three machines and the pipes become a socket across a network connection 
rather than internal to the machine which gives us more time to play with since
time is the name of the game when trying to keep up.
	Then we start playing games with the data capture. Probably the 
interface needs to buffer the full frame (because it needs to verify CRC), but
since we don't want all of the packet, if we stop the transfer from the card
to memory when we have seen all we want of the packet we save memory cycles
(this may require some funky DMA processing and may not be possible depending
on the card). 

> 
>   Also, some of the places we will be monitoring, are full duplex.  Putting 
> taps in (like the ones at www.shomiti.com), feed us two 100, or 1000 Mbit 
> wires.  These will have to go into two cards in a server, and have Argus read 
> from both cards, and merge records.  Question:  Whats best, two cards, or 1 
> cards with dual ports??  If two cards, which is best, single, dual, or quad 
> CPU?  Do we tie interrupts from each card to a unique CPU?

	Modulo syncronization between the streams, two separate machines (each
doing half the work) is the easier from a performance standpoint. This only
really needs to be on the sensor machines (i.e. two sensors reducing the 
flows and then passing them on to two interfaces on the machine running the 
second argus task which merges the streams in to a flow again has likely 
reduced our required bandwith enough that the machine doing the merge can 
keep up). Syncronization is going to be the exciting part of this one :-).

> 
>   Dealing with the packets you get is another issue... i.e.: memory bandwidth. 
> Now, there are new memory technologies coming (and current ones that maybe 
> have no data as to how they affect an application like Argus, ie: Rambus).  
> Which of these is most promising?

	Interleaving and static RAM (mainframes have done this for years). The
DRAM technologies are concentrating on high density rather than high speed.
Again looking at wire speed routers I expect you'll find very fast static 
ram (i.e. 32K byte 15 nsec static RAM is common against many megabyte 120 nsec
DRAM). While it is very expensive, it is also very fast and in this application
we need fast more than we need density. Unfortunatly because it isn't mainstream
we are back in to custom hardware again. We may be able to use some of the 
fancy DRAM technologies (which depend on filling sequencial bursts of data to
get their speed) if the locallity of reference is correct for the application.
A cache miss that involves 120 or more nsec to service (against 15 for the 
SRAM case) can be deadly.

> 
>   With regard to operating system technologies, Linux has some new zero-copy 
> networking patches, that tries to avoid moving network traffic stuff around in 
> memory very much, and there may be others.  Question: does this help us, and 
> if so, what technology like this is the most promising for high speed network 
> monitoring?

	As noted, and imbedded OS rather than a general purpose one is probably
better.

> 
>   The CPUs: P3, P4, Itanium (its coming)...

	I960, R5000, other CPUs used in wire speed routers (who are solving 
the same problem we are looking at).

> 
>   The PCI bus...  obviously a bottleneck.  PCI-X is coming, Infiband right 
> after that.  32 bit, or 64 bit current PCI?  We need to move data across the 
> bus to memory/cpu... Question:  Is the optimal machine PCI-X, Infinband, a Sun 
> server?
> 
>   The general thing here I'm trying to weed out is this:  Money no object 
> (lets be realistic though)... what is the best hardware/software/network 
> card/bus/memory/CPUs(dual/quad/single)(P3,p4, Itanium), and _configuration_  
> combination to be able to deal with the stormiest network events... ie: as 
> many tiny packets, or other crap, thrown at your network.  Obviously there is 
> a real hard limit out there... but, how do we get as close as possible to it?
> 
> Thanks,
> 
> Chris

	Of course the first thing to do with all of this is build the tools to
test. Tcpreplay is probably the best answer here, 10 PC class machines with
UDMA 66 IDE disks and 100 baseT ether cards in to a 100 switch with a gig
uplink should be able to create a saturated GIG port out (actually probably
20 machines and 2 switches to get a fully loaded full duplex GIG). Now having
the controlled traffic environment you can try various combinations of things
and see how fast they will go and what the minimum we can get away with is.

Peter Van Epp / Operations and Technical Support 
Simon Fraser University, Burnaby, B.C. Canada