Argus detecting historical APT1 activity #3 cont

Thu Apr 11 11:23:21 EDT 2013

Hey Dave,
Yes !!!  Now you're thinking about it !!!   Keys to scalability of this strategy
come when you do dynamic aggregation,  preserving the producer /
consumer relationship in the data, and then seeing which populations
change.  So yes, the entire Internet for most sites SHOULD be a producer,
not a consumer !!!!   Very effective !!!

Now, long response coming on the suggested metric.  Sorry.
Yes, ( AppBytes / TotalBytes ) is a great Cyber Security metric, ...........,
but not for historical APT1 infrastructure detection.  Explanation follows.

The ratio of application bytes delivered to total bytes on the wire, has been
referred to as " goodput " by network engineers for about 30+ years.  You
actually plot Goodput vs Throughput, and its THE indicator of transport
efficiency.  I use it constantly in the large scale enterprise QoS assessment
work that I do for the living.

Goodput is influenced by transport type and transport protocol mechanics,
which can be complex, but for protocols like TCP, transport efficiency is mostly
influenced by path issues such as MTU, loss ( thus retransmission) rate,
and fragmentation; all of which have significant impacts on end-to-end
performance. 

The metric you are interested in is a QoS metric. Now, from a Cyber Security
perspective, there are classes of attacks that are designed to impact QoS,
through direct and / or indirect intervention.  Denial of Service, is obvious.
Degradation of Service is the real threat, in something like nation state Cyber
scenarios.  A subtile 5-15% packet loss inducement, can have devastating
impact on end-to-end applications running over poorly designed architectures,
and can be very difficult to identify, attribute and mitigate.  DISA, the US
Defense Information Systems Agency, identified performance as an information
system asset that needed protection, about 5 years ago, and measuring
QoS things and tracking them, is really important to Assuring QoS.

So, YES, tracking Goodput ( plus a bunch more ) is important for serious
Cyber Security.  Monitoring and tracking path performance to the point
where you can detect ( Identification ) QoS attacks vs. just normal impacts
to QoS, and determine their cause ( Attribution ), to better recover
( Mitigation ), should be a given.

But, QoS oriented metrics are not my first go to metrics for finding
Stepping Stones or large scale Exfiltration.  NIce thing is when you can
use the same sensor and data to do it all ;O)

Carter

On Apr 9, 2013, at 11:15 PM, "Dave Edelman" <dedelman at iname.com> wrote:

> Carter,
>  
> I’ve had Argus set to collect the application byte count metrics so I dug through the data and compared the source/destination ratios using the application byte counts and using the byte counts with the associated overhead. I understand your point about a malicious actor gaming the system but would it make sense to actually calculate the ratio of the ratios (sappbytes/dappbytes)/(sbytes/dbytes)  as an indicator of malicious obfuscation? I haven’t tried it yet but if workloads for a system didn’t change significantly then I would expect that over some window of time, changes in this ratio would be worth investigation.
>  
> The other idea that came to mind was determining the producer / consumer characteristics of an entire subnet. At least in our environment subnets rarely contain mixed populations and a shift in roles would be very unusual.
>  
> --Dave
>  
> From: argus-info-bounces+dedelman=iname.com at lists.andrew.cmu.edu [mailto:argus-info-bounces+dedelman=iname.com at lists.andrew.cmu.edu] On Behalf Of Carter Bullard
> Sent: Tuesday, April 02, 2013 5:09 PM
> To: Argus
> Subject: Re: [ARGUS] Argus detecting historical APT1 activity #3 cont
>  
> Gentle people,
> To continue on the Argus and APT1 discussion, I had written that the Mandiant
> APT1 document described two basic nodes in the APT1 attack infrastructure,
> the Hop Point and the Target Attack Nodes.  I'm going to continue to write about
> Hop Points for a few more emails, because, having one of your machines acting
> as an APT1 Hop Point, is possibly the worst thing that can happen to you in the
> APT1 attack life cycle.
>  
> I suggested that the best strategy for identifying APT1 Hop Points is to use Time
> Series Analysis methods, specifically Transfer Function Models, and Intervention
> Analysis, to realize that a node has been transformed, (Identification) and to
> realize who, what, when, how it was transformed (Attribution).  Now, I'm pretty
> sure that most people are not interested in a long discourse on how to use
> 2nd and 3rd order differentials over different time periods, to recognize
> trending discontinuities.  This stuff is pretty complicated, and advanced even
> for complex Time Series forecasting and control methods.  But that is the
> kind of direction you want to go in if you want to do Machine Learning methods,
> or if you want to do unsupervised systems for network behavioral anomaly
> detection, which would be a really cool thing to have.
>  
> In support of this APT1 Hop Point identification process, however, there are more
> direct things you can look at, that don't take a lot of math, and can be done with
> simple, effective, reliable strategies that are easily explained and understood.  
> Lets look briefly at one that should be useful.
>  
> Most nodes that can be transformed to an APT1 Hop Point, are either predominately
> consumers or producers of transport network data.  User driven machines are
> generally transport service consumers, little requests sent, big responses received,
> such as those seen in web browsing and streaming video services.  Machine driven
> machines, such as DNS, Web and Database servers, are generally network
> transport data producers, they receive little requests and send bigger responses.
> Even in Peer-to-Peer networks, you see stable consumer / producer roles emerge,
> where your node, the one you are paying for, generally becomes a network data
> producer, providing services to a lot of machines you don't know.   Peer-to-peer
> networks do present a challenge to this kind of strategy, but not always.
>  
> When a node is transformed to an APT1 Hop Point service provider, and the
> stepping stone function is active, a node will become both a consumer and
> a producer of the data it is transporting.  If it moves a large amount of data,
> as indicated in the Mandiant report, the overall producer / consumer properties
> of the affected node will move from where ever they are, toward 1.0, ...
> a balanced transport node.
>  
> Our job, is to try to identify when a node is transformed to an APT1 Hop Point,
> which means that it will go from whatever it was doing, to being a balanced
> producer / consumer, accepting data from an attack target, and relaying that
> data to the attacker.  If, historically, a node can be determined to be predominately
> a producer or consumer, then detecting when it becomes a large scale
> balanced producer / consumer, will be pretty easy, as the deviation from its
> normal behavior will be pretty dramatic.
>  
> Now, producer / consumer metrics are not a measure of the packets (rate)
> or total bytes (load) sent or received on the wire by a node.  Instead, its a
> measure of the transport bytes successfully received and sent.
>  
> Protocols like TCP generally present a balance of packets sent and received
> on the wire.  After the 3-packet TCP setup handshake, one side sends data,
> and the other sends TCP overhead ACKs, almost 1-for-1.  So we can't use
> packet counts to indicate producer / consumer roles on the network.
>  
> The total bytes on the wire has a bit more asymmetry that can reveal
> consumer and producer relationships,  but the noise generated by the TCP
> protocol overhead bytes can make the distinction a bit more difficult.
> There are tricks, such as PUSHing one byte at a time, through a TCP
> connection, or reducing the allowable window size on a connection, that
> can reduce the number of transported bytes per packet very small.
> Reporting the actually ACK'd data, rather than the total data on the wire,
> makes this type of analysis possible.
>  
> To measure the application bytes received or sent, argus needs to be configured
> to generate the metric. Set ARGUS_GENERATE_APPBYTE_METRIC=yes in your
> /etc/argus.conf file.  Lets assume that your argus is monitoring your enterprise
> border interface, so that you monitor all the traffic going in and out of your site.
> The resulting argus data will have the information needed to determine all the
> producers and consumers of your enterprise, i.e. those that are bringing data in
> and those that are transporting data out (this is a starting point for developing
> formal Transfer Function Models, by the way, when you get there).
>  
> A simple measure of the producer / consumer role is the ratio of application
> data sent (produced) vs the application data received (consumed).  Using argus
> data, you can calculate the metric on each status record, on each aggregated
> flow, or on any of the various aggregations that you can perform.  So its trivial to
> calculate the  "sappbytes / dappbytes" whether its a instantaneous microflow,
> or if its an entire subnet's traffic aggregated into a single argus flow record.
>  
> To start a simple analysis, lets process a days worth of data from a single QoSient
> workstation and see what's up.  Lets measure the sent and received application
> bytes of the top IP addresses seen, to assign simple producer and consumer roles,
> and try to use those labels as guides, to see what the trends are, and how
> to interpret the data. SPOILER: In this set of data, there are no APT1
> Hop Points, but....
>  
> Lets look at the IP addresses that the QoSient node 192.168.0.68 talked to, on
> April Fool's Day, 2013. This node resides in the 192.168.0.0/24 network, and is
> a basic workstation, using shared file systems, with email, web browsing, automated
> software updates and cloud services.  What nodes does this node talk to, outside
> or inside our own network, and are they producers or consumers ?  
>  
> This node doesn't provide any services, so we expect all other nodes to be producers,
> not consumers, and we expect the node to be a consumer of network services.  Let's
> see, but to keep the email short, lets just look at the top 10 nodes.
>  
> Grabing an entire days worth of data from the collection archive, lets track
> individual IP addresses (so we'll use the " -M rmon " option), preserving the protocol
> and ports used, by each address.  We'll use this first pass derived data, as starting data
> for the actual analysis, which will we'll generate to report individual IP addresses total
> src application bytes and dst application bytes sent.  We'll take that data, and formulate
> the sappbytes/dappbytes ratio, by hand for this exercise, and if the ratio is > 1.5 then
> we'll label it as a Producer, if the ratio is < 0.95, we'll label it a Consumer, and between
> these numbers, we'll call the transport Balanced.  We'll color the output, so Consumers
> are in red, and Producers whose ratios are HUGE, we'll color blue.
>  
> Lets look at the top 10 SrcAppByte generators, to see how this might work.
> Here we go....
>  
>  
> thoth:01 carter$  racluster -R /archive/192.168.0.68/2013/04/01 -m saddr proto sport -w /tmp/argus.out - ipv4
> thoth:01 carter$ racluster -r /tmp/argus.out -m saddr -w - |  rasort -m sappbytes \
>                        -s stime dur saddr proto sport sappbytes dappbytes -No10
>                  StartTime        Dur            SrcAddr  Proto  Sport    SAppBytes    DAppBytes         Ratio
> 2013/04/01.00:00:00.847207 86399.101*       192.168.0.66     ip            69805178      1339356       52.1185  Producer
> 2013/04/01.15:54:08.964340  25.124109      208.59.201.94    tcp http       27104415          120   225870.1250  Producer
> 2013/04/01.00:01:16.133367 86285.734*        66.39.3.162    tcp imaps      12816471      1012491       12.6584  Producer
> 2013/04/01.00:00:00.847207 86399.101*       192.168.0.68     ip            11872196    120391392        0.0986  Consumer
> 2013/04/01.17:17:37.184721 528.364441       171.67.72.17    tcp ssh         4347072        50746       85.6633  Producer
> 2013/04/01.00:02:51.660475 85447.757*      17.172.208.43    tcp https       2103142       430417        4.8863  Producer
> 2013/04/01.15:55:58.919139 28921.785*       192.168.0.78     ip             1399179      7702570        0.1817  Consumer
> 2013/04/01.09:47:16.282091 43205.253*        17.154.65.1    tcp https        472376        20531       23.0079  Producer
> 2013/04/01.00:05:42.767984 85586.210*      192.168.0.127     ip              461937            0           Inf  Producer
> 2013/04/01.00:29:54.738518 81412.937*      173.194.43.33    tcp *            413487        18616       22.2114  Producer
>  
>  
> Basically, what this data is saying, of the top 10 addresses sending data on April Fool's day,
> most are producers, just as we expected.  And the workstation itself, 192.168.0.68, is 
> a consumer (first line in red), with a sent/recv'd ratio of 0.0986.  We've got some really
> HUGE producers, which indicates purely one-way transfers, the kind we're looking for.
>  
> In this data we're looking for ATP1 relay data candidates. Large data transfers from a
> remote site to an internal node, that is then relayed to another external node, possibly
> Chinese, possibly not, in real time.
>  
> None of the producers are sending enough data to represent a LARGE exfiltration of data,
> one of the definitions of being an APT1 Hop Point.  But LARGE is a relative term, so we
> need to analyze any potential APT1 traffic candidate.
>  
> From the first remote address in the list, 208.59.201.94, our largest remote producer, we
> received 27MB of data.  The sent / recv'd ratio of 225,870 is just what you would expect
> from a large transfer of data into your infrastructure, and is a good candidate for APT1
> style stepping stone data influx.  Even though its using HTTP as the protocol, we should
> assume the transport technique to be somewhat clever, so whether its HTTP, SSH, or
> a mix of protocols, is notable, but potentially insignificant.
>  
> For the purposes of this dialog, to identify this flow as a part of an APT1 Hop Point action,
> we need to find an outflow from our workstation that would transfer the data received from this
> remote node to another node.  In a simple APT1 Hop Point, our workstation would want to
> transfer the 27MB to a remote address, which we don't see in the top 10.  Whew !!!
> If its a simple relay, you would expect the outgoing flow to sort closely to the flow of
> interest, as they would both be transporting the same amount of application data.
>  
> In a slight variation of the basic APT1 Infrastructure, the Hop Point may relay the exfiltrated
> data to another internal node.  Our simple report indicates that the workstation doesn't
> transmit 27MB to any single node, either external or internal.  
>  
> And in the most complex relay models that could be implemented, where multiple
> endpoints receive portions of the exfiltrated data, our node still does not look to be
> an APT1 Hop Point.  By looking at the entry for 192.168.0.68, our workstation, we see
> that we don't actually send 27MB of total data out of the node for the whole day !!!
>  
>                  StartTime        Dur            SrcAddr  Proto  Sport    SAppBytes    DAppBytes         Ratio
> 2013/04/01.00:00:00.847207 86399.101*       192.168.0.68     ip            11872196    120391392        0.0986  Consumer
>  
> As you can see, we only sent (SAppBytes) 11.8MB total, to all our transport endpoints
> combined, for all of April 1, 2013.  So our candidate 27MB flow, does not look to be relayed.
> Now in the original data, there are about 10K individual flows that may could be candidates,
> but the aggregate analysis generated only a few hundred candidate IP addresses.
>  
> An automated system would iterate through all potential candidate transfers and attempt to
> find candidate outflows that could support the relay concept.  That would be the most elegant
> of analytics, and not that expensive, if you have a good aggregation model and
> analytic framework.
>  
> Now, just looking at an arbitrary day, by itself, you can get some assurance that you aren't
> support an APT1 Hop Point type of relay service.  But the strength of argus based network
> activity auditing, is that you have historical data that can support the development of 
> hourly producer / consumer metrics for every IP address in your archive, which could
> be abstractly called a Transport Function Model for all the assets in your observable
> domain.
>  
> I have done this type of analytic for our workstation over the last 2 years, and this
> workstation has maintained the 0.09 sent vs received application byte ratio for
> almost every day.  It has never gone over 0.17.  So this would be a great candidate
> machine for this type of analysis.  While it may receive a lot of data from the outside,
> it doesn't transfer a lot of data.  And if it did, it would be very easy to know it.
>  
> So, all I need is the sent / recv'd ratio for all the end points in my enterprise, and
> if they have had stable ratios that are >> 10 or << 0.1, indicating that they are stable
> producers and/or consumers, then detecting a significant shift toward 1.0, a balanced
> consumer / producer role, is pretty easy.  If you think that the change is significant,
> then you can go through the original flow data, looking at sappbyte an dappbyte
> metrics to figure out what happened.  Your looking for new producer roles for your
> consumers and new consumer roles for your producers, that are contributing 
> to the ratio moving toward 1.0.
>  
> It's a system that can work for the majority of the nodes in your enterprise.  For
> the ones that it doesn't, there are more complex analytics that can be used, but
> enough for a single piece of email.
>  
> Reactions, opinions, attitude and flames welcome, 
>  
> Hope all is most excellent,
>  
> Carter
>  
>  
>  
> On Mar 27, 2013, at 12:09 PM, Carter Bullard <carter at qosient.com> wrote:
> 
> 
> Gentle people,
> To continue on the Argus and APT1 discussion, I had written that the Mandiant
> APT1 document described two basic nodes in the APT1 attack infrastructure,
> the Hop Point and the Target Attack Nodes.  I'm going to continue to write about
> Hop Points for a few more emails, because, having one of your machines acting
> as an APT1 Hop Point, is possible the worst thing that can happen in the APT1
> attack life cycle.
>  
> So far, I've presented that Mandiant's report gives us a lot of detail, trends and
> methods, that allow us to detect overt APT1 behavior using the argus data.  Trends
> such as APT1's establishment and use of well defined attack infrastructure and
> the tendancy to access that infrastructure directly, from well defined IP address
> blocks, using specific access methods, and a good description of the attackers
> intent, exfiltration of large amounts of data.  These trends lead to a set of very
> simple tests for APT1 activity, that can be tested against argus data archives
> to help you realize if you've been had, or not. 
>  
> The APT1 strategies that Mandiant describes are conventional, and the attack
> infrastructure itself is simple, direct, almost optimal (minimal reliable methods,
> 2-3 hops from attacker to target), suggesting that the infrastructure has
> predictable utility, i.e. it may actually work to scale, and work well enough to
> be worth the effort.  The ultimate simplicity of the realized APT1 infrastructure,
> may be the result of a limit in Mandiant's detection capability ( you can only
> see what you are looking for ), but there is no question that what they describe
> is real.
>  
> While Mandiant is very detailed in what it does talk about, there are huge
> gaps in what it doesn't talk about.  I'd like to dive deeper into APT1 Hop Point
> identification, but we're lacking key information.  What kind of systems does
> APT1 use for Hop Points? Linux workstations ? Windows XP machines ? 
> Web Servers ?  Android devices ?  Routers ?  While we have some really
> great patterns to look for, like specific SSH certificates, there are so many
> things we don't have; initial penetration techniques, command and control
> methods, beaconing patterns, persistent vs dynamic access.
>  
> In the absence of real detail, we'll have to develop general strategies for
> detection, and if we want to have any success, we'll need to avoid 
> awareness / detection system pitfalls, such as sampling, and sampling bias
> (looking only at one protocol or one type of OS), and matching complexity.
>  
> One of the simple characteristics that I will try to leverage in my discussions, is
> the intent of the APT1 attack, and the goal of the APT1 Hop Point; to move
> a lot of data, from a remote site to another remote site.  If that really is the
> singular attack goal for APT1, then with good argus data generation and
> analytics, we should be able to find any node that is acting as an APT1
> Hop Point, as well as any the other APTx Hop Points that may exist.
>  
> The approach that I will try to describe in the next set of emails, is one based
> on a Bell-LaPadula style of analysis, to find nodes that have been transformed
> from being one type of network based node, to another type of network node, 
> in the case of APT1, one that is supporting a demanding network based transport
> service.
>  
> I'm going to use Time Series Analysis methods, specifically Transfer Function
> Models, and Intervention Analysis to realize that a node is doing something
> different.  The Transfer Function Models, are perfect for this, as they are
> generally used to describe input / output dynamic system response,
> and Intervention Analysis is all about the notion that there is an event that
> motivates a dynamic change in system input / output.  So I'm going to try to use
> this strategy to identify a change in input / output, and then to try to find the
> event that correlates with the change.
>  
> If you can imagine that there is an argus running on every node in an
> infrastructure, establishing a generalize network activity audit, that goes
> back quite a ways, then we should have a very rich set of data to perform
> this type of analysis, either automated, or by hand.   The goal will be to
> realize that a node went from being a specific type of producer / consumer,
> to a different kind of producer / consumer, over some period of time.
>  
> OK, that is going to be my strategy, any other approaches that seem to be
> appropriate?    More to come.
>  
> Hope all is most excellent,
>  
> Carter
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20130411/d338607f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2589 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20130411/d338607f/attachment.bin>