Internet traffic analysis

2019/11/05

The plots below refer to the “internet_plots.pdf” file on the course Github site.

Computer networks generate massive amounts of operating data (distinct from the actual content being communicated). Advanced statistical methods are commonly used to make sense of this data. This type of analysis can be used to improve network security and optimize network performance. Network traffic analysis is an active and growing area, and represents an underappreciated opportunity for statisticians.

Background on the internet

We first summarize some of the main concepts and definitions related to how the internet works, and provide links to more detailed discussions. Note that you will only need to understand these topics at a very basic level for this course.

The internet is a network of interconnected “hosts” (devices) that communicate via a communications protocol. For the purposes of this discussion, we are speaking specifically of the internet as it currently exists, built around the internet protocol suite. The key elements of this suite are the TCP, UDP, and IP protocols, which define communication at a level that is abstracted from the underlying hardware.

Hosts on the internet are uniquely identified by an IP address. The basic unit of communication between two hosts on the internet is a packet. A packet is sent from one host to another, usually passing through a sequence of intermediate network nodes (e.g. routers). The packet contains a “header” that specifies the origin and destination host IP addresses, ports (discussed further below), and other information. The packet also contains a “payload” holding the actual information to be transmitted. Packets are “routed” from the origin to the destination through a sequence of intermediate nodes. Routing is a complex topic and we will not need to discuss further here.

Since the packet payload size is limited (usually to 64KB), a single unit of data (e.g. a file) that is to be transmitted over the internet is usually split into pieces, with each piece placed into a separate packet. Thus, a single transaction involves the exchange of many packets. The first and last packets to be sent establish and close a connection, and may not transmit any of the actual content of the transaction.

The internet protocol (IP) is a low-level protocol that is responsible for the details of sending and receiving packets. TCP and UDP sit on top of IP. Most logical transactions use either TCP or UDP, but not both. UDP is the more basic of the two, it allows data to be sent over a network using IP, without requiring the recipient to acknowledge that the packet was received, or checking for errors in the transmitted data. It is used for applications like video conferencing and voice-over-IP (e.g. Skype), where packet loss and errors can be tolerated. Due to the simplicity of UDP, it is generally possible to achieve a higher transmission rate with less latency. TCP is used where it is necessary for the sender to be able to verify that data were received and are error free. TCP is responsible for handling most email and http (web page contents).

Communications over the internet are directed to and from ports. Each port is dedicated to a particular type of traffic, for example, port 80 typically handles http traffic, which includes most browser-based web traffic. A packet has both a source port and a destination port. The destination port is the port on the destination machine that will receive the packet. The source port is often a temporary port number that serves as a return address for the transaction. For example, if a client requests a web page from a web server (e.g. as initiated by someone clicking on a hyperlink using a web browser), the packet communicating this request will have destination port 80, and the host port of the packet will be a generated number such as 2435. The packet will arrive at port 80 of the web server, which will then generate a response (containing the web page’s content) sent as packets with destination port 2435 and host port 80 (i.e. the two port numbers are reversed).

Network traffic data

Organizations that operate or conduct research on computer networks generate large amount of network traffic data by placing sniffers on routers or major network links. Due to privacy considerations, raw network traffic data are very sensitive and cannot be widely shared. For research into network security and performance, it would almost never be necessary to observe the payloads, and in any case it would be impractical to store them. Therefore, most network traffic datasets report the size of the payload, not the payload itself. Even the header information is highly sensitive, as people expect privacy in terms of who they are communicating with, not just with the contents of the communications.

There are a number of commonly-used file formats for network traffic data. A very common format for packet-level data is the pcap format. In a pcap file, each row corresponds to one packet and contains all the relevant fields from the packet header. A very popular tool for working with pcap files is tcpdump.

A flowtuple file is a summary file that can be derived from a pcap file. We will use flowtuple files that summarize the traffic between each pair of distinct hosts within one hour. Within the hour, the data are aggregated by minute – each row of data in the Flowtuple file contains the number of packets sent from one source address to one destination address, within one minute. Since they are summaries, a flowtuple file is much smaller than the pcap file that it is derived from.

Darkspace dataset

The darskspace dataset is a internet traffic dataset consisting of a sample of all internet traffic in April, 2012 that was addressed to destinations in an unassigned region of the IP address space. Traffic to these addresses can result for at least two reasons: (i) misconfigured application software or servers, and (ii) scanners and other automated tools that search the internet in a way that can generate traffic to nonexistent hosts. Since no legitimate traffic uses these addresses it is relatively nonsensitive in terms of privacy.

These data are used in a tutorial on internet traffic analysis, but we will largely conduct an independent analysis here.

Characterizing traffic

One of the main goals of internet traffic analysis is to use statistical methods to characterize the behavior of normal traffic. This characterization can be used as a reference point against which to check potential anomalies. Since nearly all internet traffic takes the form of time series, this means that we will need to focus on the temporal structure of the data.

Below we work through a series of basic analyses using two datasets constructed from the FlowTuple files discussed above. These datasets take one day of contiguous data from the larger Darkspace dataset (that covers an entire month). Within this day, we first calculate four values at the 1-minute time scale: total packets, number of unique sources, number of UDP packets, and number of TCP packets. There are 1440 minutes in a day, so each of these variables comprises a time series with 1440 values. Separately, we calculated the number of packets with each possible port destination number, again on a 1-minute time scale.

Marginal distributions

Plots 1-4 display the four data series as simple time series plots. Although the data are obviously dependent, it is nevertheless informative to explore the marginal distributions of the values in each series. Quantile plots (pages 5-8) are one way to display these distributions.

Port usage

The destination port number determines the port at which a packet is received on the destination host. Most vulnerabilities on a host can only be targeted by sending traffic to a particular port. Many servers refuse traffic directed to certain ports, and since TCP traffic that is received is acknowledged, it is possible to probe a server to see which ports are open. A lot of traffic on the internet results from port scanners searching for computers with open ports that have known vulnerabilities.

We could aim to characterize patterns of variation at each port number, but there are 65536 such numbers to track, so the analysis would become unwieldy. As an alternative, we can aim to summarize the dispersion of traffic over the ports. Not only is this an interesting descriptive statistic, but also it would be anticipated that in the event of a major attach, the global distribution of traffic may become noticeably more concentrated since the malicious traffic would be aimed at only one port.

Entropy is a common measure for dispersion or concentration in discrete distributions. If we have a sample space containing $n$ points, with corresponding probabilities $p_1, p_2, \ldots, p_n$, then the entropy is

$-(p_1\log(p_1) + \ldots + p_n\log(p_n))$.

The entropy is non-negative, and it is equal to zero if and only if one of the $p_i$ is equal to 1 (in which case all the other $p_i$ must be equal to 0).

For each minute in our dataset, we can calculate the entropy across all ports, by normalizing the packet counts to proportions (so the normalized counts sum to one within each minute, across the 65536 ports). Doing this produces a series of 1440 entropy values, shown in plot 9. The 24 hour data period we are considering here does not contain any known major anomalies, so the distribution of entropy values, roughly from 2 to 9, spans the range that might be seen on a normal day.

Memory and time scales of variation

Many time series have the property that values occurring close together in time are dependent. An important property of a time series is the distance in time at which values become approximately independent. Here we will consider a few ways to assess this property.

Variance decomposition

One simple way to get a sense of the time scales of variation in a time series is to break the data into contiguous blocks, and calculate the intraclass correlation with respect to these blocks. This meaans that we are using the ANOVA identity (law of total variation), which states that $t = w + b$, where $t$ is the total variance, $b$ is the between-block variance, and $w$ is the within-block variance. For balanced data (equal block sizes), the within-block variance is obtained by taking the variance of the data within each block, then averaging these variances over the blocks. The between-block variance is obtained by taking the mean of the data within each block, then calculating the variance of these means over the blocks. Since the total variance is easy to calculate, it is only necessary to calculate b or w, then the other term can be obtained by subtraction (i.e. w = t - b or b = t - w).

The intra-class correlation coefficient (ICC) is defined to be the ratio b/t, which must fall between 0 and 1 (and is equal to 1 - w/t). If b/t is large, the blocks have very different means and there is very little variation within blocks. The means that the blocks explain a lot of variation in the data. If b/t is small, the variance within blocks is large which implies that the blocks explain very little of the variation in the data.

To illustrate this technique, we calculated the ICC for blocks of duration 1 hour and 4 hours within the 24 hour day:

            1 hour     4 hours
Traffic     0.208      0.103
Sources     0.987      0.899
TCP         0.195      0.088
UDP         0.951      0.755

These results reveal that the total traffic and TCP traffic have less variation over long timescales and hence more variation over shorter time scales. They have a longer range dependence as we will explore further below. The number of unique sources and UDP traffic has much higher ICC values, hence longer-range dependence.

Autocorrelation

The most basic measure of serial dependence in a time series is the autocorrelation. This is usually defined as the Pearson correlation coefficient between the series and a lagged version of itself. The autocorrelation is a function of this lag. Due to the shift, it will be necessary to trim values from the end of the non-shifted series, e.g. to calculate the autocorrelation at lag d, we compare times d, …, n to times 1, …, n-d+1.

The autocorrelation calculated from the Pearson correlation coefficient is sensitive to outliers because it involves taking products of deviations. While this isn’t always a bad thing, it may be informative to calculate autocorrelations in a way that is less sensitive to outliers.

The “tau correlation” for paired data (x, y) is based on the concordance of pairs of the pairs. Specifically, if we have data pairs (x, y) and (x’, y’), then the pair of pairs is concordant if x > x’ and y > y’ or x < x’ and y < y’. Otherwise, the pair or pairs is discordant. The tau-correlation is defined as the proportion of concordant pairs minus the proportion of discordant pairs.

It should be intuitive that if x tends to increase with y, most pairs of pairs will be concordant, so the tau-correlation will be positive. If x tends to decrease with y, most pairs of pairs will be discordant, so the tau-correlation will be negative. If x does not change consistently with y, the tau-correlation will be close to zero. Thus, the tau-correlation is an intuitive measure of the relationship between two variables x and y, when they are observed as a collection of independent pairs (x, y).

In our case, we have a single time series, not a collection of paired values. However the tau-correlation can be generalized to a tau-autocorrelation as follows. For a given lag d, consider pairs of the form (x(t), x(t+d)). Again, we can define concordance of this pair with another such pair (x(t’), x(t’+d)) as above. From this, we arrive at a measure of serial dependence that is does not involve taking products, and hence may be more robust to outliers than the standard autocorrelation.

The estimated tau-dependence values are shown in plot 10, for lags ranging from 1 to 240 minutes. It is evident that the sources and UDP traces have much longer range dependence than the overall traffic and TCP traffic traces.

Differencing

When a time series that appears to be non-stationary or to have long-range dependence, it is common to “difference” it to produce a transformed series. If the original series is $y_1, y_2, y_3, \ldots$, then the first-differenced series is $y_2-y_1, y_3-y_2, \ldots$. Differencing repeatedly gives the second-differenced series, etc.

Plots 11 and 12 show the tau-autocorrelation values for the first-differenced and second-differenced series. The autocorrelations are weak after differencing. We can relate this to the simple “random walk” time series model, in which $y_{t+1} = y_t + \epsilon_t$, where $\epsilon_t$ is an iid sequence. Note that when we difference a random walk, we get the sequence $\epsilon_t$, which has no autocorrelation at any lag. Although the network traffic series we are seeing here appear like random walks from this perspective, they cannot be random walks over a sufficiently-long time duration. One reason for this is that the variance of a random walk increases linearly with time, i.e. ${\rm Var}[y_t] \propto t$.

Long range dependence

Time series can be characterized as being “persistent”, “antipersistent”, or may lack any strong persistence or antipersistence. Roughly speaking, in a persistent time series, a positive change from one point to the next will tend to be followed by one or more additional positive changes. Similarly, negative changes tend to be followed by one or more negative changes. An anti-persistent time series is like a clock, or a diurnal pattern. In these series, a trend can only persist for a limited number of steps in one direction before reverting to the other direction.

One way to estimate the persistence of a time series is through the scaling behavior of the variance of the sample mean. For iid or weakly dependent data, the sample mean for a sample of size $n$ has variance proportional to $1/n$. For strongly dependent data, the variance may decrease at a slower rate than $1/n$. Specifically, if the variance of the mean of $m$ consecutive values decreases at rate $m^{2(H-1)}$, the value of H can be used to quantify the long range dependence in the data. Note that if $H=1/2$, we have the usual scaling of $1/m$. The parameter $H$ is called the “Hurst parameter”. Values of $H$ greater than $1/2$ correspond to persistence and values of $H$ less than $1/2$ correspond to antipersistence. Note that if $H$ is close to 1, the variance of the mean does not decrease at all as the sample size grows.

One way to estimate the Hurst parameter is to directly mimic its population definition. For a sequence of values of $m$, calculate block-wise sample means using blocks of length $m$, then take the sample variance of these means. These should scale like $m^{2(H-1)}$. To estimate $H$, use simple linear regression to regress the logged variance values against $log(m)$. The slope $b$ of this regression is related to $H$ via $H = b/2 + 1$.

A second method for computing the Hurst index may be more robust, as it uses absolute values instead of variances. First, center the data around the overall mean. Then, compute averages within blocks of size $m$, as above. Then, take the absolute values of these averages (the absolute deviation from the overall mean), and average these over all blocks. This is a type of dispersion measure (the mean absolute deviation from the mean). It can be shown to scale like $m^{H-1}$, so a log/log regression can be used to estimate $H$.

Hurst coefficients for the network traffic data are shown below, for data differenced 0, 1, or 2 times. The first column shows the number of times the data were differenced. The second column shows the results calculated using the aggregated variances (method 1 above), and the third column shows the results calculated using the aggregated absolute deviations from the mean (column 3 below). Consistent with our other analyses, the source and UDP traffic have longer range dependence than the overall traffic and TCP traffic series. Note that the robust method assigns slightly stronger long-range dependence to the traffic and TCP series, which as observed from the plots are “bursty”. This suggests that the bursts may be weakening the estimates of long-range dependence.

The Hurst parameters drop substantially after differencing, but the first difference continue to exhibit some evidence for long-range dependence, especially for the sources and UDP data.

Traffic
  0   0.652   0.778
  1   0.069   0.180
  2  -0.236   0.002
Sources
  0   0.996   1.000
  1   0.697   0.677
  2  -0.101  -0.060
UDP
  0   0.987   0.999
  1   0.450   0.442
  2   0.003   0.097
TCP
  0   0.635   0.762
  1   0.060   0.167
  2  -0.228   0.009

Conditional autoregressive structure

Above we have indirectly characterized the time series dependence structure as being much more persistent for the source count and UDP series, and less persistent for the TCP and overall traffic series. We can dig deeper into the time series behavior by fitting models that quantify the serial dependence. One way to do this that allows us to rely on basic regression methods is to use autoregressive modeling. This means that each point in the time series is regressed on a window of values that directly precede it in time. Since this is becomes a regression analysis, we can use any regression technique including OLS, ridge regression, and the LASSO.

The autocorrelation in the time series results in multicollinearity in the autoregression model. The most basic regression approach for accommodating multicollinearity is ridge regression. In order to also do variable selection, we combine ridge regression and the LASSO, leading to a technique called the “elastic net”.

Graphs of the fitted coefficients for the four network traffic time series are shown in plots 13-28. We see that the TCP and overall traffic series have rather short dependence using this method – the autoregression effect vanishes within about 6 minutes in these series. But the UDP series has dependence spanning to at least 12 minutes, and the unique sources series has dependence spanning back 30 minutes, which is the longest lag we considered here.