The widespread use of computer networks and the Internet have created a parallel increase in the concern for security and more specifically, for protection from hostile activity, ranging from viruses, Trojan horses, denial-of-service attacks, up to data theft and outright misrepresentation and fraud. Firewalls, among the earliest types of protective methods, are no longer considered sufficient . Many organizations use an additional layer of protection in the form of intrusion detection systems (IDSs) that monitor activity on the network in order to detect unusual, potentially hostile activity once a break-in has occurred [4, 11]. Although this is helpful to contain the resulting damage and provides information for future security improvements, it does not add much to the sense of security of the network users. Intrusion prevention systems (IPSs), which actively block attacks in real time, provide an additional layer of security. Typically, IPSs operate online by matching network activity patterns to the signatures of known modes of attack . In addition to generating substantial workload on the network management system, IPSs and IDSs have a tendency toward false positives, resulting in even more burden on the system and its users . A conservative estimate is that 30%60% of the alerts are false alarms, and the valid alerts get lost in the infinite amount of information . Further, IPSs are effective only against known patterns of attack. New patterns, or even minor modifications of known patterns, will not be detected [1, 12]. Thus, the database of harmful signatures must be updated constantly to reflect new modes of attack . Consequently, IPSs are not well suited to be used in critical networks .
Recently, a new approach to computer security has emerged: Active Intrusion Prevention (AIP), which works as follows. The product with the AIP technology constantly examines all the activity on the network, looking for requests for data that may be used to break into the network or for other similar hostile purposes. In these situations, the AIP system will provide the requested data, except that it will be specially marked. When the attacker tries to use the marked information it has received, the AIP system knows with certainty that this is a valid attack rather than a false alert and will block the attacker either directly or by means of the organizational firewall. In this manner the AIP approach yields very accurate identification of hostile events, virtually free of false positives. Other advantages of this approach include early identification prior to the actual break-in and protection from both familiar and new attack modes.
Further, the AIP system captures information about the originator of the attack, thus enabling it to be blocked completely and prevent future attempts. In fact, AIP programs routinely accumulate and compile data on the attacker, the attacked system, and the setting under which the activities occured. This data may serve for more comprehensive statistical analysis that can provide guidelines for setting up and managing effective security measures for the entire network. One of the first works combining an intrusion detection system and statistical analysis was published in 1999 in the Proceedings of the IEEE Symposium on Security and Privacy ; a more recent experiment, carried out by Haines et al., in 2003 , demonstrates the value of classifying traffic as "normal" and "abnormal."
The objective of this article is to demonstrate how data obtained from an AIP system may be analyzed with standard statistical techniques and applied to enhance computer network security. To this effect, we describe the development and validation of a predictive model and discuss the practical implications of the variables that turned out to be significant. Although the model was validated only for the computer network from which the data originated, we believe the specific results are indicative if not directly applicable for other networks that operate under similar conditions.
The model was developed using a sample of data collected from a small U.S.-based organization. The sample consists of 65,533 observations or events that occurred in the six-month period from December 2002 until June 2003. Each event in the sample corresponded to an attempt to access the computer network of the organization. The events in the sample were classified as either legitimate probes or hostile intrusions, as determined by the AIP system in place in the organization computer network. In this sample there were 16,662 hostile intrusions, or 25.43% of the events, while the rest of the activity was legitimate.
The predictive model was developed using Logit regression analysis. In logit regression, the dependent variable is the logistic transformation of the probability that a dichotomous variable will take one of its two possible values. The logistic transformation is the logarithm of the ratio of the two probabilities corresponding to the two possible values of the dichotomous dependent variable. In our case, that corresponded to the probabilities that a given event will be an intrusion or a legitimate probe. The logit transformation ensures the dependent variable will be continuous and unbounded, thus allowing for a better fit of the regression equation. Predictions regarding the actual value of the dependent variable are made by taking the inverse logit transformation, then selecting the event outcome with the largest probability.
The independent variables were attributes of the event that are known at the moment the event occurs. In this study we focused on four characteristics of the event: protocol used; geographical location of the event originator; service accessed; and method of access. These independent variables were all measured on categorical scales that had a large number of possible values. In order to better manage the development of the regression model, the original categories were grouped into a smaller number, as described here.
Geographic location. Most events originated in countries with extensive computer networks, such as the U.S., Canada, and China. Due to the diversity of countries, a grouping process was carried out by continent. The only exception was with China, which had a large number of events (13%). North America (U.S., Canada, and Mexico) had 36% of the events. Grouping China and the rest of the Asian nations results in 32% of the sample. This is not a trivial amount, because it suggests that even small and medium U.S. organizations are vulnerable to a substantial number of attacks from Asia, almost as many as from the U.S. It should be noted, incidentally, that China and India are also considered rapid-growing IT markets that are expected to acquire a salient presence on the Internet in the near future. Europe generated only 15% of the events despite the fact it includes developed countries with large IT systems. After Europe, we find South America with only 5%, and the remainder of the world, mainly Australia/New Zealand and some African countries, appears under Other. Thus, we ended with the following six geographical categories and their respective frequencies: North America (36%); China (13%); Asia (19%); Europe (15%); South America (5%); and Other (11%).
Service accessed. The network of the sample organizational database contains mainly Windows technology services and a few Web services. Their distribution was grouped according to port families. The Windows group included NetBios (ports 137 and 139), Microsoft DS (port 445), and the popular Microsoft database MS SQL server (ports 1433 and 1434). The Web group included HTTP and HTTPS (ports 80, 8080, and 443). The unassigned ports (8, 17300, and 27374) were grouped as Other, while the remaining ports with known assigned services (such as 21-FTP, 22-SSH) were grouped together as Other Known. The final distribution was as follows: Windows (82.41%); Web (9.77%); Other Known (5.26%); and Other (2.56%).
Method of access. The most common method used in the initial reconnaissance phase is port scan (61%). This is not surprising since it is the most generic and enables the attacker to gain extensive information about the target. The Net Bios probe (Windows probe) is the next most popular (22%), which can be explained as being mainly due to the fact that Windows operating systems have a lot of known loopholes, and the majority of personal computers use a Windows operating system. The login group contains all the probe types that were used during a login process to a known service in the network (for example, telnet, SSH, and FTP). The final distribution was as follows: System (4%); Login (6%); Net Bios (22%); Port Scan (61%); and HTTP (6%).
A logit regression was run on the categorized dataset described earlier. The logit model extends the principles of generalized linear regression models to better treat the case of dichotomous and polytomous dependent variables. It focuses on association of grouped data, looking at all levels of possible interaction effects. In order to be able to assess the specific impact of each value, the categorical variables were converted into binary variables. Since all the individual values are not mutually independent, it is desirable to use n-1 categories to represent the n values of the original variable. This is illustrated in Table 1 for the variable Service Accessed. The value with all "0" ("Other") is referred to as the base value. The coefficients of the logit regression reflect the impact of the other values relative to the base value.
The results of the logit regression are shown in Table 2. The column labeled "b" shows the coefficients of the individual binary values in the logit regression model; "se" refers to the standard estimation error for the respective coefficient. The column "p" shows the statistical significance of rejecting the hypothesis that b=0. This test was done using the Wald Chi-square test with one degree of freedom for each binary variable.
Most of the variables are significant, that is, above the limit of 0.05. Two geographical location values turned out not to be statistically significant: G1 with p=0.253 and G2 with p=0.931. A possible explanation for this is the fact that these two variables contain the majority of the events. G1 grouped all the attacks from North America and G2 consists of all the attacks from China. Thus, it is clear that each of these categories contains a variety of events, which explains why the significance is low. The other binary variable that turned out not to be statistically significant is M3 (events using the NetBios method), with p=0.635.
The quality of the model was examined based on the Cox and Snell R2 coefficient, which was 0.551, and the Nagelkerke R2 coefficient, which was 0.812. These are standard measures of association used in logit analysis and are calculated automatically by statistical analysis software, such as SPSS. Cox and Snell's R-Square is an attempt to imitate the interpretation of multiple R2 based on the likelihood, but its maximum can be (and usually is) less than 1.0, making it difficult to interpret. Nagelkerke's R2 is a further modification of the Cox and Snell coefficient to assure that it can vary from 0 to 1. That is, Nagelkerke's R2 divides Cox and Snell's R2 by its maximum in order to achieve a measure in the range 0 to 1. Therefore, Nagelkerke's R2 will normally be higher than the Cox and Snell measure. These measures are interpreted in a manner similar to R2. Thus, the values obtained here indicate a fairly strong association between the predicted and the actual outcomes. Table 3 presents a more detailed description of the model's fit.
Table 3 shows that the model gives true negatives in 44,180 cases out of 48,871 probe cases, or 90.4%. On the other hand, only 4,691 cases are false positive (type-2 (b) errors, that is, classifying an event as an intrusion when it is only a probe), or 9.6%. This is well within the defined limit of 20%, meaning that the results are more than suitable.
False negatives, type-1 (a) errors, that is, classification of an intrusion event as a probe event, cause more concern. The model gave only 581 cases or 3.4% of such type-1 errors, while the number of cases it correctly predicted was 16,081 out of 16,662 probe events, or 96.5% of correct predictions. Overall the model gives us 92% of correct predictions, which is very good, considering the target and the field of research.
Prediction is based on the generic logit probability function shown in Equation 1. In Equation 1, a denotes the constant in the regression equation; the last row in Table 2, b denotes the vector of coefficients, and the Xi represents the binary variables. Based on the results of Table 2, and ignoring the variables that turned out to be not statistically significant, we obtain the predictive equation shown in Equation 2.
In order to predict whether a certain event is a hostile intrusion or a legitimate probe, the values of the variables that describe the event (such as P1) are entered in the Equation 2. If the result of the calculation exceeds 0.5, then the event is more likely to be hostile than legitimate, and should be treated as such. Otherwise, the event is more likely to be legitimate. The threshold value of 0.5 is based on pure probabilistic considerations. An organization that wishes to be more (or perhaps less) risk averse may set a lower (higher) threshold value. In such case, of course, the percentage of false alarms and missed attacks will vary.
The analysis of the timing of the events was done separately and not as part of the logit model. Initial analysis based on the week time scale showed a uniform distribution, with no specific day or hour during which more events occur. We had expected to find more events during the weekend, but the results indicate slightly more activity in the middle of the week, though the difference was not significant. However, analyzing the time variable according to the local time of the attack reveals a clear trend, as shown in Figure 1. The trend shown in Figure 1 indicates, for each of the main geographical areas (North America, China, rest of Asia, and Europe), the actual frequency of intrusion events by time of the day and the polynomial curve that represents the best fit to the data.
Figure 1 shows that all around the world attackers increase their activity during the night. There are several possible explanations for this interesting finding. One is that system administrators are not present, so the attacker can move around freely during this time. Another possible reason is the fact that office computers left on are easily broken into because no one is around to notice and there are no reboots during the night. Yet another reason is the bandwidth, which is less used at night.
The findings of this research can be summarized in a few main conclusions. The first conclusion relates to the time factor and to one of the most interesting and unexpected findings of the study, namely that attackers increase their activity during the evening. This is something that can be used in future research on the behavior of hackers. The second conclusion relates to the geographic findings, suggesting that most of the hostile activity is the work of a local attacker. This was found not only in the U.S. database described here but also in an analysis the European and Israeli databases that was carried out later.
It is therefore suggested that organizations wishing to strengthen their network security should adhere to the following action items:
No matter what the location of the organization or what the physical size of its computer network, the risks exist and a tool that follows this model can improve network security.
Finally, the logit regression could be turned into a network security risk reduction tool that analyzes the traffic and acts accordingly, as depicted in Figure 2. The first phase is to create a model that suits the specific needs of the organization. This is followed by the second phase, in which the organization uses the output of the model as the risk reduction tool. The two phases are circular and should be amended or supplemented from time to time, depending on the nature of the organization and the threats it faces.
Customers for such a model can be divided into two groups: business and academic. In the business sector organizations could improve their security level by applying the model to their network, customizing it for their particular services and the importance they ascribe to their various resources. This research shows that no matter what the location of the organization or what the physical size of its computer network, the risks exist and a tool that follows this model can improve network security. At the same time, in the academic world, the findings of this research can be studied together with the findings of security research organizations such as Carnegie Mellon University's Software Engineering Institute (SEI) (project.honeynet.org/ and isc.sans.org/index.php) in order to build an accurate model of future attacks and add to the body of knowledge we have on attackers. For further research, the suggested model could be adapted to the type of organization (for example, small or medium enterprises of a specific business segment or a Fortune 500 company). This could help to identify the specific threats to which each segment of the market is most vulnerable.
The main limitation of this research is the lack of organizations to make up a valid sample. Future research taking a macro view on a collection of 30 (or more) databases from around the world could therefore do much to broaden the perspective in general and give more accurate results concerning the geographic influence in particular.
7. Julisch, K. and Dacier, M. Mining intrusion detection alarms for actionable knowledge. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (July 2002), 263270.
©2007 ACM 0001-0782/07/0400 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2007 ACM, Inc.
No entries found