Next Patent: SECURE USER-LEVEL TUNNELS ON THE INTERNET
Next Patent: SECURE USER-LEVEL TUNNELS ON THE INTERNET
Plaque It!
Sponsored by: Flash of Genius |
[0001] This application claims the benefit of U.S. Provisional Patent Application Serial No. 60/340,197, filed on Dec. 14, 2001, entitled “System for Monitoring and Tracking the Spread of Malicious E-mails,” and U.S. Provisional Patent Application Serial No. 60/312,703, filed Aug. 16, 2001, entitled “Data Mining-Based Intrusion Detection System,” which are hereby incorporated by reference in their entirety herein.
[0003] A computer program listing is submitted in duplicate on CD. Each CD contains a routine Clique_finder, which CD was created on Aug. 15, 2002, and which is 16.8 kB in size. The files on this CD are incorporated by reference in their entirety herein.
[0004] A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by any one of the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.
[0005] 1. Field of the Invention
[0006] This invention relates to systems and methods for detecting violations of an email security policy in a computer system, and more particularly to the use of probabilistic and statistical models to model the behavior of email transmission through the computer system.
[0007] 2. Background
[0008] Computer systems are constantly under attack by a number of malicious intrusions. For example, malicious software is frequently attached to email. According to NUA Research, email is responsible for the spread of 80 percent of computer virus infections (Postini Corporation, Press release “Postini and Trend Micro Partner to Offer Leading Virus Protection Via Postini's Email Pre-processing Infrastructure,” Online Publication, 2000. http://www.postini.com/company/pr/pr100200.html.) Various estimates place the cost of damage to computer systems by malicious email attachments in the range of 10-15 billion dollars in a single year. Many commercial systems have been developed in an attempt to detect and prevent these attacks. The most popular approach to defend against malicious software is through anti-virus scanners such as Symantec and McAfee, as well as server-based filters that filters email with executable attachments or embedded macros in documents (Symantec Corporation, 20330 Stevens Creek Boulevard, Cupertino, Calif. 95014, Symantec worldwide home page, Online Publication, 2002. http://www.symantec.com/product, and McAfee.com Corporation, 535 Oakmead Parkway, Sunnyvale, Calif. 94085, Macafee home page. Online Publication, 2002. http://www.mcafee.com).
[0009] These approaches have been successful in protecting computers against known malicious programs by employing signature-based methods. However, they do not provide a means of protecting against newly launched (unknown) viruses, nor do they assist in providing information that my help trace those individuals responsible for creating viruses. Only recently have there been approaches to detect new or unknown malicious software by analyzing the payload of an attachment. The methods used include heuristics, (as described in Steve R. White, “Open problems in computer virus research,” Online publication, http://www.research.ibm.com/antivirus/SciPapers/White/Proble
ms/Problems.html), neural networks (as described in Jeffrey 0. Kephart, “A biologically inspired immune system for computers,”
[0010] In recent years however, not only have computer viruses increased dramatically in number and begun to appear in new and more complex forms, but the increased inter-connectivity of computers has exacerbated the problem by providing the means of fast viral propagation.
[0011] Moreover, violations in email security policies have occurred which are marked by unusual behaviors of emails or attachments. For example, spam is a major concern on the internet. More than simply an annoyance, it costs corporations many millions of dollars in revenue because spam consumes enormous bandwidth and mail server resources. Spam is typically not detected by methods that detect malicious attachments, as described above, because spam typically does not include attachments.
[0012] Other email security violations may occur where confidential information is being transmitted by an email account to at least one improper addressee. As with spam, such activity is difficult to detect where no known viruses are attached to such emails.
[0013] Accordingly, there exists a need in the art for a technique to detect violations in email security policies which can detect unauthorized uses of email on a computer system and halt or limit the spread of such unauthorized uses.
[0014] An object of the present invention is to provide a technique for detecting violations of email security policies of a computer system by gathering statistics about email transmission through a computer system.
[0015] Another object of the present invention is to provide a technique for modeling the behavior of attachments and/or modeling of the behavior of email accounts on a computer system.
[0016] A further object of the present invention is to provide a technique for generating and comparing profiles of normal or baseline email behavior for an email account and for selected email behavior and for determining the difference between such profiles, and whether such difference represents a violation of email security policy.
[0017] A still further object of the invention is to protect the identity of email account users, while tracking email behavior associated with such users.
[0018] These and other objects of the invention, which will become apparent with reference to the disclosure herein, are accomplished by a system and methods for detecting an occurrence of a violation of an email security policy of a computer system by transmission of selected email through the computer system. The computer system may comprise a server and one or more clients having an email account. The method comprises the step of defining a model relating to prior transmission of email through the computer system derived from statistics relating to the prior emails, and the model is saved in a database. The model may be probabilistic or statistical. Statistics may be gathered relating to the transmission of the selected email through the computer system. The selected email may be subsequently classified as violative of the email security policy based on applying the model to the statistics.
[0019] In a preferred embodiment, the step of defining a model comprises defining a model relating to attachments to the prior emails transmitted through the computer system. Such model may created by using a Naive Bayes model trained on features of the attachment. New attachments are extracted from each of the new emails transmitted through the computer system. The attachment may be identified with a unique identifier. According to this embodiment, the step of gathering statistics relating to the transmission of new email through the computer system comprises recording the number of occurrences of the attachment received by the client.
[0020] The step of gathering statistics relating to the transmission of new email through the computer system may comprise, for each attachment that is transmitted by an email account, recording a total number of addresses to which the attachment is transmitted. This step may also include recording a total number of email accounts which transmit the attachment. In addition, this step may include, for each attachment that is transmitted by an email account, defining a model that estimates the probability that an attachment violates an email security policy based on the total number of email addresses to which the attachment is transmitted and the total number of email accounts which transmit the attachment.
[0021] The step of classifying the email may be performed at the client. Alternatively or in addition, the step of classifying the email may be performed at the server. The classification determined at the server may be transmitted to the one or more clients. In addition, the classification determined at the client may be transmitted to the server, and retransmitted to the one or more clients in the system.
[0022] According to another embodiment, the step of defining a model relating to prior transmission of email may comprise defining model derived from statistics relating to transmission of emails from one of the email accounts. A model may be derived from statistics accumulated over a predetermined time period. For example, a model may be defined relating the number of emails sent by an email account during a predetermined time period. A model may alternatively be derived from statistics accumulated irrespective of a time period. For example, a model may be derived relating to the number of email recipients to which the email account transmits an email. In an exemplary embodiment, such models are represented as histograms. The step of gathering statistics about the transmission of selected email may comprise representing such transmission of selected email as a histogram. Classifying the transmission of selected email may comprise comparing the histogram of prior email transmission with the histogram of selected email transmission. The comparison may be performed by such techniques as Mahalonobis distance, the Chi-Square test, or the Kolmogorov-Simironov test, for example.
[0023] Advantageously, the step of defining a model relating to transmission of emails from one of the email accounts may comprise defining the model based on the email addresses of recipients to which the emails are transmitted by the email account. Accordingly, the email addresses may be grouped into cliques corresponding to email addresses of recipients historically occurring in the same email. The step of gathering statistics relating to the transmission of email through the computer system may comprise, for email transmitted by the email account, gathering information on the email addresses of the recipients in each email. The email may be classified as violating the email security policy based on whether the email addresses in the email are members of more than one clique.
[0024] The step of defining a model relating to transmission of emails from one of the email accounts may comprise, for emails transmitted from the email account, defining the model based on the time in which the emails are transmitted by the email account. Alternatively, the model may be based on the size of the emails that are transmitted by the email account. As yet another alternative, the model may be based on the number of attachments that are transmitted by the email account
[0025] The client may comprise a plurality of email accounts and the step of defining a model relating to prior transmission of email may comprise defining a model relating to statistics concerning emails transmitted by the plurality of email accounts. According to this embodiment, the step of defining a probabilistic model may comprise defining a model based on the number of emails transmitted by each of the email accounts. The model may also be defined based on the number of recipients in each email transmitted by each of the email accounts.
[0026] In accordance with the invention, the objects as described above have been met, and the need in the art for a technique which detects violations in an email security policy by modeling the email transmission through the computer system, has been satisfied.
[0027] Further objects, features and advantages of the invention will become apparent from the following detailed description taken in conjunction with the accompanying figures showing illustrative embodiments of the invention, in which:
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037] Throughout the figures, the same reference numerals and characters, unless otherwise stated, are used to denote like features, elements, components or portions of the illustrated embodiments. Moreover, while the subject invention will now be described in detail with reference to the figures, it is done so in connection with the illustrative embodiments. It is intended that changes and modifications can be made to the described embodiments without departing from the true scope and spirit of the subject invention as defined by the appended claims.
[0038] This invention will be further understood in view of the following detailed description.
[0039] In accordance with the invention, a system and method for a violation of an email security policy of a computer system is disclosed herein. A violation of an email security policy can be defined in several ways. Such an email security policy may be explicit or implicit, and generally refers to any activity which may be harmful to the computer system. For example, an attachment to an email which contains a virus may be considered a violation of a security policy. Attachments which contain viruses can manifest themselves in several ways, for example, by propagating and retransmitting themselves. Another violation of a security policy may be the act of emailing attachments to addresses who do not have a need to receive such attachments in the ordinary course. Alternatively, the security policy may be violated by “spam” mail, which are typically unsolicited emails that are sent to a large number of email accounts, often by accessing an address book of a host email account. The method disclosed herein detects and tracks such security violations in order to contain them.
[0040] A model is defined which models the transmission of prior email through the computer system through the computer system. The model may be statistical model or a probabilistic model. The transmission of emails “through” the system refers to emails transmitted to email accounts in the system, email transmitted by email accounts in the system, and between email accounts within the system. The system accumulates statistics relating to various aspects of email traffic flow through the computer system. According to one embodiment, the model is derived from observing the behavior or features of attachments to emails. Another embodiment concerns modeling the behavior of a particular email account. Yet another embodiment models the behavior of the several email accounts on the system to detect “bad” profiles. The model is stored on a database, which may be either at a client or at a server, or at both locations.
[0041] The selected email transmission is typically chosen for some recent time period to compare with the prior transmission of email. Each email and/or its respective attachment is identified with a unique identifier so it may be tracked through the system. Various statistics relating to the emails are gathered. The probability that some aspect of the email transmission, e.g. an attachment, an email transmission, is violative of an email security policy is estimated by applying the model based on the statistics that have been gathered. Whether the email transmission is classified as violative of the email security policy is then transmitted to the other clients.
[0042] The system
[0043] The client
[0044] When integrated with the mail server
[0045] The mail server
[0046] This unique identifier is used to aggregate information about the same attachment propagated in different emails. This step if most effective if payload, e.g., the content of the email, such as the body, the subject, and/or the content of the attachment, is replicated without change during virus propagation among spreading emails and thus tracking the email attachments via this identifier is possible.
[0047] The client
[0048] System
[0049] The system
[0050] In addition, the client
[0051] The system also may define a probabilistic or statistical model relating to the behavior of attachments derived from these statistics or features. This allows a global view of the propagation of malicious attachments and allows the system
[0052] Self-replicating viruses naturally have extremely high birth rates. If a client
[0053] Many self-replicating viruses have a similar method of propagation, i.e., they transmit themselves to email addresses found on the address book of the host computer. This behavior may manifest itself in an extremely high birth rate for the attachment. While in some cases a large birthrate for an attachment would be normal, such as in a broadcast message, self-replicating viruses are characterized in that the message is transmitted from multiple email accounts
[0054] An exemplary method for detecting self-replicating viruses is to classify an attachment as self replicating if its birth rate is greater than some threshold t and the attachment is sent from at least l email accounts. If an email flow record is above the threshold t, the client
[0055] The server
[0056] The server
[0057] Screen
[0058] Screen
[0059] Information concerning attachments as illustrated in
[0060] This information may be stored on database
[0061] The server
[0062] When a client
[0063] Additional statistics may be computed for each attachment and stored on databases
[0064] The system
[0065] This profile of behavior patterns may be represented as a histogram, for example. A histogram is a way of graphically showing the characteristics of the distribution of items in a given population of samples. In the exemplary embodiment, histograms are used to model the behavior of particular email accounts. From a training set, e.g., the statistics as discussed above, a histogram is constructed to represent the baseline behavior of an email account. A histogram is also created to represent selected behavior of the email account.
[0066] Histograms may model statistics, e.g., events or operations, which are accumulated over a fixed time period. Each bin in the histogram counts some number of events in fixed time periods. For example, a histogram may record the average number of emails sent by an email account each day during the previous month, wherein each bin represents a day, hour, or other time period. Alternatively, histograms may model statistics accumulated irrespective of a time period. In such case, each bin is not a fixed time period, but some other feature. For example, over a set of emails from an arbitrary time period (gathered over a month, or gathered over a year, etc.) a histogram recording the number of email sent to a distinct recipient, wherein each bin represents a recipient, for example.
[0067]
[0068] A sequential profile can be represented which is irrespective of the quanta of time measured (non-stationary), but which instead uses each email as a measurement point. With continued reference to
[0069] Once such histograms have been created, the histogram of the baseline behavior is compared with the histogram of the selected behavior to determine whether the new behavior represents a deviation that may be classified as a violation of email security policy. There are many known methods to compute the histogram dissimilarity. Generally such methods may be divided into two categories: One method is using a histogram distance function; the other method is to use a statistics test. A histogram can be represented by a vector.
[0070] Histograms may be compared with the L
[0071] When the sums of X[i] and Y[i] are equal, the histogram intersection formula of equation (1) may be simplified to the L
[0072] Alternatively, histograms may be compared with the L
[0073] The L
[0074] Other distance equations are the weighted histogram difference equations, e.g., the histogram quadratic distance equation and the histogram Mahalanobis distance equation. The histogram quadratic difference equation (4) considers the difference between different bins.
[0075] In equation (4), A is a matrix and a
[0076] The Mahalanobis distance is a special case of the quadratic distance equation. The matrix A is given by the covariance matrix obtained from a set of training histograms. Here, the elements in the histogram vectors are treated as random variables, i.e., X=[x
[0077] This method requires a sufficiently large training set (of prior email transmission statistics) in order to allow the covariance matrix to accurately represent the training data.
[0078] The chi-square test is used to test if a sample of data came from a population with a specific distribution. It can be applied to any uni-variance distribution for which it is possible to calculate the cumulative distribution function. However, the value of chi-square test statistic depends on how the data is binned, and it requires a sufficient sample size. The chi-square test is represented by equation (6):
[0079] where k is the number of bins O
[0080] where F is the cumulative distribution function, Y
[0081] The Kolmogorov-Simironov test (the “KS test”) is a statistical test which is designed to test the hypothesis that a given data set could have been drawn from a given distribution, i.e., that the new behavior could have been drawn from the normal behavior. The KS test is primarily intended for use with data having a continuous distribution, and with data that is independent of arbitrary computational choice, such as bin width. The result D is equal to the maximum difference between the cumulative distribution of data points.
[0082] and where N is total number of samples The KS test does not depend on the underlying cumulative distribution function which is being tested, and it is an exact test (when compared with the Chi-Square test, which depends on an adequate sample size for the approximations to be valid). The KS test may only be applied to continuous distribution; it tends to be more sensitive near of the center of the distribution than at the tails.
[0083] The modeling of the behavior of an email account may include defining a model based on the time of day in which emails are transmitted by a particular email account.
[0084] Another method for defining a model relating to the transmission of emails from one of the email accounts is based on the email addresses of the recipients of emails transmitted by the particular email account. Thus, another statistic or feature gathered by the method in accordance with the invention is the email addresses of recipients in each email. The recipients of the emails may be grouped into “cliques” corresponding to email addresses historically occurring in the same email.
[0085] A clique is defined as a cluster of strongly related objects in a set of objects. A clique can be represented as a subset of a graph, where nodes in the graph represent the “objects” and arcs or edges between nodes represent the “relationships” between the objects. Further, a clique is a subset of nodes where each pair of nodes in the clique share the relationship but other nodes in the graph do not. There may be many cliques in any graph.
[0086] In this context, the nodes are email addresses (or accounts) and the edges represent the “emails” (and or the quantity of emails) exchanged between the objects (email accounts). Each email account is regarded as a node, and the relationship between them is determined by the to:, from:, and cc: fields of the emails exchanged between the email accounts. As illustrated in
[0087] The relationship between nodes that induces the cliques can be defined under different periods of time, and with different numbers of emails being exchanged, or other features or properties. For example, an edge (as represented by line
[0088]
[0089] Cliques are determined according to any number of known methods. In the exemplary embodiment, cliques are modeled as described in C. Bron and J. Kerbosch. “Algorithm 457: Finding All Cliques of an Undirected Graph,”
[0090] First, the graph is built by selecting all of the rows from the email table in the database. As illustrated in
[0091] A first step is to check an aliases file against the sender and recipient to map all aliases to a common name. For instance, a single user may have several accounts. This information, if available, would be stored in an aliases file.
[0092] The edge between sender and recipient is updated (or added if it doesn't already exist). (The edge is represented as line
[0093] A next step is pruning the graph. The user inputs a minimum edge weight, or minimum number of emails that must pass between the two accounts to constitute an edge, and any edges that don't meet that weight are eliminated. For example, the minimum number of emails may be determined from the average number of emails sent by the email account over a similar time period.
[0094] Subsequently, the cliques are determined. Throughout this process, there exist four sets of data: (1) * compsub* represents a stack of email user accounts representing the clique being evaluated. Every account in * compsub* is connected to every other account. (2) * candi da tes* represents a set of email user accounts whose status is yet to be determined. (3) *not* represents a set of accounts that have earlier served as an extension of the present configuration of * compsub* and are now explicitly excluded. (4) * cliques* represents a set of completed cliques
[0095] In the exemplary embodiment, these are implemented using the Java Stack and HashSet classes rather than the array structure suggested in the Bron & Kerbosch in The Appendix and the routine Clique_finder attached herein.
[0096] The algorithm is a recursive call to extendClique(). There are five steps in the algorithm: Step 1 is the selection of a candidate, i.e., an email user account which may be prospectively added to the clique. Step 2 involves adding the selected candidate to *compsub*. Step 3 creates new sets *candidates* and *not* from the old sets by removing all points not connected to the selected candidate (to remain consistent with the definition), keeping the old sets intact. Step 4 is calling the extension operator to operate on the sets just formed. The duty of the extension operator is generate all extensions of the given configuration of * compsub* that it can make with the given set of candidates and that do not contain any of the points in *not*. Upon return, step 5 is the removal of the selected candidate from * compsub* and its addition to the old set *not*.
[0097] When *candidates* and *not* are both empty, a copy of *compsub* is added to * cliques*. (If *not* is non-empty it means that the clique in * compsub* is not maximal and was contained in an earlier clique.) A clique's most frequent subject words are computed by merging and sorting the weighted sets of subject words on each edge in the clique.
[0098] If we reach a point where there is a point in *not* connected to all the points in * candidates*, the clique determination is completed (as discussed in The Appendix). This state is reached as quickly as possible by fixing a point in *not* that has the most connections to points in *candidates* and always choosing a candidate that is not connected to that fixed point.
[0099] A clique violation occurs if a user email account sends email to recipients which are in different cliques. If an email
[0100] A strength of the clique violation may be measured by counting the number of such violations in a single email, e.g., the number of recipients who are not themselves part of the same clique, and/or the number of emails being sent, or other features that may be defined (as the system designer's choice) to quantify the severity of the clique violation. (For example, if email account
[0101] Clique violations may also be determined from multiple email messages, rather than from just one email. For example, if a set of emails are sent over some period of time, and each of these emails are “similar” in some way, the set of email accounts contained in those emails can be subjected to clique violation tests. Thus, the email recipients of email sent by a particular use is used as training data to train a model of the email account.
[0102] If a specific email account is being protected by this method of modeling cliques and detecting clique violations, such violations could represent a misuse of the email account in question. For example, this event may represent a security violation if the VP of engineering sends an email to the CEO concurrently with a friend who is not an employee of the VP's company. Similarly, a clique violation would occur when a navy lieutenant sends a secret document to his commanding officer, with his wife's email account in the CC field. These are clique violations that would trigger an alert.
[0103] The techniques described herein can also be used a) to detect spam emails (which may or may not and generally do not have attachments, and b) to detect spammers themselves. Spam generally has no attachments, so other statistics about email content and email account behavior are needed to be gathered here by system
[0104] The methods described above generally refer to defining probabilistic or statistical models which define the behavior of individual email accounts. Also useful are models relating to statistics for emails transmitted by the plurality of email accounts on the computer system.
[0105] Detecting email accounts that are being used by spammers may allow an internet service provider or server
[0106] Individual profiles may be represented by histograms in screen
[0107] Detection of a “spammer” may be performed by comparing email account profiles, such as those illustrated in
[0108] Profile
TABLE 1 Average Number of Emails Sent Account A Account B Per minute 0.5 100 Per day 11 12,000
[0109] Profile
TABLE 2 Average Number of Recipients of Email by Time of Day Account A Account B Morning 1 15 Day 5 15 Night 1 15
[0110] Profile
TABLE 3 Cumulative Distinct Email account recipients Account A Account B Email 1 1 15 Email 2 1 27 Email 3 2 43 . . . . . . . . . Email 55 7 1236
[0111] Given these three profiles, Account A appears to have a profile showing very modest use of emails, with few recipients. Account B on the other hand appears to be a heavy transmitter of emails. In addition, there seems to be evidence that the behavior of Account B is indicative of a ‘drone’ spammer. Such determination may be made by comparing the histograms of Account A (considered a “normal” user) with the histograms of Account B, and determining the difference between the two. Equations (1)-(8), above, are useful for this purpose. For example, the histogram of Table 2 indicates that the behavior of Account B may be consistent with running a program that is automaticaaly sending emails to a fixed number of recipients (e.g.,
[0112] It will be understood that the foregoing is only illustrative of the principles of the invention, and that various modifications can be made by those skilled in the art without departing from the scope and spirit of the invention.