Next Patent: Mining model versioning
Next Patent: Mining model versioning
[0001] This application is a continuation application and claims priority from U.S. patent application Ser. No. 09/523,446, filed on Mar. 10, 2000, which is hereby incorporated by reference herein.
[0002] The present invention relates generally to data processing and more specifically to an OLAP-based customer behavior profiling method and system.
[0003] Telecommunication fraud is a major problem that costs the telephone service providers many millions of dollars annually. There are generally two types of telecommunication fraud: fake identity and real identity fraud. In fake identity fraud, the impostor uses another's access code to access telephone services (e.g., local or long distance access). In a real identity fraud, the perpetrator uses a real identity, but fails to pay the telephone service providers for services. When the telephone company stops providing service to a real identity fraud perpetrator, the perpetrator either applies for a new number or switches service providers, thereby continuing to defraud the telephone service providers.
[0004] To counteract these problems, telephone service providers currently hire consultants and provide them with past calling records, which typically include all the calling records for a previous year. The consultants then take six months or more to sort through the many millions of records and to generate a report that describes any suspicious activity for the past year. Unfortunately, the prior art tools for fraud detection utilized by the consultants to analyze the records are very limited and employ very crude or coarse threshold detection methods to detect the fraudulent behavior.
[0005] For example, one prior art threshold detection method is based solely on the length of the telephone call. When a particular call exceeds a particular length (e.g., 24 hours), the method informs the consultant that the call is probably fraudulent. Another prior art threshold detection method is based on both the length of the call and the time when the call occurred. When a particular call is more than a particular length of time (e.g., 4 hours), and the call occurs in the evening (e.g., after 10 PM), then this prior art method classifies the call as “fraudulent.”
[0006] These current methods suffer from several disadvantages. First, these tools do not have the ability to generate a specific and personalized caller profile and to use that profile to detect suspicious calling activity that corresponds to a unique calling behavior. As noted, only very coarse threshold can be established. Personalized profiles are important because calling behavior that may considered to be abnormal calling behavior (e.g., phone calls in the evening that last more than four hours) for a first caller, who normally makes no calls in the evenings, may be normal activity for a second caller, who only makes calls in the evenings that average between five and six hours. Thus, it is desirable to have a mechanism that can establish a personalized threshold or baseline that differs among different callers thereby accommodating different callers, who inevitably have different calling behaviors and patterns. Such a mechanism could then determine what is abnormal calling activity as measure to a baseline of that caller's previous calling behavior.
[0007] Second, the prior art approaches consume much time. Because of the time needed by the consultants to perform the analysis and generate the report, the impostor or perpetrator of telephone fraud will more than likely have moved onto a different telephone service provider or to new telephone number by the time any fraud has been detected. In addition, there will always be six months to a year or more of unrecoverable profits lost to fraudulent behavior before that behavior is detected, if at all. It is desirable to have a mechanism that reduces the time needed between the fraudulent activity and the detection thereof.
[0008] Third, the prior art methods are also poor at handling the volume of calls. Even if more consultants were hired, and these consultants worked around the clock, they would be unable to handle the sheer volume of calls that are continuously generated. The volume of call data is in the order of millions of call records per day for a particular local geographic area. It is desirable to have a mechanism that can incrementally update an existing profile to reflect information from the new call records.
[0009] Furthermore, the prior art methods are limited to analyzing past calling records and are unable to provide up-to-date reports that reflect current call records and trends. In this regard, it is desirable to develop a system that is scaleable (i.e., that can automatically process new records on a periodic basis and generate reports that reflect new information provided by the new records).
[0010] Fourth, these prior art methods use volume data, which is difficult to compare across different time periods. For example, the number of calls made in a single month (e.g., January) cannot be compared to the total number of calls made for an entire year (e.g., 1999). Similarly, a weekly measure of the number of calls made by a particular caller makes cannot be compared to a monthly measure of the number of calls made by the same caller. In the example given above, suppose the consultant studies the past six months of call records and determines that it is likely that any caller who makes more than 100 calls for a duration of more than 24 hours in six months is likely to be fraudulent. This information is not useful for determining if a caller over a time frame different from six months is perpetrating telephone fraud. In addition, It is desirable instead to have a mechanism that generates values that can be compared easily across different time periods.
[0011] Accordingly, there remains a need for a method for generating and using caller profiles to detect telecommunication fraud that overcomes the disadvantages set forth previously.
[0012] The present invention discloses an OLAP-based method and system for profiling customer behavior. In one embodiment, the present invention is applied to telecommunication fraud detection and involves processing call records. In this embodiment, the following steps are performed. First, call records are received. Next, a calling profile cube (e.g., a multi-customer profile cube) is generated based on the call records. A volume-based calling pattern cube (e.g., a calling pattern cube for each individual customer) is then generated based on the multi-customer profile cube. The volume-based calling pattern cube is then compared with known fraudulent volume-based calling patterns. If the similarities generated by the comparison reaches or exceeds a predetermined threshold, then the particular caller with the calling pattern being analyzed is considered suspicious. In this manner, suspicious calling activity can be detected, and appropriate remedial actions, such as further investigation or the cancellation of telephone services, can be taken.
[0013] In an alternative embodiment, after the volume-based calling pattern cube (e.g., a calling pattern cube for each individual customer) has been generated, a probability-based calling pattern cube is generated based on the volume-based calling pattern cube. The probability-based calling pattern is then compared with known probability-based fraudulent patterns. If the similarities generated by the comparison reaches or exceeds a predetermined threshold, then the particular caller with the calling pattern being analyzed is considered suspicious. One advantage of the alternative embodiment over the first embodiment described above is that two patterns that cover different time periods can be compared and analyzed.
[0014]
[0015]
[0016]
[0017]
[0018] The subject invention will be described with reference to numerous details set forth below, and the accompanying drawings will illustrate the invention. The following description and the drawings are illustrative of the invention and are not to be construed as limiting the invention. Numerous specific details are described to provide a thorough understanding of the present invention. However, in certain instances, well-known or conventional details are not described in order not to obscure unnecessarily the present invention in detail. In the drawings, the same element is labeled with the same reference numeral.
[0019] One aspect of the present invention is the use of an OLAP-based profile engine as a scalable computation engine to compute, maintain and utilize customer behavior profiles. In other words, the present invention provides an OLAP-based system and method for customer behavior profiling and pattern analysis that powerfully extends the limited capabilities of traditional OLAP tools that were generally directed only to query and analysis of data.
[0020] Another aspect of the present invention is to generate personalized or group-based thresholds that are more precise and useful than generalized thresholds of the prior art. For example, by generating personalized calling behavior profiles, the present invention can determine that calls by John for four hours are considered usual, but calls by Jane for two hours are considered unusual.
[0021] Yet another aspect of the present invention is the use of an OLAP-based method and system to detect telephone fraud by comparing a known fraudulent profile to customer profile. For example, in one embodiment, the present invention profiles each new customer's calling behavior and compares these profiles against known fraudulent profiles to detect fraud.
[0022] According to yet another aspect of the present invention, profiles and calling patterns are represented as multi-level and multidimensional cubes.
[0023] In one embodiment of the present invention, profiles and calling patterns are based on the probability distribution of call volumes. The present invention can utilize an OLAP-based profile engine to compute these probability distributions.
[0024] The architecture of a data processing system configured in accordance with one embodiment of the present invention is illustrated in
[0025] OLAP-Based Data Processing System
[0026] One substantial challenge for the prior art approaches to caller fraud detection is how to process the sheer volume of call data in order to generate caller profiles and update them. In order to create and update customer behavior profiles, hundreds of millions of call records must be processed everyday.
[0027] The present invention overcomes this challenge by providing an OLAP-based architecture or framework that is both scalable and maintainable to support customer behavior profiling. One application of the OLAP-based architecture of the present invention is caller behavior profiling for telecommunication fraud detection.
[0028]
[0029] First, profiling engine
[0030] The data warehouse
[0031] In one embodiment, the data warehouse
[0032]
[0033] The profile engine
[0034] The profile builder and update module (PBUM)
[0035] The behavior pattern generation module (BPGM)
[0036] Below is the Oracle-8 schema of the profile table called “Profile”, where “pc” is the number of calls dimensioned by other attributes.
// Oracle8 table schema CREATE TABLE Profile ( caller VARCHAR2(10) NOT NULL, callee VARCHAR2(10) NOT NULL, duration CHAR(1) NOT NULL, time CHAR(1) NOT NULL, dow char(1) NOT NULL, pc INTEGER ) STORAGE ...;
[0037] The corresponding profile cube (PC)
// Oracle Express cube definition define PC variable int <sparse <duration time dow callee caller>> inplace where dow stands for day_of_week (e.g. Monday, ..., Sunday).
[0038] It can be seen that the attributes of the profile table
[0039] Referring to
[0040] In step
[0041] In step
[0042] Data Management
[0043] The data management module
[0044] In order to reduce data redundancy and query cost, it is preferable for the present invention to maintain minimal data in the profile tables
[0045] Calling Cubes
[0046] In one embodiment, the present invention generates and uses two types of calling cubes: (1) multi-customer based profile cubes (e.g., updated profile cube
[0047] Profile Cubes
[0048] A profile cube (e.g., profile cube
define PC variable int <sparse <duration time dow callee caller>> inplace define PCS variable int <sparse <duration time dow callee caller>> inplace
[0049] where callee is the telephone number of the person being called; caller is the telephone number of the person placing the call; dimension time has values representing time-bins (e.g., ‘morning’, ‘afternoon’, and ‘evening’); dimension duration has values representing duration-bins (e.g., ‘short’, ‘medium’, and ‘long’); and dimension dow has values representing days of week (e.g., ‘MON’, . . . ‘SUN’).
[0050] It is noted that the use of keyword “sparse” in the above definitions instructs Oracle Express to create a composite dimension <duration time dow callee caller>, in order to handle sparseness, particularly between calling and called numbers, in an efficient way. A composite dimension is a list of dimension-value combinations. A combination is an index into one or more sparse data cubes. The present invention uses a composite dimension to store sparse data in a compact form similar to relation tuples.
[0051] The PBUM
[0052] The PBUM
[0053] In this manner, the PBUM
[0054] Hierarchical Dimensions for Multilevel Pattern Representation
[0055] A hierarchical dimension D contains values at different levels of abstraction. The following is associated with the hierarchical dimension D: 1) dimension DL that describes the levels of the hierarchical dimension D; 2) a relation DL_D that maps each value of the hierarchical dimension D to the appropriate level; and 3) a relation D_D that maps each value of the hierarchical dimension D to its parent value (i.e., the value at the immediate upper level). Let D be an underlying dimension of a numerical cube C, such as a volume-based calling pattern cube. D, together with DL, DL_D and D_D, fully specify a dimension hierarchy. They provide sufficient information to rollup cube C along dimension D, (i.e., to calculate the total of cube data at the upper levels using the corresponding lower-level data). As can be appreciated, the cube C can be rolled up along multiple underlying dimensions.
[0056] The BPGM
[0057] Dow Hierarchical Dimension
[0058] In accordance with one embodiment of the present invention, the Day of Week (dow) hierarchy includes the following objects:
[0059] dow(day of week): dimension with values ‘MON’, . . . ‘SUN’ at the lowest level (dd level), ‘wkday’, ‘wkend’ at a higher level (ww level), and ‘week’ at the top level (‘week’ level);
[0060] dowLevel: dimension with values ‘dd’, ‘ww’, ‘week’;
[0061] dow_dow: relation (dow, dow) for mapping each value to its parent value, e.g.,
[0062] dow dow(dow‘MON’)=‘wkday’
[0063] . . .
[0064] dow_dow(dow ‘SAT’)_‘wkend’
[0065] dow_dow(dow ‘wkday’)=‘week’
[0066] dow_dow(dow ‘wkend’)=‘week’
[0067] dow_dow(dow ‘week’)=NA;
[0068] dowLevel_dow: relation (dow, dowLevel) for mapping each value to its level, e.g.,
[0069] dowLevel_dow(dow‘MON’)=‘dd’
[0070] . . .
[0071] dowLevel_dow(dow ‘wkday’)=‘ww’
[0072] dowLevel_dow(dow ‘wkend’)=‘ww’
[0073] dowLevel_dow(dow ‘week’)=‘week’.
[0074] Time Hierarchical Dimension
[0075] In accordance with one embodiment of the present invention, the time hierarchy includes the following objects:
[0076] time: dimension with values ‘night’, ‘morning’, ‘afternoon’ and ‘evening’ at ‘time_bin’ level (bottom-level), and ‘allday’ at the ‘time_all’ level (top-level);
[0077] timeLevel: dimension with values ‘time_bin’ and ‘time_all;
[0078] time_time: relation (time, time) for mapping each value to its parent value, e.g.,
[0079] time_time(time ‘morning’)=‘allday’
[0080] . . .
[0081] time_time(time ‘allday’)=NA;
[0082] timeLevel_time: relation (time, timeLevel) for mapping each value to its level, e.g.,
[0083] timeLevel_time(time ‘morning’)=‘time_bin’
[0084] timeLevel_time(time‘allday’)=‘time_all’.
[0085] Duration Hierarchical Dimension
[0086] In accordance with one embodiment of the present invention, the duration hierarchy includes the following objects.
[0087] duration: dimension with values ‘short’, ‘medium’, ‘long’ at ‘dur_bin’level (bottom-level, and ‘all’ and ‘dur_all’ level (top-level)
[0088] durLevel: dimension with values ‘dur_bin’ and ‘dur_all’
[0089] dur_dur: relation (duration, duration) for mapping each value to its parent value, e.g.,
[0090] dur_dur(duration ‘short’)=‘all’
[0091] . . .
[0092] dur_dur(duration ‘all’)=NA
[0093] durLevel_dur: relation (duration, durLevel) for mapping each value to its level, e.g.,
[0094] durLevel_dur(duration ‘short’)=‘dur_bin’
[0095] . . .
[0096] durLevel_dur(duration ‘all’)=‘dur_all’
[0097] When the present invention performs profile storage, combination and updating, only the bottom levels are involved. Thus, rolling up profile cubes, such as PC, is unnecessary. It is noted that the present invention applies the roll up operation to calling pattern cubes for analysis purposes.
[0098] Volume Based Calling Patterns
[0099] In the preferred embodiment of the present invention, a calling pattern cube is associated with a single customer for representing the individual calling behavior of that customer. Since the calling behavior of a customer can be viewed from different aspects, the present invention can define different kinds of calling pattern cubes. These cubes are commonly dimensioned by time, duration and dow (day of week). Cubes that are related to outgoing calls are commonly dimensioned by callee, and cubes that are related to incoming calls are commonly dimensioned by caller. The cell values of these cubes represent the number of calls, the probability distributions, etc. Calling pattern cubes, several examples of which are described below, are derived from profile cubes and then rolled up.
[0100] Cube CB.o represents the outgoing calling behavior of a customer. In Oracle Express that is defined by the following: define CB.o variable int <sparse <duration time dow callee>> inplace.
[0101] Similarly, cube CB.d representing incoming calling behavior is defined by the following: define CB.d variable int <sparse <duration time dow caller>> inplace.
[0102] The cell values of these cubes are the number of calls falling into the given ‘slot’ of time, duration, day of week, etc. When generated, CB.o and CB.d are rolled up along dimensions duration, time and dow. Therefore, CB.o(duration ‘short’, time ‘morning’, dow ‘MON’) measures the number of short-duration calls this customer made to each callee (dimensioned by callee) on Monday mornings during the profiling interval. Similarly, CB.o(duration ‘all’, time ‘allday’, dow ‘week’) measures the number of calls this customer made to each callee (total calls dimensioned by callee) during the profiling interval.
[0103] A Method for Deriving a Calling Pattern Cube
[0104] An exemplary method that can be utilized by the BPGM
[0105] Cube PC is pre-populated using the data retrieved from database table Profile and possibly combined with cube PCS that is generated from loading call data. With the following algorithm, the calling pattern cube, CB.o, is populated for a given customer as specified by
parameter customer callID. define genCB (customer_callID text) { - if customer_callID is not a value of caller then return - remove old cells of CB.o by - limit dimensions duration, time, dow and callee to all their values CB.o = NA - limit dimensions duration, time, dow to their bottom level values - limit dimension caller to the given customer by limit caller to customer_callID - limit dimension callee to those being called by the given customer, as limit callee to any(PC > 0, callee) - form a subcube of PC by selecting only the data related to the given customer (the current value of caller dimension), then transfer (unravel) its cell values to cube CB.o, as CB.o = unravel(total(PC, duration time dow callee) - rollup CB.o by limit duration, time, dow to all their values rollup CB.o over duration using dur_dur rollup CB.o over time using time_time rollup CB.o over dow using dow_dow }
[0106] Behavior Profiling with Probability Distribution
[0107] For customer behavior profiling, the present invention first specifies which features (i.e., dimensions) are relevant. In one embodiment, in connection with calling behavior profiling, the present invention utilizes the following features for a customer's outgoing and incoming calls: the phone-numbers, volume (i.e., the number of calls), duration of the call, time of day the call was made, and day of week the call was made. Second, the present invention also specifies the granularity of each feature. For example, the time of day feature can be represented by the time-bins ‘morning’, ‘afternoon’, ‘evening’ or ‘night’. Similarly, the duration feature can be represented by duration bins, such as ‘short’, ‘medium’, and ‘long.’ Each bin can be defined and set to predetermined values. In one embodiment, all calls that have a duration shorter than 20 minutes are placed into the ‘short’ bin. Also, all calls that have a duration between 20 minutes and 60 minutes are placed into the ‘medium’ bin, and all calls that have a duration longer than 60 minutes are placed into the ‘long’ bin. Third, the present invention specifies a profiling interval, which in a non-limiting example can be 3 months, and the periodicity of the profiles, which in a non-limiting example can be weekly. The profiling interval is that time interval over which the customer profiles are constructed, and the periodicity of the profiles is how often the customer profile is summarized. In this example, the customer's profile is a weekly summarization of his calling behavior during the profiling interval.
[0108] Based on the profiled information, the present invention derives calling patterns of individual customers. The present invention can generate the following three kinds of calling patterns. The first type of calling pattern is a fixed-value based calling pattern. A fixed-value based calling pattern represents a customer's calling behavior with fixed values showing his “average” behavior. TABLE 1 illustrates a profile with a simple, fixed values. This profile describes the calling pattern from a first telephone number to a second telephone number during “morning”, “afternoon”, and “evening” periods. On an average, calls are of a medium duration during the morning, of a short duration during the afternoon, and of a long duration during the evenings.
TABLE 1 Morning Afternoon Evening Medium Short Long
[0109] The second type of calling pattern is a volume-value based calling pattern. A volume based calling pattern summarizes a customer's calling behavior by counting the number of calls of different duration in different time-bins. Referring to
[0110] The third type of calling pattern is a probability distribution based calling pattern. A probability distribution based calling pattern represents a customer's calling behavior with probability distributions. TABLE 2 illustrates a profile with probability distribution values. Specifically, the profile describes the calling pattern or behavior from a first telephone number to a second telephone number in terms of probability values. For example, in the mornings, 10% of the calls were long, 20% of the calls were medium, and 70% of the calls were short. In the afternoons, 30% of the calls were long, 40% of the calls were medium, and 30% of the calls were short. In the evenings, 30% of the calls were long, 50% of the calls were medium, and 20% of the calls were short.
TABLE 2 Morning Afternoon Evening Short 0.7 0.3 0.2 Medium 0.2 0.4 0.5 Long 0.1 0.3 0.3
[0111] The VPCM
[0112] Computing Probability Distribution Based Calling Patterns using OLAP
[0113] The present invention represents profiles and calling patterns as cubes. A cube has a set of underlying dimensions, and each cell of the cube is identified by one value from each of these dimensions. The set of values of a dimension D, called the domain of D, may be limited (by the OLAP limit operation) to a subset. A sub-cube (slice or dice) can be derived from a cube C by dimensioning C by a subset of its dimensions, and/or by limiting the value sets of these dimensions.
[0114] As mentioned above, the profile of a customer can be a weekly summarization of his activities in the profiling period. In the preferred embodiment of the present invention, the information for profiling multiple customers' calling behavior are grouped into a single profile cube with dimensions <duration, time, dow, callee, caller>, where dow stands for day_of_week (e.g. Monday, . . . , Sunday), callee and caller are calling and called phone numbers. The value of a cell in a profiling cube measures the volume (i.e., the number of calls) made in the corresponding duration-bin, time-bin in a day, and day of week during the profiling period. In this way a profile cube records multiple customers outgoing and incoming calls week by week From such a multi-customer profile cube, the present invention derives or generates calling pattern cubes of individual customers. The calling pattern cubes of individual customers have similar dimensions to the profile cubes except that a calling pattern cube for outgoing calls is not dimensioned by caller, and a calling pattern cube for incoming calls is not dimensioned by callee because they pertain to a single customer.
[0115] The size of each profile cube may be controlled by partitioning the customers represented in a profile cube by area and by limiting the profiling period. The present invention can generate multiple calling pattern cubes to represent a customer's calling behavior from different aspects. For example, some calling pattern cubes representing probability-based information can be derived from intermediate calling pattern cubes representing volume-based information.
[0116]
[0117] Based on the volume cube
[0118] C
[0119] C
[0120] C
[0121] It is noted that all the above probability cubes, C
[0122] In the above expressions, total is an OLAP operation on cubes with numerical cell values. While total(V) returns the total of the cell values of V, total(V, callee) returns such a total dimensioned by callee, total(V, time, callee) returns such a total dimensioned by time and callee. In fact, a dimensioned total represents a cube. The arithmetic operations on cubes, such as ‘/’ used above, are computed cell-wise.
[0123] In view of the foregoing, the data management module
[0124] Cubes Representing Probability Distribution Based Calling Patterns
[0125] The Volume-based to Probability-based Conversion Module
[0126] Probability Distribution on All Calls
[0127] Cube P_CB.o for a customer represents the dimensioned probability distribution of outgoing calls over all the outgoing calls made by this customer, and can be derived from CB.o by the following:
define P _CB.o formula decimal <duration time dow callee> EQ (CB.o/total(CB.o(duration ‘all’, ‘allday’, dow ‘week’)))
[0128] where total(CB.o(duration ‘all’, ‘allday’, dow ‘week’)) is the total calls this customer made to all callees. Since CB.o has already been rolled up, its top-level value can be utilized. The value of a cell is the above probability corresponding to the underlying dimension values.
[0129] Probability Distribution on Calls to Each Callee
[0130] Cube P1_CB.o is dimensioned by duration, . . . and callee, and represents the probability distribution of a customer's outgoing calls over his total calls to the corresponding callee, and is also derived from CB.o as specified by the following:
[0131] define P1_CB.o formula decimal <duration time dow callee>
[0132] EQ (CB.o/total(CB.o(duration ‘all’, ‘allday’, dow ‘week’), callee))
[0133] where total(CB.o(duration ‘all’, ‘allday’, dow ‘week’), callee) is the total calls this customer made to each cal lee (dimensioned by callee). The value of a cell is the above probability corresponding to the underlying dimension values. Calling pattern cubes for incoming calls can be defined similarly.
[0134] Calling Pattern Similarity Comparison
[0135] The behavior pattern comparison module
[0136] It is noted that similarity of volume-based calling patterns is meaningful only when the patterns cover the same time span. In this regard, the present invention preferably measures the similarity of probability-based calling patterns so that patterns that cover different time spans can be compared meaningfully. For example, the present invention can be utilized to compare a predetermined calling pattern (e.g., a known fraudulent pattern) with an ongoing pattern in real-time.
[0137] The foregoing description has provided examples of the present invention. One example has been directed to telecommunication fraud. It will be appreciated that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. For example, the method of generating, updating, and comparing the customer profiles of the present invention can be applied to other areas, such as targeted marketing, targeted promotions, and general fraud detection. In applications where there is a very large collection of transaction data, the present invention can be utilized to generate customer behavior profiles, extract patterns of the activities of the customer, and provide guidelines as to how to meet or otherwise service the needs of the customers.