Next Patent: Secured shared storage architecture
Next Patent: Secured shared storage architecture
[0001] This application claims priority under 35 U.S.C. §119(e) from provisional application No. 60/298,622 filed Jun. 15, 2001. The provisional application No. 60/298,622 is incorporated by reference herein, in its entirety, for all purposes.
[0002] The invention relates generally to the processing of data, and more particularly to efficiently and generically aggregating data available on a communication network.
[0003] Recently, the collection and processing of data transmitted over communication networks, like the Internet, have moved to the forefront of business objectives. In fact, with the advent of the Internet, new revenue generating business models have been created to charge for the consumption of content received from a data network (i.e., content-based billing). For example, content distributors, application service providers (ASPs), Internet service providers (ISPs), and wireless Internet providers have realized new opportunities based on the value of the content that they deliver. As a result of this content-billing initiative, it has become increasingly important to intelligently collect and analyze content according to the business needs of the customer.
[0004] Unlike other data collection environments, communication networks like the Internet impose additional burdens on the collection and analysis process. For example, the Internet by its very nature is a network of unlimited data sources and correspondingly unlimited data types. As a result, the data collection and analysis process must be capable of understanding and processing the various types of data. Furthermore, the Internet communicates a vast quantity of data, only some of which may be needed to conduct the desired analysis. To simply store all of the data on the off chance that it may be used for subsequent processing would require a very large data store. Operating such a data store would result in undesirable processing time and wasted memory storage. Therefore, the data collection and analysis process must be capable of determining which of the data is desired, based on user criteria, and intelligently filter and classify the data (i.e., aggregate the data).
[0005] Currently, data aggregation is accomplished using various application specific (i.e., “non-generic”) methods. One method well known in the art, for example, performs aggregation by programming the appropriate filtering and classification techniques within the database operation itself. However, these “hard-coded” databases are limited to specific purposes only, for example, Web server databases. As a result, in the context of content collection and analysis, these “hard-coded” databases are too inflexible to efficiently satisfy the ever-changing face of a communication network like the Internet. For example, once the database is programmed to aggregate certain data, it must be re-programmed to accommodate the new data sources and corresponding new data items often introduced to the Internet.
[0006] These new data sources and new data items may provide information that is greatly desired by a particular organization or business group. Yet, because the required reprogramming necessary to collect and aggregate this new data is so time-consuming and labor-intensive, organizations often forego implementation and continue to use the stagnant “hard-coded” aggregation processes.
[0007] Therefore, there exists a need to provide a technique for allowing customers to create revenue models by recouping costs from network traffic, using scalable and flexible content analysis solutions. There also exists a need to provide a technique for aggregating data from a variety of different sources on the networks in a way that is capable of accommodating new data sources and new data types regularly added to such networks. The data aggregation process may be performed on data, both of which are stored on non-persistent memory (e.g., RAM).
[0008] The invention describes a method, device and system for increasing the speed of processing data. The inventive method includes filtering the data, classifying the data, and generically applying logical functions to the data without data-specific instructions. Moreover, the steps of filtering, classifying and applying logical functions are based on a predetermined criteria. The inventive method further includes storing the data in an in-memory database. The step of classifying may include adjusting the classification of data as a function of the quantity of data classified, and/or compounding classification categories as a function of a logical relation between the categories. The inventive method further may comprise creating data control objects and storing the data control objects outside of the in-memory database, and using pointers to avoid redundant data. The method may create one or more records that describe a transaction.
[0009] The invention further provides a system for collecting and analyzing the transfer of content between two systems on a communication network. The system includes a content collection layer, a transaction layer, and a settlement layer. The content collection layer may include an input data adapter for converting raw data from one or more data sources to sets of relevant attributes. The content collection layer further may include a content data language component for creating new attributes, and a correlator component for grouping data. The content collection layer further may include an aggregator component for filtering and/or classifying the attributes. The transaction layer may include a content detail record database for storing the classified and filtered attributes. The transaction layer further may include a transaction component for capturing predetermined agreements regarding the value of the transferred content among users of the system. The settlement layer may include a rating component for providing a significance (e.g., a price) to the transaction, so as to provide a tangible value to the transaction.
[0010] Other features of the invention are further apparent from the following detailed description of the embodiments of the invention taken in conjunction with the accompanying drawings, of which:
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018] System Overview
[0019]
[0020] In addition, it should be appreciated that the term “content” may be defined as data that is transmitted over the network. In the context of the Internet, content may include .mp3 files, hypertext markup language (html) pages, videoconferencing data, and streaming audio, for example. The terms “producer” and “customer” will be used throughout the description as well. Producer may refer to the primary creator or provider of the content, while customer is the primary recipient of the content. Both the producer and customer may be a human or a computer-based system.
[0021] As shown in
[0022] Content collection layer
[0023] Content collection layer
[0024] Transaction layer
[0025]
[0026] Input data adapter
[0027] Input data adapter
[0028] Content data language
[0029] Content data language
[0030] Furthermore, correlator
[0031] Although system
[0032] Filer
[0033] CDR database
[0034] Transaction component
[0035] Rating component
[0036]
[0037] In step
[0038] As shown in
[0039] Generic Aggregation
[0040] Aggregation is the process of filtering, classifying, and applying logical or mathematical function to data, based on user criteria. The aggregation process may be accomplished both as the data is received in real-time and offline. The aggregation process may create one or more records that provide information sufficient to adequately describe a transaction or event. As discussed with reference to
[0041] Aggregation Terminology
[0042] Aggregation may apply to any of the “attributes” of the data. As discussed with reference to
[0043] Attributes may be defined by a name or label that identifies the attribute, a unique identifier number that distinguishes one attribute from another, and/or a designation that identifies a type of attribute. For example, one particular attribute may have a label “CONTENT_TYPE,” a unique identifier of “8,” and a type called “STRING” that identifies the attribute as a series of alphanumeric characters. The following is just one example of possible attributes:
TABLE 1 DOMAIN 1 APO_TYPE_STRING HIT_BYTES 2 APO_TYPE_LONG_LONG MISS_BYTES 3 APO_TYPE_LONG_LONG TIME_STAMP 4 APO_TYPE_LONG BYTES 5 APO_TYPE_LONG_LONG URL 6 APO_TYPE_STRING DOMAIN 7 APO_TYPE_STRING CONTENT_TYPE 8 APO_TYPE_STRING HIT_FLAG 9 APO_TYPE_STRING URL_EXTENSION 10 APO_TYPE_STRING CONTENT_PROTOCOL 11 APO_TYPE_STRING
[0044] Notably, because attributes that are string type values may consume larger portions of memory, a single copy of each string value may exist in the database. If the same string is needed in other locations in the database, a pointer to the single copy may be used, instead of storing an additional copy of the string.
[0045] The classification portion of the aggregation process may be based on one or more “keys.” As is well known to those skilled in the art, a key corresponds to one or more categories in a database table that participates in unique identification of each row of the table. Every attribute that has a key may be represented by an object which is called the “data key” object. For example, if the source address attribute is a key for a particular aggregation, a corresponding data key will be created for this object that contains the object data.
[0046] An aggregation that has multiple attributes as keys may be represented in memory as a collection of “data keys,” where every data key corresponds to a distinct value of the first key attribute. Every data key in that collection, points to the collection of data keys that keep the values for the second key attribute. In turn, every element of the second collection points to the collection of data keys that keep the values for the third key attribute, and so on. If a data key contains the value for the key that does not have any subkeys, this data key will be constructed without any pointers to collections. In the case where several aggregations are configured, common keys may be shared among the aggregations.
[0047] Aggregating the data may be based on a set of key attributes and/or a set of counter attributes. Counter attributes are those attributes that are used to contain the current state of an aggregation. For a given set of keys, counters may be aggregated. The counter attributes may be the same as, or different than, the key attributes. For example, the “destination address” key attribute may be used both as a key and as a counter. In the latter case, function such as LAST_SEEN_VALUE can be applied to a destination address, so that every time aggregation data is output, only the last seen value of destination address is output. Alternatively, “destination address” may be used as an aggregation key, while “cache hit bytes” may be used as a counter. In this instance, when the destination address appears in the cache the counter is updated (i.e., incremented or decremented).
[0048] The following is an example of an aggregation configuration file that helps further define the terms used in the aggregation process of the invention:
TABLE 2 <Aggregation> AGGREGATION_NAME CacheCustomer AGGREGATE_BY_TIME_INTERVAL yes #SUMMARIZE no <Keys> <Attribute> ATTRIBUTE_NAME NCP_ACCOUNT_NO ALIAS_NAME CustomerAccount </Attribute> </Keys> <Counters> <Attribute> ATTRIBUTE_NAME HIT_BYTES ALIAS_NAME HitBytes AGGR_FUNCTION_NAME SUM </Attribute> </Counters> </Aggregation>
[0049] The “<Aggregation>” indicates that what follows is an aggregation. The “AGGREGATION_NAME” specifies the name or label for the particular aggregation. In the above example, the Aggregation name is “CacheCustomer.” If aggregation is to be output to CDR database
[0050] The “<Keys>” denotes the beginning of the section that describes the attributes that serve as keys to the aggregation. The “<Attribute>” denotes the beginning of the aggregation attribute description. The “ATTRIBUTE_NAME” is the name or label of the attribute, as described with reference to Table 1. In the above example, the “ATTRIBUTE_NAME” is “NCP_ACCOUNT_NO.” The “ALIAS_NAME” is the alternative name of the attribute. The “ALIAS_NAME” must coincide with the column name of CDR database
[0051] As discussed above, certain attributes may be used as counters to keep track of certain operations. The “<Counters>” denotes the beginning of the descriptions of those attributes that serve as counters. Therefore, in the example above, the attribute known as “HitBytes” will serve as the first counter. Also, “AGGR—FUNCTION—NAME” is the name of the function to invoke on an existing “HitByte” data value and a new “HitByte” data value when new data is submitted to the aggregation. In the above example, “SUM” indicates that the existing and new “HitByte” values will be added. The “</Counter>” denotes the end of the descriptions of those attributes that serve as counters, and the “</Aggregation>” indicates the end of the “CacheCustomer” aggregation.
[0052] In sum, the aggregation file above represents an aggregation called “CacheCustomer” that aggregates over a predetermined time interval without summarizing the aggregated data. The aggregation is a function of a key that is based on the “CustomerAccount” attribute alone. Therefore, the aggregation will classify the data based on a customer account indicator. For the “CustomerAccount” key, the addition of the existing and new “HitBytes” attributes will serve as a counter. Using this counter, the customer associated with a customer account will be able to determine the value provided by the cache device installed by the service provider. In this example, every cache hit means that browser request was satisfied very quickly, and thus served its purpose. The number of bytes served after the cache hit is a further measurement of service of a cache device rendered to a given customer.
[0053] In addition, it should be appreciated that more than one aggregation may be run simultaneously, but with different parameters. For example, the single aggregation process shown in Table 2 may be conducted over two overlapping intervals (e.g., over 5 and over 10 minutes). Also, two or more aggregations may be run simultaneously where, for example, the same aggregation receives data from two distinct data adapters.
[0054] Aggregation “buckets” are storage points that contain the counters associated with a particular key. Therefore, for example, if the key that contains destination address, source address and hit byte only uses hit byte as the counter, there will be an aggregation bucket for the hit byte counter. Also, in order to avoid duplicating data for identical keys, counters for each aggregation are stored in distinct aggregation buckets, under the same key.
[0055] An aggregation thread is an instance of the aggregation. The following is an example of an aggregation thread:
[0056] Thread CacheCustomer
[0057] Filter AccFilter
[0058] Adapter LogFileAdapter
[0059] Aggregation CacheCustomer
[0060] Period
[0061] NonRealTimeInterval
[0062] DataSetPath C:\temp
[0063] FileRetain
[0064] The “Filter” parameter in the aggregation thread specifies that the generic filter with the specified name “AccFilter” must be matched in order for the data to be aggregated. The “Filter” parameter may include multiple names. In this case, the designated multiple names must match in order for the data to be aggregated. The “Adapter” parameter in the aggregation thread definition identifies that Data Adapter “LogFileAdapter1” is the adapter that submits data to this aggregation thread. The “Period” parameter identifies how often (e.g., in minutes) the aggregation thread will output a file. The “NonRealTimelnterval” specifies time interval (e.g., in minutes) over which data needs to be summarized. The “DataSetPath” specifies the top directory under which will be created the file hierarchy for the aggregation files of the aggregation thread. The “FileRetain” parameter specifies maximum number of files to keep in the output directory for the aggregation thread.
[0065] Aggregation Process and Data Structure
[0066] The process of aggregating data may include factors such as which data is to be collected and which is to be deleted, how the data is to be classified and/or filtered, and how often the data should be aggregated (e.g., real-time, monthly, etc.). Aggregation also may include performing certain operations on the counter attributes, including summing, determining a minimum or maximum, and determining a number of counter updates. In addition, and depending upon the desires of the customer, aggregation may involve a number of other functions including applying filters to delete undesired data and to pass desired data to transaction layer
[0067] As used throughout, the term “in-memory” database refers to the non-permanent memory portion of the database. This non-permanent memory typically is smaller in size, but faster in processing speed than permanent memory. An example of such in-memory may be dynamic random access memory (DRAM) or static random access memory (SRAM).
[0068] Because the aggregation of data is accomplished within the non-permanent memory (i.e., in-memory database), certain considerations are necessary to ensure efficiency and speed. First, the invention uses an “adoptive collection” process. It is well known in the art that certain large collections of data are more suited to a hierarchical scheme (e.g., a binary tree). It is similarly well known in the art that smaller collections of data are more suited to a simpler scheme (e.g., arrays or lists). In fact, the large data collections cannot be updated efficiently if the data collection is implemented as an array, and updating smaller collections implemented as a binary tree often is an inefficient use of memory resources. The invention, therefore, adapts the scheme to the complexity of the collected data. For example, the invention may first employ a simple array collection scheme for certain data. Once the complexity of the collection reaches a certain threshold (e.g., four elements), however, the invention automatically may adopt a more optimal collection representation, such as binary tree. Therefore, the invention is able to adapt to a complicated hierarchical collection scheme. This “adoptive collection” can be performed in real-time as the data is received. This is a significant advantage over hard-coded collection schemes that must be re-written in order to accommodate increased or decreased complexity and load of certain collections.
[0069] The invention also benefits from the use of pointers in the key structure that serve to save memory space. As discussed, aggregation that has multiple attributes as keys may be represented in memory as a collection of “data keys,” the data key corresponds to a distinct value of the first key attribute. The data key in a collection points to the collection of data keys that keep the values for the second key attribute. In turn, each element of the second collection points to the collection of data keys that keep the values for the third key attribute, and so on. Therefore, the use of these pointers saves memory space. Moreover, if a particular data key contains a value for a key that does not have any subkeys, this data key will be constructed without any pointers to collections. In the case where several aggregations are configured, common keys may be shared among the aggregations.
[0070] Pointers also may be used for redundant attribute strings. Certain attributes with long string values may consume a great deal of memory. Therefore, the invention may have just one copy of every string value in the database. When an attribute with the same string value needs to be stored in the database, a pointer to the original string is stored instead of the copy, thus conserving additional memory.
[0071] When several aggregations are configured, certain key attributes may be shared such that multiple data keys do not need to be constructed for the same attribute value. This structure permits certain data keys to point to two or more collections of values of other key attributes.
[0072] The invention also conserves memory space by modifying particular objects based on the way that the object's associated data resides in the database. These modifications may be made based on predetermined data structure decisions made during implementation. By creating objects that are streamlined to their associated data, memory space is further conserved. Therefore, the objects are generic without sacrificing memory space.
[0073] One example of such object modification relates to pointers. Virtual table pointers are well known in the art. When a virtual function is created in an object, the object must keep track of the function. A virtual function table is kept for each type of object, and each object keeps a virtual table pointer, which points to the virtual function table. This allows the object to appear the same, but act differently. However, it is well known by those skilled in the art, that virtual table pointers require a great deal of overhead memory.
[0074] The invention avoids the unnecessary use of such overhead memory by using control objects, instead of requiring the data objects stored in the in-memory database to have virtual functions and corresponding virtual table pointers. The data control objects are created from the configuration, and determine such aspects as: which objects to create, when the objects are to be created, how data is to be extracted from the objects, and how the objects should eventually be destroyed, whether the object is a key or a counter (or both), how many bytes long the object should be, and/or how the object gets updated.
[0075] For example, consider the source address, destination address, and hit byte keys. During configuration each key has a data control object created for it. Therefore, each object has information regarding how it should behave. This intelligence is stored in the control objects and outside of the in-memory database. However, the data (located in the in-memory database) corresponding to the object does not contain this intelligence, and thus reduces the required in-memory space. Therefore, the data control objects result in a memory savings for each data object stored in the in-memory database.
[0076] Another way that an object may be modified so as to conserve memory space is by using different objects to represent certain types of data keys. For example, as described above, data buckets are used to contain the counters associated with a particular key. The objects may be optimized such that a data key that is not a counter has no intelligence necessary to understand even that buckets exist. Using the previous example, where source address is used as a key but not a counter, the source address control object may not store a pointer to a bucket, nor will it have any intelligence associated with counters or buckets. Therefore, the key-only object may be somewhat different, and perhaps less complex, than the counter-based object. In this way, the particular object is optimized so as to not waste memory space by pointing to buckets (or even having knowledge of buckets) that are nonexistent.
[0077] It should be appreciated that this memory saving tactic can be extended to structures other than data buckets. For example, the invention similarly may conserve memory for keys that do not have subcollections. In this instance, the object for they key, may have no intelligence related to the existence or manipulation associated with subcollections.
[0078] Each aggregation bucket may have multiple counters. The invention's flexibility of using multiple buckets for the same key instead of multiple keys, each having its own aggregation bucket may provide more efficient use of memory. For example, where one aggregation is configured to occur every five minutes, and another aggregation using the same counter is configured to occur every ten minutes, the counters for the aggregations may be stored under the same key in two different aggregation buckets. This eliminates the need for creating two different keys with the same data.
[0079] The invention conserves the space in the non-persistent in-memory database using a “compound keying” technique. Compound keying describes the notion of intelligently grouping certain keys based on some logical relation between the keys. As discussed with the “adoptive collection” technique, certain smaller collections of data may be configured to be collected in arrays. However, in cases where the arrays hold more elements than are required, even the arrays represent wasted memory space. For example, a key with only one or two data elements will not efficiently be accommodated by a four-element array. Therefore, where a key is known to have less elements than the designated array, compound keys may serve to conserve valuable memory space.
[0080] For example, when aggregating source address and Quality of Service (QoS), a customer may determine that there will be just one QoS value for each source address key. Therefore, during configuration, the source address key and QoS key can be combined into a compound key, where each key is referred to as a “compound key part.” The single compound key data structure contains both source address and QoS. Having a single compound key instead of a key and subkey permits faster access to the QoS element, because there is no need to conduct a search of a subcollection to get the element. When additional flow objects arrive at the aggregation, the aggregation validates that the QoS is the same and the counter is updated. Compound keys are particularly useful where the customer knows in advance that a certain key will only have a certain number of data elements, less than the number of elements established for the array.
[0081] The compound keying used in the invention is to be distinguished from similar compound keying performed with hard-coded grouping instructions, because hard-coded grouping results in a loss of generality that is maintained with the invention. This is because the predetermination made by the invention is accomplished during the configuration by setting up the compound keying capability, for example, without having to specify those attributes that require compound keying. The required attributes are then added after the configuration, for example through a graphical user interface. The hard-coded computer instructions, on the other hand, must expressly identify the attributes. Any subsequent changes render the hard-coded instructions useless or at least less efficient.
[0082] Aggregation Process Features
[0083] The following description describes three features of the aggregation process with respect to the operation of the in-memory database: database population, query support, and garbage collection. It should be appreciated, however, that these features are not exclusive, but are meant to further describe the efficiency and speed of the aggregation process on the database operation.
[0084]
[0085] As discussed, these data control objects may be used instead of relying on virtual table points within the data objects, so as to conserve memory space. The data control objects determine which data objects to create, when the data objects should be created, how to extract data from the objects, and how to delete the objects, for example. Also, class inheritance may be used in in-memory database population. Class inheritance describes the ability to extend a class definition by declaring a new class that inherits characteristics from the old class. Class inheritance may be used for the data objects to extend base classes for keys, data buckets, and data bucket intervals.
[0086] In step
[0087] In one example, a first aggregation may use keys Source Address and Interface Number, and a second aggregation may use keys Source Address and Destination Address. Assuming the keys are not compound keys (as discussed above), typical data population flow requires that a key with a matching source address is first found for a value of provided flow object source address attribute. Once this occurs, the counters are updated in the subkeys associated with the provided interface number and destination address. If the first aggregation's filter causes a mismatch, it updates the data propagation map to decrement the number of subscriptions to the Interface Number key (e.g., it goes to 0) and to the Source Address key (e.g., it goes to 1). Based on this mapping, the second aggregation will continue to look for matching source addresses in the Source Address collection, and update the key's counters for the provided destination address. However, because the number of subscriptions on Interface Number has been decremented (e.g., to 0), this collection will not be unnecessarily searched and updated.
[0088] The propagation map also permits differentiation between the propagation subscription and the update subscription. Using the above example, if the second aggregation uses keys Source Address, Interface Number, and Quality of Service, the Interface Number key may continue to have subscription for propagation, because the Interface Number must be considered to update the second aggregation. However, the Interface Number key would not have a subscription for updating, because the mismatched flow objects cannot update the first aggregation.
[0089] Returning to
[0090]
[0091] In step
[0092] In step
[0093] Because the invention accomplishes aggregation in non-permanent memory (i.e., in-memory database), data must be efficiently moved to permanent memory (i.e., “garbage collection”).
[0094]
[0095]
[0096] If the predetermined time has not expired, in-memory database continues to be populated with data in step
[0097]
[0098] Also, although these garbage collection techniques are described based on certain conditions (e.g., periodic intervals or amount of available memory space), it should be appreciated that the invention includes other garbage collection techniques that may be accomplished sporadically and unrelated to any preset conditions.
[0099] The invention is directed to a system and method for aggregating data. The invention often was described above in the context of the Internet, but is not so limited to the Internet, regardless of any specific description in the drawing or examples set forth herein. For example, the invention may be applied to wireless networks, as well as non-traditional networks like Voice-over-IP-based networks and/or private networks. It will be understood that the invention is not limited to the use of any of the particular components or devices herein. Indeed, this invention can be used in any application that requires aggregating data. Further, the system disclosed in the invention can be used with the method of the invention or a variety of other applications.
[0100] While the invention has been particularly shown and described with reference to the embodiments thereof, it will be understood by those skilled in the art that the invention is not limited to the embodiments specifically disclosed herein. Those skilled in the art will appreciate that various changes and adaptations of the invention may be made in the form and details of these embodiments without departing from the true spirit and scope of the invention as defined by the following claims.