Title:
SYSTEM AND METHOD FOR PROVENANCE FUNCTION WINDOW OPTIMIZATION
Kind Code:
A1
Abstract:
A system and method for selection of a provenance dependency function in a stream-based data processing infrastructure to optimize backtracing performance in response to a provenance query includes determining performance of a set of dependency functions associated with an analysis component for determining relevancy of each input event received by the analysis component. The relevancy of each input event is determined according to each dependency function and storing a record of relevant events according to a recording method. Relevancy results of the dependency functions are aggregated, and the dependency functions are ordered according to a criterion. Data provenance is provided for a given output event using the input event recording method associated with a best dependency function according to the criterion.


Inventors:
Davis II, John Sidney (Arlington, VA, US)
Ebling, Maria Rene (White Plains, NY, US)
Venkatramani, Chitra (Roslyn Heights, NY, US)
Wang, Min (Cortlandt Manor, NY, US)
Application Number:
12/046779
Publication Date:
09/17/2009
Filing Date:
03/12/2008
Primary Class:
1/1
Other Classes:
707/999.005, 707/E17.017
International Classes:
G06F7/06; G06F17/30
View Patent Images:
Related US Applications:
20080016041SPREADSHEET-BASED RELATIONAL DATABASE INTERFACEJanuary, 2008Frost et al.
20040205044Method for storing inverted index, method for on-line updating the same and inverted index mechanismOctober, 2004Su et al.
20090300031AUTOMATIC AD GROUP CREATION IN A NETWORKED ADVERTISING ENVIRONMENTDecember, 2009Lejano et al.
20040019584Community directoryJanuary, 2004Greening et al.
20080168023Web surfing enhancerJuly, 2008Stephens
20080195633Management of Vertical Sales and Agent ReferralsAugust, 2008Rose et al.
20070073703LDAP to SQL database proxy system and methodMarch, 2007Quin
20090216766Hierarchical TableAugust, 2009Vignet
20090240686Thread-based web browsing historySeptember, 2009Murali
20090319483GENERATION AND USE OF AN EMAIL FREQUENT WORD LISTDecember, 2009Consul et al.
20030212688Stacking and unstacking documentsNovember, 2003Smith
Attorney, Agent or Firm:
KEUSEY, TUTUNJIAN & BITETTO, P.C. (20 CROSSWAYS PARK NORTH, SUITE 210, WOODBURY, NY, 11797, US)
Claims:
What is claimed is:

1. A method for selection of a provenance dependency function in a stream-based data processing infrastructure to optimize backtracing performance in response to a provenance query, the method comprising: determining performance of a set of dependency functions associated with an analysis component for determining relevancy of each input event received by the analysis component; determining the relevancy of each input event according to each dependency function and storing a record of relevant events according to a recording method; aggregating relevancy results of the dependency functions and ordering the dependency functions according to a criterion; and providing data provenance for a given output event using the input event recording method associated with a best dependency function according to the criterion.

2. The method as recited in claim 1, wherein determining relevancy of each input event received by the analysis component includes determining a measure of relevancy for each input event for each function.

3. The method as recited in claim 1, wherein aggregating relevancy results of the dependency functions and ordering the dependency functions according to a criterion includes computing a relevancy ratio.

4. The method as recited in claim 3, further comprising a relevancy counter, which counts relevant inputs and an input counter which counts input events, the method further comprising computing the relevancy ratio as a ratio of: relevancy counts/input event counts.

5. The method as recited in claim 3, wherein the criterion includes a relevancy threshold and further comprising comparing the relevancy ratio to the relevancy threshold to determine whether an entry is added to a table indexed by a generated output event.

6. The method as recited in claim 5, wherein comparing the relevancy ratio to the relevancy threshold includes, if the relevancy ratio is less than or equal to the relevancy threshold, copying all aggregated entries into the table; otherwise, deleting the entry.

7. The method as recited in claim 1, wherein providing data provenance includes looking up an output event in a table and determining whether a set of input event entries exist which are associated with the output data element.

8. The method as recited in claim 7, wherein if the output event has associated input event entries, returning the set as data provenance for the provenance query.

9. The method as recited in claim 7, wherein if the output event has no associated input event entries, backtracing to determine data provenance for the provenance query.

10. A computer readable medium comprising a computer readable program for selection of a provenance dependency function in a stream-based data processing infrastructure to optimize backtracing performance in response to a provenance query, wherein the computer readable program when executed on a computer causes the computer to perform the steps of: determining performance of a set of dependency functions associated with an analysis component for determining relevancy of each input event received by the analysis component; determining the relevancy of each input event according to each dependency function and storing a record of relevant events according to a recording method; aggregating relevancy results of the dependency functions and ordering the dependency functions according to a criterion; and providing data provenance for a given output event using the input event recording method associated with a best dependency function according to the criterion.

11. The computer readable medium as recited in claim 10, wherein determining relevancy of each input event received by the analysis component includes determining whether the input event satisfies a relevancy criterion.

12. The computer readable medium as recited in claim 10, wherein aggregating relevancy results of the dependency functions and ordering the dependency functions according to a criterion includes computing a relevancy ratio.

13. The computer readable medium as recited in claim 12, further comprising a relevancy counter, which counts relevant inputs and an input counter which counts input events, the method further comprising computing the relevancy ratio as a ratio of: relevancy counts/input event counts.

14. The computer readable medium as recited in claim 12, wherein the criterion includes a relevancy threshold and further comprising comparing the relevancy ratio to the relevancy threshold to determine whether an entry is added to a table indexed by a generated output event.

15. The computer readable medium as recited in claim 14, wherein comparing the relevancy ratio to the relevancy threshold includes, if the relevancy ratio is less than or equal to the relevancy threshold, copying all aggregated entries into the table; otherwise, deleting the entry.

16. The computer readable medium as recited in claim 10, wherein providing data provenance includes looking up an output event in a table and determining whether a set of input event entries exist which are associated with the output data element.

17. The computer readable medium as recited in claim 16, wherein if the output event has associated input event entries, returning the set as data provenance for the provenance query.

18. The computer readable medium as recited in claim 16, wherein if the output event has no associated input event entries, backtracing to determine data provenance for the provenance query.

19. A method for runtime selection of a provenance output/input dependency function in a stream-based data processing infrastructure to optimize backtracing performance in response to a provenance query, the method comprising the steps of: observing the performance of a set of output/input dependency functions associated with an analysis component to determine relevancy for each input event received by the analysis component; calculating the relevancy of each input event according to each output/input dependency function and storing a record of each event that is determined to be relevant according to a recording method; aggregating the relevancy results of output/input dependency function and ordering the output/input dependency functions according to an ordering criterion; and using the input event recording method associated with the best output/input dependency function according to the ordering criterion for use when backtracing from a given output event.

20. A computer readable medium comprising a computer readable program for runtime selection of a provenance output/input dependency function in a stream-based data processing infrastructure to optimize backtracing performance in response to a provenance query, wherein the computer readable program when executed on a computer causes the computer to perform the steps of: observing the performance of a set of output/input dependency functions associated with an analysis component to determine relevancy for each input event received by the analysis component; calculating the relevancy of each input event according to each output/input dependency function and storing a record of each event that is determined to be relevant according to a recording method; aggregating the relevancy results of output/input dependency function and ordering the output/input dependency functions according to an ordering criterion; and using the input event recording method associated with the best output/input dependency function according to the ordering criterion for use when backtracing from a given output event.

Description:

BACKGROUND

1. Technical Field

The present invention relates to data management and query support in data analysis and, more particularly, to techniques for optimizing response time of queries about provenance of data elements that result from the analysis and transformation of input data streams.

2. Description of the Related Art

Data provenance involves the management of metadata about the history, generation and transformation of data. Data provenance is of special importance in large data processing systems in which data is operated on and routed between networked processing elements (PEs). The PEs in a stream processing system perform various operations on input data elements to generate output data elements. These output data elements are referred to as the results of the stream processing system. Examples of input data elements include packets of audio data, email data, computer generated events, network data packets, or readings from sensors, such as environmental, medical or process sensors. Examples of transformations conducted by individual PEs deployed on a stream processing graph include parsing a header of a network, aggregating audio samples into an audio segment or performing speech detection on an audio segment, subsampling sensor readings, averaging the readings over a time window of samples, applying spatial, temporal, or frequency filters to extract specific signatures over the audio or video segments, etc. The PEs produce results as a stream of output data elements or may produce individual output elements consumed by some external monitoring applications.

Data provenance applied to stream processing systems involves verification of the origins and causal factors of data produced by the system's PEs. A given data element that has a value of interest might lead to a query about the provenance of that datum, perhaps to determine why the data element has a particular value, or why the element was generated in the first place. The provenance query response requires an analysis of all upstream PES and data consumed and generated by the upstream PEs, on which the datum of interest is dependent. Given the high data throughput of stream processing systems, a key challenge with managing provenance is the minimization of provenance query response times.

The standard approach for responding to provenance queries is to perform provenance function backtracing. In provenance function backtracing, each PE in a graph of processing elements maintains a provenance function that maps a given output event to a set of input events. When a query about a given output event occurs, the provenance function associated with the PE that generated the event is used to determine the precipitous input events. Once these input events have been identified, the provenance functions of the upstream analysis components which generated the input events are used to determine further upstream events that are indirectly related to the given output event. This process is repeated recursively until all relevant events have been identified.

Several points about provenance functions are worth noting. Most notably, provenance functions are distinct from the operations performed on input data streams by a processing element in that provenance functions map output data elements to sets of input data elements and, like PE operations, provenance functions can be mathematical functions and not simply relations. The fact that PE operations may not be functions and, more specifically, may not be invertible functions is a key motivator for why provenance functions are needed. Note further that while it is implicitly understood that PE operations are specified by an author of a PE, this may or may not be the case for a provenance function associated with a PE. A provenance function may be specified by the corresponding PE author, it may be specified by an author not responsible for the corresponding PE or the provenance function may be automatically generated using various techniques. These characteristics of provenance functions imply that a given output data element may be deterministically mapped to a specific set of input data elements during a provenance query event, though the corresponding PE operation may be non-invertible or even stochastic.

Provenance function backtracing can result in very inefficient provenance query responses. As described above, provenance functions map output events of a given PE to a set of input events for that PE. Given the time ordered nature of streaming data systems, the set of input events mapped to by provenance functions is referred to as a provenance input window. Due to the characteristics of provenance functions, as outlined above, the provenance input window may be conservatively specified such that only a small portion of the data contained within the window is directly relevant to the corresponding output event. The relevancy ratio is referred to as the ratio of the relevant provenance window data count to the provenance window size where the window size is the cardinality of the set of data events contained in the window. When the relevancy ratio of a provenance window is very small, this results in an unnecessarily large search space of data events to search through in response to a provenance query and the search space increases exponentially as the query traces upstream.

The degree of inefficiency of a provenance query depends both on the specification of the provenance function as well as the statistics of the input data with respect to the provenance function specification. Consider an example scenario in which a processing element consumes a single input stream of real number-valued data and produces an output event with a value that is equal to the average of the last ten input events that have had values greater than or equal to 50. If the stream of input data is such that most input events have values over 50, then on average the relevancy ratio will be high for each input window. If most input events are below 50, then on average the relevancy ratio will be low for each input window.

To further refine the example, assume a relevancy ratio of 1%, in this case, backtracing through a single processing element would produce, on average, input windows containing 1000 data events in which only 10 of the input events are directly relevant to a given output event. In a worst case scenario, as backtracing continues recursively upstream, this inefficiency will expand exponentially. Such inefficiencies result in slow provenance query response times since the space of data elements that must be searched to determine the provenance of a given output data event is unnecessarily large. Providing solutions to avoid this inefficiency are needed.

A significant amount of related work exists on providing solutions for infrastructures that manage provenance data. Such related work considers the best way to store provenance information independent of optimizing response time of data provenance queries. Rather, the focus of much of the previous work on data provenance considers whether provenance information should be stored as annotations attached to the appropriate data elements (see, e.g., K. Muniswamy-Reddy, D. Holland, U. Braun and M. Seltzer, Provenance-Aware Storage Systems, Proc. of the 2006 USENIX Annual Technical Conference, June 2006) or alternatively whether provenance information should be encoded in the processing elements of the data processing system (see, e.g., R. Bose, “A conceptual framework for composing and managing scientific data lineage”, 14th International Conference on Scientific and Statistical Database Management, SSDBM'02, pp. 15-19).

Prior systems do not teach how to store and manage input data elements that were responsible for producing certain final output elements/events so that the data provenance queries can be answered efficiently in a stream processing system. The problem of efficiently querying for provenance information is not addressed. Also, no technique for efficient store and retrieval of data provenance information for analytic methods whose output elements/events depend on a subset of the input data elements that satisfy the certain characteristics is disclosed or suggested.

SUMMARY

A system and method for selection of a provenance dependency function in a stream-based data processing infrastructure to optimize backtracing performance in response to a provenance query includes determining performance of a set of dependency functions associated with an analysis component for determining relevancy of each input event received by the analysis component. The relevancy of each input event is determined according to each dependency function and storing a record of relevant events according to a recording method. Relevancy results of the dependency functions are aggregated, and the dependency functions are ordered according to a criterion. Data provenance is provided for a given output event using the input event recording method associated with a best dependency function according to the criterion.

There are key differences with the previous work and the present embodiments. Notably, much of the previous work considers provenance at the granularity of an entire data stream whereas the present work considers provenance at the level of individual data elements. More specifically, as will be shown, the present embodiments consider an optimization that may be applied at runtime or offline, to reduce provenance query response times whereas the previous work focuses on efficient ways for storing provenance data.

These and other features and advantages will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.

BRIEF DESCRIPTION OF DRAWINGS

The disclosure will provide details in the following description of preferred embodiments with reference to the following figures wherein:

FIG. 1 is a diagram illustrating an exemplary stream processing element and an exemplary relationship between output data elements and a set of input elements that may contribute to a given output element;

FIG. 2 is a block/flow diagram illustrating an exemplary execution of a system/method for performing backtracing in response to a data provenance query.

FIG. 3 is a block/flow diagram illustrating a system/method in accordance with present principles which takes a relevancy criterion at runtime (prior to a provenance query) and a tunable relevancy threshold as input for query processing;

FIG. 4 is a block/flow diagram illustrating a system/method in accordance with present principles for query processing for a data provenance query;

FIG. 5 is a block/flow diagram showing a system/method for runtime selection of a provenance output/input dependency function in a stream-based data processing infrastructure to optimize backtracing performance in response to a provenance query in accordance with the present principles.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

The present principles provide for optimizing response time of provenance queries in stream processing systems. At least one provenance function is provided that maps each output data event to a set of input data events. A mechanism for evaluating statistics of input data streams, at runtime and prior to provenance queries; with respect to a given provenance function is provided. An evaluation method is used to select an efficient backtracing method for responding to a provenance query such that the method optimizes response time. A method for further optimizing the backtracing method selection to take into consideration additional resources is also presented. The resources include but are not limited to memory and speed.

A system and method are provided for adaptively optimizing the response times of queries regarding data provenance in stream processing systems. In accordance with one aspect, a method for determining, prior to a provenance query, the most efficient means for mapping a given output event associated with a processing element to a set of predicating input events in response to a provenance query, is based on statistics of the input streams consumed by the processing elements. The processing element has at least one provenance function associated with it, and a method for evaluating the statistics of the input data consumed by the processing element is provided.

During execution, the input data statistics are evaluated to determine an efficient method for associating output events with sets of input events in anticipation of a provenance query regarding a given output event. This may be executed at runtime when the PEs are processing data, or offline on stored output data, after the processing is completed.

In additional embodiments, a relevancy criterion is specified that evaluates the statistics of input data during runtime, but prior to the occurrence of a provenance query. The relevancy criterion is used to dynamically select an efficient provenance function backtracing method during runtime, but prior to the occurrence of a provenance query. The relevancy criterion maps a set of input data events to a Boolean value and maintains a count of the number of True and False outputs to determine the input data statistics by maintaining a value of each input window's relevancy ratio.

In another embodiment, a runtime system for leveraging a relevancy criterion is used for determining if input window data should be cached. An exemplary embodiment of such a system operates by caching a pointer from an output event to a set of input events when the relevancy ratio of the corresponding input window is low, or elects to use the available provenance function for associating an output event with a set of input events when the relevancy ratio of the corresponding input window is high.

The relevancy criterion may be used to create a cache of input data, in an offline process. That is, the method is used either in a separate process after the stream-processing application has completed, or when a subsequent provenance query (say the first) is being evaluated. This permits the system to create the cache when the needed computation and storage resources are available to do so, and makes all subsequent provenance queries more efficient.

A system for runtime tuning of a provenance backtracing selection system is based on a set of available resources. The resources may include but are not limited to processing system memory and query response time speed. In an exemplary embodiment, a specification of the maximum cache size for storing input events associated with a given output event is provided. The maximum cache size is used to limit the number of input events stored and permits the system to elect a provenance function for query time evaluation when the cache size is exceeded. An alternative exemplary embodiment includes a specification of a maximum query response time used to limit the anticipated query response time. The maximum query response time is used to enforce the use of cached data when such usage will ensure that query response times are kept below the maximum response time limit.

Advantages of the present principles include adapting the runtime execution of provenance management methods in response to changing statistics of input data. As input data statistics vary, the present embodiments cache input data accordingly. A further advantage includes optimizing provenance query response time in accordance with input data statistics as well as specified provenance dependency functions associated with processing elements.

Embodiments of the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment including both hardware and software elements, In a preferred embodiment, the present invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.

Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that may include, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk read/write (CD-R/W) and DVD.

A data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code to reduce the number of times code is retrieved from bulk storage during execution. Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system either directly or through intervening I/O controllers.

Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.

Referring now to the drawings in which like numerals represent the same or similar elements and initially to FIG. 1, a diagram illustrates an exemplary stream processing element and an exemplary relationship between output data elements and a set of input elements that may contribute to a given output element. An input stream 101 in time interval [t, t+5] is input to a processing element (PE) 102. An output stream 103 is output at time interval [t+2, t+5]. Data elements 108 in input stream 101 may contribute to an output data element 112 in output stream 103, and data elements 110 in stream 101 may contribute to output data element 114 in output stream 103.

For example, it is common that an output data element at time t only depends on the input data elements within a time interval [t−a, t−b], where a≧b. In this example, we have a=2 and b=0, if the output is computed as the running average of the past three input data items.

Note in some cases, only a subset of the input elements in the time interval [t−a, t−b] really contribute to the output data element at time t. For example, the output data element is computed as the average of the input data streams values in time interval [t−a, t−b] whose values are bigger than, e.g., 50. In this case, only the input elements in time interval [t−a, t−b] with values bigger than 50 contribute to the output data element at time t.

In other cases, the parameters a and b in the dependency interval [t−a, t−b] may not be fixed values. For example, an output data element at time t may be computed as the average of the past 10 input data elements whose value is bigger than 50. In this case, the values of a and b depend on the input data distribution.

Given an output data element at time t and a provenance description that describes the conditions for an input data element to contribute to an output data element, it is desirable to know how to retrieve all the input data elements that really contributed to the output elements. One straightforward way of doing this is through backtracing. In this approach, all the possible input data elements are examined against a provenance description until all of the elements are found.

Referring to FIG. 2, a block/flow diagram illustrates an exemplary execution of a system/method for performing backtracing in response to a data provenance query. In this example, the provenance description, in block 202, states that an output data element at time t is computed as the average of the past 10 input data elements whose value is bigger than 50. The corresponding provenance function (PF) for this example could be, e.g., PF: O(t)→{I(t−i)|i=0}. A slightly more precise provenance function could be PF: O(t)→(I(t−i)|(t−i)>50, and i>=0}. A much more precise provenance function for the same example would be PF: O(t)→{I(t−kj)|0<=j<10, 0<=k0<k1< . . . <k9, I(t−kj)>50, and for 0<=j<10, for any i<k9, and i≠kj, I(t−i)<=50}. Other criteria are also contemplated and may be employed.

In a second example, the provenance description states that an output data element at time t is computed as the maximum value of the past 10 input data elements. The corresponding provenance function for this example could be PF:O(t)→{I(t−i)|0<=i<10}. In a third example, the provenance description states that an output data element at time t is computed as the running average of the past three values. The provenance function for this case could be PF: O(t)→{I(t), I(t−1), I(t−2)}.

In block 204, the index i and the “count” are initialized to zero. Next, in block 206, backtracing starts to check the values of the input data elements (indicated as I(t−i) at time t, t−1, t−2, . . . ). In block 208, each input data element is checked to determine if it is larger than 50. If the input data element is larger than 50, the input data element is recorded as an entry in a provenance query request in block 210. The count and index are incremented in block 212. A check is performed in block 214 to determine if 10 values larger than 50 have been found. If 10 such values are found, these 10 values, together with their time stamps, will be returned as the query result. If 10 values are not yet found, the program path returns to block 206. If in block 208 the I(t−i) is less than 50, the program increments i in block 216. The count check of block 214 is performed as well.

A drawback of the backtracing approach is performance: the query response time could be very long since it usually needs to examine many input data elements before it returns the query result.

Referring to FIG. 3, to improve the performance of a data provenance query through optimization, an illustrative system/method in accordance with present principles takes a relevancy criterion (RC, e.g., RC maps an input event to TRUE if its value is bigger than 50 and FALSE otherwise) at runtime (prior to a provenance query) and a tunable relevancy threshold (RT) (for example, RC could be set to 10%) as input in block 302. The system/method determines the input data elements corresponding to every output data element for the lifetime of the stream. This processing begins at the start of the stream at time t=0, set in block 304. Two counters are maintained during the execution of the method: an Input Counter (ic) and a Relevancy Counter (rc), which are also initialized to 0 at t=0 in block 304.

A check is made to determine if the stream is still active and there are more input data elements to process, in block 306. If there are no more input data elements to process, the method ends.

If there are more input data elements to process, then in block 308, the next input data element is processed by performing the following. In block 310, the input data element is checked to see it satisfies the RC. Each input data element is checked against the Relevancy Criterion (RC) and rc is incremented whenever an input data element satisfies the Relevancy Criterion (RC). During runtime of a process element, ic will be incremented when each input data element is processed. If the RC is satisfied, the input data element (I(t)) is recorded in a memory element or cache, and ic and rc are incremented in block 312.

Each input data element resulting in an increment in rc, is cached. If the RC is not satisfied, ic is incremented in block 314. In block 316, a check is made to see if an output event (O(t)) is generated. For example, if O(t) is defined as an output event with the value that is equal to the average of the last 10 events that have values over 50, we need to check if 10 input events with their values over 50 have been processed.

When an output event is generated, a relevancy ratio is computed as relevancy counter/input counter in block 318. If the relevancy ratio is above the given threshold RT (e.g., rc/ic≧RT), the cache is labeled (indexed) with the output data element and maintained in a table T for a future data provenance query on this output data element in block 320. The cache is then emptied and ic and rc are set to 0, in block 322. In the case that the relevancy ratio is not satisfied, the cache is similarly emptied in block 322, setting rc and ic to 0. Time t is incremented in block 324 and processing continues to block 306 to determine if there are more input elements to process.

If no output event is generated in block 316, time t is incremented in block 324 and the program path returns to block 306, to determine if there are more input data elements to process. Note that the relevancy threshold RT is a tunable parameter, e.g., between 0 and 100%. In general, RT should be set to a lower (higher) value when the available storage size for the maintained table T is smaller (larger).

One skilled in the art will realize that this embodiment assumes that each input element is associated with at most one output element. This can be easily overcome with additional bookkeeping of metadata.

Referring to FIG. 4, query processing for a data provenance query is illustratively shown. When a data provenance query is issued for a output data element O(t), we check the table T to see if there is any set of input data elements labeled with O(t) in block 402. A check of whether a set of input entries labeled with the output event is checked in block 404. If there is such a set, the set is returned as the answer to the provenance query in block 406. Otherwise, backtracing is employed in block 408.

Referring to FIG. 5, a system/method for selection of a provenance output/input dependency function in a stream-based data processing infrastructure to optimize backtracing performance in response to a provenance query is illustratively shown in accordance with one illustrative embodiment. In block 502, observing or determining performance of a set of dependency functions (e.g., output/input dependency functions) associated with an analysis component (e.g., a processing element) to determine the relevancy of each input event received by an analysis component is performed. This may include determining a relevancy measurement for each input event.

For example, if the output event at time t is computed as the average value of the input events at time t, t−2, t−4, . . . , t−18, the dependency function could be O(t)−>{I(t−i)|0<=i<20}. However, a better dependency function would be O(t)−>{I(t−i)|0<=i<=18 and i is an even number}. In general, the more effective/precise the dependency function is in selecting the relevant input elements, the better the backtracing function to answer provenance queries.

In block 504, the relevancy of each input event is determined according to each (output/input) dependency function, and a record is stored for each event that is determined to be relevant according to a recording method. Recording methods may include creating a table in a relational database and inserting all the relevant input events together with their time stamps in the table, or caching all the relevant input events using an in-memory data structure. Other recording methods may also be employed.

In block 506, the relevancy results are aggregated for each dependency function, and the dependency functions are ordered according to a particular criterion. Ordering criterion may include most to least relevant, or comparison to a relevancy threshold (RT). The ordering criterion may be based on computing a relevancy ratio. A relevancy counter (rc), which counts relevant inputs, and an input counter (ic) which counts input events, may be employed to compute a relevancy ratio as the ratio of: relevancy counts/input event counts. The criterion may include the relevancy threshold and the relevancy ratio may be compared to the relevancy threshold to determine whether an entry is added to a table indexed by a generated output event. The comparison of the relevancy ratio to the relevancy threshold may include that if the relevancy ratio is less than or equal to the relevancy threshold, all aggregated entries are copied into the table; otherwise, the entry is deleted.

In block 508, the input event recording method associated with a best output/input dependency function according to the ordering criterion is employed when backtracing from a given output event. Data provenance is provided for a given output event using the input event recording method associated with a best dependency function according to the criterion. This may include looking up an output event in the table and determining whether a set of input event entries exist which are associated with the output data element. If the output event has associated input event entries, the set is returned as data provenance for the provenance query. Otherwise, if the output event has no associated input event entries, backtracing is used to determine data provenance for the provenance query.

Having described preferred embodiments of a system and method for provenance function window optimization (which are intended to be illustrative and not limiting), it is noted that modifications and variations can be made by persons skilled in the art in light of the above teachings. It is therefore to be understood that changes may be made in the particular embodiments disclosed which are within the scope and spirit of the invention as outlined by the appended claims. Having thus described aspects of the invention, with the details and particularity required by the patent laws, what is claimed and desired protected by Letters Patent is set forth in the appended claims.