Kind Code:

Various aspects provide for receiving data associated with a plurality of samples. A sample generally includes data associated with one or more events. One or more traits may be determined, where a trait may be a set of or associated with one or more events. Generally, events included in a trait may be correlated (including anti-correlated) in some way. A trait may be associated with a sample, and the association may be recorded, an action may be triggered, and/or a user may be notified.

Ramberg, Charles (Los Altos, CA, US)
Application Number:
Publication Date:
Filing Date:
Primary Class:
Other Classes:
707/E17.001, 707/999.007
International Classes:
View Patent Images:

Primary Examiner:
Attorney, Agent or Firm:
Bresbo AB (Karlstad, SE)
What is claimed is:

1. A method comprising: receiving data associated with a plurality of samples, each sample including one or more events; determining one or more traits, each trait associated with a correlation among a plurality of events; associating one or more traits with one or more samples; and recording the association.



This description claims the priority benefit of U.S. provisional patent application No. 60/963,091, filed Aug. 3, 2007, entitled “Data Processing Method,” the disclosure of which is incorporated herein by reference.


1. Technical Field

The present invention relates generally to data processing.

2. Description of Related Art

Various methods exist to deconvolute data. Generally, many sources of data include multiple traits. Deconvoluting a response associated with a sample, particularly a sample having multiple and/or complex traits, may be challenging.

Increasing number and dimensionality of data may provide rich information. Combinatorial libraries, online behavior databases, large marketplaces and other sources of large amounts of data may yield information about traits associated with certain responses. However, analyzing large, complex sets of data may be challenging. Many analytical methods examine data on a “sample by sample” basis, which may not leverage information that may be present across a plurality of samples, particularly when the plurality of samples includes some degree of diversity with respect to one or more traits.


Various aspects provide for receiving data associated with a plurality of samples. A sample may generally include data associated with one or more events. One or more traits may be determined, where a trait may be a set of or associated with one or more events. Generally, events included in a trait may be correlated (including anti-correlated) in some way. A trait may be associated with a sample, and the association may be recorded, an action may be triggered, and/or a user may be notified.


FIG. 1 is a diagrammatic representation showing events as a function of sample number, according to certain aspects.


Various aspects provide for receiving a multidimensional data set and processing or analyzing the set to determine traits, trends, or characteristic features associated with the data. Typical data sets may include a plurality of samples, often with associated key/value pairs.

Table 1 is an illustrative example data set, and shows data associated with 11 samples, identified with, a sample identification S(i), or Sid, taking a value from 1 to 11. Each Sid includes a plurality of events. A sample may correspond to a substance that was measured using measurement equipment, and events may correspond to values associated with the measurement. A sample may also correspond to a security, financial instrument, person, clinical result, or any other quantified entity. In some embodiments, “sample” may correspond to anything from or within which a user wishes to identify a trait. A first sample may be a set of second samples.

A trait may be a characteristic of one or more samples, and/or a characteristic of a plurality of samples. A trait may generally be comprised of one or more events. An event may be associated with a sample. Events may include one or many measurements and/or observations. Events may include a temperature, a Raman peak, an FTIR wavenumber, a spectrum, a chromatograph, a peak, a gene, genome, a wavelength, a color, a price, a change in a quantity, a response to an input, a category, a chromosomal feature, DNA, RNA, a link in a network (e.g., a social network), one or more keywords (e.g., in a search), a database query, fracture, a fatigue response, an electrochemical response, a voltage, a sound, a question posed by an entity, a derivative of a function or security, traffic information, and fractions and/or components thereof.

For illustrative simplicity, various aspects are described herein using synthetic x-ray diffraction (XRD) data, simplified to either the existence (or lack thereof of a peak and a particular value of 2theta. Thus, events in Table 2 may correspond to “peaks,” and the value of each event may be a 2theta position associated with each peak. Other measurement data (Raman, FTIR, efficacity, concentration, price, change in price, number of members, change in a number) as well as synthesis data (composition, temperature, crystallization data) may be included. For simplicity, a two dimensional data set is used, although larger dimensionality data sets may also be used.

In some embodiments, a trait may be associated with one or more events. In certain cases, a trait is associated with a plurality of events. A trait may be associated with one or more samples; a trait may also be associated with a plurality of samples, yet not associated with any particular sample.

In some cases, the recording or measurement of an event may be accurate as received (e.g., a permanent IP address, or a stock symbol). In other cases, it may be advantageous to “bin” or categorize or group events for subsequent analysis. For example, an event may correspond to a time, recorded with microsecond accuracy, but it may be advantageous to mark two events differing by less than one millisecond as having occurred at the same time. Many measurement devices include some experimental error, and it may be advantageous to bracket each event by an “error bar” associated with this error. In some embodiments, two or more events whose value differs by less than three, two or even one error bars may be considered the same event.

Data in Table 1 may be binned for analysis, for example as shown in Table 2. Events in Table 2 have been assigned to bins based on their similarity with other events (e.g., a difference between two events being less than an error associated with a measurement of the events), annotated as EventID or E(i). Events may be binned with a finer granularity (e.g., more bins or tighter experimental error), and events may be analyzed without binning. A first event may be a plurality of second events.


A frequency of occurrence of an event across the data set may be calculated. For some data sets, assigning events to a bin may simplify the determination that two slightly different events may be treated as if they were the same event. An exemplary frequency of occurrence may correspond to a number of members in a given bin (e.g., the number of samples having an event associated with that bin). A value E(i,i) may be a self-correlation value, describing the number of times that Event (i) occurs with itself.

The co-occurrence of two or more events may be calculated (herein E(i,j)). For illustrative simplicity, co-occurrence and/or correlation is described herein using pairwise correlations, although higher order or more complex correlation functions may be implemented. For some data sets, pairwise correlation may yield sufficient results and speed calculations. For some data sets, multi-event co-correlations may provide increased accuracy and/or discrimination in trait determination.


A value for co-occurrence E(i,j) may be a number of times (e.g., in a number of samples or in the set of data) that Event (i) and Event (j) occur together. Table 3 shows a symmetrical matrix showing exemplary values for correlation between pairs of events (in this case, the addition of the number of samples for which both events occurred). More complex co-occurrence calculations are possible, including a probability associated with an event. Co-occurence may also include additional information associated with an event (e.g., a peak intensity in addition to the existence of the peak).

Table 3 may also provide a graphical illustration for convenience, although various embodiments include automated analyses of these data. The body diagonal in Table 3, showing values of E(i,i), may provide information describing the frequency of an event within a data set (e.g., within the set of samples). For example, E(15, 15)=11, showing that event (15) was present in each sample. E(1,1)=6, showing the presence of Event (1) in 6/11 samples, and E(12,12)=1, showing that Event (12) occurred in only one sample.

Various embodiments provide for the assignment of events to one or more traits, and generally may provide for the determination of one or more traits associated with each sample. In some aspects, a trait may be extracted by processes somewhat like nucleation and growth, in which (for example) a pair of events forms a nucleus of a possible trait, and events are associated with the trait based on their association with events already present in the trait. In other aspects, traits may be formed with methods somewhat akin to spinodal decomposition, in which the correlation among a random combination of events is evaluated until a set of events having a small degree of correlation is determined. The set of somewhat correlated events may continually add and remove members, gradually iterating toward a final trait composition. Iteration may include adding a member when a figure of merit (e.g., a total trait self-correlation function) is improved. Iteration may also include deleting a member when deletion similarly improves this figure of merit.

Depending upon a type of trait and/or types of samples, trait analysis may incorporate the number of times that an event occurs in the sample set. In some cases, trait identification may be improved when events associated with a trait occur more frequently. In certain embodiments, an event frequency for each Event (i), (between 0 and 1) is calculated by dividing E(i) by the number of samples. In some embodiments, a trait identification process may focus in a first routine on events whose frequency is greater than an event frequency threshold, and in a second routine on events whose frequency is lower than a threshold. Exemplary thresholds include 0.01, 0.1, 0.5, 0.7, 0.9, and 1.

A trait may include multiple events, and co-occurrence of events may be determined. Table 3 illustrates that some events always occur together, some events sometimes occur together, and some events do not occur together. For example, E(2,7)=E(7,7)=5, showing that event 7 occurs whenever event 2 occurs, although event 2 occurs once without event 7. Co-occurrence E(1,13)=0, showing that events 1 and 13 do not occur together, notwithstanding that both of these events occur separately several times (E(1, 1)=6; E(13,13)=5). Similarly, E(2,14)=3, showing that event 2 occurs with event 14 in 3/6 samples, and event 14 occurs with event 2 in 3/5 samples.


In some embodiments, a first event that sometimes occurs with a second event may be annotated as a multitrait event, or an event that may be included in two or more traits (e.g., event 15) or in a sample having two or more traits.

In some embodiments, assigning events to one or more traits may begin with selecting an event (i) that has both strong co-occurrence with first other events (j) (e.g, E(i,j)=1)) and strong anti-occurrence with second other events (k) (e.g., E(i,k)=0). An exemplary selection may include selecting one or more events (i) for which the standard deviation of the co-occurrence of the event (i) with a plurality of (e.g., all) other events (j) is a maximum. In certain cases, normalizing this standard deviation to the number of times that E(i) occurs may improve comparison among different events. Exemplary events (i) having large values for (e.g.,) the standard deviation of (E(i,j) over all j)/E(i) are events 6, 7, 12, 13, 18-21, 23 and 25. In some cases, a first trait may be nucleated with two or more events having strong correlation, and a second trait may be nucleated with an event having a strong anti-correlation with one or more events in the first trait. For example, E(6,7)=E(6,13)=0, suggesting that events 6 and 7 would be associated with different traits, and events 6 and 13 would be associated be in different traits. Thus, a first trait may include event 6 and a second trait may include event 7 and/or event 13. Comparing events 7 and 13: E (7,13)=E(7,7)=E(13,13)=5, suggesting that events 7 and 13 are strongly correlated, and thus the second trait may include events 7 and 13.

In some embodiments, multitrait events are identified. These events may be associated with several traits and/or samples with several traits. In some cases, events (i) having small values for the standard deviation of (E(i,j) over all j)/E(i)) may be associated with multitrait events. Exemplary events include events 1, 5, 15, 16, and 17. In certain cases, a particularly low value (e.g., event 15) may be associated with an event that corresponds to a background, a substrate, a standard, a control, and/or another feature associated with many (or even all) samples.

In some aspects, a trait may be nucleated with one or more events (i) having a boolean correlation response with other events (j) (e.g., E(i,j) is either 0 or 1 for all j). In some embodiments, a trait may be nucleated by selecting an event that appears a small number of times (where small may largely be determined by the size of the data set). For example, a distribution of values of E(i,i) may be calculated, and events (j) for which E(j,j) is in the bottom 1%, 5%, 10% or 30% of the distribution may be selected. Certain selections appear once. Exemplary selections using the present example data may be events for which E(i,i)=1, such as events 12, 19, 23, and 25. In certain embodiments, a trait may be nucleated with a first event that appears rarely, and combining the first event with a second event that always occurs with the first event to form a nucleus of two events. In some cases, the second event also appears rarely.

In some aspects, it may be advantageous to normalize co-occurrence of events (e.g., the co-occurrence matrix). For a pairwise correlation example, E(i,j) may be normalized to E(i) and/or E(j), depending upon a trait being analyzed. Table 4 shows an asymmetrical co-occurrence matrix in which the upper right quadrant shows values of E′(i,j) that have been normalized to the lower of E(i,i) and E(j,j). Events may also be normalized to the higher of E(i,i) or E(j,j), and in some cases, a trait nucleus may be created from events for which E(i,i) =E(i,j)=E(j,j) for two or more events in the nucleus.

Data in the upper right quadrant have been rounded to one decimal place for convenience. In some embodiments, it may be advantageous to round these data to either 0 or 1, and in some such cases, subsequent processing may be performed with integers, which may speed certain calculations.

In some cases, it may be advantageous to create a set of trait nuclei, wherein each nucleus includes one or more events, and an event never occurs with an event in another nucleus (i.e., E′(i,j)=0 for i and j in different nuclei). For example, events 6, 7, and 8 never occur together, and thus each could become a nucleus (termed N6, N7, and N8) for a different trait. Similarly, events 13, 22, 23 could each nucleate a different trait, forming nuclei (termed N13, N22, and N23).

Nuclei may be combined. A first set of trait nuclei (N6, N7, N8) may be compared with a second set of trait nuclei (N13, N22, N23) to examine their correlation, and nuclei may be merged and/or maintained as discrete traits. Such an analysis in this case may yield a first nucleus with events 6 and 22 (because E(6,22)=1 and E(6,13)=E(6,23)=0; a second nucleus with events 7 and 13 (E(7,13)=1, E(7,22)=E(7,23)=0), and a third nucleus with events 8 and 23 (E(8,23)=1, E(8,22)=0.3, E(8,13)=0).

As shown in Table 4, several normalized E′(i,j) values are equal to one, showing that for this normalization, certain events occur as often with another event as they themselves occur. For example, E(1,4)=E(4,4)=4, showing that when Event(4) occurs, Event(1) also occurs, so E′(1,4)=1. However, the converse is not true in this case, (which may be represented by a normalization matrix wherein the denominator is the larger of E(i,i) and E(j,j)), and Event (1) occurs twice in samples that do not display Event (4).

Various embodiments may be illustrated by focusing on normalized values of E′(i,j)=1, as highlighted in Table 4. In certain embodiments, two or more events are grouped together into a trait based on their co-occurrence.

In some embodiments, a trait may be constructed (e.g., a nucleus may be created) from two or more co-occurring events, and in some cases, a trait includes a set of events in which events within the set have a co-occurrence greater than a threshold (e.g., E′(i,j)>0.4, 0.6, 0.8, 0.9). A nucleus may be comprised of events for which E′(i,j)=1. In some cases, each pair of events in a trait has a co-occurrence greater than a threshold. In other cases, a correlation across multiple (e.g., two or more) events in a nucleus may be greater than a threshold.

Growth (e.g., a more thorough description) of a trait may include the addition of an event to a set of events associated with the trait (e.g,. a nucleus). In some cases, an event may be added to more than one nucleus. In other cases, an event is added to a first nucleus and is annotated so that it is not added to a second nucleus.

E′(i, j); upper right quadrant normalized, body diagonal and lower left not normalized.

A set of “possible additions” to a nucleus or set of events associated with a trait may contain one or more events E(k) for which E(i,k) is greater than the threshold. Co-occurrence of each event E(k) with the other members (j) of the trait, may then be examined. A “possible addition” may be added to the trait when E(j,k) is also greater than the threshold, and the set (or trait), may expand to include members E(i), E(j), and E(k), for which the co-occurrences of these events are all greater than the threshold.

For example, consider three previously identified trait nuclei (herein traits). Trait-6-22 includes events 6 and 22. Trait-7-13 includes events 7 and 13. Trait-8-23 includes events 8 and 23. Various methods to determine a set boundary may be used to assign other events in the data set to one or more traits. In some embodiments, a list of possible additions to a trait may include events (outside the trait) that have a strong correlation (e.g., E(i,j) greater than 0.6, 0.8,0.9, or equal to 1) with one or more events in the trait. A possible addition may be selected from this list, and a correlation between the selected possible addition and other members of the trait may then be examined, and the selected possible addition may be added when the correlation is above a threshold. Using Trait-6-22 as an example, a set of possible additions to the trait may include a set of events (j) for which E′(6,j) and/or E′(22,j) are greater than or equal to a threshold (which may be chosen to be one). For example, events j=9, 10, 12, 14, 15, 16, and 17 have E′(6, j)=1. Of these, events j=12, 15, 16, and 17 have E′(22,j)=1, and E′(10, 22)=E′(14, 22)=0.7. Thus events 9, 10, 12, 14, 15, 16, and 17 may be added to the trait using a threshold of 0.7, and events 12 and 17 may be added using at threshold of 1. When a new event k is added to a trait comprising events (i), a new set of possible additions l (e.g., for which E(k,l)=1) may then be compared (e.g., examine E(i,l) to determine whether a particular event (l) is then added to the trait. Such an expansion of trait-6-22 may result in member events 3-6-9-10-12-14-15-17-22

Events having different levels of correlation with one or more existing members of the set may be used in assigning an importance or strength of each event with respect to a trait. For example, “core” events may be events that only occur with other members of the trait (e.g., events i for which normalized E′(i,j)=1 for all j in the trait), such as events 6 and 12. Some core events do not have a correlation with events outside the trait. Some events may be associated with many (or even all) traits (e.g., event 15), and in some cases, such events may not provide discriminatory information and so may be omitted from further analysis. “Associated” events (i) may be events that may occur in other traits or events that may occur in a sample having multiple traits. In some embodiments, the existence of an associated event within a sample may signify the presence of more than one trait. In some cases, association may be manifest as E(i,j)<1 for one or more j within the trait. Some events may have a strong effect on a traitwide correlation coefficient and may be annotated as “key” events. For example, trait 6-22 may include key event 22 with a threshold of 0.7. However, the correlation of event 22 with events 3, 10 and 14=0.7. Removal of event 22 may then result in a set of events having E′(i,j) =1 for the event. Thus, trait-6-22 may include core events 3, 6, 9, 10, 12, 14, 15, and 17, and associated or key event 22, based on E(3,22)=0.7, E(9,22)=0.7, E(10, 22)=0.7, and E(14, 22)=0.7.

Trait-7-13 may be grown to include events, 2, 5, 7, 9, 13, 15, and 17, of which events 2, 7, 13, 15, and 17 may be “core” events, events 5 and 9 may be associated events. In some embodiments, events such as event 5 may be a key event, which may be an event whose inclusion/exclusion in a trait has a particularly large effect on the correlation among one or more traits.

Trait-8-23 may be grown to include core events 1, 4, 8, 11, 15, 16, 18, 19,21,23,24, and25.

Various embodiments may assign an event to one or more traits (or to no trait). In some cases, assignment may include a strength of association with a trait.

One or more traits may be assigned to a sample. Events in a sample may be evaluated to determine whether they are associated with one or more traits. In some cases, each event in a sample is assigned one or more traits (or assigned to no trait). The strength of a trait's representation within a sample may be determined (e.g., by determining how many events associated with the trait are present in the sample. A sample displaying only one trait may be annotated. A sample displaying multiple traits may be annotated. In some cases, samples may be annotated for further analysis, which may include an iterative re-examination of one or more samples associated with a trait.

FIG. 1 is a diagrammatic representation showing events as a function of sample number, according to certain aspects. For illustrative purposes, trait(s) associated with each event annotated. Event 15, occurring within each sample, is not annotated.

Samples may be highlighted and/or annotated for further analysis by evaluating a representation of a sample's response by one or more traits (e.g., fitting traits to a sample). Absence of a trait associated with an event may be a marker. For example, samples 3 and 4 include event 20, which is unassigned to a trait. Cluster analysis of the E(20, i) within the trait shows that event 20 co-occurs with events 1, 4, 8, 11, 15, 16, and 18, but not with events 21, 23, 24, and 25. Thus, event 20 may be a marker for finer granularity within event 8-23. A sample for which no events are assigned to more than one trait may be marked as a single-trait sample (e.g., sample 1). A sample having events assigned to more than one trait may be marked as multi-trait samples (e.g., sample 8. A sample having many (or even all) of the core events associated with a trait may be marked as having a core of the trait.

Sample analysis may include determining how many events associated with a given trait are present in a sample (having at least one event of that trait). In some cases, samples may be marked for re-analysis (e.g., iteration) when a trait present in the sample has only a small fraction (e.g., 20%, 10%, or even one of) the events of a given trait. For example, events 1 and 16 are associated with trait 8-23 and trait 6-22. Samples 1-3 show several events associated with trait 8-23. However, events 1 and 16 are the only representatives of trait 8-23 in samples 10 and 11. Cluster analysis of event 1 may show that a single bin capturing all instances of event 1 may be inaccurate, and samples 10 and 11 have values (19.5, 19.5) different than those of samples 1, 2, 3, 4 (19.2, 19.1, 19.0, 19.0, resp.). In some aspects, sample analysis may include re-binning and subsequent re-analysis. In certain embodiments, binning includes calculating an average value for events assigned to a bin, calculating a maximum error associated with an allowed deviation from the average value, and assigning an event to a bin when the event falls within the allowed deviation from the average.

Various steps (including binning, nucleation and growth) may be performed iteratively. For example, after sample analysis identifies event 1 as previously described, a re-binning analyis (e.g., cluster analysis, averaging/exclusion and the like). Re-binning may separate event 1 into event 1′ (associated with samples 10, 11) and sample 1″ (associated with samples 1-4), and subsequent nucleation/growth may then result in assignment of event 1′ to trait 6-22, and assignment of sample 1″ to trait 8-23. Following such a procedure, the following traits may be associated with sample as shown in Table 5.

18-23, largest number of 8-23 events, event 5
(key event, 7-13 trait)
28-23, event 5 (key event, 7-13 trait)
38-23, one 7-13 event, one 6-22 event, event
48-23, three 7-13 events, event 5 (key event,
7-13 trait), event 20
57-13, two 6-22 events (one also not 7-13),
event 5 (key event, 7-13 trait)
67-13, two 6-22 events (one also not 7-13),
event 5 (key event, 7-13 trait)
77-13, 6-22 (several events eachevent 5 (key
event, 7-13 trait)
87-13, 6-22 (several events each), event 5
(key event, 7-13 trait)
97-13, 6-22 (several events each), event 5
(key event, 7-13 trait)
106-22, one 7-13 (also 6-22)
116-22, one 7-13 (also 6-22), largest # of 6-22

In some embodiments, a trait may be matched to a known or canonical trait (e.g., a known diffraction pattern). In some cases, the correlation of an associated event with a trait may be increased or decreased based on the event's presence in the canonical trait. For some data sets, a “pre-binning” step may be performed. In such cases, a coarse grouping or clustering of samples sharing at least some feature may speed processing. For example, a library of crystallization experiments associated with an API may be analyzed as a library, as opposed to analyzed simultaneously with libraries associated with other API's. In other cases, many API's may be analyzed simultaneously. For spectral data, a minimum overlap of spectra may be used for pre-binning.

Various embodiments provide for storing, notifying a user, triggering a process (e.g., a followup procedure), presenting an advertisement, and/or other action upon the identification of a trait, identification of a sample having a trait, and/or identification of a trait (or absence of a trait) within a sample. In some cases, a number of samples displaying a trait may trigger an event (e.g., a single sample displaying a trait, or a certain critical number of samples displaying a trait). In some embodiments, a time associated with each sample may be recorded, and in some cases, a sample displaying a ‘new’ trait may be annotated (and optionally an action triggered).

Various embodiments provide for a computing device including a processor, memory, storage, input, output, display, networking hardware, communications circuitry (including wireless communications) and storage media. Certain embodiments include a computer-readable storage medium having embodied thereon a program executable by a processor configured to perform a method.

The above description is illustrative and not restrictive. Many variations of the invention will become apparent to those of skill in the art upon review of this disclosure. The scope of the invention should, therefore, be determined not with reference to the above description, but instead should be determined with reference to the appended claims along with their full scope of equivalents.