Evaluating the validity of systematic reviews to indentify empirically supported treatments.
Article Type:
Academic achievement (Research)
Educational programs (Management)
Slocum, Timothy A.
Detrich, Ronnie
Spencer, Trina D.
Pub Date:
Name: Education & Treatment of Children Publisher: West Virginia University Press, University of West Virginia Audience: Professional Format: Magazine/Journal Subject: Education; Family and marriage; Social sciences Copyright: COPYRIGHT 2012 West Virginia University Press, University of West Virginia ISSN: 0748-8491
Date: May, 2012 Source Volume: 35 Source Issue: 2
Event Code: 310 Science & research; 200 Management dynamics Computer Subject: Company business management
Geographic Scope: United States Geographic Code: 1USA United States

Accession Number:
Full Text:

The best available evidence is one of the three basic inputs into evidence-based practice. This paper sets out a framework for evaluating the quality of systematic reviews that are intended to identify empirically supported interventions as a way of summarizing the best available evidence. The premise of this paper is that the process of reviewing research literature and deriving practical recommendations is an assessment process similar to the assessment process that we use to understand student performance and derive educational recommendations. Systematic reviews assess the quality and quantity of evidence related to a particular intervention and apply standards to determine whether the evidence is sufficient to justify an endorsement of the intervention as "empirically supported". The concepts and methodological tools of measurement validity can be applied to the systematic review process to clarify their strengths and weaknesses. This paper describes ways in which these concepts and tools can be brought to bear on systematic reviews, and explores some of the implications of doing so.

Arguments for evidence-based practice (EBP) in education proceed from the basic value statement that selection of educational strategies, materials, and programs are high stakes decisions with socially important implications. The difference between more and less effective interventions makes meaningful differences in the lives of children, families, and society at large. If we agree that these decisions are important, then they should be made in the most effective manner possible. Evidence-based practice asserts that important decisions should be based on three kinds of inputs: (a) the best available evidence, (b) clinical expertise, and (c) client values and context (APA Presidential Task Force, 2006; Sackett, Rosenberg, Gray, Haynes, & Richardson, 1996; Whitehurst, 2007; see also Spencer, Detrich, & Slocum, 2012 [this issue]). The challenge for implementing a system of EBP is to develop strategies and procedures to identify each of these inputs in ways that make them useful in the decision-making process that is the essence of EBP. An earlier paper in this special issue (Slocum et al., 2012 [this issue]) explores several strategies for summarizing the best available evidence in ways that are useful to educators. That paper also acknowledged that any general strategy to summarizing best available evidence might be carried out with greater or lesser quality (see also Wilczynski, 2012 [this issue]). Each of these strategies could be implemented in a very high quality manner and result in clear recommendations to practitioners that would improve outcomes for children; however, it is also true that each approach could be implemented poorly and could generate unclear recommendations or recommendations that would not improve outcomes. Advocates of EBP must be acutely concerned about the quality of these reviews and the validity of their outcomes.

The premise of this paper is that the process of reviewing research literature and deriving practical recommendations is an assessment process similar to the assessment process that we use to understand student performance and derive educational recommendations. When we plan student assessment, we must select appropriate tests. There are many different kinds of tests each of which provides a different kind of information. For example, reading skill might be assessed with a curriculum-based measure, norm referenced tests of decoding and comprehension, and analytical tests that identify specific skill strengths and weaknesses. Each type of test provides a different kind of information, and each may be useful in gaining a full understanding of a student's reading skill. A teacher would often be best informed by integrating the information from all three. And within each of these general types, some tests may be more valid for our purposes than others -- that is, some tests are more likely than others to give clear information that enables us to understand the student's skill and intervene more effectively.

The process of reviewing the available evidence, identifying the best of this evidence, and providing summaries for practitioners can be seen as an assessment process analogous to that of testing a student and deriving recommendations for intervention. This analogy is mapped in Table 1. In both cases, we begin with a complex construct--a student's reading skill on the one hand and the best available evidence on the other. In both cases, we need some systematic approach to understand the construct--we know that casual observation may not be an adequate guide to the important decisions that must be made. We apply a measurement system and derive scores--with students that process is educational testing that yields scores describing their reading skill; with best available evidence, that process is systematically reviewing research and deriving ratings that describe how well various treatments are supported by the evidence. We combine these test results with other kinds of information and considerations in a problem-solving process. In both cases, the assessment results should inform a decision-making process--it would be inappropriate to assume that test results determine a decision by themselves. When working with a reading problem, test results must be interpreted along with other information on the student's skills and background to decide on interventions; when working with the best available evidence, results of a systematic review should be combined with clinical expertise and contextual information to select treatments. The two processes lead to important decisions that impact children; therefore, we must be concerned about whether they work as well as possible. Both processes deserve careful scrutiny and ongoing improvement.

The literature on EBP has included numerous descriptions of systems for reviewing literature (e.g., Best Evidance Encyclopedia [BEE], n.d.; Cook, Landrum, Cook, & Tankersley, 2008; Gersten, et al, 2005; Homer, et al, 2005; Slavin, 2008; What Works Clearinghous[WWC], 2008), critiques of particular review systems (e.g., Schoenfeld, 2006), and several important analyses of current systems along with recommendations for improvement (Briggs, 2008; Cook, Tankersley, & Landrum, 2009; Confrey, 2007). However, this discussion of review methods has not been organized and it lacks an overall framework. If we think of these systematic reviews as measurement processes, we can bring the sophisticated methods of measurement validity to bear on questions of whether the review systems are adequate and how they might be improved. This paper sets out a framework for evaluating the quality of systematic reviews that are intended to identify empirically supported treatments as a way of summarizing the best available evidence. We explore how the concepts and methods of measurement validity might be applied to the process of reviewing and learning from the best available evidence--a critical component of EBP.

The Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education [AERA, APA, & NCME], 1999) define validity as "the degree to which evidence and theory support the interpretation of test scores entailed by proposed uses of tests" (p. 9). That is, validity is based on both empirical findings and logical analysis, and it is concerned with the interpretations of test scores that have implications for taking action (i.e., how we use the test results). By this definition, validity is not some arcane technical requirement important only to data nerds and ivory tower professors; it has everything to do with figuring out whether our measures actually enable us to take effective action to improve the lives of students. The point is that we should not assume that test results are an adequate basis for making important decisions unless we have evidence that they are up to the job. And this is exactly the sense of measurement validity that must be applied to EBP. We must ask thoughtful questions about how well systematic reviews enable educators to identify and implement the most effective interventions to improve outcomes for students. A validity review is a process of carefully examining a test and determining how well evidence and logic support the ways that its results are interpreted and used. The field of measurement validity has developed an elaborate set of strategies and methods for critically examining tests and gathering information about their adequacy. We believe that applying these methods to systematic reviews will support and strengthen the EBP movement.

Benefits of Applying a Validity Framework to EBP Reviews

Thoughtful application of the concepts and methods of measurement validity to systems for identifying and summarizing best available evidence may have several benefits including: (a) supporting a careful stance toward claims that a particular treatment is supported by the best available evidence and providing an organized way of talking about these issues, (b) providing a basis for determining the level of confidence that we should place in results from a systematic review, (c) allowing comparisons with alternative review methods (e.g. various styles of systematic reviews and meta-analyses vs. practice reports vs. narrative reviews), (d) identifying weak aspects in any given review method and strengthening those aspects, (e) suggesting ways to use multiple methods in a coordinated system to compensate for weaknesses in each, and (f) buttressing the perceived legitimacy of EBP. Each of these purposes is examined in greater detail below.

Perhaps the most basic and general benefit of approaching reviews from the perspective of measurement validity is the understanding that no measurement system (i.e., test) is perfect, and this includes measures of evidence related to an educational intervention (i.e., systematic reviews). Although these systems are designed to produce results that are as valid as possible, all such systems will produce some false positive and some false negative results. That is, they will inevitably support some treatments that are not effective (false positive) and fail to support some treatments that are actually effective (false negative). The possibility of error in these outcomes is a result of the fact that the best available evidence is never complete and sufficient, and that identifying and interpreting best available evidence is a highly complex process (see also Wilczynski, 2012 [this issue]).

No measurement system is perfect--and fortunately measurement systems need not be perfect to improve decision-making. One important question is how to best work with a necessarily imperfect system for measuring the extent to which treatments are supported by the best available evidence. Because of the possibility of false positives and false negatives, results from these systematic reviews should not be assumed to be "objectively true" statements; they are measurement results that should be taken seriously but not followed blindly. We argue that reviews can be used most effectively if they are recognized as current best approximations rather than final truths, the sources of possible error are identified and understood, and review methods are continually improved.

Understanding the level of confidence that is warranted by a review process is important because in EBP systems decisions are not based solely on research results. They also draw from professional judgment and contextual concerns. If a review of the best available evidence gives a clear and well-validated result, the result should be given heavy weight; but if the review of research is less definitive, then factors such as professional judgment should be given greater latitude. It serves no one's interests to present review results in a way that suggests that they are more certain and definitive than they actually are. By investigating the validity of review processes, we can better understand how much confidence we should place in its results.

Since there are multiple strategies for identifying and summarizing the best available evidence and each strategy has many variations (Slocum, Spencer, & Detrich, 2012 [this issue]), making comparisons among alternative review methods is important. In the absence of evidence, we cannot assume that the particular style of systematic review that has come to be strongly associated with EBP is the most valid strategy for linking best available evidence to practice recommendations (for a description of systematic reviews, see Slocum et al.,2012 [this issue]). Meta-analyses, for example, often consider a broader sample of literature on a given topic because they typically do not limit their scope to the most methodologically rigorous research, but rather treat methodological quality as a variable to be evaluated. In addition, meta-analyses typically derive an overall average effect size for a particular intervention rather than counting number of studies in various categories. Therefore, rather than simply assuming that one style of review is superior for representing the best available evidence, we should carefully examine how well various candidates do this job. We are not arguing that the review systems typically used as a basis for EBP are weak, only that they should not be assumed to be the most effective methods without careful analysis.

Evaluating the validity of review methods may promote ongoing refinement and improvement of these methods. The systematic review style associated with EBP is quite new to education and can undoubtedly be improved. All systems for reviewing best available evidence involve many trade-offs and risks (see also Wilczynski, 2012 [this issue]). It would be beneficial to carefully examine how these trade-offs affect the results of reviews so that we can improve our review systems over time.

When data-based judgments are high-stakes we often insist on multiple sources of data. For example, the decision to find that a student is eligible for special education services must include information from multiple sources. Clearly, decisions to adopt educational programs and interventions are high-stakes. Some, such as adoption of a core language arts curriculum, may affect tens of thousands of students for years to come. Validity studies may indicate how we can best integrate information from multiple sources. For example, different review methods may have distinct strengths and weaknesses, so if these distinct methods converge on the same conclusion, we could have greater confidence in that outcome.

The ultimate purpose of EBP is to improve the quality of educational practices used with students and thereby improve student outcomes. The technical quality of the recommendations contributes to this outcome, but it cannot be realized without broad buyin from educational decision makers at all levels. Among the key stakeholders are (1) producers of educational programs including researchers, educational developers, and publishers, (2) educational administrators, (3) teachers and other service providers, (4) political leaders, and (5) the general public. If recommendations based on best available evidence are seen as biased or ineffective, EBP efforts may encounter ongoing opposition and may be de-legitimized as another educational fad. Systematic validity studies may provide support for the legitimacy of EBP among these key constituencies.

The current paper outlines some of the validity issues and methods that are relevant to systematic reviews. We describe a validity framework for evaluating systematic reviews. We do not offer a validity study of any particular review system, but rather, describe issues that must be addressed in a validity study and explore some methods that could be used to address these issues.

Defining the construct

The validity framework offers numerous useful concepts for understanding systematic reviews that identify empirically supported treatments. This framework is based on the idea that we are attempting to measure a construct. One of the first tasks in thinking about validity is to clearly define and analyze the construct to be measured (AERA et al., 1999; Kane, 2006) -- this is a matter of being very clear about exactly what we are trying to measure. As we noted earlier, in the case of typical educational assessment, a target construct might be reading skill. In systematic reviews of educational treatments, the target construct is the best available evidence. In both cases, the target construct is complex and cannot be measured directly. Reading skill is generally considered to have multiple dimensions (e.g., decoding, fluency, comprehension) and it is understood that any single observation of a student reading may be influenced by factors outside of his/ her reading skill (fatigue, background knowledge in the content, difficulty of the text, etc.) A clear understanding of the construct reading skill is necessary for judging the validity of a reading test. Similarly, a clear understanding of the construct best available evidence is a foundation for understanding the validity of systems that claim to measure it.

Definitions of EBP are our starting point. A previous article in this volume discussed definitions of EBP (Spencer et al., 2012 [this issue]). Our focus in this paper is on the best available evidence_ This construct has several dimensions. The phrase best available implies a continuum of quality of evidence -- there is better quality evidence and lesser quality evidence. We might think about quality of evidence in at least two ways: (1) Best can refer to methodological quality, and (2) best can refer to relevance to our specific application. Both of these meanings of best are quite complicated and the full meaning of best includes both. There are many scales for rating methodological quality -- such a scale is required of a systematic review intended to identify empirically supported treatments. However, there has been relatively little explicit discussion of what we really mean by methodological quality and how rating systems correspond with this. Best available evidence might also be understood in terms of relevance to a particular question. In this sense, the best evidence is evidence from research that is most similar to the practical educational question we are asking. Research is similar (or dissimilar) to practical questions in a number of ways: the specific intervention, the specific population, the specific problem, the specific outcome measures, and the specific type of context. Since no research can match an application context exactly, generalization from research to an application that is somewhat different from the research is always an issue. Thus, any system for judging best evidence must somehow deal with the fact that research studies differ in their relevance to the practice question. Further, research studies do not simply match or fail to match the practice question; instead they are better matches in certain ways and worse matches in other ways. Relevance is more of a continuum than an either/or proposition. Further, the full meaning of best implies both methodological quality and relevance -- the best evidence has a combination of methodological quality and relevance to the specific application with which we are concerned. The term best available evidence also implies that standards for best evidence should be relative to what evidence is available. To say that we should use the best available evidence is different from saying that we should limit ourselves to impeccable evidence. This implies that systems for evaluating the evidence should help us draw lessons from the evidence that is available, even if that evidence is less than ideal.

In the previous paragraph, we have sketched a few of the main features of the construct best available evidence; a full validity study would take this process much farther. Each component of the construct best available evidence would be carefully analyzed and described in detail. A clearly defined and thoroughly analyzed construct establishes the target for systematic review methods -- the goal is to develop reviews that measure this construct as closely as possible. This process of analyzing the construct is a critical foundation of a validity study because only if we clearly understand what we are trying to measure can we evaluate how well a particular review system measures it. Once we clearly understand the construct, we can begin to evaluate how well our measurement methods and outcomes correspond with it. The remainder of this paper discusses concepts and methods that are useful for understanding the relationship between the measure (systematic reviews of research literature) and the construct (best available evidence).

Threats to validity

No measurement system is perfect; there is always some slippage (and sometimes a lot) between the complex construct that we are interested in and the results of measurement. We can talk about two sources of error in any measurement; (1) construct under-representation and (2) construct-irrelevant variance (Cook & Campbell, 1979; Messick, 1989).

First, construct under-representation is a problem when some of the important features of the construct do not influence the measurement system and its results -- when the measurement system does not represent the entire construct. A review system that fails to fully represent all aspects of methodological quality and relevance would under-represent the full meaning of best available evidence. This could be a serious problem because the results from such a review system would not reflect all that we mean by best available evidence and actions based on the review could be misdirected. For example, failure to represent the relevance of evidence (i.e., the match between details of the research and the specific practical application) could result in recommendation of an intervention in spite of the fact that the research supporting that intervention is based on substantially different versions of the intervention, populations of students with different learning needs, and so on. This would constitute a false positive outcome based on construct under-representation. In other words, a program was recommended for use in spite of inadequate support because the review did not fully represent the construct of best available evidence. On the other hand, this same underrepresentation of relevance could result in failure to recognize the importance of a study that is extremely relevant to the practice question. Thus, construct underrepresentation can result in both false positive and false negative results. If a review system does not take the full meaning of best available evidence into account, it can produce errors in recommendations. These errors can include recommendation of programs that are not well supported as well as possible failure to recommend programs that are well supported.

The second threat to validity is construct-irrelevant variance--the problem that a measure may be influenced by factors that are not included in the target construct. For example, a system for evaluating best available evidence may be influenced by the date of publication or the language in which a study is published. Neither of these factors is an aspect of best available evidence (at least as discussed above), but may be included in review systems as a matter of convenience for reviewers. If these factors resulted in exclusion of articles and altered the outcomes of the review, this would be considered construct-irrelevant variance.

The main objective of a validity study is to try to understand the ways in which the measure may be influenced by construct under-representation and construct-irrelevant variance. All of the methods of evaluating the validity of a measure (e.g., analyzing its content, examining correlations with other measures, conducting factor analysis, etc.) can be seen as means of obtaining evidence regarding the possible influence of construct underrepresentation and construct-irrelevance variance (Messick, 1989).

Construct validity -- The umbrella for all validity

Modern definitions of validity (e.g. AERA et al., 1999; Kane, 2006; Messick, 1989) assert that measurement validity is a single thing (i.e., a unitary construct) rather than a collection of separate types of validity (e.g., content validity, criterion validity, etc.). This view treats all evidence about validity as evidence of how well the results produced by the measure reflect the construct. All validity evidence contributes to understanding construct underrepresentation and construct-irrelevant variance. Thus, construct validity is the umbrella that includes all evidence about validity.

The separate "types of validity" from older conceptualizations become sources of evidence about how well results of the measure reflect the construct. Thus, "content validity" is not a distinct type of validity but a source of evidence about how well the content of a test represents the targeted construct. Similarly, "criterion validity" becomes another source of evidence about whether the test measures the targeted construct and is unaffected by unrelated constructs. In the case of systematic reviews intended to identify empirically supported treatments, this means that all sources of evidence will be used to understand how well a particular review process represents best available evidence. In the remainder of this paper, we will discuss five aspects (often referred to as "facets") of validity as they apply to measures of best available evidence. Each aspect suggests unique questions about how a systematic review process might (or might not) correspond with the construct of best available evidence. Table 2 lists these five aspects of measurement validity, briefly describes each, and lists how each aspect might be relevant to examining systematic reviews.

Content Aspect of Validity

The content aspect of validity focuses our attention on the measurement tasks -- how well the set of tasks represents all aspects of the target construct and excludes irrelevant influences (Messick, 1989). This is briefly summarized in the first row of Table 2. For a systematic review, the content aspect involves a careful examination of how the system identifies, evaluates, and rates the evidence related to a treatment. Systematic reviews generally include explicit procedures for (1) locating studies, (2) rating the relevance of studies, (3) rating methodological quality of studies, (4) rating outcomes of studies, and (5) rating the set of studies with respect to how well the treatment is supported by the evidence (e.g., BEE, n.d.; WWC, 2008). This sequence of procedural steps could be subjected to expert evaluation of how well it represents the construct best available evidence. And more specifically, each of these five steps could be evaluated for how well it captures all that we mean by best available evidence (i.e., avoids potential construct underrepresentation) and is uninfluenced by factors that are not part of best available evidence (i.e., avoids construct irrelevant variance). Of course, this kind of evaluation repeatedly refers back to the target construct and depends on a clear understanding of best available evidence.

In traditional tests, the content aspect of validity is typically assessed by having experts judge each item for relevance and having them judge the entire set of items for how well they cover the target construct. Similarly, experts might be asked to examine each element of a review system (i.e., all the rules for reviewing and standards for ratings) and judge the degree to which they correspond with the best available evidence construct and avoid influence by irrelevant factors. The first two columns of Table 3 organize some central ideas regarding the content aspect of validity for systematic reviews. The first column lists the main steps in a typical systematic review and the second column lists several key questions relevant to the content aspect of validity for each of these steps. It may be helpful to refer to these first two columns of Table 3 while reading the following sections.

Locating studies

With respect to locating studies, we can ask whether the procedural rules appear to describe a search process that will access all relevant studies. Or are there faults in the search process that could result in missing relevant research? For example, does the search process include use of the reference section of each located study as a means of locating additional studies? Does it include contacting authors directly to ask if they have conducted additional studies? We could ask a panel of experts to carefully examine procedures for locating studies and respond to these questions.

Rating relevance of study

Rating the relevance of studies for the particular review is an important component of the construct best available evidence. This process includes a review of each study and determination of its relevance to the review question. It may include a step of noting features of the study including the specific treatment, participants, outcomes, and context; developing a rating system that specifies what variations are to be considered relevant to the review question; and giving relevance ratings to each feature of each study (or the study as a whole). In many systematic reviews the process is not explicitly described, but there is always some strategy for determining the relevance of each study to the review question. Expert evaluators could examine the rules and procedures for determining relevance, provide an evaluation of the degree to which the system would tend to include all relevant studies (and perhaps weight the evidence so that more relevant evidence is given greater consideration) and exclude all irrelevant studies. Experts would attend to the potential sources of construct under representation (undervaluing aspects of best available evidence) and construct irrelevant variance (influence by factors that are not part of best available evidence.)

Rating methodological quality

The construct best available evidence includes a component of methodological quality. Validity questions similar to those on relevance could be asked about the rating of methodological quality of each study. Validity reviewers could examine each step in the process of determining methodological quality; they could look at what methodological features of the study are rated, how ratings are determined, and how individual ratings on various aspects of the study are combined to arrive at a rating for the study. For example, for a group study, does the systematic review process consider whether the control group is equivalent to the treatment group prior to the intervention? And in a single subject study, does the system record whether the baseline phase was stable prior to beginning the treatment phase? Validity reviewers could examine this process and evaluate the degree to which it represents the construct best available evidence. The validity review would be concerned about sources of construct irrelevant variance such as ratings of methodological factors that do not actually correspond with best available evidence; and they would be watching for construct under representation such as aspects of methodological quality that are not included or not given sufficient importance. For example, a review system that excludes single subject research may be judged to seriously under-represent best available evidence.

Rating Outcomes

The process of rating study outcomes may be subjected to a similar set of questions as with methodological quality. Studies must be reviewed, outcomes recorded, and records converted to ratings based on a rating system. The rules and procedures for carrying out this process can be examined and evaluated by experts; and the outcomes (ratings of a specific set of studies) can be evaluated. In both cases, the key question for validity evaluators would be whether the procedures and outcomes represent the construct, best available evidence, with minimal under representation and minimal intrusion of other factors. For example, a review system in which ratings of outcomes are based on statistical significance, but are not influenced by the effect size, may be judged to under-represent best available evidence.

Substantive aspect of validity

Messick (1989) points out repeatedly that expert judgment is fallible and should be supported by other kinds of evidence. He makes extended arguments that expert judgment is an important contributor to overall validity judgments, but must be combined with empirical observations where this is possible and relevant. Thus, the content aspect of validity with its reliance on expert judgment should be combined with evidence regarding how well the measurement procedures actually work. The substantive aspect of validity focuses on empirical evidence of how well the components of tests represent the construct.

The nature of this empirical evidence depends on the nature of the construct to be measured. In many traditional tests, a series of items are assumed to reflect a single underlying construct. For example, a reading test might include numerous words to be read out loud -- all would be assumed to represent the construct decoding skill. These items would be expected to be highly inter-correlated. However, systematic reviews are not organized in this way; each step in the review procedure is designed to represent a separate and distinct aspect of best available evidence (e.g., obtaining all studies, judging relevance, judging methodological quality, etc.) and these aspects are not necessarily assumed to correlate. For example, a rating of the relevance of the study would not be expected to correlate with the methodological quality or the strength of outcomes -- each measures a distinct and independent feature of a study. Therefore, inter-correlation among steps of a systematic review has little relevance to validity of its recommendations. Further, when we look more closely at any single step, the various component scores that make up that step would not be expected to correlate with one another. For example, the step of judging relevance would include judgments of relevant participants, relevant treatment, relevant outcomes, relevant settings, and perhaps others. Based on our understanding of the construct, we would not expect these elements to correlate with one another. Knowing that a study was conducted with participants who are relevant to our review does not indicate that a treatment was also relevant to our review.

Instead, a different approach might be used to obtain empirical evidence relevant to how well each step of a systematic review represents the corresponding aspect of best available evidence. One approach would be to test each step in the review process by employing several alternative methods and compare their outcomes. This would give us an empirical picture of the effects of different methods for performing each step of the process. Again, Table 3 may be helpful in understanding how this approach could be applied to each step of a systematic review. The third (far right) column of Table 3 provides examples of the kinds of empirical questions that might be asked of each step of the process.

Locating studies

For the step of locating studies, we could compare a given search method to several different methods and record the results -- for example, how many studies are located, how many relevant studies are located, which specific studies are located by each method. Then methods could be compared based on what they produce. This comparison is not limited to gross number of studies located; review methods can also be compared based on the quality of these studies (relevance and methodological quality) and whether various methods locate different kinds of studies. In addition, we could ask the bottom line question of whether different systems for locating studies produce different final statements about how well the treatment is supported by best available evidence. These questions would help us understand the adequacy of each method and may suggest new approaches that might optimize the results.

Rating relevance of study

The step of judging the relevance of studies is difficult to test empirically because the relevance of a study requires a judgment about how well results from that study generalize to the specific review question. For example, if a study was conducted with general education students who scored below the 20th percentile on a reading test, to what degree are the results relevant to students who have a diagnosis of learning disability in the area of reading? However, there are several ways in which we might gain insight into the effects of the rules and procedures used to rate relevance in EBP reviews. We could compare results from several systems for rating relevance of studies -- this would reveal the effects of differences among these systems. The set of studies that is found to be relevant by some, but not all, systems could be examined. How many studies are at stake? What features of the studies cause them to be judged as relevant by some but irrelevant by other systems? What are their characteristics in terms of methodology and outcomes? How would they affect overall judgments of the program or intervention? In addition, we might give experts a review question and ask them to rate the relevance of several studies to this question. We could compare this global judgment of relevance to the ratings derived by the review procedures. It is possible that although the rules appear to be logical and well justified; when they are applied they do not match with overall judgment of relevance. Experts examining a study as a whole might find it to be relevant even though the review procedures result in a rating of not relevant. Neither method is flawless, but the comparison would be instructive. This analysis would provide information on robustness of results across variations in relevance standards--that is, it would help us understand whether small differences in our methods for determining relevance result in large differences in our conclusions about programs or interventions. The strongest evidence for robustness would come from review methods that are very different, yet yield similar results. This kind of evidence would support the case that the results are not dependent on the particular review methods that we have employed.

Rating methodological quality

Systematic reviews rate the methodological quality of each relevant study and accept studies that meet their quality standards; this step in the process might be empirically evaluated in ways similar to those suggested for the judgment of relevance. The results from different systematic review systems could be compared on total number of studies accepted and rejected, and more importantly, the particular studies in question could be examined. In addition, experts might closely examine studies and rate them for their overall methodological quality. These ratings could be compared to the results from systematic review systems. These comparisons might reveal how review procedures differ from overall judgments by experts and also how results from one system differ from another. It would help us understand the implications of variations in how methodological standards are set. We could better understand the implications of these standards for our conclusions about the best available evidence on a program or intervention. If differing methods tend to converge on a single result, this would support the validity of the methods.

Rating Outcomes

Systematic review systems must describe the outcomes from the primary studies--most review systems categorize outcomes and give ratings such as supportive, neutral or no effect, and negative. The substantive aspect of validity would focus on how well these rating systems work. The ratings for a set of studies could be compared to several other ways of summarizing outcomes. They could be compared to expert judgment of the outcomes of the studies and other categorization systems. The comparison with expert judgment would help us understand how well a rating system captures the same qualities that an expert uses to judge the outcomes of a study. The comparison to other categorization systems would reveal how differences in details of these systems might (or might not) affect the apparent support for a program.

Interaction between content and substantive aspects of validity

The content and substantive aspects of validity are similar in that they both examine each component of the assessment -- in this case, each step of the systematic review process. They are different in that the content aspect consists of a logical analysis of the procedures, rules, and standards that make up each step and the substantive aspect involves collecting data on how each step functions. Thus we have two views of each step, one logical and one empirical. Messick (1989) emphasizes that the "confrontation" between these two views provides key insight into validity of the components of a measurement system. When these views agree in support of validity, the case for validity of that step is strengthened; when they agree that validity is weak, the case for validity is seriously weakened; and when the two views disagree, a new set of question about why they differ is raised.

Structural aspect

The content and substantive aspects of validity focus on the components or steps of the assessment process. The structural aspect of validity asks questions about the ways in which scores on individual items are combined into component scores, and how these are combined into broader scores. For example, in many academic tests, each item is scored correct or incorrect, and a score on a subtest is simply the number of items correct. Subtest scores (perhaps after some kind of transformation) might then be added up to find a total test score that presumably represents the relevant construct. The question posed by the structural aspect of validity is how well these procedures for deriving scores match our understanding of the organization of the construct. Do we think that the overall construct is the sum of its parts, or is there some other type of relationship among components? For example, what is the relation between relevance of a study and that study's impact on best available evidence? In the most common conception, a study is either relevant or irrelevant to the review question, and if it is irrelevant, it has no impact on the best available evidence on that question. This would suggest that a score of zero for relevance should mean that no other features of the study impact the overall score. It does not matter how good the methods are, or how strong the outcomes; if it is not considered to be relevant, the study has no impact.

Systematic reviews typically include several different ways of combining scores in their overall system. In the review for relevance, each dimension of relevance (participants, treatments, measures, settings) may be scored separately, and if any dimension is judged to be irrelevant to the review question, the study is rejected. Similarly, in the review for methodological quality, a study may be rejected based on a single low-rated item. However, in other systems, a low rating on one item may be compensated by high ratings on other items. For example, the demonstration of pretest equivalence and other design features may balance lack of random assignment. In the rating of outcomes, some review systems use two scores (statistical significance and effect size) and others combine the two into a single scale with several possible values (e.g., a study may be said to give "strong support" if it finds statistically significant differences and the effect size is at least .25). Finally, the scoring model for rating the program or intervention typically sorts studies into categories based on both methodological quality and outcomes, counts the number of studies in each category, and determines a rating based on a set of rules.

The first benefit of examining the structural aspect of validity of systematic reviews is to raise awareness that the complex scoring model that is built into each review system is only one of a number of ways that this process might be organized. There are other ways of deriving scores and the system used in a systematic review should be carefully and critically examined based on how well it reflects our understanding of the construct best available evidence. The second benefit of examining the structural aspect of validity is that it leads us to scrutinize specific scoring models employed by review systems, understand how they function, and compare them to other approaches.

One aspect of scoring assessments is the determination of whether scores will be interpreted relative to criteria or relative to other scores (e.g., norms). Most systematic reviews are basically criterion-referenced assessments. They are designed to determine whether individual studies meet predetermined standards of design, relevance, and outcomes. These criterion-based ratings are combined to determine whether there are a sufficient number of high quality studies to support the procedure; this determination is also based on predetermined criteria. Typically, review systems use discrete ordinal scales with 3 to 5 possible values. Again, validity analysis calls for critical examination based on our understanding of best available evidence.

Gathering evidence relevant to the structural aspect of validity can be approached in several ways. First, experts can examine the rules for rating the evidence base. They can evaluate whether all aspects of best available evidence are taken into account and given appropriate consideration, and whether other factors might influence these ratings. Also, experts could examine sets of evidence and evaluate whether the rating adequately represents their understanding of how well this set of evidence supports the treatment in question. Second, various different systems for combining scores from the steps of the rating process could be compared by submitting a given set of information to each system and evaluating any differences in how the program or intervention is rated. This would clarify how rating systems actually perform and expose differences across rating systems.

Generalizability aspect of validity

One of the central questions for any assessment process is about how well results from a particular observation or testing session inform decisions about other situations that differ in some respects -- the generalization question. Without some ability to generalize, assessment results give us no guidance in decision-making. Generalizability questions have obvious importance for understanding the results of systematic reviews -- do the results of the review generalize to the particular teaching situations in which we work? There are at least three important levels of generalizability of these reviews: (1) generalizability across raters and instances of the review process (i.e., reliability), (2) generalization of a finding that an intervention is supported by the best available evidence to somewhat different populations of students, variations on the treatment, related measures, and contexts of application, and (3) generalization of the review process across topics and literatures.

Generalization across raters and reviews -- Reliability

The most basic level of generalizability is reliability (Messick, 1989). Reliability includes the issue of whether two different raters tend to give the same ratings when reviewing a study (interrater agreement). Best available evidence review systems require raters to make numerous, and sometimes difficult, decisions in the coding and rating process. If we do not have confidence that coding and rating are consistent across reviewers, we can have little confidence in any interpretation of the results of an assessment system. Existing reviews and review systems have taken different approaches to this issue. Some have not discussed this issue, others have built extensive training and review systems but not reported interobserver reliability, and some have reported reliability. Several reviewers who have reported reliability and commented on these issues have found it to be a significant challenge. For example, Chard, Ketterlin-Geller, Baker, Doabler, & Apichatabutra (2009) reported that when a second reviewer independently rated features of single-subject studies, their scores matched the primary reviewer only 36% of the time, and for group studies they matched only 53% of the time. This is well below levels of agreement considered acceptable in primary research. Reliability may be reported for specific coding items (e.g., was the study done with the kind of participants targeted in the review), for categorical outcomes at each part of the review process (e.g., decision to consider a study as relevant or not-relevant), and for determination of the status of the intervention (e.g., the intervention is found to be strongly supported by the best available evidence). Each of these levels of reliability is important for understanding the extent to which the particular ratings given to an intervention are consistent across different raters. The overall rating of an intervention (as either being empirically supported or not empirically supported) can change on the basis of the coding of a single item on a single study; as a result, the reliability of these ratings is a critical concern. A small amount of unreliability could produce important errors in the overall ratings of interventions. Careful analysis of reliability of each item is also important for continual improvement in the system. When low-reliability items are identified, they can be revised, training of raters can be improved, or other modifications can be made.

Generalization of conclusions from research to practice

A second level of generalizability concerns the extent to which the results of a body of research findings are relevant to a range of applications. These questions take us to the heart of one of the most difficult issues in interpreting research. There is a large literature on this topic (e.g., Shadish, Cook, & Campbell, 2002). In addition, the extent to which test results predict performance in a range of situations has also been studied extensively (e.g., Cronbach, 1982; Messick, 1989; Schmidt & Hunter, 1981) and is highly controversial. The challenge sterns from the fact that research is conducted with a specific group of students responding to specific treatments and specific measures in specific contexts. From this highly specific research, we attempt to predict performance about a different group of students responding to a treatment that is likely somewhat different and a measure that may differ in a context that is different. Of course, we cannot assume that findings are completely generalizable to new situations -- this is one reason that systematic reviews typically require multiple independent studies that demonstrate effectiveness of a treatment. If we assume that research findings are highly generalizable, we may make false positive errors--conclusions that treatments are effective in situations where they are not actually effective. On the other hand, one could conduct a systematic review that is limited to studies that very closely match the local context (e.g., demographics, particular grade levels, etc.). However, such an approach would often find few if any studies that matched the local context very closely; even interventions that have been subjected to extensive research may not have any studies that closely match the particular context. This approach would produce numerous false negative results--the false conclusion that the intervention is not well supported. The more we narrow the scope of evidence that we consider to be relevant, the smaller the research base and greater chance that effective interventions will be found to lack sufficient evidence (Gardner, Spencer, Boelter, DuBard, & Jennett, 2012 'this issue]). Said another way, limiting a review to the most specific evidence may not yield the best available evidence. This kind of undergeneralization from research may leave practitioners with little guidance from the evidence that is available. This takes us back to the very difficult and critical issue of sensible generalization from a set of research studies to a range of potential applications.

The recognition that any decision about generalizability (whether the decision favors broad or narrow generalization) is fraught with uncertainty is helpful if it supports thoughtful and careful interpretation of the best available evidence. Once the challenge of generalization is recognized, it can be approached analytically. Practitioners can list specific features of their local situation (students, likely variations on treatments, measures of importance, other aspects of context) and logically analyze the 'contextual fit' (Albin, Lucyshyn, Homer, & Flannery, 1996) of a practice above and beyond results from a review of the best available evidence. In this process, practitioners would have to make judgments about whether differences between research and practice settings are likely to change the effectiveness of the practice substantially.

The uncertainty of generalization also underscores the importance of ongoing monitoring of effectiveness. No evidence base is so strong and complete that it would justify implementation of an intervention without monitoring its effectiveness in practice. In addition, if intervention effectiveness is routinely monitored (e.g., through curriculum-based measurement and end-of-year testing) local agencies can generate highly relevant local evidence.

There is one conclusion that can be asserted strongly--generalization of results from reviews of research is complex and problematic. Strong claims of either applicability or limits of relevance should be examined carefully and critically. There will always be a great deal of uncertainty about how a set of research results informs a practice in a particular situation. The strongest position is to recognize and describe this uncertainty.

Generalization of review process to new topics

The third level of generality is the degree to which a particular review process can be applied with equal validity to a range of topics and literatures. To what degree is a review method that has been found to be valid for one literature assumed to be valid for another literature? The issue here is the extent to which we can generalize the validity of a systematic review process. The threat to validity is the claim that a review process that is adequate for the literature on one topic may not be equally adequate for the literature on another topic. For example, a review method might result in valid recommendations based on the best available evidence when applied to topics that have a very large literature with numerous randomized control trials; however, those same methods may result in less valid representation of the best available evidence in sparse literatures with few randomized experiments, or literatures with important contributions from single subject research. There has been little if any published discussion of how the nature of a set of literature might affect the validity of systematic reviews for identifying the best available evidence. A thorough study of the validity of any review system would need to attend to the range of situations in which that review system is appropriate.

External aspect of validity

The external aspect of validity has traditionally been called criterion validity (Messick, 1989). It is concerned with the relations between the test in question and other assessments. When other assessments are measures of the same construct, we generally expect high correlations between results; when they are similar or related constructs, we generally expect moderate correlations; and when they are measures of unrelated constructs, we expect low correlations in outcomes. For example, scores from a test of reading comprehension would be expected to be highly correlated with those from another test of reading comprehension, moderately correlated with those from a test of listening comprehension, and only modestly correlated with a test of mathematics. The key to the external aspect is not that all correlations are high, but that correlations be similar to what is predicted based on the constructs that are being measured.

Alternative measures can also be similar (or dissimilar) in their methods of measurement, and sometimes methods of measurement influence assessment results. For example, two multiple-choice tests have similar methods of measurement and would be expected to share "method variance" that is associated with students' skills in responding to multiple-choice questions. Thus, when two tests use very different methods of measurement (e.g., multiple-choice and open-ended essay) to assess a single construct, and the results correlate highly, this is very strong external evidence for validity of both tests.

These concepts can be applied to systematic reviews as well. We would expect that the results of two systematic reviews would converge on the same conclusions about the degree to which the best available evidence supports a treatment because these reviews target the same construct and also share many methods of review. If they did not converge, this evidence would tend to undermine the validity of one or both reviews and would suggest a need for detailed analysis of the specific review items or standards that caused the difference in outcomes. We would also expect a reasonably high level of convergence between systematic reviews and meta-analyses both are attempts to characterize the best available evidence and they share some review methods, but they are also distinct in important ways. Because meta-analyses and systematic reviews are distinct in many of their methods, convergence of results would lend stronger credibility to the validity of each method as a measure of the best available evidence. Systematic reviews and best practice panels share even fewer methodological features, so convergence of outcomes would be particularly powerful evidence of validity--the convergence cannot be attributed to method variance. Messick (1989) commented, "... what is critical is that [results from the assessment] relate appropriately to other construct scores based on distinctly different measurement methods from its own. Otherwise, shared method variance becomes a plausible rival explanation of aspects of the external relational pattern, a perennial threat that must be systematically addressed in making external validity claims" (p. 46.)

Lack of convergence in outcomes may be more difficult to interpret. We do not have a "gold standard" review method that we can assume is highly valid. Therefore, when conclusions from multiple reviews differ, it is not immediately clear whether one, the other, or both reviews are less than valid. In this case, it would be most productive to take this kind of result as a call for further analysis. We can compare the review processes and identify specific procedures and standards that account for the divergent findings. Depending on what features are responsible, we may conclude that one or the other is stronger in this case. For example, one review may have imposed an historical cut off of their search for research and several otherwise acceptable studies may have been published just outside the search dow. In this case, results from the review that includes these studies would be preferred. In other cases, there may be no clear reason for considering one review to be more valid than another. For example, one review may have imposed higher standards for methodologically acceptable studies--such as a requirement for explicit description of detailed procedures for selection and assignment of participants to groups--this could exclude studies based on brief description as well as those with methodological flaws. With a slightly lower standard, more research evidence would be available to evaluate the effectiveness of the practice. In this case, there is no inherent basis for favoring one method or the other. This outcome would suggest that results of either review must be understood to be one of two reasonable interpretations of the best available evidence. This kind of outcome would also suggest that, in general, these review systems are subject to challenges of this sort. We would know that any conclusion about the practice is dependent upon a somewhat arbitrary methodological decision. And in this case, we can describe the specific aspects of the research base that is ambiguous. Of course, the best solutions to cases like these are to generate additional research that can result in a less equivocal finding, and to implement treatments with progress monitoring to provide a local check on effectiveness.

Consequential aspect of validity

We use assessments to achieve some benefit beyond the testing and interpretation process itself. Testing is only valuable to the degree that it helps us make better decisions and achieve positive consequences. The Standards (AERA et al., 1999) state that, "a fundamental purpose of validation is to indicate whether these specific benefits are likely to be realized" (p. 16). Messick (1989, 1995) takes this argument to the next logical step; he argues that consequences of test use are relevant to validity whether these consequences were intended or unintended and whether they are positive or negative. He states, "judging validity in terms of whether a test does the job it is employed to do (Cureton, 1951; Rulon, 1946) -- that is, whether it serves its intended function or purpose -- requires evaluation of the intended or unintended social consequences of test interpretation and use" (Messick, 1989, p. 84). This is the consequential aspect of measurement validity.

Systematic reviews of research literature to identify practices supported by the best available evidence have the clear intended purpose of contributing to the overall goals of EBP -- to produce better outcomes for students and clients. The importance of including the consequential aspect in validity analysis of these review systems is very clear. The intended consequences of the use of systematic reviews are that practices employed in schools become more effective over time and student outcomes improve. However, measurement invalidity could result in false negative and false positive evaluations of practices. These errors could have negative unintended consequences for students -- poorer outcomes as a result of failure to implement potentially effective treatments (result of false negative) and the implementation of ineffective treatments (result of false positive).

It is also possible that there may even be unintended negative consequences of valid systematic reviews of the best available evidence. It is possible that results of these reviews may be used without the other components of evidence-based practice (professional judgment and contextual factors) resulting in inflexible implementation of identified treatments in inappropriate contexts -- that is, overgeneralization of the applicability of practices. This could result in poor student outcomes and increased resistance of staff to using empirically supported treatments.

However, we should recognize at the outset that the consequences of systematic reviews that identify empirically supported treatments is complicated by the numerous other factors that influence the selection and implementation of treatments and their effectiveness in practice. In this paper we have limited our discussion to the validity of processes for identifying empirically supported treatments based on the best available evidence. In the EBP model, best available evidence is one of three influences on educational decision-making and selection of treatments. The EBP model specifies that selection and implementation of treatments should not be simply a result of research reviews; they should be a result of reviews along with professional judgment and contextual factors. Thus, the important question of whether systematic reviews lead schools to select well-supported treatments is bound up with broader questions about decision making in EBP systems. Further, after a treatment is selected, numerous factors influence the effectiveness with which it is implemented (e.g., Fixsen, Naoom, Blase, Friedman, & Wallace, 2005). As a result, the important task of evaluating the degree to which systematic reviews realize their intended purposes takes us beyond narrowly defined validity studies of the systematic reviews to broader evaluation of EBP. This serves as an additional reminder of the importance of evaluating the overall function of EBP systems.

The recognition that features of the broader EBP system are critical in determining the ultimate impact of treatments, however, does not mean that the process of implementation is completely separate from reviews of the best available evidence. For example, a review process might find that the best available evidence supports a particular reading comprehension program for a particular grade level and type of student. However, that reading comprehension program may not be implemented with fidelity. The failure to realize improved outcomes cannot automatically be attributed to failure of the review process -- it may have been a result of inadequate training and coaching in the proper use of the intervention. On the other hand, a review process that reveals not only the effectiveness of the intervention under ideal conditions but also the requirements for producing these ideal conditions (e.g., administrative support, training, coaching, ongoing monitoring and problem solving, etc.), might better support effective selection of contextually appropriate interventions and their effective implementation. Thus, failures of implementation may suggest needed improvements in review systems.

We have argued that the consequences of using systematic reviews to identify empirically supported interventions is a result of both (a) the review processes and results and (b) the ways in which educators use these results (i.e., lists of empirically supported treatments) to select and implement interventions.

In his classic work on measurement validity Messick (1989) argued that it is the responsibility of the user of assessment information to interpret and act on that information in ways that make sense in local circumstances. He stated that the users of test results are "in the best position to evaluate the meaning of individual scores under the specific circumstances" (p. 88) and that they bear "a heavy interpretive burden" (p. 88). As a result of this important role in interpreting and acting on test results, users also bear "a heavy ethical burden" (p. 88) to ensure that the consequences of their uses of test information are positive. This idea takes us back to the basic three-part definition of EBP. EBP is not a system of blindly implementing treatments listed in a research review; it is a decision making process that includes the best available evidence along with professional judgment and client values and contextual factors.

Summary and Conclusions

Educational decisions have meaningful consequences for students and society as a whole. The evidence-based practice movement argues that the best available evidence should be one of the primary contributors to these decisions. However, identifying the best available evidence and interpreting its message is neither simple nor straight-forward. The process of identifying and summarizing the best available evidence in ways that are useful to educators is as difficult as it is important. This paper has suggested that the process of reviewing the best available evidence can be seen as a form of measurement and the concepts of measurement validity can be usefully applied. The tools of measurement validity can support evidence-based practice in several ways. First, these tools constitute a well-established set of methods for compiling validity evidence and can provide a firm basis for understanding how much confidence should be placed in the results of any particular systematic review. Second, they provide a foundation for improving the validity of systematic reviews -- various review methods can be compared, strengths and weaknesses of each can be identified, different methods may be combined to improve validity.

EBP argues for the inclusion of the best available evidence in educational decision-making, therefore it is particularly appropriate and consistent to apply serious scientific scrutiny to EBP itself. The framework of measurement validity provides the tools to generate high quality evidence about the process of systematically reviewing research literature and deriving recommendations for practice. Just as EBP is proposed as an antidote to uncritical assumptions about the effectiveness of educational interventions, the measurement validity perspective is proposed as a counter to uncritical acceptance of the results of claims regarding empirical support. There are many (perhaps infinitely many) possible systems for identifying and summarizing best available evidence and the results of each of these systems will have imperfect validity. The tendency to uncritically accept systematic reviews and simplistically equate their results with best available evidence is a potentially serious weakness in the entire EBP system.

This examination of the application of the measurement validity framework to systematic reviews in support of evidence-based practice suggests several recommendations for future research and practice. First, validity studies should be conducted on specific review systems that are used to identify empirically supported treatments. For example, organizations that conduct numerous reviews and render ratings for a large number of treatments (e.g., BEE, WWC, and others) should conduct and report thorough studies of the validity of their review processes. These studies would not necessarily need to address every conceivable validity question, but should prioritize the most important issues. When the costs of validity reviews are compared to the social importance of the ratings promulgated by these groups, such validity reviews appear to be very well justified. Further, the Standards for Educational and Psychological Testing (AERA et al., 1999) make it clear that those who develop and conduct assessments bear a great deal of responsibility for evaluating and reporting on the validity of these assessments. The responsibility to address these standards might well be considered to apply to organizations that produce ratings of the empirical support for treatments just as they apply to publishers of tests. In addition to validity studies conducted by large reviewing organizations, outside researchers could conduct and publish validity studies as well. For example, researchers could use alternative methods for conducting the steps of a systematic review and compare the results of their alternative review to those of a WWC review. The two reviews could be compared based on expert evaluation of each step of the process (content aspect of validity), based on the results of each step of the process (substantive aspect), and based on the overall results (external aspect).

Second, when systematic reviews are conducted on a smaller scale, such as those that are reported in journal articles, authors could begin to include information on the validity of the review process to the degree that this is possible. One simple move in this direction would be reporting interrater reliability for all coding and rating decisions in all systematic reviews. This is already a commonly accepted standard for primary research studies and meta-analyses. In addition, where appropriate, authors might test the effects of altering their review processes. For example, a review might require that research studies report the fidelity with which treatments were implemented and downgrade the methodological rigor rating of any study that did not report this information. These reviewers might conduct parallel analyses with and without this requirement and report its effects on the overall results. One possible result might be that this requirement resulted in rejection of older studies conducted before fidelity of implementation was commonly addressed. If this difference in the review process resulted in different conclusions about a treatment, conclusions might be moderated accordingly.

Third, just as we understand that consumers of test results must be educated in the proper interpretation and integration of these results with other considerations; so too, we who promote evidence-based practice and disseminate information on empirically supported treatments bear a responsibility to educate and support consumers in their role of carefully evaluating information on empirically supported treatments, and combining that information with professional expertise, client values and contextual issues.


Albin, R. W., Lucyshyn, J. M., Horner, R. H., & Flannery, K. B. (1996). Contextual fit for behavioral support plans: A model for "goodness of fit." In L. K. Koegel, R. L. Koegel, & G. Dunlap (Eds.), Positive behavioral support: Including people with difficult behavior in the community. (pp. 81-98). Baltimore, MD: P.H. Brookes.

American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (1999). The Standards for Educational and Psychological Testing [Rev. ed.]. (Washington, DC,: American Educational Research Association).

APA Presidential Task Force on Evidence-Based Practice. (2006). Evidence-based practice in psychology. American Psychologist, 61, 271-285. DOI: 10.1037/0003-066X.61.4.271

Best Evidence Encyclopedia. (n.d.). Retrieved from http://www.beste-vidence.org/

Briggs, D. C. (2008). Synthesizing causal inferences. Educational Researcher, 37, 15-22.

Chard, D. J., Ketterlin-Geller, L. R., Baker, S. K., Doabler, C., & Apichatabutra, C. (2009). Repeated reading interventions for students with learning disabilities: Status of the evidence. Exceptional Children, 75, 263-281.

Confrey, J. (2007). Comparing and contrasting the National Research Council report on evaluating curricular effectiveness with the What Works Clearinghouse approach. Educational Evaluation and Policy Analysis, 28,:195-213.

Cook, B. G., Landrum, T. J., Cook, L., & Tankersley, M. (Eds.). (2008). Evidence-based practices in special education [Special issue]. Intervention in School and Clinic, 44 (2).

Cook, B. G., Tankersley, M., & Landrum, T. J. (2009). Determining evidence-based practices in special education. Exceptional Children, 75, 365-383.

Cook, T. D., & Campbell, D. T. (1979). Quasi - experimentation: Design & analysis issues for field settings. Boston, MA: Houghton Mifflin.

Cronbach, L. J. (1982). Designing evaluations of educational and social programs. San Francisco, CA: Jossey-Bass.

Fixsen, D. L., Naoom, S. F., Blase, K. A., Friedman, R. M., & Wallace, F. (2005). implementation research: A synthesis of the literature (FMHI Publication #231). Tampa, FL: University of South Florida, Louis de la Parte Florida Mental Health Institute, The National Implementation Researcher Network.

Gardner, A. W., Spencer, T. D., Boelter, E. W., DuBard, M., & Jennett, H. K. (2012). A systematic review of brief experimental analysis methodology with typically developing children. Education and Treatment of Children 35(2), 313-332.

Gersten, R., Fuchs, L. S., Compton, D., Coyne, M., Greenwood, C., & Innocenti, M. S. (2005). Quality indicators for group experimental and quasi-experimental research in special education. Exceptional Children, 71, 149-164.

Horner, R. H., Carr, E. C., Halle, J., McGee, G., Odom, S., & Wolery, M. (2005). The use of single-subject research to identify evidence-based practice in special education. Exceptional Children, 71, 165-179.

Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17-64). Westport, CT: American Council on Education/Praeger.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed. pp. 13-103). New York, NY: American Council on Education/Macmillan.

Messick, S. (1995). Validity of psychological assessment. American Psychologist, 50, 741-749.

O'Keeffe, B. V., & Slocum, T. A. (2012). Is repeated readings an empirically-supported intervention? A comparison of review methods. Education and Treatment of Children, 35(2), 333-366.

Sackett, D. L., Rosenberg, W. M., Gray, J. A., Haynes, R. B., & Richardson, W. S. (1996). Evidence based medicine: What it is and what it isn't. British Medical Journal, 312, 71-72.

Schmidt, F. L., & Hunter, J. E. (1981). Employment testing: Old theories and new research findings. American Psychologist, 36, 1128-1137.

Schoenfeld, A. H. (2006). What doesn't work: The challenge and failure of the What Works Clearinghouse to conduct meaningful reviews of studies of mathematics curricula. Educational Researcher, 35, 13-21.

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton-Mifflin.

Slavin, R. E. (2008). Evidence-based reform in education: Which evidence counts? Educational Researcher, 37(1), 47-50.

Slocum, T. A., Spencer, T. D., & Detrich, R. (2012). Best available evidence: Three complementary approaches. Education and Treatment of Children, 35(2), 153-181.

Spencer, T. D., Detrich, R., & Slocum, T. A. (2012). Evidence-based practice: A framework for making effective decisions. Education and Treatment of Children, 35(2), 127-151.

What Works Clearinghouse (2008, Dec.). WWC Procedures and Standards Handbook (Version 2.0). Retrieved from http://ies.ed.gov/ncee/wwc/references/idocviewer/doc.aspx?docid=19&tocid=1/.

Whitehurst, G. J. (2007). Evidence-Based Education (EBE) [Power-Point slides]. Retrieved from http://www2.ed.gov/nclb/meth-ods/whatworks/eb/evidencebased.pdf.

Wilczynski, S. M. (2012). Risk and strategic decision-making in developing evidence-based practice guidelines. Education and Treatment of Children, 35(2), 291-311.

Correspondence to Timothy A. Slocum, Department of Special Education and Rehabilitation, Utah State University, Logan UT 84322-2865. E-mail: tim.slocum@usu.edu.

Timothy A. Slocum Utah State University Ronnie Detrich Wing Institude Trina D. Spencer Northern Arizona University
Table 1
Analogy Between Educational Testing and Assessing Best Available

                      Assessing Student    Assessing Best Available
                      Skill                  Evidence

Construct           Reading skill          Best Available Evidence

Measurement system  Educational test       Systematic review of
                                           research literature

Measurement         Scores reflecting      Ratings of strength of
outcomes            reading skill          evidence for

Other               Other information      Clinical judgment and
considerations      about student          context including family
                    including educational  values and capacity of
                    opportunities,         system
                    language background,

Decisions           Placement and          Selection of treatment or
                    instructional          intervention to solve
                    decisions for          particular problem
                    individual student

Table 2
Aspects of Measurement Validity Applied to Systematic Reviews

Aspect of Validity        General Issues            Application to
                                                Systematic Reviews

Content             Logical examination of how  Expert judgments of
                    well test items and         each element of
                    procedures correspond with  review system for
                    construct                   correspondence with
                                                best available

Substantive         Empirical evidence of how   Empirical comparisons
                    well components of the      of different ways of
                    measure correspond with     accomplishing each
                    construct                   step in review

Structural          Are scores from components  Examination of
                    integrated in a wav that    systems for combining
                    reflects the construct?     results of each step
                                                in review process

Generalizability    Consideration of ways in    Consideration of how
                    which scores from testing   results of reviews
                    may or may not generalize   may or may not
                    to other settings and       generalize to various
                    situations                  practice contexts

External            Correlations between        Correlations among
                    results of target test and  systematic review
                    other measures              systems; correlations
                                                with other ways of
                                                reviewing evidence

Consequential       Examination of the effects  Examination of
                    of test use on socially     effects of systematic
                    important outcomes          reviews on
                                                effectiveness of

Table 3
Content and Substantive Aspects of Validity Applied to Systematic

Step in review     Content aspect of       Substantive aspect of
process         validity (Questions for      validity (Empirical
                    Expert Judgment)               Questions)

Locating        Experts examine          Employ alternative search
studies         procedures for locating  methods and compare
                studies and judge their  results. For studies that
                ability to locate all    are identified by some,
                relevant studies.        but not all, methods, we
                Experts evaluate         could ask: (1) what
                whether there are        characteristics account
                limitations in the       for the study being missed
                system that could        by some search systems,
                result in missing        (2) what are their
                relevant studies.        characteristics on
                                         relevance, methodology,
                                         outcomes, (3) how would
                                         they affect overall

System for      Experts examine          Employ alternative
assessing       procedures for           standards for selection of
relevance of    reviewing, coding, and   relevant studies and
primary         rating relevance of      compare results across
studies.        studies. Is each study   variations in systems.
                reviewed in sufficient   Alternate standards may
                depth to allow an        include various sets at
                accurate appraisal of    rules as well as global
                relevance? Does each     expert judgment of the
                coding and rating        relevance of studies.
                factor represent an      Studies found to be
                aspect of "relevant      relevant by some, but not
                evidence on the          all, systems may be
                treatment,               analyzed to (1) identify
                participants, outcomes   characteristics that
                and contexts" (i.e.,     account for the difference
                exclude irrelevant       in relevance ratings, and
                factors)? Does the       (2) impact of these
                system as a whole        differences on overall
                include all of the       judgment.
                important features of
                relevance (i.e.,
                represent full
                construct)? Does the
                system appropriately
                weigh various aspects
                of relevance?

System for      Experts examine          Employ alternative
rating          procedures for           procedures/standards for
methodological  reviewing, coding, and   methodological quality and
quality         rating methodological    compare results. Analyze
primary         quality. Is each study   characteristics of studies
studies.        reviewed in sufficient   that are rated as
                depth to allow for       acceptable by some, but
                accurate appraisal of    not all, systems. How many
                methodological quality?  studies are at stake? What
                Does each coded and      features cause the
                rated feature of         difference in ratings?
                methodological quality   What are their
                represent a feature of   characteristics on
                methodological quality?  relevance and outcomes?
                (i.e. exclude            How would differences in
                irrelevant factors)      methodological ratings
                Does the system as a     affect overall judgment of
                whole include all        intervention?
                important features of
                methodological quality
                (i.e., represent full
                construct)? Do the
                ratings and standards
                appropriately weight
                the various aspects of

System for      Experts examine          Employ alternative systems
assessing       procedures for           for rating outcomes of
outcomes of     reviewing, coding, and   primary studies and
primary         rating outcomes of       compare results across
studies         studies Do all aspects   systems. How many studies
                of outcomes that are     change ratings? What
                summarized and rated     features differentiate
                (e.g., magnitude of      studies that are rated
                effects, statistical     differently by different
                significance, etc.)      systems? How do these
                represent the construct  differences affect overall
                strength of results      judgment of the
                (i.e., exclude           intervention?
                irrelevant factors)?
                Are there relevant
                aspects of outcomes
                that are not summarized
                and rated (i.e.,
                represent full
                construct)? Are overall
                ratings of strength of
                results in studies
                appropriate to the
                results reported?
                (e.g., if magnitude
                and stat significance
                of studies combined
                into a single rating,
                are the two
Gale Copyright:
Copyright 2012 Gale, Cengage Learning. All rights reserved.