Risk and strategic decision-making in developing evidence-based practice guidelines.
Article Type:
Decision-making (Educational aspects)
Educational programs (Management)
Education (Methods)
Education (Research)
Wilczynski, Susan M.
Pub Date:
Name: Education & Treatment of Children Publisher: West Virginia University Press, University of West Virginia Audience: Professional Format: Magazine/Journal Subject: Education; Family and marriage; Social sciences Copyright: COPYRIGHT 2012 West Virginia University Press, University of West Virginia ISSN: 0748-8491
Date: May, 2012 Source Volume: 35 Source Issue: 2
Event Code: 310 Science & research; 200 Management dynamics Computer Subject: Company business management
Geographic Scope: United States Geographic Code: 1USA United States

Accession Number:
Full Text:

Evidence-based practice (EBP) represents an important approach to educating and treating individuals diagnosed with disabilities or disorders. Understanding research findings is the cornerstone of EBP. The methodology of systematic reviews, which involves carefully analyzing research findings, can result a practice guideline that recommends treatments based on the best available evidence. Educators and practitioners will be best positioned to effectively use these guidelines when they recognize both the strengths and limitations of these documents. This article highlights some of the limitations of these documents by reviewing the decisions experts make when they develop practice guidelines. The risks associated with each of the decisions are outlined, with the National Standards Project serving as an example for each decision and resulting risks. The implications of the risks are considered so that educators and practitioners will be better able to evaluate the usefulness of any practice guideline when they select treatments for the children they serve.

KEYWORDS: Evidence-based practice, practice guidelines, National Standards Project, systematic review

It is impossible to avoid risk in life. You cross the street, you jog, or you listen to music making these decisions involve risks. You can get hit by a car when crossing the street; you can strain a muscle when jogging; you can damage your hearing by listening to very loud music. Yet making the opposite decisions also involves risk. If you never cross the road you will lose access to many good things in life. If you don't exercise, you risk heart disease and a range of other health problems. If you do not listen to music, your quality of life may be significantly diminished. Whether you take action or do not take action, you have made a decision and your choice has inherent risks.

Even though science contributes important answers that improve society, it is important to remember that the series of decisions that are required when the scientific process is followed also have their risks. Every introductory research methods class includes a discussion of possible errors that can occur even when a study employs a rigorous design. False positives occur when the results indicate a treatment is effective--but in reality, it does not actually produce favorable outcomes. Conversely, false negatives occur when results suggest that a treatment is not effective but it actually does produce benefit. These risks can be minimized but they cannot be avoided altogether. Introductory research students are taught to weigh each of these risks and decide how best to design their studies. These students also learn to responsibly remind people who consume their research that errors may have occurred despite their best efforts to avoid them. These students are not taught that the best way to avoid risks is to stop using the scientific method. We would think this viewpoint laughable because the risk of error is much greater when we do not use the scientific method. We understand that the failure to draw a scientific conclusion will mean that decisions will be made without systematic evidence. Thus, there is a third type of risk--the risk associated with failing to draw a conclusion from the best data that exist. Fortunately, although failing to draw a conclusion is the greatest risk, it is the easiest to avoid.

In recent decades, scientists and practitioners have developed an approach that minimizes the risk of failing to draw conclusions when decisions about treatment selection must be made. As defined in Spencer, Detrich, & Slocum (201.2) in this special issue, evidence-based practice (EBP) involves the integration of the best available evidence, clinical expertise, as well as client values and context. These three components consistently appear in the EBP literature. The National Autism Center (NAC, 2009) included a fourth component, the capacity to implement a treatment with fidelity (i.e., accurately delivering the treatment). Despite wording or component differences that appear within the EBP literature, the research findings component consistently serves as the cornerstone of EBP and results from systematic reviews.

Systematic review is the process of evaluating the literature to determine if there is sufficient research evidence to conclude that a treatment produces benefit. The results of systematic reviews are often used to develop practice guidelines. These practice guidelines identify the methods used to conduct systematic reviews and give recommendations for selecting treatments based on the results. They often convey the limitations of the evidence as well as recommendations for how educators and practitioners should use the results.

This article addresses the decisions that are required to develop practice guidelines and the potential risks associated with each of the decisions required to complete a systematic review. This article is intended to shine a light on these risks so that we can all be more critical consumers of systematic reviews and the practice guidelines that are developed based on their results. Two important gains can be accomplished by acknowledging the risks associated with systematic reviews and practice guidelines. First, professionals completing systematic reviews can minimize the risks associated with each decision they make, and will be more likely to produce practice guidelines that are useful to educators and practitioners who guide the treatment selection process. Second, professionals relying on practice guidelines will become more critical consumers. As a result, these professionals will be better able to interpret the outcomes reported in practice guidelines and to determine if the recommendations should be applied when serving specific clients or students.

Acknowledging that the process of completing systematic reviews involves risk is not an indictment of practice guidelines any more than recognizing that false positives and false negatives exist is an indictment of science. Readers are cautioned not to throw out the proverbial baby with the bathwater. Rather, the information provided here is intended to help professionals to thoughtfully consider and adopt strategies for reducing possible sources of error - this is at the heart of the scientific process upon which practice guidelines are dependent.

Practice Guidelines

Science and practice have always had an interesting relationship. In the early years of the field of medicine, individuals with little medical training and bogus degrees could set out a shingle and called themselves "doctor." These doctors were sometimes dynamic characters like John Brinkley, the "Goat Gonad King," who killed many of his patients when performing gonadal transplants from goats to human males with the goal of increasing virility. Despite their charisma and ability to attract a desperate patient population, these charlatans were often opposed by professionals who sought to use science as a foundation of clinical decision-making (Brock, 2009). These professionals began to establish an expectation for medicine to be based on believable research outcomes.

EBP first emerged in the field of medicine in the 1990s when it was recognized that physicians often made recommendations that were not consistent with the best available research outcomes (Evidence-Based Medicine Working Group, 1992). Other fields such as allied health, psychology, and education recognized this same concern and quickly adopted EBP as an approach to serving patients, clients, and students. EBP requires a careful review of available evidence in order to identify treatments that have been empirically demonstrated to be efficacious (i.e., shown to produce benefit in highly controlled studies). This process typically requires a systematic review, which involves a careful and detailed evaluation of the quality, quantity, and consistency of research outcomes. Scientists aggregate results across studies by comparing treatment outcomes to an established criteria against which all treatments are judged. The methods for developing practice guidelines can be applied to a single treatment (e.g., depression in pre-adolescent girls) or to a broad array of treatments (e.g., identification of efficacious interventions for Autism Spectrum Disorders (ASD)). Practice guideline have been offered by a single author publishing in a refereed journal or from organizations that work with large teams of experts who have a strong body of knowledge in both the relevant literature and research methodology.

EBP has gained such popularity that the phrase has sadly become ubiquitous and its meaning has been attenuated because it has been applied to treatments for which no sound empirical evidence exist. In some ways, this is rather shocking, given the number of practice guidelines that require strict methodological rigor before recommendations are generated. But as has been the case throughout history, some people are motivated to use a popular term because it increases their income or fame.

Nefarious intentions do not explain most cases in which the term "evidence based" has been applied to a treatment that lacks adequate research support. Instead, this may occur because no universal methodology for identifying efficacious treatments has emerged. All fields evolve over time, so it is not surprising that such a new approach to evaluating research has not yet been set in stone. When reading about methods for completing a systematic review, the entire process may appear to be perfectly clear, objective, and simple to apply. However, this process can become extremely complicated in application. A host of decisions must be made to evaluate the quality, quantity, and consistency of research outcomes and although the process may be transparent, the risks are not always clear.

I served as the chair of the National Standards Project (NAC, 2009), the largest systematic review of the autism treatment literature completed to date, and will draw largely from this experience in the remainder of this article. Based on a review of 775 studies spanning 50 years of autism research, the National Standards Project identified 11 Established Treatments (i.e., efficacious interventions). In addition, the 22 Emerging Treatments (i.e., treatments with preliminary research support but requiring more evidence) and five Unestablished Treatments (i.e., treatments without compelling evidence of efficacy) were evaluated. The National Autism Center endeavored to be utterly transparent regarding the process for developing their EBP guideline and many critical decisions were discussed publicly so that input from the autism and scientific communities could inform these decisions.

Although this article identifies many of the critical decisions that were made in order to complete the National Standards Project (N AC, 2009), it is important to recognize that these decisions are not unique to this particular project. For all practice guidelines and the systematic reviews on which they are based, decisions entail risk and informed consumers should examine these decisions when any recommendations are forwarded. Lest the fear of the risks associated with these decisions tempt consumers to turn away from EBP, they are reminded to bear the greatest risk in mind - the risk of failing to draw conclusions based on the best evidence that is available. Without a firm commitment to EBP, we are likely to be surrounded by more Dr. Brinkley's in the fields of medicine, allied health, psychology, and education. Without an evidence-based approach, educators will expend the resources (e.g., money, time, etc.) available to them on treatments that not only squander these valuable resources but could actually cause harm.

Strategic Decisions in Developing Practice Guidelines

There are many strategic decisions that have the potential to influence outcomes reported in practice guidelines. Decisions that can influence the results of systematic reviews and the value of practice guidelines will be considered in more detail below. The strategic decisions made by the experts of the National Standards Project are often provided as an example. In addition, recommendations are made to readers who are interested in learning strategies for determining how these risks may influence their own decision to adopt or reject a practice guideline.

Number and variety of experts

A systematic review begins with the identification of experts who will conduct the review and develop the practice guideline. Experts involved in the development of a practice guideline must have knowledge about the range of treatments that are being considered in the review as well the scientific methods used to determine treatments' efficacy. But selection of experts is not without risks. Without input from professionals who represent a sufficient range of perspectives or that have sufficient methodological training and experience, it is unlikely that a thorough review will be made. Erroneous recommendations could be made because the experts did not select a sufficiently large number of articles to review or methods to adequately evaluate the literature. Further, the practice guideline could be broadly rejected when too few experts are involved, undermining the goal of providing useful information that helps educators select effective treatments.

On the other hand, these experts must be able to derive consensus about the model used to evaluate the literature. Obtaining timely and comprehensive feedback from a body of experts that is extremely large and hold widely divergent views can result in significant delays in the completion of a practice guideline - or failure to produce a practice guideline altogether. As the goal of developing a practice guide-line is to put relevant information into the hands of those providing treatment, an especially sizeable body of experts could undermine this primary goal by delaying or failing to complete the guideline.

Number and variety of evaluation models

A large number of practice guidelines have been developed since the inception of evidence-based medicine twenty years ago (Ev-idence-Based Medicine Working Group, 1992). Many of these guide-lines have clarified and detailed the methods required to complete a thorough systematic review of the treatment literature. However, there is no universally accepted standard for systematically reviewing the literature - instead there are numerous models of how to conduct the review. Thus, experts must review previous models as a basis for establishing the specific procedures they will use for the review.

If the experts developing the practice guideline choose to review a small number of models for completing systematic reviews they would face two risks. First, because they did not review numerous models that apply to their particular situation, they may have to expend considerable time "reinventing the wheel." Second, models that the experts consider may be too narrow and fail to include all relevant variables and/or minimize risks. Conversely, experts may invest a tremendous amount of time reviewing innumerable models for evaluating the literature and risk delaying the completion of the practice guideline. The delay might be worthwhile if it results in more useful recommendations. Unfortunately, this may not be the case because after several models have been considered, reviewing additional models may yield few improvements in methodology. Thus, information that is critically important to professionals selecting treatment could be needlessly delayed.

Article Identification

Inclusionary and exclusionary criteria. The experts producing a practice guideline must determine which studies to include and which to exclude from the systematic review. Decisions must be made about the population studied, the type of treatment, the research design, as well as many other factors. One example of complications in defining the population of interest involves the definition of autism spectrum disorders (ASD) in the National Standards Project (NAC, 2009). There were difficult questions of co-morbidity - that is, should studies including participants who were diagnosed with ASD along with other disorders be included?

Every time an inclusionary or exclusionary rule is applied, it has direct impact on the outcomes reported in the practice guideline because it has bearing on whether or not a study is considered for review. Most practice guidelines attempt to severely restrict the review so that the project is manageable and the results are very clear for a highly specific population. Without restriction, the outcomes may be difficult to interpret. For example, if a treatment study includes participants with autism and a major medical disorder and the study showed that the treatment did not produce favorable outcomes, does this mean the treatment is not efficacious for (a) individuals with autism, (b) individuals with major medical disorders, or (c) individuals with both autism and a major medical disorder? By adopting broadly inclusive rules and thus, including a study like this in a systematic review for autism, the experts could state that a treatment does not benefit individuals with autism when, in fact this is not the case (i.e., false negative).

On the other hand, if the rules are so restrictive that they exclude the majority of the treatment research available for a given population, it raises concerns about whether the practice guideline reflects all that is known from the research. For example, many guidelines do not consider studies that use single-subject research design. Given that the vast majority of research with some populations (e.g., autism) is conducted using single-subject research design, applying this exclusionary rule would mean that most research on the treatment of these disorders would not be included in the review. By restricting the review in this way, it is possible to suggest that a treatment is experimental (i.e., not well supported), but this would simply be a result of applying overly restrictive exclusionary rules,

Literature Search. Once the inclusionary and exclusionary criteria are established, the experts must complete an exhaustive search for relevant research. This process typically begins with putting key words into a search engine such as PsychInfo or PubMed. Unfortunately, search engines do not identify all articles that should appropriately be included in a review. What process should be used to find relevant studies that are not found using standard search engines? Options include: (a) asking the experts to identify articles that are missing, (b) making the existing list available publically and asking interested parties to identify articles that were not captured in the review, (c) requesting a second team of experts to secure additional articles, (d) reviewing recent articles/texts for missing articles, and (e) systematically reviewing references for all obtained articles to identify potentially relevant articles. There are risks associated with all methods for expanding the review beyond standard search engines. The options forwarded here could result in an increase in articles identified which could increase accuracy but also require the expenditure of a great deal of time and additional resources. Imagine how long it would take to retrieve and evaluate the suitability of every article cited in each of the 775 studies reviewed in the National Standards Project (NAC, 2009). Further, given the large number of published studies that do not reach the highest criteria for most systematic reviews, the additional time may not significantly alter recommendations.

Reviewer Identification and Reliability

Once a set of research studies has been identified, experts must carefully review each study. A review of an extensive literature base can be completed more quickly if multiple qualified reviewers complete the reviews. However, using a large number of reviewers also increase the likelihood that different conclusions will be drawn about a given article due to human error even when practice guidelines clearly document how the review should be completed. If reviewers do not agree, there is no point in combining results across their reviews because some treatments may be incorrectly categorized as efficacious or as lacking empirical support. A rigorous methodology requires that a reviewer cannot be deemed qualified unless he or she is able to demonstrate a high degree of agreement with other reviewers before beginning the extensive reviews.

By convention, individual studies with inter-observer agreement of less than .80 are not considered believable (Barlow, Nock, & Hersen, 2009; Cooper, Heron, & Heward, 2007). This same minimum criterion can be reasonably extended to systematic reviews and the practice guidelines on which they are based. Reviewers should establish inter-observer agreement at or above .80 before reviews are initiated. Further, inter-observer agreement should be maintained throughout the review process. Without meeting this criterion, the outcomes of the review and the recommendations upon which they are based should be considered suspect.

Reviewers cannot begin the review process until reliability has been established. If a large number of reviewers are to participate, the process of ensuring that all reviewers demonstrate reliability may require a good deal of time. Further, a reviewer may not maintain reliability even if inter-observer agreement was initially attained. This would mean that some reviews would need to be thrown out and each of these studies would need to be re-reviewed by another professional, resulting in a delay in the completion of a practice guideline. A larger pool of reviewers yields a greater likelihood that some reviewers will fail to maintain reliability.

Criteria for Determining Scientific Merit of a Study

Some studies are not capable of answering the question "Is this treatment efficacious for individuals from a specific population?" Many studies suggest a treatment may produce benefit but, in reality, the researchers did not design the study in a way that would definitively answer this critical question. When evaluating studies in a systematic review, reviewers must extract sufficient information from each study to determine if it is scientifically capable of answering this question. Although there is general agreement that scientific merit should be determined prior to drawing conclusions about treatment efficacy, the specific standards for rating scientific merit are not universally accepted (Kvernbekk, 2011).

The National Standards Project (NAC, 2009) evaluated five factors of each research study: (a) research design, (b) dependent measure, (c) treatment fidelity, (d) participant ascertainment, and (e) generalization. In the past, issues of research design and the dependent measure have been the main concerns in systematic reviews. More recently, the scientific community has come to see additional factors as critical for acceptable research. These more recently emphasized factors include treatment fidelity (i.e., evidence that a treatment was accurately implemented), participant ascertainment (i.e., that participants belong to the population they are intended to represent), and generalization (i.e., that the effects generalize to other situations). Given the historical importance of these factors and their relative impact on the decision about treatment efficacy, each of these factors was given a different weighting for the National Standards Project. The formula was: research design (.30) + dependent measure (.25) + participant ascertainment (.20) + treatment fidelity (.15) + generalization (.10) = scientific merit.

As simple as this solution seems, this process actually involves three levels of strategic decisions and each level brings different risks. First, a different group of experts may have selected different factors to determine scientific merit. For example, many systematic reviews do not include an indicator of generalization in the determination of scientific merit. A given study may receive a different score for scientific merit if a different arrangement of factors had been evaluated. Second, a six-point multidimensional scale was developed for each of these factors. Criteria were developed so that a score between 0 and 5 could be applied to each of these factors for every study that was reviewed (NAC, 2009). A given study could receive a very different score on any one of these factors if the experts had identified a different scale for evaluating that factor (e.g., every study receives a score of 0 or 1). Third, the weighting applied to each of these factors was based on expert opinion and agreement was not always unanimous. For example, not every expert involved with the National Standards Project agreed that participant ascertainment should receive a higher rating than treatment fidelity and generalization (NAC, 2009). In fact, one expert suggested that all factors should be given equal weightings. Had this recommendation been followed, the overall scientific merit score for any given study could have been different. There is no universal standard because there is no perfect solution. Risks exist irrespective of the decisions that are made. But the experts must weigh the risks and find a balance to best produce scientific merit scores that will contribute to identification of efficacious treatments.

Criteria for Determining Treatment Outcomes for a Study

Scientific merit only tells the consumer whether or not they should have confidence in the outcomes of a study - it does not tell them the nature of the outcomes. Does the treatment produce beneficial outcomes, adverse outcomes, no change, or is it impossible to determine its effects?

When conducting systematic reviews, the experts must develop criteria for describing treatment outcomes. The risk of setting treatment outcome criteria too high is that it becomes too difficult for researchers to show a treatment produces benefit or harm. Conversely, if the criteria are met too easily, many treatments may be identified as efficacious when, in reality, they are ineffective. For example, the criterion required to show that a treatment was effective when single-subject research design was evaluated in the National Standards Project (NAC, 2009) was that a clear relationship between the independent variable (i.e., treatment) and the dependent variable (i.e., outcomes) was demonstrated at least two times. If the criterion was that a relationship between the treatment and the outcomes could be demonstrated with only one comparison of baseline and treatment conditions (e.g., an AB design) many more studies would have been seen as indicating beneficial treatments. If the criteria required three demonstrations (e.g., and ABABAB design), many fewer studies would have provided this evidence.

Setting the criteria for treatment outcomes has clear practical implications. If less restrictive criteria are used, the likelihood of overes-timating the number of treatments that produce beneficial outcomes increases. This means that school systems, insurance companies, and families could expend tremendous funds on treatments that do not actually improve lives and that an important opportunity for effective treatment will be lost. On the other hand, if more restrictive criteria are adopted, the review could fail to recognize treatments that are truly effective. Practice guidelines could offer few or no recommendations, despite the fact that substantial evidence is available to support the use of some treatments.

Categorization of Treatments

As noted previously, some practice guidelines involve a review of a broad literature base because the goal is to identify every treatment that is efficacious for a given population. All studies regarding a given disorder are reviewed and treatments must be organized into meaningful categories. The most obvious strategy for categorizing treatments is to call each treatment by the name provided in the study. However, this strategy is inadequate because (a) two treatments may be very similar but they have been given two different names by different researchers, (b) many treatments do not have a given name, they are multi-component treatments that often emerge out of a single general approach (e.g., providing consequences), and (c) two substantially different procedures may be called by the same name.

If treatments are categorized based on similarity, the experts must determine how broadly or narrowly to define the treatment category. There is no exact science for completing this task - once again these decisions involve risk. If the experts define the treatment category broadly (i.e., they include a large number of treatments within a given treatment category), the results may lack sufficient specificity. The category may include some specific variations that are effective and others that are ineffective. In addition, broad categories of treatments may result in recommendations that are too vague to be useful. At the extreme, imagine if a treatment category was identified as "behavioral or social science" based on the language provided by the National Institute of Health (2010).

Based on this definition, virtually all non-biological treatments would fall into one broad category. A practice guideline would simply recommend that "behavioral or social science" treatments should be used because they work. This would be of limited practical value to the educators and practitioners responsible for selecting treatments.

On the other hand, the treatment category can be defined too narrowly. For example, if treatments were categorized based on the name given in each specific article, it may actually be more difficult to interpret outcomes. Consider the case of 'reinforcement.' If the expert reviewers were to decide that a distinct treatment name would be given to each variation in how reinforcement was delivered, this would result in innumerable treatment categories and would miss the fundamental similarity of all the variations. For example, a different treatment category would have to be created if reinforcers were delivered after every demonstration of a target behavior (fixed ratio 1), every other demonstration (fixed ratio 2), every third demonstration (fixed ratio 3), etc. Similarly, a different treatment category would have to be created if reinforcers were delivered the first time the behavior was demonstrated after 1 minute passed (fixed interval 1), 2 minutes passed (fixed interval 2), three minutes passed (fixed interval 3), etc. Further, different treatment categories would be developed if the ratio or interval at which reinforcers were delivered were variable. The schedule of reinforcement is not the only factor that might result in novel treatment categories. The plan for gradually reducing the delivery of reinforcers (i.e., thinning the schedule of reinforcement) would also vary across experiments and every variation would result in a novel treatment category. Treatment categories would further proliferate for every type of reinforcer delivered (e.g., attention, escape, tangible, physiological stimulation). Even within these categories, a different treatment category could be generated for every different variation of the type of reinforcer (e.g., for tangibles, cookies, crackers, stuffed animals, books, stickers, etc.). Finally, every combination of (a) schedule of reinforcement, (b) thinning of reinforcement, (c) type of reinforcement, and (d) specific stimulus would have to receive a novel treatment category label. The experts would have to apply the same strategy when categorizing every treatment.

The effect of narrowly defining the treatments in this way would be two-fold. First, it would be extremely difficult to test and replicate every treatment for every population. This would mean that it would be exceedingly unlikely that any treatment could ever receive a rating other than 'experimental' or 'unsupported.' Second, the recommendations made in the practice guideline would be of limited value to consumers of these documents because it would not be organized in a way that would reflect actual treatment selection processes. In practical terms, if each study was its own category, there would never be enough research on any treatment to show that it produced benefit. There would be no reason to complete systematic reviews because they would produce no useful information.

For this reason, experts use their best judgment to identify the size of the treatment category and to define the category as clearly as possible. Given that the goal of a practice guideline is to provide information to support the selection of treatments in order to improve the quality of life for children, adolescents, and adults, the utility of the treatment category size decisions should be of paramount importance.

Criteria for Identifying Efficacious Treatments

All previous decisions lead to one culminating point: Information from all of the studies of a given treatment must be combined to determine the how well the treatment is supported by the research. First, the scores a study receives for scientific merit and treatment outcomes are combined to describe how much that study contributes to our understanding of the treatment's efficacy. Second, the collective evidence from all studies of the treatment is compared against a set of criteria. The final statement about treatment efficacy is typically based on the quality, quantity, and consistency of research findings (West et al., 2002). These variables formed the foundation for the Strength of Evidence Classification System (SECS) in the National Standards Project (NAC, 2009).

Quality and Quantity of Studies. Although replication is crucial for demonstrating the efficacy of a treatment, if the criterion for the number of studies is too high, it could take decades to accumulate sufficient evidence to support an effective treatment. Decision-makers could be left with no guidance despite the fact that valuable evidence is available in the treatment literature. Similarly, if the criterion for the quality of studies is too high (e.g., a perfect score for scientific merit is required across multiple studies), the likelihood that researchers would complete a sufficient number of such studies is remote given the realities of limited funding and participant interest. For instance, how many researchers have funding that allows them not only to demonstrate an effect but also to provide multiple examples of generalization (e.g., across multiple teachers, with multiple sets of materials, across multiple environments, over time)? Similarly, how many parents of children on the autism spectrum would be willing to participate in a treatment study in which their child was in the control condition over multiple years? Even if they were offered effective treatment at the conclusion of the study, few parents are likely to wait for years before their child receives help.

Conversely, if the criteria for number and quality of studies are set too low, false positives will occur. That is, treatments will be identified as efficacious even though, in reality, they do not produce benefit. Precious resources will be lost to treatments that are ineffective or could even cause harm.

Consistency of Outcomes. Quality and quantity are only two components of identifying the strength of evidence supporting an intervention. The experts must also have a system for addressing inconsistencies in the literature. Inconsistent findings might be reported in the literature for any number of reasons. The primary explanations are that: (a) despite strong scientific merit, one of the studies has reported an inaccurate finding that results in a false positive or false negative, and (b) a treatment may be identified as producing favorable outcomes in a given study but the scientific rigor of that study is too poor to produce clear outcomes. Experts must develop a strategy for interpreting inconsistencies in the literature with these explanations in mind. The risk associated with setting a criterion for inconsistencies is that if you set the criterion too high (i.e., any inconsistencies mean that the treatment cannot be rated as effective), the system may identify a treatment as ineffective when, in reality, it produces benefit to the population. On the other hand, if the criterion is set too low (i.e., even with many inconsistencies treatments are rated as effective), the system may identify a treatment as efficacious when it is not.

Levels of Classification. The experts must decide if they will identify only those treatments that reach the criteria for efficacy (i.e., treatments that 'work') or if they describe all treatments in terms of the level of scientific evidence currently available. Some systematic reviews or practice guidelines identify only those treatments that are deemed efficacious; however, many users of these guidelines want to know the quality, quantity, and consistency of outcomes for treatments that do not meet the highest standard.

Risks exist whether the experts select a two-level (e.g., efficacious or lacks sufficient evidence) or a multi-level system (e.g., efficacious, preliminary, no reliable evidence, or harmful/ineffective). In a two-level system, most treatments will not reach the efficacy criteria (even with very low standards). A two-level system does not provide information on the evidence that is available on these treatments. Are they just below the standard, do they lack any evidence at all, or have they been shown to cause harm? In the absence of this information about the majority of treatments, consumers may make inaccurate assumptions. On the other hand, when experts select a multi-level system, people often incorrectly assume that a treatment with some support (e.g., preliminary evidence) does not pose any risk. In reality, when additional well-controlled research is conducted, some of these treatments may actually be shown to ineffective or harmful.


Individuals with special needs deserve to have access to treatments that work. Educators and practitioners are highly motivated to select treatments that will produce socially meaningful improvement in the lives of the children, adolescents, or adults in their care. Unfortunately, there are people and organizations that prey on the good intentions of individuals selecting treatments by promising miracle cures. Systematic reviews and the practice guidelines based on these reviews provide the most accurate way of identifying efficacious treatments. Educators and practitioners need the information that result from these guidelines so that they can identify treatments that are most likely to be effective.

EBP is currently riding a tidal wave of popularity with good cause. Practice guidelines can offer improved access to effective treatments by putting critical information in the hands of decision-makers. But consumers of these guidelines must also take action to avoid the undertow. Consumer should become informed about the methodology applied when practice guidelines are developed and the risks associated with each methodological decision. The intensity of the risk varies across decisions, but each has implications about the utility of these practice guidelines when selecting treatments for individual clients. Some decisions could delay the completion of the project, which leaves educators and practitioners without clear guidance about which treatments have adequate research support. In this case, almost certainly, some of the treatments selected will be ineffective or harmful. Other decisions may compromise the outcomes and recommendations increasing risk of Type I or II errors: Falsely stating a treatment is efficacious when it is not, or stating a treatment is ineffective or lacks research support when it actually produces benefit. Either of these outcomes means that valuable resources could be spent on the wrong treatment for children, adolescents, or adults. Further, some decisions may undermine the goal of producing a guideline that is useful to professionals who have the responsibility of selecting treatments. That is, a lot of work may go into producing a practice guideline that is simply not helpful.

With all of the risks involved with producing practice guide-lines, it would be easy to decide that the entire concept of EBP should be rejected. This course of action would be reasonable if it was not so dangerous there are many modern day John Brinkleys in our midst. In 1947 Winston Churchill stated, "Democracy is the worst form of government, except for all of those other forms that have been tried from time to time," (Notable Quotations, 2011) it should be argued that following recommendations based on systematic reviews is the worst form of selecting treatments, except for all of those other methods that have been tried from time to time.

Democracy requires critical and engaged citizen and EBP requires informed consumers. In the same way that introductory research methods students learn to critically evaluate research and to minimize risks, developers and consumers of practice guidelines must learn to weigh and systematically reduce the risks of producing errors or incorrectly applying the recommendations provided in practice guidelines. There is no formula for minimizing these risks and the "best" decision may vary across different populations and treatments. But practitioners and educators can evaluate for themselves if sound decisions were made and if the recommendations forwarded should actually guide treatment selection for an individual client.

Although there are inherent risks associated with practice guide-lines, consumers must not lose sight of the fact that the greatest risk comes from the failure to use the best available evidence as the basis for making important decisions. The likelihood of selecting ineffective or harmful treatments is much higher in the absence of a practice guideline. By becoming a critical consumer of practice guidelines, practitioners and educators can identify those guidelines that provide them trustworthy and useful information.

The responsibility for using effective treatments does not end with the scientists conducting research or the experts who aggregate the research outcomes in order to identify efficacious treatments. Educators and practitioners are encouraged to begin by selecting from among the array of efficacious treatments that have been identified in well-conducted systematic reviews. But this is only the first step. Professionals delivering treatment must collect data in a way that determines whether or not the intervention produces benefit for a given client. Reliable data must be collected and the treatment must be introduced in a systematic manner, such that the practitioner can quickly determine if the treatment improves quality of life. Only then can each individual receiving treatment reach his or her potential.


Special thanks are offered to Keith Allen, Tim Slocum, and Ronnie Detrich for helping to inspire and encourage the writing of this article.


Barlow, D. H., Nock, M. K., & Hersen, M. (2009). Single case experimental designs: Strategies for studying behavior change (3rd ed.). Boston, MA: Pearson Education Inc.

Brock, P. (2009). Charlatan: America's most dangerous huckster, the man who pursued him, and the age of flimflam. New York, NY: Crown Publishers.

Cooper, J. 0., Heron, T. E., Heward, W. L. (2007). Applied behavior analysis (2nd ed.). Upper Saddle River, NI: Pearson Education Inc.

Evidence-Based Medicine Working Group. (1992). Evidence-based medicine: A new approach to teaching the practice of medicine. PIMA, 268, 2420-2425.

Kirschstein, Ruth L. (2000). Description of behavioral and social sciences research. National Institutes of Health. 1-68.

Kvernbekk, T. (2011). The concept of evidence in evidence-based practice. Educational Theory, 61, 515-532.

National Autism Center. (2009). National Standards Report: National Standards Project Addressing the need for evidence-based practice guidelines for autism spectrum disorders. Randolph, MA: National Autism Center, Inc.

Notable Quotes. (2011). Democracy Quotes [Web log post]. Retrieved from http://www.notablequotes.com/d/democracy_quotes. html

Spencer, T. D., Detrich, R., & Slocum, T. A. (2012). Evidence-based

practice: A framework for making effective decisions. Education and Treatment of Children.

West, S., King, V., Carey, T. S., Kathleen, N. L., McKoy, N., Sutton, S. F., & Lux, L. (2002). Systems to rate the strength of scientific evidence. Evidence report/Technology assessment no. 47. Prepared by the Research

Triangle Institute-University of North Carolina Evidence-Based Practice Center under Contract No. 290-97-0011. AHRQ Publication No. 02-E016. Research Triangle Park, NC: University of North Carolina.

Correspondence to Susan M. Wilczynski, Special Education Department, Ball State University, Muncie, IN, 46306; email: smwilczynski@bsu.edu. The author identifies her previous role as Executive Director of the National Autism Center as critical to the development of this article. However, this article has not been reviewed by the National Autism Center in advance of publication and thus, it does not reflect the views of that organization. Susan M. Wilczynski Ball State University
Risk to adoption: An insufficient range of expertise could result in
  broad rejection of the practice guideline.

  Risk to timely access: An excessively large group of experts could
  result in significant delays in project completion, leaving educators
  without critical information for treatment selection.

  Recommendation to reader: Consumers should look at the credentials
  of the experts. If relevant professional perspectives have been
  represented, consumers can be more confident that the practice
  guideline has a broader consensus. In some cases, multiple fields
  of study (e.g., medicine, education, behavior analysis, etc.) are
  required to have sufficient representation.

Risk to accurate methodology: Insufficient examination of systematic
  review methodologies could produce false positives or false

  Risk to timely completion: Excessive time devoted to examining
  systematic review methodologies could delay the completion of the
  practice guideline without significant improvement to the final

  Recommendation to reader: Systematic reviews typically list
  the guidelines on which the methodology is based. Consumers should
  consider examining these methodologies to determine if critical
  variables have been omitted

Risk to utility: Highly restrictive exclusion rules could produce a
  practice guideline that applies to a very small subset of the
  population who needs treatment. Conversely, adoption of extremely
  broad criteria could yield a practice guideline that provides
  practitioners no unique information about the specific population they

  Risk to accuracy: Highly restrictive or extremely broad rules
  could result in a treatment being incorrectly identified as
  ineffective or experimental.

  Recommendation to reader: Consumers are encouraged to examine the
  exclusionary and exclusionary rules applied in a systematic review.
  If these rules result in a systematic review that does not apply to
  the practice questions, the practice guideline may not usefully inform
  treatment selection

Risk to accuracy: Omissions of relevant studies undermines the
  accuracy of reported outcomes.

  Risk to timely completion: Educators
  and practitioners may be deprived of information about treatment
  efficacy for a protracted period of time without significantly
  improving the accuracy of the information when the most extensive
  methods for article retrieval are used.

  Recommendation to reader:
  Consumers should decide if the experts used a sufficient number of
  search engines and used an adequate number of key words in their
  search. They should also determine if they applied reasonable
  strategies to identify articles that may have been missed by the
  search engine. This will help consumers decide if relevant
  studies are likely to have been identified.

Risk to timely completion: Delays to the practice guideline may result
  from (a) too few reviewers, (b) the time it takes to train a large
  panel of reviewers, or (c) the time required to re-review a number of
  the articles because a number from the large pool of reviewers did not
  reliably evaluate the articles.

  Recommendation to reader: Consumers should be extremely cautious in
  considering the results of any systematic reviews that do not
  describe inter-observer agreement or inter-observer agreement
  falling below .80.

Risk to accuracy: The accuracy of the practice guideline may be
  threatened when (a) all relevant variables for assessing scientific
  merit have not been included, (b) the criteria are not reasonable, or
  (c) the weightings are not appropriate given the field of study or
  area of investigation.

  Recommendation to reader: Consumers should be familiar with the
  factors that are being rated to determine the scientific merit of
  a given study. Are the factors sufficiently comprehensive or are
  only a few factors considered? If the factors do not adequately
  represent every variable that should be considered in order to
  critically analyze a given study, the outcomes reported in
  the practice guideline may not be accurate. Also, if
  weightings are assigned, do the experts have a strong rationale
  for these weightings? It is best if the weightings do not seem
  to inflate or deflate the scientific merit of each treatment

Risk for false negatives: Stringent criteria may result in all studies
  being described as "not having enough evidence." Risk for false
  positives: Loose criteria may mean almost all treatments are reported
  to be efficacious, when in reality this is not the

  Recommendation to reader: Consumers should examine the criteria
  used to describe treatment outcomes. Are they consistent with the
  general rules taught in graduate programs to determine that a
  treatment produced a positive outcome? If the criteria seem much
  stronger or weaker, the results reported in the practice guideline may
  be questioned.

Behavioral and social sciences research is a large, multifaceted
  field, encompassing a wide array of disciplines ... employs a
  variety of methodological approaches ... several key
  cross-cutting themes [that] ... include: an emphasis on
  theory-driven research; the search for general principles
  of behavioral and social functioning; the importance ascribed
  to a developmental, lifespan perspective; an emphasis on
  individual variation, and variation across sociodemographic
  categories such as gender, age, and sociocultural status;
  and a focus on both social and  biological contexts of
  behavior (Para. 4).

Risk to comprehensive review: The experts may be unable to analyze or
  draw meaning from the majority of the treatment research if experts
  define treatment categories too narrowly.

  Risk to utility: Practice
  guidelines may not provide sufficiently specific information to help
  educators and practitioners select treatments if the treatment
  categories are defined very broadly.

  Recommendation to reader:
  Consumers should read the description of each treatment category and
  determine if it makes sense from an applied point of view. If the
  treatment categories are  not defined in a way that informs selection
  of treatments, the practice guideline will be of limited value.

Risk for false negatives: Truly efficacious treatments may be deemed
  to lack sufficient evidence and may not be used if the criteria for
  quality, quantity, and consistency are too high.

  Risk for false positives: Ineffective or possibly harmful treatments
  are more likely   to be endorsed as efficacious if the criteria are
  too low. Limited resources could be expended on a treatment that
  does not produce benefit.

  Risk for adoption of ineffective or harmful
  treatments: Although a multilevel system provides more information
  about available evidence, they may also result in greater
  misinterpretation. Treatments with preliminary evidence may be
  interpreted as "close enough" and adopted despite the fact they
  actually do not work or produce adverse effects.

  Recommendation to reader: Irrespective of the levels of evidence
  provided in practice guidelines, it is always safest for consumers
  to select first among the treatments that have been shown to be
  efficacious. All other treatments have a greater possibility of
  being ineffective or harmful. But even when efficacious treatments
  are selected, the only way to know if the treatment is effective
  for a given client is to collect data and systematically introduce
  the treatment using one of the many different single subject
  research designs that can be appropriately used given the
  individual case.
Gale Copyright:
Copyright 2012 Gale, Cengage Learning. All rights reserved.