Quality appraisal of single-subject experimental designs: an overview and comparison of different appraisal tools.
Article Type:
Report
Subject:
Educational programs (Management)
Decision-making (Educational aspects)
Authors:
Wendt, Oliver
Miller, Bridget
Pub Date:
05/01/2012
Publication:
Name: Education & Treatment of Children Publisher: West Virginia University Press, University of West Virginia Audience: Professional Format: Magazine/Journal Subject: Education; Family and marriage; Social sciences Copyright: COPYRIGHT 2012 West Virginia University Press, University of West Virginia ISSN: 0748-8491
Issue:
Date: May, 2012 Source Volume: 35 Source Issue: 2
Topic:
Event Code: 200 Management dynamics Computer Subject: Company business management
Geographic:
Geographic Scope: United States Geographic Code: 1USA United States

Accession Number:
292882222
Full Text:
Abstract

Critical appraisal of the research literature is an essential step in informing and implementing evidence-based practice. Quality appraisal tools that assess the methodological quality of experimental studies provide a means to identify the most rigorous research suitable for evidence-based decisionmaking. In single-subject experimental research, quality appraisal is still in its infancy. Seven different quality appraisal tools were identified and compared with respect to their compliance with current standards for conducting single-subject experiments as well as their performance in evaluating research reports. Considerable variability was noted relative to the construction and content of the tools, which consequently led to variability in their evaluation results. Few tools provided empirical support for the validity of item construction and reliability of use. The Evaluative Method, the Certainty Framework, the What Works Clearinghouse Standards, and the Evidence in Augmentative and Alternative Communication Scales were identified as the more suitable instruments currently available for the critical appraisal of single-subject experimental designs, noting their different strengths and limitations. In the absence of a "gold standard critical appraisal tool," applied researchers and practitioners need to proceed with caution when interpreting evaluation results obtained from the existing tools, keeping their context and intent in mind.

KEYWORDS: Appraisal tool, critical appraisal, evidence-based practice, quality assessment, single-case design, single-subject experiment, single-subject research, systematic review

Recent movements towards evidence-based practice (EBP) in applied nand clinical disciplines have brought more attention to the value of single-subject experimental designs (SSEDs). The methodologies of SSEDs are uniquely suited to investigate the impacts of specific interventions, making SSEDs critical for implementing EBP. SSEDs represent a quasi-experimental approach to evaluating intervention efficacy in a single participant or a small group of participants, which serve as their own controls (Backman & Harris, 1999).

SSEDs can be a powerful source of evidence for EBP purposes: Several variants--for example, the adapted alternating treatment design or the parallel treatment design--allow a rigorous evaluation of different interventions and their effects simultaneously, and they are of utmost importance for the current emphasis on comparative effectiveness research in many applied fields (Agency for Healthcare Research and Quality, 2009). Unsurprisingly, the use of SSEDs has increased, especially when dealing with very heterogeneous populations--for example, individuals with autism spectrum and other developmental disorders, behavior disorders, communication disorders, learning disabilities, mental health disorders, and physical impairments. The problem of obtaining homogeneous samples of participants with similar characteristics and the high cost of applied research make group-comparison designs difficult to implement with these populations (Barlow, Nock, & Hersen, 2009). Consequently, SSEDs constitute a considerable percentage of intervention studies across the fields of behavioral, disability, educational and rehabilitation research (e.g., Schlosser, 2009; Wendt, 2007). A growing array of scholarly disciplines has incorporated SSEDs into their methodological repertoire; the growing popularity of SSEDs is reflected by over 45 professional, peer-reviewed journals now reporting single-subject experimental research (American Psychological Association [APA], 2002; Anderson, 2001).

Due to the ability to establish experimental control and reveal intervention efficacy, several authors recommend SSEDs for identifying empirically supported interventions (e.g., Homer et al., 2005; Schlosser, 2003; Schlosser & Sigafoos, 2008). This movement has led to increased scrutiny of how well SSED research is conducted (Perdices & Tate, 2009). Applied researchers and practitioners are facing the challenge to separate poorly conducted from well-conducted SSED studies before making intervention decisions and defining an intervention as empirically supported (i.e., the presence of several SSEDs that meet minimum standards). Intervention research that is compromised by fatal flaws in research design is not appropriate for guiding educational or clinical decisions, as reported outcomes cannot clearly be attributed to the intervention (Wendt, 2009). Consequently, consumers of SSED research need to evaluate the methodological rigor of any SSED investigation before relying on study results for EBP purposes. In a similar manner, applied researchers aiming to synthesize SSEDs in a systematic review have to assess study quality and assign more weight to sound studies when evaluating the overall research evidence. This process of examining aspects of reliability and internal and external validity of a published research report is called critical appraisal or assessing study quality (Petticrew & Roberts, 2006) and has become a fundamental element of EBP, particularly in those disciplines where group experimental designs are the norm.

For quite a long time, the fields of healthcare and medicine have utilized established critical appraisal guidelines to help clinicians and applied researchers assess the relevance and rigor of research findings (Crombie, 1996). Critical appraisal guidelines for surveys, cohort studies, case-control studies and randomized controlled trials have been in existence for a long time and are readily available (Crombie, 1996; Straus, Richardson, Glasziou, & Haynes, 2005). Such guidelines can be turned into comprehensive checklists or rating scales, which can be used to assign a score or ranking to a study as an indication of methodological quality. These checklists and scales containing rigorous sets of criteria against which a research report is evaluated are known as critical appraisal tools.

Despite the prevalence and long history of SSEDs, critical appraisal of this methodology had been largely overlooked until recently. It was not until some years ago that the first explicit quality indicators for SSEDs were published by Horner et al. (2005) and subsequently applied to published research (Tankersley, Cook, & Cook, 2008). Shortly afterward, the first SSED quality appraisal tools appeared in the form of rating scales and were published in the fields of neurorehabilitation (Tate et al., 2008) and developmental medicine (Logan, Hickman, Harris, & Heriza, 2008).

The purpose of this article is to provide an overview and preliminary comparison of the different appraisal instruments that have evolved to evaluate the methodological rigor of SSEDs. In particular, our goals are to (a) introduce the currently available appraisal tools and summarize their defining features, (b) compare these tools to a current standard of research design for SSEDs, (c) use the tools across a variety of different SSEDs to investigate their performance in distinguishing study quality, and (d) reveal directions for the further development of SSED quality appraisal tools as well as recommendations for applied researchers and practitioners.

An Overview on Current Quality Appraisal Tools for Single-Subject Experimental Designs

Locating SSED appraisal tools

In order to locate currently available appraisal tools for SSEDs, searches were conducted in the Cumulative Index of Nursing and Allied Health Literatures (CINAHL), Educational Resources Information Clearinghouse (ERIC), Linguistics and Language Behavior Abstracts (LLBA), MEDLINE, and PsycINFO, as well as search engines and publisher specific databases including Google Scholar (http://scholar.google.com), Ixquick (http://www.ixquick.com), ScienceDirect (http://www.sciencedirect.com), Scirus (http://www.scirus.com), Scopus (http://www.scopus.com), and SpringerLink (http://www.springerlin.k.com). All databases were searched using search strings of the key words "single subject design" or "single case design" or "single subject experiment" in combination with "critical appraisal" or "scale" or "rating." Additionally, the authors conducted footnote chasing from reference lists of obtained qualifying publications (e.g., Perdices & Tate, 2009; Tankersley et at., 2008). To be included, articles needed to operationalize appraisal guidelines into a checklist or scale. Articles that merely discussed quality issues related to SSEDs but did not provide a concrete appraisal tool (e.g., Atkins & Sampson, 2002; Beeson & Robey, 2006) were not included. This yielded a total of seven articles providing the critical appraisal tools under review.

Background and characteristics of currently available tools

The most critical aspects of the seven tools are briefly summarized in Table 1. The various tools differ with respect to their background, composition, psychometric properties, and sponsorship.

Background. It seems noteworthy that some fields have developed further in the advancement of critical appraisal than others. Two instruments were created for the evaluation of intervention research related to autism spectrum disorders, including the Evaluative Method (Reichow, Volkmar, & Cicchetti, 2008), and the Smith, Jelen, and Patterson Scale (2010; hereafter referred to as Smith et al. Scale). Two other instruments were developed or primarily being used by the field of augmentative and alternative communication, comprising the Certainty Framework (Simeonsson & Bailey, 1991) and the scales used for the Evidence in Augmentative and Alternative Communication (EVIDAAC) database. EVIDAAC includes the Single-Subject Scale (Schlosser, 2011) and the Comparative Single-Subject Experimental Design Rating Scale (CSSEDARS, Schlosser, Sigafoos, & Belfiore, 2009; both hereafter referred to as EVIDAAC Scales). The remaining three tools are not limited to a specific subject area. The Logan et al. (2008) Scale (hereafter referred to as Logan et al. Scale) targets the wider area of pediatric medicine and rehabilitation; the Single-Case Experimental Design Scale by Tate and colleagues (2008; hereafter referred to as SCED Scale) originated in the field of neurorehabilitation; and the What Works Clearinghouse Standards (Kratochwill et al., 2010; hereafter referred to as WWC Standards) were designed for the broader area of educational research.

Composition. Four of the seven appraisal tools follow the common appraisal format of providing a checklist with yes/no questions evaluating the presence of certain quality criteria. Depending on how many criteria are met, a final score is derived that for some tools is translated into a quality rating (e.g., strong/moderate/weak evidence). Three tools deviate from this format and require further explanation:

Certainty Framework. Unlike other quality appraisal tools, the Certainty Framework does not provide specific items on a scale that allow calculation of a final quality score. Instead, this framework classifies a research report in terms of the certainty of its findings as conclusive, preponderant, suggestive, or inconclusive. This is done by descriptively evaluating the three dimensions of (a) research design quality, (b) interobserver agreement (IOA) on the dependent variable, and (c) treatment integrity (TI). Conclusive evidence means that the outcomes are undoubtedly the results of the intervention based on a rigorous, flawless design and adequate or better IOA and n Preponderant evidence confirms that the outcomes are not only plausible but they are also more likely to have occurred than not because of minor design flaws with adequate or better IOA and TI. Suggestive evidence verifies that the outcomes are plausible and within the realm of possibility due to a strong design but unsatisfactory IOA and/or TI or minor design flaws and inadequate IOA and/or TI. Inconclusive evidence indicates that the outcomes are not plausible due to fatal flaws in research design. Because the three dimensions apply to intervention research in general and are not specific to any particular design, the Certainty Framework applies to both group and single-subject research; this versatility can be an advantage when synthesizing studies across the two methodologies in a systematic review.

Evaluative Method. The Evaluative Method consists of three instruments: 1) Rubrics for evaluating the rigor of a research report include two levels of indicators, primary and secondary. Primary indicators refer to study elements that are deemed critical to the validity of the study, and these are ranked on a three level scale (high quality, acceptable quality, or unacceptable quality). The secondary indicators reflect quality design features that are of value but not strictly necessary for validity purposes, and those are evaluated on a dichotomous scale (evidence or no evidence); 2) The second instrument provides a method for synthesizing the ratings from the rubrics into a rating for the overall strength of the research report (strong, adequate, or weak); 3) The final instrument then provides the means for determining the level of EBP (i.e., rating interventions as "promising" or "established" based on amount and quality of empirical support). The authors combined group design and SSED criteria into the Evaluative Method, enabling quality assessment across methodologies with a single scoring system.

WWC Standards. These standards are divided into two parts, Design Standards and Evidence of Effect Standards. Design Standards are applied first and assess the internal validity of a SSED. Studies are classified as Meets Standards, Meets Standards with Reservations and Does not Meet Standards. If a study meets Design Standards (with or without reservations), its results are then evaluated primarily via visual analysis using the Evidence of Effect Standards. This process results in a rating of the strength of effects of that study on a 3-point-scale: (1) Strong Evidence of a Causal Relation, (2) Moderate Evidence of a Causal Relation, and (3) No Evidence of a Causal Relation (Kratochwill et al., 2010).

Psychometric properties. Only the Evaluative Method and the SCED Scale provided information on content validity of the items used for evaluation. Inter-rater reliability of scale administration was assessed for the Evaluative Method, the Logan et al. Scale, and the SCED Scale.

Sponsorship. Two instruments, the EVIDAAC Scales and the WWC Standards, were supported by the U.S. Department of Education, either through its Institute of Education Science (WWC Standards) or through its National Institute on Disability and Rehabilitation Research (EVIDAAC Scales). The SCED Scale was sponsored by external funding sources in Australia.

Congruence of Appraisal Tools with Homer et al. (2005) Standards

As the EBP movement grew more and more prominent in education, healthcare and related fields, various task forces outlined at what point an intervention may be considered "empirically validated" or "empirically supported" (Schlosser & Sigafoos, 2008). For example, the American Psychological Association (APA), as early as 1996, sponsored Task Force 12 to identify empirically validated therapies in clinical psychology (Chambless et al., 1998). In the field of special education, the Council for Exceptional Children's Division for Research (CEC-DR) took initiative by sponsoring a special issue of its flagship journal Exceptional Children that included a series of papers describing quality indicators for special education research and proposing standards for identifying empirically supported treatments (Graham, 2005). Within this special issue, an article by Homer et al. (2005) explicitly outlined criteria SSEDs have to meet in order to be considered of high quality. For a quality appraisal tool to yield meaningful results it seems critical that the tool is congruent with accepted standards in the field. Such correspondence can be considered a means of validating the content of the instrument. For this purpose, the following sections review the Homer et al. (2005) quality indicators and report the results of a comparison of each appraisal tool against these criteria. Despite its wide recognition in the field (cf., Tankersley et al., 2008), the Homer et al. (2005) article is not the only source outlining quality criteria for SSEDs; textbooks such as those by Di Noia and Tripodi (2008) or Hancock and Mueller (2010) offer similar sets of criteria to evaluate SSEDs. The various tools may have performed differently if any of these sources had been used as standards for comparison. Furthermore, content validity of the different tools could also be evaluated through other methods, such as expert ratings of tool items (e.g., having an expert panel identify the "best instrument" as demonstrated in Hootman, Driban, Sitler, Harris, & Cattano, 2011).

Comparing Appraisal Tools to Horner et al. (2005) Criteria

The comparison of the seven quality appraisal tools to the quality indicators provided by Homer and colleagues (2005; hereafter referred to as Homer Criteria) is shown in Table 2. The Homer Criteria were specified according to the table provided in Horner et al. (2005). These specifications seemed consistent with the presentation of these criteria in the professional SSED literature (e.g., Gast, 2010; Tankersley et al., 2008). It should be noted that text and table in Homer et al. (2005) do not always match, allowing different interpretations of required detail and rigor within the quality indicators. Therefore, when applying the Horner Criteria, some judgment is inevitably needed on how to exactly operationalize them. Consequently, researchers may formulate the quality indicators in different ways depending on context or scenario. Schlosser (2009), for example, presents a very condensed version that abbreviates many of the social validity aspects. For the purposes of this comparison, we supplemented the original 21 quality indicators from the table in Homer et al. (2005) with information from the text specifying the minimum number of baseline data points. We also separated one quality criterion describing several critical baseline characteristics into two separate items for a clearer comparison with the appraisal tools. Different interpretations and ways to operationalize the quality indicators are possible and may lead to different results in this kind of comparison.

The first and second author then independently coded each instrument in relation to the final set of 23 quality indicators. The authors used a binary format, that is, a checkmark indicates whether or not a criterion is addressed by the instrument. The criterion had to be covered to the full extent; no credit was given if an instrument addressed the Homer Criteria only partially (e.g., requiring interobserver agreement without specifying minimum levels). In the case of the Certainty Framework, checkmarks in parentheses indicate any Horner criteria that can be incorporated into this tool. The Certainty Framework differs significantly from the remaining tools, as it does not contain operationalized evaluation items in a checklist; instead, it provides a more general context that permits the reviewer to specify more detailed items for evaluating a research report.

Inter-rater reliability was calculated using percentage agreement. The total number of agreements was divided by the number of agreements plus disagreements. This yielded an agreement rate of 91%. The first and second author then discussed incidences of disagreement and reached consensus to finalize coding. Both authors have considerable experience with SSEDs through actively conducting single-subject experimental research and teaching relevant doctoral-level course work.

Congruence and Deviation

The Homer Criteria are grouped into seven general categories, each representing an essential design component to be evaluated. The following is an outline of these seven major design components along with a summary of how the different appraisal tools addressed the items in each category.

Describing Participants and Setting. For replication purposes, quality single-subject research should provide detailed descriptions of participants and settings, including gender, age, diagnosis, disability, instrument(s) used to determine disability, and other details relevant to the study being conducted. This level of detail also pertains to the physical settings of the study. With the exception of the Certainty Framework and WWC Standards, all tools ask for a detailed description of participant characteristics. The Evaluative Method is most specific on this criterion by requesting interventionist information and participant test scores. The EVIDAAC Scales are the only tool that also addresses the criteria of describing participant selection processes and physical features of treatment settings.

Dependent Variable. Dependent variables of the study should be operationally defined with enough detail for future replication. The dependent variables must be measureable through a procedure that generates a quantifiable unit. Dependent variables should be measured repeatedly over time to identify patterns of performance such as level, trend, and variability. Interobserver agreement (IOA) should be conducted and meet a minimal standard of 80% for percentage agreement, and 60% for Cohen's kappa, respectively.

All tools except the Certainty Framework and WWC Standards explicitly require an operational definition of dependent variables (DVs). The Evaluative Method goes as far as requiring that a link be established between selected measures and treatment outcomes. None of the scales was specific enough to address the Horner Criteria for providing a quantifiable index when measuring DVs; valid measurement of DVs with provision of replicable detail was addressed by the Evaluative Method only. The EVIDAAC Scales and the SCED Scale include the criterion of repeated measurement over time. All tools require IOA being conducted but differ in their rigor on assessing this aspect. While Evaluative Method, EVIDAAC Scales, and WWC Standards appear most rigorous by specifying minimum levels of agreement and minimum numbers of sessions to be evaluated, Logan et al. Scale, SCED Scale, and Smith et al. Scale do not articulate agreement levels or session numbers. The Certainty Framework leaves it to the evaluator to set acceptable levels. Logan et al. Scale and Smith et al. Scale merely require IOA for all phases of the study or prior to intervention.

Independent Variable. As with dependent measures, the independent variable should also be operationally defined with enough detail for future replication, including detail of materials used, actions and procedure descriptions. The independent variable should be directly and systematically manipulated under the control of the researcher, and treatment fidelity (treatment integrity) must be documented to ensure the consistency and accuracy of implementation.

All tools except the Certainty Framework, SCED Scale, and WWC Standards ask for replicable precision in describing the independent variable (IV). The Smith et al. Scale additionally requires specifying IV conditions including setting, interventionist and duration of sessions. Only the Evaluative Method and WWC Standards address the criterion of systematic control and manipulation of the IV by the researcher. The last criterion in this category, measurement of IV implementation, is assessed by four of the seven tools: Certainty Framework, Evaluative Method, EVIDAAC Scales, and Smith et al. Scale

Baseline. Credible baselines consist of five or more repeated measures of the dependent variable, which establish a pattern of response that can be used to predict future performance. Baseline procedures and conditions should again be described with replicable detail.

The criterion of repeated measurement of DVs during baseline is addressed by all the tools except the Certainty Framework. These tools either have an item requesting repeated measurement or they specify a minimum number of baseline data points. The Logan et al. Scale and the Smith et al. Scale meet the criterion of a minimum of five baseline data points. The Evaluative Method and WWC Standards require a minimum of three data points. The Certainty Framework implicitly requires repeated measurement of DVs by asking for a sound design but leaves room for interpretation as it may be argued that some SSEDs may not need a baseline phase (e.g., alternating treatment designs as outlined in Kennedy, 2005).

Establishment of a pattern to predict future performance is assessed only by the Evaluative Method and the EVIDAAC Scales. The Evaluative Method is the only tool to assess whether baseline conditions are described with replicable detail.

Experimental Control and Internal Validity. To account for experimental control and internal validity, quality single-subject research must demonstrate at least three experimental effects at three different points of time. The design must account for threats to internal validity, eliminate rival hypotheses, control for confounding variables, and display a pattern that can demonstrate experimental control.

Only the Evaluative Method, the EVIDAAC Scales, and the WWC Standards explicitly assess whether the SSED shows three demonstrations of experimental effect. The Certainty Framework does not specify the number of effect demonstrations but allows this to be a requirement. Logan et al. Scale and Smith et al. Scale ask for replication of treatment effect across subjects but do not specify a required number of replications. Instead, both tools ask for the type of SSED to be dearly and correctly identified.

The SCED Scale is the most generic; it requires that the design allow for demonstration of cause and effect. Controlling for threats to internal validity is addressed explicitly by the Certainty Framework but only partially by some of the other tools; the Evaluative Method, the Logan et al. Scale, and the Smith et al. Scale assess whether raters were blinded to experimental conditions. The SCED Scale asks for independence of those that collect data. Documentation of a pattern showing experimental control is required by all tools except the Certainty Framework and SCED Scale. Four tools--the Evaluative Method, the Logan et al. Scale, the Smith et al. Scale, and the WWC Standards--provide detailed instruction on the type of visual analysis that is required.

External Validity. Replication across settings, materials, and participants should be conducted to establish external validity.

The Logan et al. Scale most explicitly addresses this criterion by requiring replication across three or more research participants. Both the Evaluative Method and the Smith et al. Scale require a study to assess generalization or maintenance at the conclusion of treatment. The SCED Scale asks for inclusion of a generalization phase.

Social Validity. Up to four aspects of social validity may be desirable: the selection of a socially meaningful DV, the amount of treatment effect to be socially important, the demonstration of practicality and cost efficiency in implementing the treatment, and facilitating social validity through typical intervention agents and contexts. Only the Evaluative Method addresses all four of these. The Smith et al. Scale requires a qualitative or quantitative report of social validity but does not go into further detail. All other tools exclude social validity from their evaluation.

Other Criteria. It is noteworthy that some of the tools provide evaluation items that exceed the Homer Criteria. The Logan et al. Scale and Smith et al. Scale ask for appropriate construction of the visual analysis graph. The Logan et al. Scale also includes the use of a statistical test and evaluates whether the assumptions for the test were met. The SCED Scale, too, asks for statistical analysis plus the reporting of raw data.

Overall Results

The seven critical appraisal tools were compared against 21 Homer Criteria including a total of 23 specific features. The Evaluative Method includes the largest number of Horner Criteria (with 17 out of 23 addressed), followed by the EVIDAAC Scales (12 out of 23). The Smith et al. Scale (9 out of 23), the Logan et al. Scale (7 out of 23), the WWC Standards (7 out of 23), and the SCED Scale (5 out of 23) rank last. The Certainty Framework can incorporate up to 19 of the 23 specific features; its current version is not directly comparable to the remaining tools and would need extension to be transformed into an operationalized, rating scale format.

A Preliminary Field Test: Application of Appraisal Tools to Treatment Studies

A well-designed appraisal instrument would not only be aligned with current research standards in the field, but its most important asset would be the capacity to adequately determine the overall strengths and weaknesses of a research report and to clearly distinguish sound studies from flawed ones. A small field test was conducted to compare the seven appraisal tools in this regard. The authors selected four SSED treatment articles, each one representing one of the major design types: withdrawal design (represented by Crozier & Tincani, 2005), changing criterion design (represented by Ganz & Sigafoos, 2005), multiple baseline design (represented by Ozdemir, 2008), and alternating treatment design (represented by Tincani, 2004). All of the articles were taken from the field of treatment efficacy in autism for two reasons: (a) the Evaluative Method was specifically designed and promoted for this focus, and (b) treatment efficacy in autism is a top priority on the agenda of federal funding agencies (Interagency Autism Coordinating Committee, 2011); therefore, quality appraisal of this research is a pressing need. The first and second author independently applied each appraisal tool to each article and calculated inter-rater agreement using percentage agreement. This yielded an agreement rate of 85%. All discrepancies were discussed and resolved before evaluations were finalized.

Certain limitations of this field test need to be recognized: First, the sample of studies is very small in number and chosen purposefully, not randomly drawn from a larger body of research. A larger sample would permit a more statistical comparison--for example, a correlational analysis of final evaluation scores across the various tools and/or with expert ratings of study quality. Such an extended comparison may draw a different picture of the various strengths and weaknesses and/or reveal further issues warranting attention. The observations reported, however, were made independently by the two authors and verified by mutual agreement; therefore, we feel that the issues we raised are indeed worth further discussion. Other single-subject researchers are encouraged to conduct their own applications and evaluations of the various tools and share their experiences.

The results of this preliminary field test are shown in Table 3. It should be noted that the Certainty Framework and WWC Standards do not provide quantitative scores rather than qualitative rankings of evidence. All other tools yield a final quality score based on how many of their items are fulfilled by the study under evaluation. The Evaluative Method and the Logan et al. Scale provide guidelines regarding how to translate this final score into evidence rankings of "strong," "adequate/moderate," and "weak." To form a basis for comparison, final quality scores were transformed into percentage scores indicating the proportion of scale items that were met by the studies. A higher percentage indicates a higher rating by the tool. This field test gives an initial impression of how the quality tools may differ in their assessment and where the discrepancies can be found.

Withdrawal design (Crozier & Tincani, 2005 exemplar). The Certainty Framework ranks this study as "inconclusive" because the A-B-A-C design does not meet the criterion of three effect demonstrations as experimental control over phase C is lacking. Similarly, the Evaluative Method derives a score of 58% and a final ranking of "weak." The WWC Standards, too, identify this study as "not meeting evidence standards." All other tools rate this study at a higher level ranging from 70% by the SCED Scale to 87% by the Smith et al. Scale. On these scales, the serious flaw in experimental design is not as evident because the study fulfills most other items. On the Smith et al. Scale, for example, the flaw remains undetected because the tool does not require three demonstrations of effect. On the EVIDAAC Scales, the study misses the criterion for three effect demonstrations as well as the criterion for describing participants with replicable detail, but it still meets the remaining eight criteria resulting in a high score of 80% (because all criteria are weighed equally). Overall, there is strong consistency between Certainty Framework, Evaluative Method, and WWC Standards, but much less consistency with the remaining tools, which seem to derive an overly positive rating.

Changing criterion design (Ganz & Sigafoos, 2005 exemplar). The Certainty Framework yields a ranking of "suggestive"; despite a sound design, an adequate number of effect demonstrations, and strong inter-rater reliability, the study does not provide treatment integrity data. The Evaluative Method, Logan et al., SCED and Smith et al. Scales all yield scores between 50% and 60%. The WWC Stan-dards assign a final ranking of "moderate evidence." The EVIDAAC Scales stand out by assigning a score of 90%, mainly because the study fails only the treatment integrity criterion on the EVIDAAC Scales; this tool does not recognize the various criteria on which the study received low ratings from other scales (i.e., minimum number of data points, adequate statistical analysis, or blinding of raters among others). Overall, there is considerable agreement between most of the appraisal tools with the exception of the EVIDAAC Scales.

Multiple baseline design (Ozdemir, 2008 exemplar). According to the Certainty Framework, this study ranks as "preponderant" because it has a strong design with sufficient replications of treatment effect and strong reliability on both dependent and independent variables. The lack of a stable baseline for one participant decreases the ranking from "conclusive" to "preponderant." Similarly, the WWC Standards rank the study as "strong evidence." The Evaluative Method, SCED, and Smith et al. Scales all score this study between 60% and 67%. However, the EVIDAAC Scales yield a score of 80%. The Logan et al. Scale gives the lowest score, 39%, indicating "weak" due to the lack of generalization data, statistical analysis and blinding. Overall, the large range in ratings is noteworthy (from "weak" on the Logan et al. Scale to "strong evidence" on the WWC Standards) and seems worrisome given that this type of design is used extensively in educational and clinical research (Carr, 2005; Gast & Ledford, 2010).

Alternating treatment design (Tincani, 2004 exemplar). This study has a sound design with strong inter-rater agreement and treatment integrity; however, generalization probes are not taken during baseline and are not graphed, and therefore the Certainty Framework yields a ranking of "preponderant." The WWC Standards yield a rating of "moderate evidence" due to considerable data overlap and no clear response differentiation in the alternating treatment phase of one participant. The remaining tools derive similar results: The Evaluative Method gives a score of 67%, resulting in a rating of "adequate," and the EVIDAAC Scales show a score of 79%. SCED and Smith et al. Scales yield scores of 70% and 80%, respectively, mostly because the study does not meet the minimum number of data points per phase (these scales require at least five), features no blinding of raters, and lacks statistical analysis. For similar reasons, the Logan et al. Scale ranks this study at 61%, implying "moderate" strength of evidence. Overall, the appraisal tools appear fairly consistent in rating this study within the range of moderate to adequate evidence.

General discrepancies and discriminability. From this initial field test, we observe that the discrepancies seen in the final ratings are often caused by (a) differences in composition of the tools (i.e., which appraisal criteria are included and which are not); (b) the weight given to each criterion; and (c) the rigor of the criteria (e.g., whether 3 or 5 data points are required in each phase). For example, most studies did not use raters who were blind to conditions, and as a result they received lower scores from the Evaluative Method, the Logan et al. and Smith et al. Scales but none of the others. Tools also differ in the minimum number of data points required per phase, and while the Logan et al. and Smith et al. Scales demand five data points, others require only three (e.g., Evaluative Method) or do not specify the minimum number (e.g., EVIDAAC Scales). Further discrepancies are due to the assessment of statistical analyses and related data assumptions being met; such criteria can be found in the Logan et al. and SCED Scales, while others do not include them. Discrepancies are also caused by some of the tools assessing studies for generalization data and/or social validity, while other tools do not include these design aspects.

Clearly distinguishing the level of research quality is an important function of an appraisal tool, and the field test reveals ways in which appraisal tools vary in how they make this discrimination. Within this small sample of studies, the Certainty Framework has assigned three out of its four possible rankings. The Evaluative Method distinguished "weak" from "adequate" evidence but never assigned its highest ranking of "strong" (which can be due to the very small and selective sample of studies). The EVIDAAC Scales seemed to appraise all studies at a relatively high level, with a restricted score range from 79% to 90%. The Logan et al. Scale showed the largest range of scores, from 39% to 71%, and separated "weak" from "moderate" evidence. A similar range of scores, 53% to 87%, was observed with the Smith et al. Scale, while the SCED seemed to assign only one of two score values, 60% or 70%. For both scales, it is unclear how these scores should be interpreted in qualitative terms (e.g., "weak" or "moderate"). The WWC Standards produced all three potential ratings at the initial stage of design evaluation ("meets standards," "meets with reservations," "does not meet standards") and assigned two out of three potential ratings at the final stage of effects evaluation ("strong," "moderate," "no evidence"), resulting in a fine-grained separation of low and high quality studies.

Conclusions and Discussion

Quality appraisal of SSED research has become a critical issue for the advancement of EBP. Several quality appraisal tools have recently become available to assist applied researchers and practitioners with this task. These instruments were compared to the Homer Criteria, one important representative of current standards for SSEDs. Finally, the appraisal tools were put to a brief field test by applying them to four articles representing the major types of SSEDs. These initial comparisons led to discussion of (a) strength and weaknesses of the various tools, (b) some general issues yet to be resolved, and (c) final recommendations and future research directions.

Strengths and weaknesses of the various tools

As shown throughout this report, the different appraisal instruments vary remarkably in their alignment with the Homer Criteria and in their evaluation results of SSED studies. Based on these initial experiences, some tools appear to be stronger and more rigorous in quality assessment than others; the four soundest tools are listed first, in hierarchical order starting with the more rigorous ones. The last three tools are not listed in any hierarchical sequence. In sum, strengths and weaknesses appear to be as follows:

Evaluative Method. This tool left the most compelling impression for a variety of reasons. Among all the instruments, it demonstrated the strongest congruence with the Homer Criteria. In the evaluation of SSED articles, the Evaluative Method appeared very rigorous, as it clearly identified study weaknesses and consequently distinguished between "weak" and "adequate" evidence. Another important advantage of the Evaluative Method are the well-researched psychometric properties resulting from two validation studies, although not conducted by independent research teams (Reichow et al., 2008; Cicchetti, 2011). A feature that distinguishes the Evaluative Method from all other instruments is the attempt to separate primary and secondary quality indicators and give more weight to the former (see importance of weighting discussed below). While there is no doubt that some elements of a research study are more crucial to its validity (especially internal validity) than others, it could be argued that some of the secondary quality indicators should be primary ones and vice versa: Providing participant characteristics is not related to internal validity and could be considered secondary, while interobserver agreement and treatment fidelity are crucial to internal validity (Schlosser, 2003) and therefore may need to be primary quality indicators.

Certainty Framework. This tool does not contain an operationalized rating scale; instead it provides guidelines to evaluate a research report along three essential elements of internal validity (i.e., sound design, strong reliability of the dependent and independent variable). The Certainty Framework was assessed on its ability to incorporate the Horner Criteria into its three dimensions and showed strong compatibility with these criteria. In other words, the vast majority of Homer Criteria can be used to add more detail to the Certainty Framework and specify what exactly creates a sound design and strong reliability of dependent and independent variables. If applied in this combination (for example, specifying that a sound design requires a minimum of three effect demonstrations at three different points in time), the Certainty Framework can produce fine-grained evaluations of evidence; in our sample, the Certainty Framework separated inconclusive from suggestive and preponderant studies. All evidence levels are clearly defined and the implications for EBP are outlined (e.g., inconclusive evidence not suitable to inform educational practice). This distinction makes the Certainty Framework useful in translating research findings into treatment recommendations. One weakness, however, is the lack of a check-list type of format to guide evaluation. While an experienced reviewer, knowledgeable about SSED standards, can perform a very thorough and rigorous evaluation, the novice reviewer may not know exactly what to look for, thus generating a potentially superficial and/or flawed evaluation. This shortcoming poses a serious threat to the tool's inter-rater reliability. This and other psychometric properties are still unknown. The exclusive focus on internal validity for deriving evidence rankings leaves little room to reveal further aspects of study quality that might be of interest, such as evidence for generalization or maintenance, participant descriptions, or social validity.

WWC Standards. The WWC Standards were applied as an appraisal tool, although they could also be seen as a set of evaluation criteria similar to the Homer Criteria. There is, however, a particular procedure for using the WWC Standards in the evaluation of SSEDs and deriving a final rating of evidence strength. The WWC Standards are somewhat similar to the Certainty Framework in terms of an exclusive focus on internal validity and absence of a check-list type of format. Criteria are, however, much more operationalized and described in detail. Because of this narrower focus on internal validity, the WWC Standards address only some of the most critical items of the Homer Criteria. The question remains whether this kind of narrower focus is adequate to capture overall quality of an SSED, as some essential elements (i.e., participant descriptions, dependent variable operationalization, and treatment integrity, among others) are not included. The scope of the WWC Standards may be sufficient within the context of producing WWC practice guidelines, and one has to keep in mind this tool was developed for internal use to guide WWC reviewers; the average applied researcher or practitioner may benefit from a larger set of items to identify all relevant quality aspects of a SSED.

In our brief field test, the WWC Standards clearly identified major flaws in research design and adequately distinguished between low quality and high quality research reports. It has yet to be determined if such results can be produced with high reliability and validity. A noteworthy asset of the WWC Standards is the provision of very explicit and comprehensive instructions for conducting visual analysis of SSED data; none of the other tools provided such guidance.

EVIDAAC Scales. This instrument was congruent with about half of the Horner Criteria. The one-page, 10-item checklist format (for all SSEDs except comparative treatment designs) seems user-friendly and time-efficient. Its greatest strength is the provision of a separate 19-item checklist for comparative treatment designs (e.g., alternating treatment design, parallel treatment design among others). None of the other instruments made an effort to provide separate, fine-grained criteria specific to design type. Having separate criteria for different types of SSED makes a lot of sense because some designs have specific requirements, and if those are not met the entire study can easily be flawed. An alternating treatment design, for example, may require separate instructional sets to minimize carry-over effects, which is crucial for internal validity. When evaluating actual SSED studies, the EVIDAAC Scales showed very little ability to discriminate between weak and strong evidence. The tool consistently yielded relatively high ratings with minimal score range (79%-90%) across studies. This deficit seems to be caused by the fact that all scale items are counted equally; for example, in the case of a study lacking treatment integrity (which would affect its score or ranking heavily within other tools), the EVIDAAC Scales reduces the score by only one point. This instrument also lacks interpretational guidelines clarifying score cut-offs for evidence levels, and its psychometric aspects have not yet been investigated.

Logan et al. Scale. This tool addressed a smaller portion of the Horner Criteria. It provides evidence ratings that are converted to qualitative descriptors of study quality. It is unclear, however, if the score cut-offs were based on an empirical validation or set arbitrarily; the authors copied the cut-off levels from a similar scale for group designs (Logan et at, 2008). Discrimination between low versus high quality studies appeared to be a weakness, as three out of four articles were ranked as "moderate." Only one article was ranked as "weak," and that rating was in opposition to all other tools. This discrepancy was due in part to the scale's emphasis on statistical analysis -- a feature that is rarely present and is currently controversial for SSEDs (see below). Other than content validity for scale items, little is known about the tool's psychometric properties.

SCED Scale. Among all the tools, the SCED Scale showed the lowest alignment with the Horner Criteria. One scale item defines acceptable SSEDs as either "A-B-A" or multiple baseline, a very arguable definition: The "A-B-A" design does not provide three demonstrations of experimental effect, is considered pre-experimental in the newer WWC Standards (see Kratochwill et al., 2010, p. 15), and raises ethical concerns as participants do not end in a treatment phase (Gast & Hammond, 2010). Alternating/parallel treatment designs and changing criterion designs are excluded, yet these represent valuable design strategies to reveal causal relationships and comprise a considerable portion of the SSED literature (Barlow et al., 2009). Discrimination is another concern with this tool. When evaluating SSED reports, final scores were either 60% or 70%; thus, discrimination appears low. Interpretation guidelines for final scores are missing. Nevertheless, aspects of content validity and reliability have been investigated and established.

Smith et al. Scale. Similar to the Logan et al. Scale from which it was derived, this tool addresses a certain portion of the Homer Criteria. During the field test, this instrument assigned a relatively high score of 87% to a study with serious design flaws. Interpretational guidelines for final scores are not provided. Psychometric properties have not been examined.

Some general issues yet to be resolved

Overall, our comparison and examination of the various instruments suggest that the ideal quality appraisal tool for SSEDs has yet to be developed. The field is off to a promising start by creating urgently needed procedures for the evaluation of SSEDs, but certain concerns still warrant further discussion.

First, when composing a quality appraisal instrument, researchers need to decide what the focus of that quality appraisal should be. Traditionally, quality appraisal is concerned with a thorough assessment of the internal validity of a research report. If a study lacks internal validity, its findings should not be taken as strong evidence. For this reason, quality appraisal items concentrate on assessing how well a study demonstrates a causal relationship and rules out threats to internal validity. Quality appraisal does not need to be limited to internal validity and may also evaluate further aspects of a study that would enhance its credibility--for example, providing information on social validity and attempts to reveal generalization and maintenance effects. These, however, are not critical to the soundness of an experimental investigation; in other words, a research report can have strong internal validity and completely neglect these additional quality elements. If the appraisal is concerned with internal validity only, then all items should strictly focus on this aspect. If the tool makes an attempt to go beyond internal validity by trying to reveal some type of overall strength of the research report, it needs to clearly separate items of internal validity from additional quality criteria and distinguish between primary and secondary evaluation items. A serious problem can occur if high scores for non-essential items can compensate for low scores on essential items. In this way, a very poorly controlled study with very low internal validity could appear to be adequate because it scores well on other items. By the same token, a high quality study that garners perfect scores on items related to internal validity could be pulled down by lower ratings on items that are desirable but not essential. In other words, when primary evaluation items for internal validity are mixed with secondary quality items, it is no longer clear what construct the tool is assessing and how representative this construct is to study quality. The distinction between primary and secondary quality indicators by the Evaluative Method is a first laudable step in this direction. Ultimately, there needs to be an empirically derived decision on what such primary and secondary quality indicators should be. To this end, a team of SSED researchers at the University of Sydney, Rehabilitation Studies Unit is currently working on reporting guidelines for SSEDs based on a Delphi exercise (U. Rosenkoetter, personal communication, August 19, 2011). Researchers from around the world can provide input regarding essential items of a SSED report, and resulting guidelines could be used to further refine quality appraisal instruments.

Second, and along the same lines, is the issue of item weighting. Most of the currently existing tools give equal weight to each item/ quality criterion on their scales. This practice is questionable, as some criteria could be considered of higher importance to internal validity than others. For example, a lack of reliability data on either the dependent or independent variable diminishes the credibility of a research report much more than a lack of sufficient detail on participants and settings. Again, final evaluation results can be distorted if items are not weighted properly, and our field test indicates that this easily happens. For example, on a 10-item scale, a study may fail the two reliability items for dependent and independent variables, and if all other criteria are addressed still end up with a relative high score of 80%. Thus, the study might wrongfully be classified as strong evidence despite critical flaws.

Third, for three of the seven tools there are no clear guidelines as to how the final evaluation scores are to be interpreted. Without clear instructions for what is considered inferior versus superior evidence, the utility of a quality appraisal tool is somewhat limited, especially for the practitioner trying to decide whether reported results are adequate to inform clinical or educational practice. For other tools, it is unclear how existing cut-off scores for evidence levels were set. Ideally, the thresholds between weak, moderate, and strong levels of evidence should have an empirical basis. For example, final evaluation scores drawn from a larger, heterogeneous sample of studies could be correlated with expert ratings of evidence levels to derive proper score cut-offs.

Fourth, the usefulness of a criterion assessing proper application of statistics to SSEDs is questionable at best. The Logan et al. and SCED Scales explicitly ask for the report of a statistical test. Such a criterion may not be a good indicator of study quality in SSEDs for a number of reasons: Conventional parametric tests (such as t and F tests, wrongfully proposed by Logan et al., 2008) may not be appropriate for use with SSED data because the parametric assumptions of normal distribution, homogeneity of variances, and independence of observations are not usually met. Other statistical options are currently being discussed in the literature, but there is no clear consensus in support of any one approach. Therefore, including this as a criterion for quality SSEDs appears to be inappropriate.

Final recommendations and future research directions

A major finding of this comparison was the variation in intent, construction, components and psychometric properties of the quality appraisal tools. Results show that different tools yield variable quality appraisals when applied to the same research reports. Applied researchers and practitioners undertaking quality appraisals are advised to approach the evaluation results obtained from the various tools with caution and to interpret such results with the context, focus and limitations of the tool in mind. The variability in the construction, content and subsequent evaluation result seems mostly due to a missing empirical basis of tool construction, and little or no attempts to establish reliability and validity of item composition. In addition, there may be lack of agreement on a "gold standard" against which to compare a newly developed tool. The Horner Criteria or the general guidelines from the WWC Standards may be seen as only one of several important sources to derive quality indicators. Not all fields conducting single-subject experimental research may have embraced the Horner Criteria or WWC Standards. Prospective users of SSED quality appraisal tools therefore cannot be completely confident that the content of a given tool accurately reflects the most critical aspects of the SSED literature under evaluation.

Given the current situation, applied researchers and practitioners face difficult decisions when trying to identify the most suitable quality appraisal tool for their needs. Based on our observations, the four most useful quality appraisal tools for SSED research seem to be the Evaluative Method, the Certainty Framework, the WWC Standards, and the EVIDAAC Scales. These are the tools with the greatest clarity of intent, construction and content. We recommend that applied researchers and practitioners carefully select among these four and distinguish different purposes: The Evaluative Method may be best suited for comprehensive systematic reviews that aim to inform both clinical/educational practice and policy. One of its strongest assets is the provision of separate evaluation rubrics for group designs and SSEDs. At the same time, the tool provides guidelines for integrating the evaluation results from both methodologies into a summative assessment of overall EBP status. Thus, it permits the aggregation of evidence across the major research designs, a scenario often encountered when synthesizing a larger body of treatment research or operating within a context not limited to one methodology. The summary assessment of the current level of support (either "established" or "promising" or weaker) facilitates policy-related decision-making and may be appreciated by funding agencies, advocacy groups, professional associations, and other entities trying to disseminate information about efficacious treatments. Although the Evaluative Method was originally designed for autism treatment research, we see no obstacles applying it to other fields. Applied researchers and practitioners are encouraged to use this tool with the research base in their respective fields and provide insight into the generality of the definitions and evaluation procedures.

The Certainty Framework appears most suitable for time-efficient literature reviews such as rapid evidence reviews (United Kingdom Civil Service, 2011) or critically appraised topics (Wendt, 2006). By reducing the appraisal process to three crucial criteria of internal validity, the Certainty Framework provides this efficiency and derives a summative evaluation that can easily be translated into a clinical recommendation. Its definitions are also broad enough to synthesize group and SSED research at the same time, an advantage when dealing with literature that includes both types of designs. Having a solid understanding of research design is crucial to proper implementation of this tool, as it does not come in a user-friendly checklist format. This drawback can be resolved by using some or all of the Homer Criteria to add operational zed appraisal items to the more general framework as needed in a specific context. Thus, even novice evaluators or graduate-level student clinicians/educators can easily be trained in this method.

The WWC Standards appear suitable when reviews are particularly aiming for a thorough assessment of internal validity. The WWC Standards can assist in sorting out SSED studies with little to no experimental control from those with strong experimental control. EBP often emphasizes decision-making based on the "best and most current research evidence" (Straus et al., 2005). Consequently, many EBP reviews pursue a "best evidence" approach to narrow down included literature to the highest quality research (Agency for Healthcare Research and Quality, 2012). This notion of "best evidence" is primarily linked to a high degree of internal validity. The WWC Standards provide a procedure to decide when "best evidence" SSEDs are present. Studies identified as meeting standards (with and without reservations) undergo further evaluation of the demonstrated strength of the causal relationship, which allows for a more fine-grained discrimination of evidence-levels. This distinction is based on a comprehensive visual analysis that is applicable to the vast majority of SSEDs and that is familiar to most SSEd researchers.

The EVIDAAC Scales might be useful when reviews or critical appraisals are focused on a body of SSED research with a considerable proportion of comparative treatment designs. These can be evaluated with the separate Comparative Single-Subject Experimental Design Rating Scale (CSSEDARS). As comparative efficacy is a priority on the agenda of healthcare and related research (Agency for Healthcare Research and Quality, 2009), the ability to conduct a separate, in-depth assessment of these designs is a plus. Overall, the EVIDAAC Scales are a very straightforward assessment of internal validity. Although they do not categorize studies into levels of design quality, the final score could be used as a moderator variable in a meta-analysis of SSEDs. The user-friendliness of the scale--that is, an easily accessible format and clear instructions how to use the instrument--also make it an option for the less experienced reviewer.

In general, applied researchers and practitioners are well advised to proceed with caution when selecting quality appraisal tools for their EBP needs. Users should consider published empirical support for item construction and validity. Data should also be available documenting inter-rater reliability. Clear guidelines for proper use are crucial so that the instruments can be implemented and interpreted in a consistent manner (c.f., Katrak, Bialocerkowski, Massy-Westropp, Kumar, & Grimmer, 2004).

Future research should investigate reliability and validity of the currently available tools, as these are often unknown. The issue of different tools rendering considerable variability in final evaluation results when applied to the same body of literature warrants further investigation to reduce sources of discrepancies and refine current instruments. Finally, there is a critical need for discussing (a) what criteria are truly important and should be core items on a SSED quality appraisal tool and (b) how these items should be weighted in relation to others. Reaching consensus on this issue can lead to a more standardized and valid framework to facilitate quality appraisal tool development. In sum, considerable attempts have been made to provide initial quality assessment tools for SSEDs, but a "gold standard" critical appraisal tool is yet to be developed.

Acknowledgement

The authors would like to thank reviewer TS for the very valuable feedback provided on an earlier version of this manuscript.

References

Agency for Healthcare Research and Quality (2009). Testimony on comparative efficacy research. Retrieved from http://www.ahrq.gov/about/nac/autsp_tnf.htm.

Agency for Healthcare Research and Quality (2012). "Best evidence" approaches. Retrieved from http://www.ahrg.gov/clinic/tp/bestevtp.htm.

American Psychological Association (2002). Criteria for evaluating treatment guidelines. American Psychologist, 57, 1052-1059.

Anderson, N. H. (2001). Empirical direction in design and analysis. Mahwah, NJ: Erlbaum.

Atkins, C. F., & Sampson, J. (2002, June). Critical appraisal guidelines for single case study research. Paper presented at the 10th European Conference of Information Systems (ECIS), Gdansk, Poland. Paper retrieved from http://is2.1se.ac.uk/asp/aspe-cis/20020011.pdf.

Backman, C. L., & Harris, S. R. (1999). Case studies, single-subject research, and N of 1 randomized trials. Comparisons and contrasts. American Journal of Physical Medicine and Rehabilitation, 78, 170-176.

Barlow, D. H., Nock, M. K., & Hersen, M. (2009). Single-case experimental designs: Strategies for studying behavior change (3rd ed.). Boston, MA: Allyn & Bacon.

Beeson, P. M., & Robey, R. R. (2006). Evaluating single-subject treatment research: Lessons learned from the aphasia literature. Neuropsychological Review, 16, 161-169.

Carr, J. E. (2005). Recommendations for reporting multiple-baseline designs across participants. Behavioral Interventions, 20, 219-224.

Chambless, D. L., Baker, M. J., Baucom, D. H., Beutler, L. E., Calhoun, K. S., Crits-Christoph, P., ... Woody, S. R. (1998). Update on empirically validated therapies, II. The Clinical Psychologist, 51, 3-16.

Cicchetti, D. V. (2011). On the reliability and accuracy of the Evaluative Method for identifying evidence-based practices in autism. In B. Reichow, P. Doehring, D. V. Cichetti, & F. R. Volkmar (Eds.), Evidence-based practices and treatments for children with autism (pp. 41-51). New York, NY: Springer.

Crombie, I. K. (1996). The pocket guide to critical appraisal. London, United Kingdom: BMJ Publishing Group.

Crozier, S., & Tincani, M., J. (2005). Using a modified social story to decrease disruptive behavior of a child with autism. Focus on Autism and Other Developmental Disabilities, 20, 150-157.

Di Noia, J., & Tripodi, T. (2008). Single-case design for clinical social workers (2nd ed.). Washington, DC: NASW Press.

Ganz, J. B., & Sigafoos, J. (2005). Self-monitoring: Are young adults with MR and autism able to utilize cognitive strategies independently? Education and Training in Developmental Disabilites, 40, 24-33.

Gast, D. L. (2010). Single subject research methodology in behavioral sciences. New York, NY: Routledge.

Gast, D. L., & Hammond, D. (2010). Withdrawal and reversal designs. In D. L. Gast (Ed.), Single subject research methodology in behavioral sciences (pp. 234-275). New York, NY: Routledge.

Gast, D. L., & Ledford, J. (2010). Multiple baseline and multiple probe designs. In D. L. Gast (Ed.), Single subject research methodology in behavioral sciences (pp. 276-328). New York, NY: Routledge.

Graham, S. (2005). Criteria for evidence-based practice in special education [special issue editorial]. Exceptional Children, 71(2), 135.

Hancock, G. R., & Mueller, R. 0. (2010). The reviewer's guide to quantitative methods in the social sciences. New York, NY: Routledge.

Hootman, J. M., Driban, J. B., Sitler, M. R., Harris, K. P., & Cattano, N. M. (2011). Reliability and validity of three quality rating instruments for systematic reviews of observational studies. Research Synthesis Methods, 2, 110-118.

Horner, It H., Carr, E. G., Halle, J., McGee, G., Odom, S., & Wolery, M. (2005). The use of single-subject research to identify evidence-based practice in special education. Exceptional Children, 71, 165-179.

Interagency Autism Coordinating Committee (2011). 2011 IACC Strategic Plan for Autism Spectrum Disorder Research. Retrieved from http://iacc.hhs.gov/strategic-plan/2011/index.shtml

Katrak, P., Bialocerkowski, A. E., Massy-Westropp, N., Kumar, S., & Grimmer, K. A. (2004). A systematic review of the content of critical appraisal tools. BMC Medical Research Methodology, 4:22. doi:10.1186/1471-2288-4-22

Kennedy, C. H. (2005). Single-case designs for educational research. Boston, MA: Allyn & Bacon.

Kratochwill, T. R., Hitchcock, J., Horner, R. H., Levin, J. R., Odom, S. L., Rindskopf, D. M., & Shadish, W. R. (2010). Single-case designs technical documentation. Retrieved from What Works Clearinghouse website: http://ies.ed.govincee/wwc/pdf/wwc_scd.pdf.

Logan, L. R., Hickman, R. R., Harris, S. R., & Heriza, C. B. (2008). Single-subject research design: Recommendations for levels of evidence and quality rating. Developmental Medicine & Child Neurology, 50, 99-103.

Ozdemir, S. (2008). The effectiveness of social stories on decreasing disruptive behavior on children with autism: Three case studies. Journal of Autism and Developmental Disorders, 38, 1689-1696.

Perdices, M., & Tate, R. L. (2009). Single-subject designs as a tool for evidence-based clinical practice: Are they unrecognized and undervalued? Neuropsychological Rehabilitation, 19, 904-927.

Petticrew, M., & Roberts, H. (2006). Systematic reviews in the social sciences: A practical guide. Malden, MA: Blackwell Publishing.

Reichow, B., Volkmar, F. R., & Cicchetti, D. V. (2008). Development of the evaluative method for evaluating and determining evidence-based practices in autism. Journal of Autism and Developmental Disorders, 38, 1311-1319.

Schlosser, R. W. (2003). The efficacy of augmentative and alternative communication: Toward evidence-based practice. San Diego, CA: Academic Press.

Schlosser, R. W. (2009). The role of single-subject experimental designs in evidence-based practice times. FOCUS, 22, 1-8.

Schlosser, R. W. (2011). EVIDAAC Single-Subject Scale. Retrieved October 19, 2011 from http://www.evidaac.com/ratings/Single_Sub_Scale.pdf.

Schlosser, R. W., & Sigafoos, J. (2008). Identifying "evidence-based practice" versus "empirically supported treatment." Evidence-Based Communication Assessment and Intervention, 2, 61-62.

Schlosser, R. W., Sigafoos, J., & Belfiore, P. (2009). EVIDAAC Comparative Single-Subject Experimental Design Scale (CSSEDARS). Retrieved from http://www.evidaac.com/ratings/CSSEDARS.pdf.

Smith, V., Jelen, M., & Patterson, S. (2010). Video Modeling to improve play skills in a child with autism: A procedure to examine single-subject experimental research. Evidence-based Practice Briefs, 4, 1-11.

Simeonsson, R., & Bailey, D. (1991). Evaluating programme impact: Levels of certainty. In D. Mitchell, & R. Brown (Eds.), Early intervention studies for young children with special needs (pp. 280-296). London, United Kingdom: Chapman and Hall.

Straus, S. E., Richardson, W. S., Glasziou, P., & Haynes, R. B. (2005). Evidence-based medicine: How to practice and teach EBM (3rd ed.). Edinburgh, United Kingdom: Elsevier Science.

Tankersley, M., Cook, B. G., & Cook, L. (2008). A preliminary examination to identify the presence of quality indicators in single-subject research. Education and Treatment of Children, 31, 523-548.

Tate, R. L., McDonald, S., Perdices, M., Togher, L., Schultz, R., & Savage, S. (2008). Rating the methodological quality of single-subject designs and n-of-1 trials: Introducing the single-case experimental design (SCED) scale. Neuropsychological Rehabilitation, 18, 385-401.

Tincani, M. (2004). Comparing the Picture Exchange Communication System and sign language training for children with autism. Focus on Autism and Other Developmental Disabilities, 19, 152-163.

United Kingdom Civil Service (2011). What is a Rapid Evidence Assessment? Retrieved from http://www.civilservice.gov.uk/net-works/gsr/resources-and-guidance/rapid-evidence-assess-ment/what-is.

Wendt, O. (2006). Critically Appraised Topics: An approach to critical appraisal of evidence. Perspectives on Augmentative and Alternative Communication, 15, 24-26.

Wendt, O. (2007). The effectiveness of augmentative and alternative communication for individuals with autism spectrum disorders: A systematic review and meta-analysis. Dissertation Abstracts International, 68(2), 526-A.

Wendt, O. (2009). Research on the use of graphic symbols and manual signs. In P. Mirenda & T. Iacono (Eds.), Autism Spectrum Disorders and AAC (pp. 83-139). Baltimore, MD: Paul H. Brookes.

Correspondence to Oliver Wendt, Department of Speech, Language, and Hearing Sciences, 500 Oval Drive, West Lafayette, IN 47907-2038; e-mail: wendto@purdue.edu.

Oliver Wendt and Bridget Miller Purdue University
Table 1
Current Quality Appraisal Tools for Single-Subject Experimental
Designs (SSEDs)

Characteristics  Certainty        Evaluative   EVIDAAC
/Properties *    Framework        Method       Scales
                 (no max. score)  (max. = 12)  (max. = 10
                                               or 19)

Composition      Ranks certainty  12-item      One
of tool          of               rating       treatment
                 evidence as      scale        scale: 10
                 "conclusive"     divided      items;
                 (highest),       into         two or
                 "preponderant",  primary and  more
                 "suggestive",    secondary    treatments
                 or               indicators;  scale:
                 "inconclusive"   strength of  19 items;
                 (lowest), based  research     higher
                 on research      ranked       score =
                 design,          "strong",    higher
                 interobserver    "adequate",  quality
                 agreement of     or "weak"
                 dependent        based on
                 variable, and    Number and
                 treatment        Level of
                 integrity        Indicators
                                  Achieved

Content          No               Yes          No
validity
established

Inter-rater      No               Yes,         No
reliability                       including
provided                          expert and
                                  novice
                                  raters

Characteristics  Logan et    SCHD Scale    Smith et   WWC Standards
/Properties *    al.         (max. = 10)   al. Scale  (no max.
                 Scale                     (max. =    score)
                 (max. =                   15)
                 14)

Composition      14          11-item       15-item    Design
of tool          questions   rating        rating     Standards
                 containing  scale; item   scale;     rank internal
                 16 items;   1 assesses    higher     validity
                 Studies     clinical      score      as "Meets
                 are rated   history       =          Standards",
                 "strong"    information;  higher     "Meets
                 (11-14      items 2-11    quality    Standards with
                 points),    allow                    Reservations",
                 "moderate"  calculation              and "Does not
                 (7-10       Of quality               Meet
                 points),    score;                   Standards";
                 or "weak"   higher score             Evidence of
                 (less than  = higher                 Effect
                 7 points)   quality                  Standards rate
                                                      effects
                                                      strength
                                                      as(l) "Strong
                                                      Evidence",
                                                      (2) "Moderate
                                                      Evidence," or
                                                      (3) "No
                                                      Evidence "

Content          No          No            No         No
validity
established

Inter-rater      Yes,        No,           No         No
reliability      including   including
provided         the four    expert and
                 authors     novice
                 of the      raters
                 scale

Note. EVIDAAC = Evidence in Augmentative and Alternative Communication;
SCED = Single-Case Experimental Design; WWC = What Works Clearinghouse.
* An extended version of this table containing further details on the
various tools is available from the first author upon request.


Table 2
Comparison of Current Quality Appraisal Tools versus Quality
Indicators by Horner et al. (2005)

Quality      Appraisal Tools for Single-Subject Experimental
indicators   Designs (checkmark indicates criterion is
for SSEDs    addressed; additional or partial criteria
as proposed  provided by the individual tools are
by Horner    listed below the checkmarks *)
et al.
(2005)

             Certainty  Evaluative  EVIDAAC  Logan et   SCED
             Framework  Method      Scales   Al. Scale  Scale

1. Description of Participants and Settings

a. Participants characteristics are detailed enough for future
replication (e.g., age, diagnosis, disability, gender)

             ([check])  [check]     [check]  [check]    [check]

b. Participant selection is detailed enough for future replication

             ([check])              [check]

c. Crucial setting features are detailed enough for future
replication

             ([check])              [check]

2. Dependent Variables

a. Dependent variable is operationally defined


             ([check])  [check]     [check]  [check]    [check]

b. Each dependent variable is measured with techniques producing a
quantifiable index

             (check)

c. There is valid measurement of the dependent variable and
provision of replicable detail

             ([check])  [check]

d. Repeated measurements of the dependent variable are taken
over Lime

             ([check])  [check]     [check]             [check]

e. Interobserver agreement is assessed for each dependent
variable and meets a minimum standard, that is lOA=80% or kappa=60%

             ([check])  [check]     [check]

3. Independent Variables

a. Independent variable is described with replicable detail

             ([check])  [check]     [check]  [check]

b. Independent variable is systematically controlled and manipulated
by the researcher

             ([check])  [check]

c. Fidelity of implementation of the independent variable is documented

             [check]    [check]     [check]

4. Baseline

a. Baseline phase demonstrates repeated measurement of the dependent
variable

             ([check])  [check]     [check]  [check]    [check]

b. Baseline includes five or more data points (not an explicit quality
indicator but described in text)

             ([check])                       [check]
                        [greater             [greater
                        than or              than or
                        equal to]            equal to]
                        3 data               5 data
                        points               Points

c. Baseline creates a pattern to predict future performance when
independent variable absent

             ([check])  [check]     [check]

d. Baseline conditions are described with replicable detail

             ([check])  [check]

5. Experimental Control & Internal Validity

a. Design shows a minimum of three demonstrations of
experimental effect at three different points in time

             ([check])  [check]     [check]

b. The design minimizes threats to internal validity

             [check]    Assessors            Assessors  Assessors
                        Blind                blind      independent

c. The results show a pattern indicating experimental control

             ([check])  [check]     [check]  [check]

6. External Validity

a. Effects are reproduced across participants, settings or materials

             ([check])  [check]              [check]    [check]

7. Social Validity

a. The dependent variable is socially meaningful

                        [check]

b. The amount of change in the dependent variable as a result of
the intervention is socially important

                        [check]

c. Practicality and cost efficiency are shown when implementing the
independent variable

                        [check]

d. Social validity is facilitated by carrying out the independent
variable over longer time periods by typical intervention agents in
typical contexts

                        [check]

Total        (19/23)*   18/23       12/23    7/23       5/23
Horner
et al.
(2005)
criteria met

Quality      Appraisal Tools for Single-Subject Experimental
indicators   Designs (checkmark indicates criterion is
for SSEDs    addressed; additional or partial criteria
as proposed  provided by the individual tools are
by Horner    listed below the checkmarks*)
et al.
(2005)

             Smith et   WWC
             Al. Scale  Standards

1. Description of Participants and Settings

a. Participants characteristics are detailed enough for future
replication (e.g., age, diagnosis, disability, gender)

             [check]

b. Participant selection is detailed enough for future replication

c. Crucial setting features are detailed enough for future
replication

2. Dependent Variables

a. Dependent variable is operationally defined

             [check]    [check]

b. Each dependent variable is measured with techniques producing a
quantifiable index

c. There is valid measurement of the dependent variable and
provision of replicable detail

d. Repeated measurements of the dependent variable are taken
over Lime

                        [check]

e. Interobserver agreement is assessed for each dependent
variable and meets a minimum standard, that is lOA=80% or kappa=60%

                        [check]

3. Independent Variables

a. Independent variable is described with replicable detail

             [check]

b. Independent variable is systematically controlled and manipulated
by the researcher

                        [check]

c. Fidelity of implementation of the independent variable is documented

             [check]

4. Baseline

a. Baseline phase demonstrates repeated measurement of the dependent
variable

             [check]    [check]

b. Baseline includes five or more data points (not an explicit quality
indicator but described in text)

             [check]
             [greater   [greater
             than or    than or
             equal to]  equal to]
             5 data     3 data
             points     points
                        [greater
                        than or
                        equal to]
                        5 data
                        points

c. Baseline creates a pattern to predict future performance when
independent variable absent

d. Baseline conditions are described with replicable detail

5. Experimental Control & Internal Validity

a. Design shows a minimum of three demonstrations of
experimental effect at three different points in time

                        [check]

b. The design minimizes threats to internal validity

             Assessors
             blind

c. The results show a pattern indicating experimental control

             [check]    [check]

6. External Validity

a. Effects are reproduced across participants, settings or materials

             [check]

7. Social Validity

a. The dependent variable is socially meaningful

             [check]

b. The amount of change in the dependent variable as a result of
the intervention is socially important

c. Practicality and cost efficiency are shown when implementing the
independent variable

d. Social validity is facilitated by carrying out the independent
variable over longer time periods by typical intervention agents in
typical contexts

Total        9/23       6/23
Horner
et al.
(2005)
criteriamet

Note. EVIDAAC = Evidence in Augmentative and Alternative Communication;
IOA = Interobserver agreement; IV = Independent variable; SCED = Single-
Case  Experimental Design; SSED = Single-Subject Experimental Design;
WWC = What Works Clearinghouse. * An extended version of this table
containing further details  on the various tools is available from the
first author upon request. ** Certainty Framework is not an
operationalized rating scale and was assessed on its  ability to include
Horner Criteria.


Table 3
Comparison of Quality Appraisal Tools When Applied to Four Different
Types of Single-Subject Experimental Designs

Article    Single-Subject
(Authors;  Research
Year)      Design

                             Certainty     Evaluative   EVIDAAC
                             Framework       Method      Scales
                                           (max.=12)   (max.=10)

Crozier &  Withdrawal      "Inconclusive"         58%        80%
Tincani,   (A-B-A-C)                           "weak"
2005

Ganz &     Changing        "Suggestive"           50%        90%
Sigafoos,  Criterion                           "weak"
2005

Ozdemir,   Multiple        "Preponderant"         67%        80%
2008       Baseline                        "adequate"
           Across
           Participants

Tincani,   Alternating     "Preponderant"         67%        79%
2004       Treatment                       "adequate"   CSSEDARS


Article    Single-Subject
(Authors;  Research
Year)      Design

                           Logan et       SCED     Smith et
                           al. Scale       Scale    al. Scale
                            (max.=14)   (max.=10)  (max.=15)

Crozier &  Withdrawal              71%        70%        87%
Tincani,   (A-B-A-C)        "moderate"
2005

Ganz &     Changing                54%        60%        53%
Sigafoos,  Criterion        "moderate"
2005

Ozdemir,   Multiple                39%        60%        60%
2008       Baseline             "weak"
           Across
           Participants

Tincani,   Alternating             61%        70%        80%
2004       Treatment        "moderate"

Article    Single-Subject
(Authors;  Research
Year)      Design

                           WWC
                           Standards

Crozier &  Withdrawal      "Does not meet
Tincani,   (A-B-A-C)       evidence
2005                       standards"

Ganz &     Changing        "Meets
Sigafoos,  Criterion       standards with
2005                       reservations":
                           "Moderate
                           evidence"

Ozdemir,   Multiple        "Meets
2008       Baseline        standards
           Across          with
           Participants    reservations":
                           "Strong
                           evidence"

Tincani,   Alternating     "Meets
2004       Treatment       standards":
                           "Moderate
                           evidence"

Note. CSSEDARS = Comparative Single-Subject Experimental Design
Rating Scale; EVIDAAC = Evidence in Augmentative and Alternative
Communication; IOA = Inter observer agreement; IV = Independent
variable; SCED = Single-Case Experimental Design; SSED = Single-
Subject Experimental Design; WWC = What Works Clearinghouse. *An
extended version of this table containing further appraisal
details is available from
Gale Copyright:
Copyright 2012 Gale, Cengage Learning. All rights reserved.