Sign up

Can performance raters be more accurate? Investigating the benefits of prior knowledge of performance dimensions.
Abstract:
Research in performance appraisals has found that most errors are due to the way that raters process information. This study builds on previous research investigating "on line" and "memory based" processing. When raters know that they will be rating performance before they observe behavior, they should theoretically use on line processing and produce more accurate ratings that depend on a person impression which is based on relevant performance dimensions. In previous studies, raters who do not have prior knowledge of the rating task and dimensions have been hypothesized to rely on memory based processing. However, since conditions in this study more closely approximate an actual rating situation, reliance on a less accurate impression of overall effectiveness (or ineffectiveness) was predicted. An in-basket test was used to create an environment that was representative of how performance information is typically received. Results showed, as predicted, that raters with prior knowledge of the rating task did show more accurate results. However, both groups showed evidence of relying on a general effective-in-effective impression. Also as predicted, memory based processing was not found.

Subject:
Performance appraisals (Research)
Performance standards (Analysis)
Author:
Day, Nancy E.
Pub Date:
09/22/1995
Publication:
Name: Journal of Managerial Issues Publisher: Pittsburg State University - Department of Economics Audience: Academic; Trade Format: Magazine/Journal Subject: Business; Human resources and labor relations Copyright: COPYRIGHT 1995 Pittsburg State University - Department of Economics ISSN: 1045-3695
Issue:
Date: Fall, 1995 Source Volume: v7 Source Issue: n3
Accession Number:
17536685
Full Text:
Performance appraisals are not and probably will never be infallible reflections of human behavior. They are prone to many sources of error, most of which are due to raters' cognitive distortions (Landy and Farr, 1980; Vance et al., 1983). A significant amount of performance appraisal research has focused on how raters process information, including how information is acquired, how it is stored and organized in memory and how it is retrieved and integrated into performance evaluations (Ilgen et al., 1993). Much of this research has relied on theories that can be considered "limited capacity" theories of information processing (Lord and Maher, 1990), in that they concentrate on the cognitive shortcuts that individuals take to reduce or simplify a large amount of information in a limited amount of cognitive space. When raters use inappropriate simplifications, such as relying on general impressions of persons that are not related to relevant performance dimensions, rating errors occur. This can cause unfortunate consequences in organizations. Inaccurate decisions regarding performance may ultimately affect the productivity of the organization, and on an individual level, inadequate judgments can be devastating. The purpose of this paper is to describe a study investigating some of these simplification processes.

A recent line of research addresses "on line" and "memory based" processing. Hastie and Park (1986) describe on line processing as judgments made from "working memory" at the time the information is encoded. When we are asked to serve behavior in order to evaluate performance, on line processing is engaged. As we observe the person's behavior, we immediately interpret that behavior in accordance to relevant performance criteria. Memory based processing, on the other hand, could be compared to a "traditional" model of judgment (Borman, 1978), wherein bits of information are encoded directly into long-term memory and later, when a performance judgment is required, these memories are accessed to evaluate the behavior.

On line and memory based processing differ in fundamental ways. On line processing, since it manages a large amount of information to make immediate judgments, will rely on impressions and other shortcuts in order to handle the incoming information. Thus, raters using on line processing will be more likely to depend on dimensions of expected behavior in order to group information into meaningful patterns. Memory based processors, since they do not need to organize the incoming information in order to make an immediate judgment, will tend to record information into long-term memory in noncategory based configurations. Therefore, on line processors will be more likely to remember information that is relevant to the person judgment being made. However, since, they are organizing information around a specific conception of behavior, they may remember fewer discrete behaviors than memory based processors, who are not constrained by a cognitive category. In performance appraisal research, on line processing has been initiated by telling raters in advance that they must evaluate the behavior of those they will observe. Memory based processing has been introduced when the rating task was not known until after the behavior had been observed.

Recent research has studied these two processing modes in relationship to performance appraisal. In one study (Murphy et al., 1989), researchers informed one group of subjects that their primary task in viewing a videotape of a college professor's lecture was to evaluate his performance. Another group was told that although they would have to rate the lecturer's performance, their main task was to prepare for an exam over the content of the lecture. This study found that raters observing solely to evaluate performance tended to be more accurate both in ratings and recognition of critical behaviors. However, when ratings were collected up to one week later, the group whose primary purpose was to learn the material showed more accurate performance ratings. The researchers attributed this finding to the fact that on line processors, the ones for whom performing the ratings were the sole purpose, tended to forget individual behaviors over time since they used an impression based on an overall assessment of the person, while memory based processors, the ones for whom the content of the lecture was predominant, remembered individual behaviors more effectively.

In a similar study (Williams et al, 1990), one group of student subjects were told prior to viewing videotapes of woodworkers that they should watch the videos in anticipation of evaluating the performance of the workers. Another group of subjects was told to watch the tapes to assess the tasks' difficulty. The researchers found that those with prior knowledge of the appraisal tended to rely on relevant person categories and had greater recall of performance information than those who focused on task difficulty. Additionally, there were no significant correlations between the behaviors recalled and the ratings, indicating that even though recall was high, the ratings were based on the person category more than actual behaviors.

These studies support the idea that on line processing will produce different results than memory based processing. Prior knowledge of the rating task will elicit on line processing, which will organize information based on relevant dimensions regarding the person's performance. A body of related research on "social schemata," or the general categories of expectations we hold about people, indicates that creating these mental expectations allows us to process large amounts of information much more efficiently than individually handling each piece of information (Fiske and Taylor, 1984). In other words, pure "memory based" processing is rare, unless artificially elicited, since we automatically simplify our cognitive processing by using preconceived categories. For example, in the studies cited above, it could be argued that the groups for whom performance rating was not the primary purpose were still operating on artificially presented categories. In the Murphy et al. (1989) study, they encoded incoming information based primarily on the content of the lecture, encouraging dependence on lecture content categories. In the Williams et al. (1990) study, they encoded information based on task difficulty, encouraging dependence on task categories. In more natural settings where a variety of both person and nonperson information is incoming, we tend to rely on categorical processing when interpreting information about people (Fiske and Taylor, 1984). Thus, true memory based processing of person information in the "real world" probably is not likely to occur.

Likewise, subjects in the on line conditions in the Murphy et al. (1989) and the Williams et al. (1990) research clearly knew that their only purpose as they observed the videotapes was to assess performance behavior. This is rarely an accurate reflection of real events. According to Wyer and Srull (1981), cognitive processing occurs as incoming information is placed in a cognitive "workspace." When a new piece of information comes in, the information presently in the workspace must be moved somewhere to make room for the new stimuli. If the first information could be considered representative of a certain prototype, it can be placed in a frequently used cognitive "bin" (a category), and the workspace is freed for the new stimuli. Through this process, the cognitive workspace is quickly cleared of older information so that new stimuli can be efficiently processed. Thus, in more realistic situations, where many bits of information are competing for attention, multiple categories may be accessed and the accuracy and richness of the information may be lost. This process contributes to a lessening of accurate dimensionality and an increase in dependence on prototypic information. When a rater is only presented with relevant behavioral information, as in the previous studies (since they knew their sole purpose was to rate behavior), the rater's cognitive workspace does not need to be "cleared." Since little irrelevant information is incoming, the workspace can be used solely for the purpose of appraisal. Thus, constant dependence on the person impression and accurate performance categories is unlikely to occur in real situations. Real-life raters observe behavior, write a memo, answer the phone, go to lunch, observe behavior, go to a meeting, etc. In the real world, one's workspace is frequently purged and refilled.

The Present Research

This study investigates the issue of on line versus memory based processing in a more realistic performance appraisal situation. It forces an appraisal task to be embedded in the context of a manager's "typical" workload, to be accomplished in an experimental session. This presents a situation where real-world conditions are better represented. Additionally, I argue that memory based processing will be less likely to occur in this more naturalistic setting.

The hypothesis to be studied is:

When manipulations that encourage on line processing are implemented, ratings will more accurately reflect the relevant performance dimensions. Specifically, when raters are presented with a description of the person and performance dimensions before observing the behavior and are told that they are to rate behavior based on these dimensions, their ratings will more accurately reflect the performance dimensions. Further, when conditions that do not encourage on line processing are present (i.e., they are not aware of having to make a later appraisal before observing behavior), raters will tend to rely on preexisting categories of effectiveness and ineffectiveness in rating performers. They will not tend to rely on memory based processing.

Thus, raters initially unaware of the performance rating task will tend to be less accurate in their ratings. The ratings will be skewed in the direction of an overall category. of effectiveness or ineffectiveness.

Method

In this study, I induced on line processing in the same way as did the previous performance appraisal research cited. Before the exercise began, some subjects were told that they would be asked to rate performance and the dimensions were carefully explained. Another group of subjects were not told anything about the subsequent performance rating or the dimensions until immediately before the rating and after the observation of performance behaviors. According to the hypothesis, those in this condition would theoretically rely on a general effective or ineffective category with which to process the performance information, as opposed to using memory based processing.

In order to assess rater accuracy, I varied level of effectiveness, as Nathan and Lord did in a 1983 study. Thus, the design consisted of four conditions: prior knowledge of ratings and effective performance, prior knowledge and ineffective performance, no prior knowledge and effective performance, and lastly, no prior knowledge and ineffective performance. Subjects were undergraduate psychology students.

Materials and Procedure

An in-basket exercise was used. This exercise forced the rater to work on several tasks in one interval of time, partially creating a condition whereby on line processing would theoretically be less likely to occur since it would distract the rater from focusing on a person impression. At the same time, this exercise represented an actual working environment more closely than did previous research. Additionally, the in-basket technique allowed time lapses (although somewhat short) to occur between initial contact with the ratee's behavior and appraisal, also discouraging the initiation of on line processing. The type of tasks to be accomplished was identical to many of the activities of a manager, again creating a more realistic set of conditions than previous research which focused on manual labor (Williams et al., 1990) and intellectual memory (Murphy et al., 1989). The paper-and-pencil exercise exposed the raters to the "employee" (a secretary) that they were to appraise and presented pertinent information about the performance effectiveness (or ineffectiveness) of the employee by presenting examples of behavior in seven performance dimensions (typing skill, organizational skill, initiative, social skills, telephone message-taking, learning ability, and attendance). These performance dimensions were determined through pretesting in which subjects were asked to name the skills most important for secretaries. The actual examples were also pretested, and only those examples in which over 75% of the responses were rated in the top (effective) or bottom (ineffective) five points of a ten-point scale were chosen. A secretary was chosen over other types of jobs under the assumption that the undergraduate student subjects would be familiar with this job. Since job knowledge increases the likelihood that impressions regarding the job are concrete and reliable (Smither and Reilly, 1989; Steiner and Rain, 1989), any categorization based processing that occurs would theoretically stem from performance behavior of the individual secretary rather than uncertainties about the job itself.

All subjects began the in-basket with written instructions of the their role as a manager, relevant information about the company and coworkers, including the secretary to whom the subject is the direct supervisor. In the on line (prior knowledge) condition, instructions contained a statement defining and explaining the performance dimensions, which read as follows:

One of the important tasks that you will perform in this in-basket is to rate the performance of your secretary, Pat. Some of the materials that follow will be examples of her work on several dimensions. These dimensions are typing skill, ability to take clear and complete telephone messages, attendance, social skills, ability to take initiative, learning ability, and organizing skills. Please try to be aware of the materials that indicate Pat's performance on these dimensions so that you can rate her accurately.

Each dimension was then briefly explained. The no prior knowledge conditions omitted the information about the upcoming appraisal and the performance dimensions in the introduction, but included it immediately before the performance rating, after all behavioral examples had been observed.

The experiment was conducted in one two-hour experimental session. The in-basket was presented in two parts to keep the subjects from looking back through the behavior examples in order to review the secretary's actual behaviors. Subjects were told that the study was in two parts so the timing of each exercise could be more precise. The first installment contained 21 secretary performance examples and 10 "fillers," or tasks that represented business activities unrelated to the secretary's performance. All tasks came in the form of memos, letters, or short reports to which the subject was to respond as best as he or she saw fit. After the in-basket was explained to them, they began the first installment. At the end of one hour, an experimenter entered the room, removed the materials and gave the subject the second installment.

The effective-ineffective dimension was created through manipulating the number of effective or ineffective examples of the secretary's work, as was done by Nathan and Lord (1983). Each of the seven dimensions contained three examples. Table 1 provides a listing of how the effective and ineffective examples were presented. In the effectiveness condition, the in-basket contained three examples of good performance in each of the three dimensions of typing, initiative and attendance (nine examples in total). In each of the four dimensions of social skills, taking telephone messages, organizational skills and learning ability, there were one example of good performance and two examples of poor performance. Thus, of seven dimensions, the effective performer had 13 out of 21 (62%) total examples that were of high quality and eight (38%) that were of low quality. Three dimensions were totally positive and four dimensions were mostly negative. The ineffectiveness condition was directly opposite to this arrangement.

The second part of the in-basket included more "fillers" unrelated to the secretary's performance and the performance appraisal exercises, which served as the dependent measures. These consisted of five-point rating scales for each performance dimension, similar to those used in industry (Eichel and Bender, 1984), and a checklist of behaviors that the secretary did or did not do. Subjects were requested to check the behaviors that he or she remembered the secretary to have demonstrated in the in-basket exercise. Additionally, subjects were asked to indicate, on a five point scale, their overall impression of the secretary's performance.

Ninety-two subjects participated in the experiment for undergraduate psychology credit. Out of this group, 11 were rejected because they were suspicious of the true purpose of the experiment, were non-English-speaking, did not take the experience seriously, etc. Since t-tests found no significant differences between women and men on the performance dimension ratings, both sexes were combined for subsequent analyses.

Results

To summarize the experiment, some subjects had knowledge of the rating task and dimensions before reviewing the performance information, while others did not. In the latter condition, raters should theoretically rely on a general effective-ineffective category, and their ratings should be less accurate. Raters with prior knowledge of the rating task should show more accurate ratings which reflect the performance appraisal dimensions.

Accuracy of Rating Dimensions

Difference scores were calculated between the averages of the consistent dimensions (those that were either all effective or ineffective; typing, initiative and attendance) and inconsistent dimensions (those that were of variable effectiveness; social skills, phone messages, organization and learning ability; see Table 1) for each condition. If raters who had prior knowledge of the rating task (those who should use on line processing) were relying more on the performance dimensions, the difference measure should be higher than that of the other raters, since on line processing should encourage reliance on the actual quality of performance in the dimensions themselves. This proved to be the case for the effective conditions, but not for the ineffective conditions (see Table 2). In the effective conditions, raters who did not know about the ratings before observing the behavior failed to differentiate between the consistent and inconsistent dimensions, indicated by their difference score of 1.02. Since these raters are hypothesized to process information using a general effective-ineffective category, their perceptions of poorer behavior (the inconsistent dimensions) should be more skewed towards a category of effectiveness than those of the group who did have prior rating knowledge. As can be seen, the "prior knowledge effectives" were more successful in distinguishing between performance dimensions that were consistently positive and those that were not, showing a difference score of 1.54. The difference scores between these two groups were significantly different from each other at the .01 level, indicating that the group with no prior knowledge did tend to show more reliance on a broad category of effectiveness, as predicted.

















The ineffective conditions, however, did not support the hypothesis. In these conditions, the secretary's four inconsistent dimensions had been predominantly positive and her overall performance level was mostly negative. While the raters without prior knowledge rated the inconsistent dimensions correctly as being more positive than the consistent dimensions, the difference was not statistically significant. It may be that the larger number of ineffective behavior examples increased the accuracy of these raters' processing by crossing a "threshold" of information contrary to the effective-ineffective categorization processing mode (Feldman, [TABULAR DATA FOR TABLE 2 OMITTED] 1981; Hastie and Park, 1986). Such a threshold is initiated when a large number of events occur that contradict what the rater expects. When this happens, the rater's attention is aroused and he or she discontinues reliance on the category.

Comparison of memory based and on line raters. We would expect that for effective conditions, raters without prior knowledge would be more dependent on general effective-ineffective categories than prior knowledge raters and thus tend to rate all the performance dimensions higher. This result was found. Table 3 displays the means, standard deviations and results of analyses of variance. With the exception of attendance, the raters with no prior knowledge rated all dimensions higher than did the prior knowledge raters, even those that were predominantly ineffective (the consistent dimensions). Although we would expect the "mirror image" of these results for the ineffective conditions, this was not found. There were no significant differences between any performance dimension means for the predominantly ineffective secretary across the two processing types.

Comparison of effective and ineffective conditions. Subjects with no prior knowledge in the effective condition will theoretically process information in terms of a general effective impression. Even though the four inconsistent [TABULAR DATA FOR TABLE 3 OMITTED] dimensions were relatively poor, the no prior knowledge raters' ratings for these dimensions should be higher than those for the ineffective condition. Although the results are in the predicted directions, none of the relationships are significant at the .05 level. The consistent dimensions and the overall impression were significantly higher in the effective group than in the ineffective group.

In the two prior knowledge conditions, we should expect that they will more accurately assess the secretary according to the performance dimensions, and thus the inconsistent dimensions will be rated lower in the effective conditions than in the ineffective condition. The results were again found to be in the expected directions, but no difference was statistically significant (one dimension, telephone messages, approached significance; p = .072). Again, the consistent dimensions were significantly different in the predicted direction.

Contribution of the Overall Impression or Performance Level

In order to assess the contribution made by the overall impression, performance level (effective or ineffective) was used as a dummy variable (coded 0 and 1) and, along with the overall impression, was entered into stepwise regression equations (Table 4). The order of these variables was alternated so that the change in [R.sup.2] could assess each variable's unique contribution to the variance. In the no prior knowledge conditions, we would expect that the overall impression would account for more variance in the performance rating than the actual level of effectiveness or ineffectiveness. Indeed, the overall impression did account for significant unique variance in all dimensions, as would be predicted. However, actual secretarial effectiveness also accounted for significant unique variance (p [less than] .05) in five out of the seven dimensions. Thus, contrary to expectations, actual performance does account for some variance for raters without prior knowledge of the rating task. However, the magnitude of the change in [R.sup.2] of the overall impression tended to be larger than that due to performance level. Only one [R.sup.2] between performance level and the dimensions was over .25, but five were greater than .25 between the overall impression and the dimensions. When the correlations were transformed to Fisher's z-scores and averaged, the average contribution of unique variance by performance level was .121, while that of the overall impression was .269. Therefore, although performance level accounts for significant variance in five of the seven dimensions for memory based processors, there is evidence that these raters are more dependent on a general impression than on the performance level.

For the raters with prior knowledge, one would expect that the greatest amount of unique variance would be due to the level of performance, since these raters should be more likely to base their ratings on the performance dimensions themselves, and not on general impressions. Contrary to these predictions, both performance and overall impression accounted for significant unique variance, although overall impression did not account for unique variance in one dimension (attendance). However, the two sets of correlations did not differ as dramatically as they did in the no prior [TABULAR DATA FOR TABLE 4 OMITTED] knowledge condition. The average of the Fisher's z-scores of the correlations was .277 for performance, and .289 for the overall impression. Thus, this analysis indicates that raters with prior knowledge depended more on performance than did the other raters, but they also relied to a large extent on their overall impression.

To investigate if the effectiveness manipulation produced nonequivalent results in this analysis, the data were again split into effective and ineffective groups. The contribution made by the overall impression was correlated with each dimension in all four conditions (Table 5). Since raters with prior knowledge conditions would be less likely to depend on an overall impression, these correlations should be lower than the correlations of the raters without prior knowledge. This hypothesis, however, was not consistently borne out by these data. Raters with prior knowledge in the effective condition showed ratings that were significantly related to the overall impression in four out of seven ratings. Raters without prior knowledge in the effective conditions showed ratings that were significantly related in five out of seven. "Prior knowledge ineffectives" showed significant ratings in six out of the seven, and "no prior knowledge ineffectives" in seven out of seven. Thus, there is only slightly less dependence on the overall impression for the subjects without prior knowledge. Also, it can be seen from this analysis that both ineffective conditions were more dependent on the overall impression than were the effective conditions. The average Fisher's z-transformations of the correlations for the ineffective conditions were .472 (no prior knowledge) and .475 (prior knowledge) while the average z-transformations for the effective conditions were .285 (no prior knowledge) and .234 (prior knowledge).

Evidence of Halo Effect

Similar to analyses in previous studies (Nathan and Lord, 1983), I analyzed the correlation matrix to investigate the presence of halo. Since the inconsistent dimensions are for the most part "opposite" to the consistent dimensions in effectiveness, raters who rated behavior more accurately (hypothetically, those in the prior knowledge conditions) should show negative relationships between the consistent and inconsistent dimensions. Additionally, these raters should show high positive correlations among the consistent dimensions as well as among the inconsistent dimensions. These results were partially seen in the prior knowledge correlation matrix (lower triangle on the left in Table 6). The consistent dimensions were intercorrelated, as were the inconsistent dimensions. Correlations between the consistent and inconsistent dimensions are for the most part not significant (two of the twelve are significant at the .05 level). Compared to the correlation matrix for the no prior knowledge conditions (lower triangle on the right in Table 6), six of the twelve correlations between the consistent and inconsistent dimensions were significant, indicating that they were not distinguishing as well between effective and ineffective dimensions. Raters depending on an overall impression should exhibit a correlation matrix whose elements are fairly similar to each other (Nathan and Lord, 1983), which indeed is seen in the lower triangle of the no prior knowledge condition in Table 6.

[TABULAR DATA FOR TABLE 5 OMITTED]

Partialing out the overall impression caused a better fit in relation to the predictions. In the on line conditions (upper triangle on the left in Table 6) this caused all but one of the inconsistent-consistent correlations to become negative, indicating that even the subjects who had prior knowledge of the ratings were relying somewhat on an overall impression. The no prior knowledge conditions were relying more heavily on the overall impression, as indicated in the upper triangle on the right in Table 6. Here, while all inconsistent-consistent correlations were positive previously and nine of the twelve were significant, all coefficients became negative when the overall impression was partialed out. However, only two were statistically significant, indicating that these raters without prior knowledge of the performance dimensions were indeed relying heavily on a general impression.

Behavioral Incidents

Raters were given a checklist of behaviors at the end of the in-basket test, after the performance rating task, and asked to check the incidents that "you were actually aware that the secretary did, not ones that you think she might have done or probably did." This list of 35 behaviors included actions that the secretary did perform as well as ones she did not.

[TABULAR DATA FOR TABLE 6 OMITTED]

We would expect that raters who depended on the overall effective-ineffective impression (those without prior knowledge) would recall these incidents less accurately than those who previously knew about the performance dimensions. For the behaviors that did not happen and were correctly left blank, the prior knowledge raters were more accurate (90% of the responses were correct) than the no prior knowledge raters (80% were correct), but only in the effective conditions. Ineffective responses were recalled nearly equally in both conditions (97% for no prior knowledge, 96% for prior knowledge). For the behaviors that did happen and were correctly checked, there was substantially no difference between the processing conditions (for effective conditions, no prior knowledge raters were 70% correct and prior knowledge raters were 68% correct; for ineffective, percentages correct were 73% and 71%, respectively). Thus, expected results concerning the behavioral incidents were only partially confirmed.

A summary of the results may be helpful. First, in the effective conditions, on line (prior knowledge) raters showed greater accuracy in distinguishing between effective and ineffective behaviors than did raters without prior knowledge. Second, raters without prior knowledge (again in the effective conditions) were more dependent on a general impression of effectiveness-ineffectiveness. Third, although results were in the predicted directions, the tendency for raters without prior knowledge to rate the effective-inconsistent dimensions (which were mostly poor) higher that those in the ineffective-inconsistent dimensions (which were mostly good) was not significant. Fourth, although the overall impression accounted for significant unique variance over and above the actual level of effectiveness for all raters, the magnitude of dependence on the overall impression was greater in the no prior knowledge condition, as predicted. Fifth, an analysis of correlation matrices showed that the no prior knowledge raters were relying more heavily on an overall impression. Lastly, on line (prior knowledge) raters were more accurate in identifying behaviors that did not occur, but were not more accurate in identifying behaviors that did occur.

Discussion

These data offer some support to the hypothesis that raters who did not have prior knowledge of the rating task relied more on a general effective-ineffective impression than did raters with prior knowledge, at least when behavior was mostly effective. This finding supports the assumption that understanding the rating task and performance dimensions may reduce rarer error based on a general impression. Raters in the effective condition who were not told beforehand that they were to rate the performer tended to categorize the behavior in the direction of an effective impression, even though some of the performance dimensions were mostly poor. However, raters with prior knowledge, who are theoretically using on line processing, also showed evidence of dependence on an effective-ineffective impression.

A possible explanation for the evidence of impression dependence in both conditions is that the in-basket exercise itself tends to force raters to categorize to some degree. If Wyer and Srull's (1981) "workspace" idea is correct, all raters, not just the ones without prior knowledge of the rating, cleared their cognitive workspaces several times because of the "filler" materials that were designed to create a more ecologically valid exercise. Some of the results tend to support this contention, especially the regression analyses, which indicate that although raters without prior knowledge showed more unique variance in performance ratings due to the overall impression, significant variance due to the overall impression was also found in the conditions where on line processing was predicted. The partialed correlation matrices also support this contention. Since frequent and diverse incoming information is more representative of real world performance appraisal settings than settings previous studies have used, these findings may be more reflective of rating processes in the real world.

Implications for Managers

These results support the commonly accepted idea that understanding performance appraisal dimensions before supervising workers will result in more accurate ratings. This finding is encouraging, since progressive organizations have spent a significant amount of time and money training supervisors how to accurately assess performance. It also underlines the need for those less progressive organizations, that have not spent sufficient resources on this training, to reassess their activities and include development in this area (Woehr and Huffcutt, 1994).

However, these findings also indicate that even when raters are told of and instructed about upcoming ratings, dependence on general impressions, which are likely to be less accurate, may inevitably take place. For the prior knowledge groups, the time lapse between knowledge of the ratings and the ratings themselves was just a little over an hour, but the effects of the effective-ineffective impression were strong. In organizational settings, training in performance appraisals may take place weeks or months before ratings, with thousands of irrelevant "filler" events occurring as well as relevant behavioral observations that require encoding. Thus, impression dependence, to some degree, may be inevitable.

Given that some level of inaccurate judgments about workers may be inevitable, what can human resource managers do to ensure the most accurate performance assessments possible? Perhaps performance measurement systems should be designed to carefully define a simple, yet valid, person impression of the job at hand. For example, simple descriptions based on job analysis, perhaps in story form, of workers successfully performing the jobs could be provided. These "stories" could include information about the type of behaviors that are necessary to achieve good performance within the relevant dimensions. Limiting the number of dimensions and clarifying their meaning may create an accurate person impression that will be easier for supervisors to understand and retain, thus enhancing the chance that they will depend on it rather than another, less accurate impression. Also, human resource managers may want to provide supervisors with reminders of these dimensions throughout the year so that the accurate person impression is refreshed and less likely to be forgotten. Using fewer but more powerful performance dimensions might allow for more frequent appraisals. This would not only enhance the legal defensibility of the performance management system and provide the worker with more feedback, it would also increase the supervisor's exposure to the relevant person impression, thus increasing rater accuracy. For complex jobs, this approach could be somewhat cumbersome. However, if the work involved can be analyzed, defined and identified, then such a process should be possible. Also, if jobs are very broad, flexible and changing quickly, it may be that a traditional performance appraisal system is not appropriate at all.

A relevant point that has been made in the research literature concerns whether behavioral accuracy or classification accuracy is more important (Murphy, 1991). In some cases, correct recall of behaviors is critical, such as when developmental feedback is provided to employees. In making individual personnel decisions, however, classification accuracy may be more applicable since such ratings "should be evaluated in terms of the accuracy with which they rank-order employees" (Murphy, 1991: 46). Although this distinction is quite relevant in the world of research, in practical performance management systems it becomes somewhat fuzzy. For example, legal defensibility requires that performance appraisal ratings, especially ones that are high or low, be backed up with specific behavior incidents. Obviously, behavioral accuracy is critical here so that the behaviors cited can justify the rating and are accepted by the employee as relevant and true. At the same time, in organizations with sophisticated human resource systems, performance appraisal instruments such as these, rather than broad classifications, are used to make personnel decisions. Especially in the case of promotions, sophisticated HR systems would go beyond "rank-ordering" employees. They would assess the particular skills and skill deficits each candidate has in relation to the promotional opportunity in order to maximize the job-person fit. Therefore, some type of evaluation of relevant performance dimensions is required beyond a general classification judgment.

Also, is it a valid assumption that managers are really interested in accurate ratings? Would managers care if performance ratings were inaccurate as long as the employees being rated were satisfied with the process and outcome? The answer to these questions depends on the viability of the organization's human resource practices, particularly in identifying, defining and valuing work activities. Theoretically, all employee behavior is monitored, measured and allowed to remain if it increases the firm's productivity. However, in less well managed organizations little time may be spent assessing the actual work that individuals perform. In such cases, where the work itself is not relevant, accuracy of ratings will not be meaningful. In other words, if the work isn't important, it's not important that its performance be accurately evaluated.

The tendency in some restructuring organizations is to reduce the dependency on the job as a highly structured, discrete entity. Such a strategy stems from the need for organizations in highly competitive environments to achieve flexible staffing, challenge employees who are facing few promotional opportunities and optimize profitability. In these organizations, less emphasis may be placed on work behaviors than on group or organizational outcomes, and accuracy of ratings may indeed be less relevant. However, in more traditional work settings, where the workers' outputs are discrete and measurable, the need to accurately assess performance remains critical. As long as the work itself is relevant to unit and corporate strategies, accurate performance evaluations are highly desirable.

Another implication for managers is the distinctly different findings based on level of effectiveness and ineffectiveness of the secretary. Although this issue is discussed further below, it may indicate that supervisors do not rate good and poor behavior using quite the same mental processes (Werner, 1994). If managers find poor behavior more salient, would they neglect the proper observation and attention needed to accurately evaluate the good performers? The effect on the processing capabilities of supervisors who have one or more poorer performing employees is worthy of further attention.

Considerations for Researchers

This study has several implications for further research. First, it suggests that the memory based processing, where incoming information is stored accurately and retrieved later when a judgment is required, may not automatically occur. In the more realistic setting of this study, where a more "ecologically valid" information processing situation was introduced (compared to other studies), subjects without prior knowledge still relied on a person impression, although one that was less accurate than that used by subjects in a more on line processing mode. Thus to elicit memory based processing in a realistic performance evaluation context may require significant engineering.

Second, findings in this study did not hold for raters in the ineffective conditions. While means were statistically significant between no prior knowledge and prior knowledge groups in the effective condition, they were not in the ineffective condition, indicating that raters without prior knowledge in the ineffective condition were more accurately appraising the dimensions than their counterparts in the effective conditions. It may be that the three consistently ineffective dimensions crossed the information "threshold" (Feldman, 1981; Hastie and Park, 1986), initiating more accurate processing. If these raters perceived that a large amount of the secretary's behavior was extremely poor, they may have become more aware of the situation and began to process incoming information more consciously.

Of course, another possibility is that the differential effect of effective and ineffective performance levels is due to inadequately designed research materials. Every effort was made to make the ineffective example identical to the effective example except in performance level; however, this was not always possible because changing performance level occasionally meant changing several pieces of information at once. However, the majority of the examples were identical except for the performance manipulation, so the effects of the few that were required to contain somewhat different material should be slight.

This study found that raters with prior knowledge of the rating task showed less evidence of halo than did raters without prior knowledge. Although I operationalized halo by using the general impression, caution must be taken in interpreting the overall impression as a true measure of halo or categorization. As Hulin (1982) points out, choosing an overall impression to serve as a measure of halo is arbitrary, and may remove more variance than is justified on statistical or theoretical grounds. Varying amounts of "true" and "illusory" halo are contained in an overall impression and its sole use as an indicator of halo is misleading, particularly when it is calculated by pooling across raters, as was done in this study (Murphy, 1982; Murphy et al., 1993). Additionally, if the groups without prior knowledge of the rating rely on Borman's (1978) "traditional" model, their overall impression would by definition be linearly related to the separate performance dimensions because the traditional model includes the weighting and summing of each dimension's evaluation to obtain a single rating. Thus, this measure may not adequately distinguish between raters depending on a general impression or the performance dimensions. However, here we are not as concerned with the measurement of individual performance dimensions and an exact measure of halo as we are with the differences between experimentally manipulated groups. Although using the overall impression as a measure of halo has drawbacks, it is satisfactory for these purposes.

Not included in this study was the possible effect of time on the rating task. One could assume that as time passes, the accuracy of some encoded information would be lost, and the effective-ineffective impression dependency would increase, even for those raters who were informed of the appraisal before observation of behavior. Williams et al. (1990) found that memory based raters improved in recall of behaviors over time, while on line raters' recall accuracy decreased. Therefore, it would be of interest in future research to again include this variable in the context of a more representative performance rating context, as in this study.

The results of this study indicate that when raters are in environments that represent the real world, in that much relevant and irrelevant information is incoming, they will tend to depend on overall impressions whether or not they are given prior instructions about the rating dimensions. Providing these instructions may, however, contribute to an increase in accuracy. When dealing with such notoriously imprecise devices as performance appraisals, even a slight improvement in accuracy may provide significant utility.

References

Borman, W.C. 1978. "Exploring Upper Limits of Reliability and Validity in Job Performance." Journal of Applied Psychology 63: 410-427.

Eichel, E., and H.E. Bender. 1984. Performance Appraisal: A Study of Current Techniques. Research & Information Service. American Management Association.

Feldman, J.M. 1981. "Beyond Attribution Theory: Cognitive Processes in Performance Appraisal." Journal of Applied Psychology 66: 127-148.

Fiske, S.T., and S.E. Taylor. 1984. Social Cognition. Reading, MA: Addison-Wesley.

Hastie, R., and B. Park. 1986. "The Relationship Between Memory and Judgment Depends on Whether the Judgment Task Is Memory-Based or On-Line." Psychological Review 93: 256-268.

Hulin, C.L. 1982. "Some Reflections on General Performance Dimensions and Halo Rating Error." Journal of Applied Psychology 67: 165-180.

Ilgen, D.R, J.L. Barnes-Farrell, and D.B. McKellin. 1993. "Performance Appraisal Process Research in the 1980s: What has it Contributed to Appraisals in Use?" Organizational Behavior and Human Decision Processes 54: 321-368.

Landy, F.J., and J.L. Farr. 1980. "Performance Rating." Psychological Bulletin 87: 72-107.

Lord, R.G., and K.J. Maher. 1990. "Alternative Information-Processing Models and Their Implications for Theory, Research and Practice." Academy of Management Review 15: 9-28.

Murphy, K.R. 1982. "Difficulties in the Statistical Control of Halo." Journal of Applied Psychology 67: 161-164.

-----. 1991. "Criterion Issues in Performance Appraisal Research: Behavior Accuracy versus Classification Accuracy." Organizational Behavior and Human Decision Processes 50: 45-50.

-----, T.A. Philbin, and S.R. Adams. 1989. "Effect of Purpose of Observation on Accuracy of Immediate and Delayed Performance Ratings." Organizational Behavior and Human Decision Processes 43: 336-354.

-----, R.A. Jako, and R.L. Anhalt. 1993. "Nature and Consequences of Halo Error: A Critical Analysis." Journal of Applied Psychology 78: 218-225.

Nathan, B.R., and R.G. Lord. 1983. "Cognitive Categorization and Dimensional Schemata: A Process Approach to the Study of Halo in Performance Ratings." Journal of Applied Psychology 10: 102-114.

Smither, J.W., and R.R. Reilly. 1989. "Relationship Between Job Knowledge and the Reliability of Conceptual Similarity Schemata." Journal of Applied Psychology 74: 530-534.

Steiner, D.D., and J.S. Rain. 1989. "Immediate and Delayed Primacy and Recency Effects in Performance Evaluation." Journal of Applied Psychology 74: 136-142.

Vance, R.J., P.S. Winne, and E.S. Wright. 1983. "A Longitudinal Examination of Rater and Ratee Effects in Performance Ratings." Personnel Psychology 36: 609-620.

Werner, J.M. 1994. "Dimensions that Make a Difference: Examining the Impact of In-Role and Extra-Role Behaviors in Supervisory Ratings." Journal of Applied Psychology 79: 98-107.

Williams, K.J., T.P. Cafferty, and A.S. DeNisi. 1990. "The Effect of Performance Appraisal Salience on Recall and Ratings." Organizational Behavior and Human Decision Processes 46: 217-239.

Woehr, D.J., and A.I. Huffcutt. 1994. "Rater Training for Performance Appraisal: A Quantitative Review." Journal of Occupational & Organizational Psychology 67: 189-205.

Wyer, R.S., and T.K. Srull. 1981. "Category Accessibility: Some Theoretical and Empirical Issues Concerning the Processing of Social Stimulus Information." In Social Cognition: The Ontario Symposium. Eds. E.T. Higgins, C.P. Herman, and M.P. Zanna. Hillsdale, NJ: Lawrence Erlbaum Associates.

Nancy E. Day Assistant Professor of Human Resources University of Missouri at Kansas City
Table 1


Number of Good or Poor Behaviors Included in Effective and
Ineffective Conditions


Effective          Ineffective
Performance dimension        Good      Poor      Good      Poor


Consistent dimensions


Typing                         3        0          0         3
Initiative                     3        0          0         3
Attendance                     3        0          0         3


Inconsistent dimensions


Social Skills                  1        2          2         1
Phone Messages                 1        2          2         1
Organizing Skills              1        2          2         1
Learning Ability               1        2          2         1


13        8          8        13


Total                              21                  21
Gale Copyright:
Copyright 1995 Gale, Cengage Learning. All rights reserved.