CEE 101 Back to Evaluation and Assessment

College Teaching, Wntr 2001 v49 i1 p26
Understanding Student Evaluations.

Suzanne M. Hobson; Donna M. Talbot.
Full Text: COPYRIGHT 2001 Heldref Publications

What All Faculty Should Know

With the new millennium, many people are reflecting on the past while preparing for the future. Within higher education, that reflection reveals a struggle between its very early history, which emphasized teaching (sitting at the feet of Aristotle), and its infatuation with research, which developed in the late 1800s (profoundly influenced by German institutions). Although universities have consistently valued both teaching and research, "there has been a subtle but pervasive transformation of faculty priorities in American higher education" (Glassick, Huber, and Maeroff 1997, 7). Specifically, scholarship has been defined as research and publication and has overshadowed teaching in tenure and promotion decisions. More recently, however, the definition and scope of scholarship are being questioned.

Today, on campuses across the nation, there is a recognition that the
faculty reward system does not match the full range of academic functions
and that professors are often caught between competing obligations....
While we speak with pride about the great diversity of American higher
education, the reality is that on many campuses standards of scholarship
have become increasingly restrictive, and campus priorities frequently are
more imitative than distinctive. (Boyer 1990, 1-2)
To address these criticisms, the Carnegie Foundation for the Advancement of Teaching has proposed a more inclusive perspective of what it means to be a scholar--"a recognition that knowledge is acquired through research, through synthesis, through practice, and through teaching" (Boyer 1990, 24) [emphasis added]. Although teaching is becoming a new, or renewed, commitment for the academy, it has consistently been and continues to be one of the most important factors to faculty.

In a 1989 Carnegie Foundation national survey that asked faculty to identify their primary interest, 30 percent of the faculty indicated research was primary, while 70 percent said that teaching was primary. Similarly, in a study of more than 35,000 faculty members conducted by the Higher Education Research Institute in 1991, 98 percent of the respondents stated that "being a good teacher was an essential goal," yet only 10 percent "believed their institutions rewarded good teaching" (Centra 1993, 3). Clearly, faculty members value teaching and have a strong desire to teach well.

As universities re-examine their definition of scholarship, they may wish to consider the Carnegie Foundation recommendations, which incorporate four overlapping areas that define the work of faculty: the scholarship of discovery; the scholarship of integration; the scholarship of application; and the scholarship of teaching. Although all four are important, we would like to focus on the scholarship of teaching, broadly defined as the ideology, pedagogy, and evaluation of teaching.

Although teaching evaluations may include peer (colleague) evaluations, retrospective evaluations by alumni, and self-evaluations, universities have tended to rely primarily on students' evaluations when attempting to quantify an instructor's teaching effectiveness. Our purpose here, therefore, is to introduce the scholarship of teaching evaluations with a specific emphasis on student evaluations of teaching effectiveness (SETEs). We will discuss the role of SETEs in the scholarship of teaching; address their two primary uses, highlighting the difference between global and specific items contained within most SETE instruments; briefly address the research on their accuracy; identify and describe some of the more widely used SETE instruments; and provide recommendations.

Primary Uses of SETEs

There exist today two very different purposes of SETEs within higher education (Centra 1993). In the field of program evaluation, "Scriven (1967) first distinguished between the formative and summative roles of evaluation" (Worthen and Sanders 1987, 34).

Student evaluations are formative when their purpose is to help faculty members improve and enhance their teaching skills. SETEs may serve a formative purpose when we use informal evaluations during a semester to determine what is working well and not-so-well, to pinpoint needed changes, and to guide those changes. Teachers may have students complete a written evaluation, participate in a facilitated "classroom assessment" dialogue (Angelo and Cross 1993), or participate in discussions with an instructor over the course of a semester (Nuhfer 1992).

As Centra (1993) indicated, however, evaluations serve a formative purpose only if the following four conditions are met:

First, teachers must learn something new from them. Second, they must value the new information. Third, they must understand how to make improvements. And, finally, teachers must be motivated to make the improvements, either intrinsically or extrinsically. (81)

SETEs may also be formative when used at the end of the semester. Evaluation forms invite the student to comment on specific aspects of an instructor's teaching style such as organization, planning, or structure; teacher-student interaction or rapport; clarity in communication skills; course work load and difficulty; grading, examinations, and assignments; and student learning, student self-ratings of accomplishments, or progress (Centra 1993, 57). Marsh advocated the use of an instrument that measures the following dimensions of teaching effectiveness: learning/value; instructor enthusiasm; organization; group interaction; individual rapport; breadth of coverage; examinations/grading; assignments; and workload/difficulty (1984, 712-13). Similarly, Feldman (1988) identified twenty-two "instructional dimensions" of effective teaching in his research on SETEs. Those specific dimensions tend to be especially helpful in the formative evaluation process because they help teachers understand what students like and dislike about their teaching style.

The summative purpose of SETEs is for use in evaluating the overall effectiveness of an instructor, particularly for tenure and promotion decisions (Centra 1993). As stated earlier, universities rely on several sources of information about a faculty member's teaching, but only SETEs tend to be administered systematically (Cashin 1988), generally as end-of-course evaluation forms that are used by an entire university community. In addition to items designed to target specific dimensions or behaviors, the end-of-course evaluation forms also tend to include global items on the overall effectiveness of the teacher or the quality of the course. Although global ratings tend to correlate highly with a number of specific factors (Centra 1993), there exists an ongoing debate about the appropriate uses of each.

As Cashin and Downey (1992) observed, "one of the continuing debates concerning the use of student ratings of teaching is the debate revolving around what kind of measures should be used for summative evaluation of faculty, in making personnel decisions for retention, promotions, tenure, or salary increases, and of course, to assess their effectiveness" (563). Some researchers have argued that only global items should be used for summative decisions regarding tenure and promotion. Others assert either that only specific ratings should be used for summative and formative purposes (Cashin 1988) or that both global and specific items need to be considered in summative decisions (Marsh 1991a, 1991b; Marsh and Bailey 1993).

Because this debate is not yet resolved, it is important for faculty members to be aware of its existence and the implications of each perspective. Professors will want to inquire about the specific student evaluation form used by their departments or universities and ascertain the relative importance placed by the instrument and the institution on global and specific items.

Further, questions about the accuracy of SETEs in measuring teaching effectiveness are not uncommon, but there seems to be a discrepancy between anecdotal beliefs and empirical data (Arreola 1995). Faculty members may want to explore the data available on the specific SETEs used by their institution and become more familiar with the research on SETEs in general.

Research on the Accuracy of Student Evaluations

Although this issue has received less faculty attention in recent years, the majority of past research on the accuracy of SETEs has focused on SETE reliability and validity, and on potential sources of bias that may influence ratings by students.

Reliability of Student Evaluations

Research on the reliability of SETEs is most often specific to a single instrument and has focused on three areas: consistency or inter-rater reliability; stability; and generalizability. Little consideration is given to reliability estimates based on measures of internal consistency, which as Marsh pointed out, are usually inflated because "they ignore the substantial portion of error due to lack of agreement among different students" (1984, 716). The inter-rater reliability of SETE instruments appears to vary depending on the specific instrument being used and on the class size. Estimates of inter-rater reliability have been as low as .20 when referring to the degree of agreement between any two randomly selected students in a class (Marsh 1984) and have ranged from .60 for a class size of five, .69 for a class size of ten, .81 for a class size of twenty, and .95 for a class size of fifty (Cashin 1988). These data indicate a generally acceptable degree of consistency given class sizes of at least fifteen. For this reason, universities will often not conduct official student evaluations in small classes.

The second aspect of reliability of interest in student evaluations is the stability of ratings over time. Cross-sectional studies comparing "the retrospective ratings of former students and those of currently enrolled students" (Marsh 1984, 717) have found "substantial agreement between current students and alumni (of five years) regarding who have been effective or ineffective teachers" (Centra 1974, 321; Feldman 1989). Longitudinal studies comparing the same students' end-of-course evaluations with their evaluations of the same professor/course at least one year later have found that "students' evaluations collected at the end of a course are remarkably similar to the retrospective ratings provided by the same students several years later" with a reliability coefficient of .83 (Overall and Marsh 1980, 321).

Third is the generalizability of the results. Specifically, this issue addresses a level of confidence that student evaluations are reflections of an instructor's effectiveness rather than artifacts of a particular course. As an example, faculty members frequently state that an instructor's SETE scores may be inflated simply as a result of teaching classes preferred by the majority of students. Marsh (1981) conducted a comprehensive study of the generalizability of student ratings using evaluations from 1,364 classes. His results suggested "the effect of the teacher on student ratings of teaching effectiveness is much larger than is the effect of the course being taught" (718). Other researchers have found similar results and concluded that SETEs are highly generalizable across courses and students (Barnes and Barnes 1993; Cashin 1988; Marsh and Overall 1981).

The research on SETEs has provided strong support for their reliability, and there has been little dispute about it.

Validity of Student Evaluations

Validity refers to the extent to which student evaluations actually measure what they are intended to measure--instructor effectiveness. Validity, however, is especially difficult to establish because researchers concede that there is no universally accepted criteria for what constitutes effective teaching (Aleamoni 1987; Cashin 1988; Feldman 1988; Marsh 1984). Research has therefore tended to compare SETEs to either a measure of student learning or to other evaluations of teacher effectiveness that are assumed to be valid such as instructor self-evaluation, peer/colleague evaluations, and alumni evaluations.

The validity research on SETEs correlating student learning is based on the belief that instructor effectiveness should correlate with the amount learned by students. That research has used actual student grades as a measure of student progress. Cohen (1981, in Cashin 1988) reviewed a number of studies and found that student grades correlated only .47 with their self-reported progress, .47 with overall ratings of course effectiveness, and .44 with overall ratings of instructor effectiveness.

Although these validity coefficients are substantially lower than the reliability coefficients cited earlier, Cashin (1988) asserted that validity coefficients between .20 and .49 are "practically useful ... especially when studying complex phenomenon, such as student learning" (2). Cruse (1987), on the other hand, also focused on Cohen's (1981) findings but disagreed with Cashin's conclusion. Among his criticisms of student evaluations is the fact that "the correlation between overall instructor ratings and student achievement can be .38 if the ratings are made before the students know their final grade but .85 if the ratings are made after final grading" (Cruse 1987, 729). That suggests that grade expectancies, along with other factors, bias student ratings.

The correlations between instructor self-ratings and student ratings have ranged from .20 to .65 (Centra 1973; Marsh, Overall, and Kesler 1979). In his major review and synthesis of "research comparing the [actual] overall ratings of college teachers' effectiveness made by current and former students, ... and the teachers themselves," Feldman concluded that "teachers' self-ratings and current student ratings are, at best, moderately similar" (1989, 137).

Many of the early studies measuring the correlation between SETEs and ratings by the instructor's peers or colleagues were flawed because the peer ratings were not necessarily based on observations made during a classroom visit. For example, Blackburn and Clark (1975) found a correlation of .62 between SETEs and colleague ratings, but peer ratings in these studies were likely to have been influenced by conversations with students or knowledge of an instructor's student ratings (Centra 1975; Marsh 1984). Feldman (1988) and Marsh (1984) both concluded that the correlation between peer ratings and student ratings is unacceptably low. Marsh stated the following:

Peer ratings based on classroom visitation do not appear to be
substantially correlated with student ratings or with any other indicator
of effective teaching. Although these findings neither support nor refute
the validity of student ratings, they clearly indicate that the use of peer
evaluations of university teaching for personnel decisions is unwarranted.
Finally, research on the correlation between SETEs and ratings by recently graduated alumni has offered the strongest support for validity, with evidence of correlations consistently ranging from .75 (Centra 1974) to .83 (Overall and Marsh 1980). The premise is that "follow-up ratings allow former students to develop additional perspectives about, and to obtain emotional distance from, the person and situation being assessed," thus enabling former students to offer "more informed and mature perceptions of course and instructor effectiveness" (Overall and Marsh 1980, 321). Based on this logic, the long-term consistency of student ratings is often offered as evidence of both the reliability and the validity of student ratings (Marsh 1984).

Although some researchers have found that the above studies have indicated the validity of SETEs (Cashin 1988; Centra 1993; Feldman 1988, 1989; Marsh 1984), other researchers have questioned that conclusion. Dowell and Neal (1982), for example concluded that "the literature can be seen as yielding unimpressive estimates of the validity of student ratings.... Modest at best and quite variable" (59). Abrami, d'Apollonia, and Cohen (1990) also reviewed the literature on the validity of student ratings and noted that the conclusions reached by the different researchers varied immensely.

Research on Potential Bias

In response to ongoing concerns about the validity of SETEs, research on potential biases has investigated whether factors unrelated to teaching skills (specifically, effects of instructor, student, and course characteristics) may explain variability in student ratings.

A heated debate continues about the influence on student evaluations of a teacher's sex and gender-role orientation (masculine, feminine, androgynous, or undifferentiated) (Basow and Howe 1987; Hobson 1997; Sidanius and Crane 1989). With respect to student characteristics, an interaction effect appears to exist between students and instructor's sex/gender-role orientation, but it is not yet convincingly documented. Further, class size, subject matter, and type of class all appear to have a small but clear impact on student evaluations.

Although a more extensive discussion of research about potential bias is beyond the scope of this article, faculty should be aware that such research has been conducted. However, because it has not consistently found evidence of bias, universities have tended to adopt student evaluations and to assume that they have an appropriate level of validity. Although the debate regarding potential bias and the validity of SETEs is likely to continue, the literature suggests that few universities would consider student input unimportant.

In general, student evaluations can be taken [only] to report honestly
student perceptions .... Perceptions are not necessarily accurate
representations of the objective facts, but they nevertheless constitute,
for a variety of important factors, the entirety of the student end of the
teaching process. Thus, their importance in the teaching-learning
interchange should be obvious. (Machina 1987, 20)
Widely Used Student Evaluations

Though instruments are often developed on-site, a number of published instruments are available for purchase and are widely used for measuring student evaluations of teaching competence. Although an exhaustive review of all available SETE instruments is beyond our scope, we will introduce several widely used and well-researched SETEs, which we hope will allow for a better understanding of how one's own university SETE instrument compares to others. Refer to Centra (1993) and to the manuals for individual instruments for a more thorough understanding of the strengths and weaknesses of each instrument.

Instructional Development and Effectiveness Assessment (IDEA)

Cashin (1990a) stated that the IDEA System is one of the two "most widely used rating systems in North America" (114). It was developed by the Center for Faculty Evaluation and Development at Kansas State University, first published in 1977, and most recently revised in 1988 (Cashin 1990b; Centra 1993). It is a forty-six-item self-report inventory that inquires about the students' reactions to the instructor and to the course as well as the students' perceptions of their progress on a wide range of instructional goals.

The developers of the IDEA System (Hoyt and Cashin 1977) argue that the logical result of effective teaching is student learning. Although Marsh (1984) identified student learning as "the most widely accepted criterion of effective teaching" (720), the IDEA System is unique because it is the only widely used student evaluation that uses student learning as the major criterion for teaching effectiveness.

In addition to providing a global measure of teaching effectiveness based on student self-reported progress on learning objectives, the IDEA form also provides global ratings of overall teacher effectiveness and course quality. Finally, the instrument contains a number of items designed to elicit student ratings about specific teacher behaviors.

Student Instructional Report

The student instructional report has also been identified as one of the two "most widely used rating systems in North America" (Cashin 1990a, 114). First published by the Educational Testing Service in 1971, and most recently revised in 1989, it is a self-report inventory of "thirty-nine questions, plus space for responses to ten additional questions that may be inserted locally" (Centra 1993, 188). It elicits students' opinions on specific characteristics and behaviors of the teacher, as well as a variety of global qualities.

Student's Evaluation of Educational Quality

The Student's Evaluation of Educational Quality was developed in Australia and first published in 1976 by Marsh. Two forms exist: "an `Australianized' version ... incorporating minor modifications to reflect Australian spelling and usage" (Centra 1993, 204) and a version in standard American English. Most recently revised in 1991, the SEEQ consists of thirty-five items designed to measure the following evaluation factors: learning/value; instructor enthusiasm; organization; group interaction; individual rapport; breadth of coverage; examinations/grading; assignments; and workload/difficulty (Marsh 1984, 712-3).

The SEEQ also includes three items designed to assess the student's perception of the overall quality of the class and the teacher. Because Marsh recommends a complex approach involving the multidimensionality of teaching effectiveness, a level of administrative expertise in evaluation is advisable if this instrument is to be used for summative purposes (Abrami and d'pollonia 1990).

Other Widely Used Student Evaluations

Instructor and Course Evaluation System "is a computer-based system through which faculty can select items from a catalogue of more than 400 items classified by content ... and specificity ..." (Centra 1993, 181). Only the global and general concept items are normed; the specific items are recommended only for formative use. This instrument is available from the University of Illinois at Urbana.

Student Instructional Rating System was developed at Michigan State University. The standard form consists of twenty items related to specific ratings and one general, global affect item (asking students to rate their "general enjoyment of the course"). The instrument was copyrighted in 1982 and is available for use by other universities.

Instructional Assessment System actually consists of nine student evaluation forms, one each for "large lecture, small lecture-discussion, seminar, problem-solving course, skill acquisition course, quiz section, homework section, lab section, and clinical section" courses (Centra 1993, 179-80). The instrument yields both global and specific ratings and is available from the Educational Assessment Center at the University of Washington.


Despite discrepancies in opinions and research findings on the validity of student evaluations, it is essential for faculty to understand that SETEs are--and probably will continue to be--the primary institutional measure of their teaching effectiveness. Given this reality, we have several recommendations.

First, if there is a faculty evaluation office or faculty development office at your university, meet with a representative to request a copy of the official SETE used. Inquire about the development of the instrument, obtain reliability and validity research information on the specific SETE used, and ascertain the relative importance of the global and specific items in the summative decisionmaking process. Also find out about the relative importance of SETEs in the university's evaluation of teaching effectiveness and the relative importance of teaching evaluations in tenure and promotion.

Second, become familiar with the literature on the validity and potential biases of SETEs. Third, consider using some form of midterm evaluation for formative purposes. An official university SETE instrument or one of the widely used instruments described here is likely to contain specific items that will help gauge student opinions regarding a variety of teaching behaviors. Focus on these specific items and use the feedback to improve your effectiveness. Remember that the global items will be of little use to you in midcourse evaluations.

Finally, well-developed student evaluations with adequate reliability and validity data may provide some of the best measures of teaching effectiveness. In an era of growing accountability and outcomes evaluation, achieving a better understanding of the evaluation of teaching effectiveness may be a necessary step toward including the scholarship of teaching in decisions on faculty tenure and promotion.


Abrami, P. C., and S. d'Apollonia. 1990. The dimensionality of ratings and their use in personnel decisions. In Student ratings of instruction: Issues for improving practice, ed. M. Theall and J. Franklin. New Directions for Teaching and Learning 43. San Francisco: Jossey-Bass.

Abrami, P. C., S. d'Apollonia, and P. A. Cohen. 1990. Validity of student ratings of instruction: What we know and what we do not. Journal of Educational Psychology 82: 219-31.

Aleamoni, L. M. 1987. Typical faculty concerns about student evaluation of teaching. In Techniques for evaluating and improving instruction, ed. L. M. Aleamoni, 25-31. San Francisco: Jossey-Bass.

Angelo, T. A., and K. P. Cross. 1993. Classroom assessment techniques: A handbook for college teachers. 2nd ed. San Francisco: Jossey-Bass.

Arreola, R. M. 1995. Developing a comprehensive faculty evaluation system. Boston: Anker Publishing.

Barnes, L. L. B., and M. W. Barnes. 1993. Academic discipline and generalizability of student evaluations of instruction. Research in Higher Education 34:135-49.

Basow, S. A., and K. G. Howe. 1987. Evaluations of college professors: Effects of professors' sex-type, and sex, and students' sex. Psychological Reports 60:671-8.

Blackburn, R. T., and M. J. Clark. 1975. An assessment of faculty performance: Some correlations between administrators, colleagues, student, and self-ratings. Sociology of Education 48:242-56.

Boyer, E. L. 1990. Scholarship reconsidered: Priorities of the professoriate. Princeton: Carnegie Foundation for the Advancement of Teaching.

Cashin, W. E. 1988. Student ratings of teaching: A summary of the research. (IDEA paper no. 20). Manhattan: Kansas State University, Center for Faculty Evaluation and Development.

Cashin, W. E. 1990a. Students do rate different academic fields differently. In Student ratings of instruction: Issues for improving practice, ed. M. Theall and J. Franklin. New Directions for Teaching and Learning 43. San Francisco: Jossey-Bass.

Cashin, W. E. 1990b. Student ratings of teaching: Recommendations for use. (IDEA paper no. 22). Manhattan: Kansas State University, Center for Faculty Evaluation and Development.

Cashin, W. E., and R. G. Downey. 1992. Using global student rating items for summative evaluation. Journal of Educational Psychology 84:563-72.

Centra, J. A. 1973. Self-ratings of college teachers: A comparison with student ratings. Journal of Educational Measurement 10:287-95.

--. 1974. The relationship between student and alumni ratings of teachers. Educational and Psychological Measurement 34:321-25.

--. 1975. Colleagues as raters of instruction. Journal of Higher Education 46: 327-38.

--. 1993. Reflective faculty evaluation: Enhancing teaching and determining faculty effectiveness. San Francisco: Jossey-Bass.

Cohen, P. A. 1981. Student ratings of instruction and student achievement: A metaanalysis of multisection validity studies. Review of Educational Research 51:281-309.

Cruse, D. B. 1987. Student evaluations and the university professor: Caveat professor. Higher Education 15:723-37.

Dowell, D. A., and J. A. Neal. 1982. A selective review of the validity of student ratings of teaching. Journal of Higher Education 53:51-62.

Feldman, K. A. 1988. Effective college teaching from the students' and faculty's view: Matched or mismatched priorities? Research in Higher Education 28:291-344.

Feldman, K. A. 1989. Instructional effectiveness of college teachers as judged by teachers themselves, current and former students, colleagues, administrators, and external (neutral) observers. Research in Higher Education 30:137-74.

Glassick, C. E., M. T. Huber, and G. I. Maeroff. 1997. Scholarship assessed: Evaluation of the professoriate. San Francisco: Jossey-Bass.

Hobson, S. M. 1997. The impact of sex and gender-role orientation on student evaluations of teaching effectiveness in counselor education and counseling psychology. Ed.D. diss. Western Michigan University, Kalamazoo.

Hoyt, D. P., and W. E. Cashin. 1977. Development of the IDEA System. IDEA Technical Report No. 1. Manhattan: Kansas State University, Center for Faculty Evaluation and Development.

Machina, K. 1987. Evaluating student evaluations. Academe 73:19-22.

Marsh, H. W. 1981. The use of path analysis to estimate teacher and course effects in student ratings of instructional effectiveness. Applied Psychological Measurement 6:47-60.

--. 1984. Students' evaluations of university teaching: Dimensionality, reliability, validity, potential biases, and utility Journal of Educational Psychology 76:707-54.

--. 1991a. Multidimensional students' evaluations of teaching effectiveness: A test of alternative higher-order structures. Journal of Educational Psychology 83(2): 285-96.

--. 1991 b. A multidimensional perspective on students' evaluations of teaching effectiveness: Reply to Abrami and d'Apollonia (1991). Journal of Educational Psychology 83(3): 416-21.

Marsh, H. W., and M. Bailey. 1993. Multidimensional students' evaluations of teaching effectiveness: A profile analysis. Journal of Higher Education 64:1-18.

Marsh, H. W., and J. U. Overall. 1981. The relative influence of course level, course type, and instructor on students' evaluations of college teaching. American Educational Research Journal 18:103-12.

Marsh, H. W., J. U. Overall, and S. P. Kesler. 1979. Validity of student evaluations of instructional effectiveness: A comparison of faculty self-evaluations and evaluations by their students. Journal of Educational Psychology 71:149-60.

Nuhfer, E. 1992. A handbook for student management teams. Platteville, Wis.: University of Wisconsin-Platteville, Teaching Excellence Center.

Overall, J. U., and H. W. Marsh. 1980. Students' evaluations of instruction: A longitudinal study of their stability. Journal of Educational Psychology 72:321-5.

Sidanius, J., and M. Crane. 1989. Job evaluation and gender: The case of university faculty. Journal of Applied Social Psychology 19:174-97.

Worthen, B. R., and J. R. Sanders. 1987. Educational evaluation: Alternative approaches and practical guidelines. White Plains, N.Y.: Longman.

Suzanne M. Hobson is an assistant professor in the Department of Leadership and Counseling at Eastern Michigan University, Ypsilanti, and Donna M. Talbot is an associate professor in the Department of Counselor Education and Counseling Psychology at Western Michigan University, Kalamazoo.