RESULTS-BASED FINANCING RBF EDUCATION EVIDENCE CHINA Can Classroom Observations Measure Improvements in Teaching? FEBRUARY 2018 REACH funded a pilot of the Classroom Assessment Scoring System (CLASS) to test its usefulness as a tool to assess teaching practices that can help inform the design of incentives for teacher training providers. The Results in Education for All Children (REACH) Trust Fund supports and disseminates research on the impact of results-based financing on learning outcomes. The EVIDENCE series highlights REACH grants around the world to provide empirical evidence and operational lessons helpful in the design and implementation of successful performance-based programs. While countries around the world teaching practices can be a useful have made enormous strides in tool for shining a light on the “black terms of increasing access to box” of what makes a good teacher Student test scores education, many education systems and can help policymakers to alone are not an accurate measure of teacher quality. are now working to ensure the high design better curricula and teacher quality of learning for all students. training programs. There are two measurement challenges involved in the pursuit of Furthermore, a consistent system this goal—how to measure student for assessing teaching quality is learning outcomes and how to a precondition for using results- measure teachers’ role in achieving based financing (RBF) schemes Systematic observations of those outcomes. While student test that make financing conditional on teaching practices can be a improvements in teaching quality. useful tool to understanding scores can be used to measure what makes a good teacher. teacher quality, they miss a lot of RBF can be used as an incentive to what comprises effective teaching. foster improvements at the level of Systematic observations of the teacher, school, school system, This note was adapted from Coflan, Andrew, Andrew Ragatz, Amer Hasan, and Yilin Pan (2018). Understanding effective teaching practices in Chinese classrooms: evidence from primary and junior secondary schools in Guangdong, Policy Research Working Paper, World Bank, Washington D.C. 2 RFB EDUCATION | EVIDENCE or teacher training provider. However, tool was used to conduct classroom there was substantial variation in the process of linking financing to observations of 36 teachers in performance among teachers, there learning outcomes is often plagued by Guangdong and to assess the was only modest variation by county, concerns about cheating, “teaching to strengths and weaknesses of their between urban and rural areas, by the test,” and exams that do not fully teaching practices. It sought to test teacher type, by grade, or by years of reflect what students have learned. whether this tool could be used to experience. However, teachers with Ensuring that RBF works effectively measure teaching practices in the more student-centered beliefs scored in the Chinese context will depend on context of China and how to significantly higher than those with developing ways to measure teaching introduce classroom observations teacher-centered beliefs. quality that can be influenced by into a quality assurance and teachers, teacher trainers, and other monitoring and evaluation (QAME) This pilot has also yielded lessons actors but not manipulated. system, as well as exploring how to about how to improve the design establish the preconditions for of classroom observations in the The Results in Education for All introducing RBF into China. future to be more effective in an RBF Children (REACH) Trust Fund at the context. To address discrepancies World Bank has funded a pilot of the The CLASS tool provided valid between the ratings of different Classroom Assessment Scoring and reliable measures of teaching classroom observers, it may be System (CLASS) in Guangdong, quality in Guangdong as it has necessary to provide them additional China to test its usefulness as a tool done in other settings. On average, training, or to modify the scoring to assess teaching practices. The teachers in Guangdong scored rubric to take cultural issues into pilot was also designed to establish a high on classroom organization but consideration. In designing an proof of concept for using classroom lower on emotional support and RBF mechanism using classroom observations to measure the impact instructional support. This was useful observations, it is critical to design of teacher training and incentivize evidence showing how the teacher incentives targeted to teaching training providers within an RBF training system can fill these critical practices where there is significant mechanism. In the pilot, the CLASS gaps in teaching practices. While room for improvement. CONTEXT enrollment rates are close to 100 percent thanks to large investments in public schooling, schools in rural The issue of ensuring that all students areas are often under-resourced receive high-quality and equitable and are staffed by teachers of learning is particularly acute in China. variable quality.3 These disparities While Guangdong’s GDP has grown also translate into gaps in learning by 13 percent per year since 1981, performance. In 2015, results from this rapid economic progress has four provinces (Beijing, Shanghai, coincided with growing urban-rural Jiangsu, and Guangdong) on the disparities. The urban-rural income Program for International Student gap increased from 1.7 in 1980 to Assessment (PISA) found that rural Guangdong over 3.0 in 2010.1 Historically, the students lagged roughly 2.5 years of Schools in rural areas are system for financing public services learning behind urban students.4 often under-resourced has resulted in substantial spending and staffed by teachers disparities across regions, particularly To reduce social inequality and of variable quality. in the education sector, resulting in achieve a more harmonious society, insufficient infrastructure, faculty, one of China’s key priorities is to and operational resources.2 While ensure equity in both the quantity and CHINA 3 quality of education.5 To reach this ambitious goal, China must continue to invest in teacher quality, including increasing teacher training capacity. Training in Guangdong is typically provided by teacher training colleges but varies in quality from both a content and delivery standpoint. There is no consistent system for measuring training objectives. As a result, with no way to measure whether their training is meeting expected outcomes, teacher trainers are often unaware of how to improve their content or delivery method. Achieving equity will also require a WHY WAS THE INTERVENTION CHOSEN? quality assurance mechanism, which is currently lacking. While the province measures some overall indicators In a recent literature review, quality, it is difficult to assess pedagogical interventions and teacher whether teaching is improving over of the education system, such as training were found to be among the time or whether teacher training is the number of backbone teachers most effective methods to improve effective. Teacher training courses (individuals that have been identified as student learning outcomes.6 However, are not designed to address observed excellent teachers by their peers; they changing teaching practices is weaknesses of teacher practices, also receive additional professional notoriously difficult, and these efforts and there is little focus on measuring development), the overall system does tend to fail if not well designed or training outcomes. In the absence of not include any metrics of teaching properly targeted. A recent study found any standardized measurement of quality in the classroom. Guangdong is that, while teachers in China gained teaching quality, there is little incentive in the process of improving its QAME knowledge from training, their teaching for teachers to improve or for teacher system as part of the Guangdong behavior did not change, leading to training providers to develop content Compulsory Education Project, a World no significant gains in learning.7 that will improve teaching practices. Bank co-financed initiative to improve Before implementing RBF, school education in 16 under-performing Furthermore, measuring classroom administrators must be able to counties in the province. These 16 teaching practices consistently is demonstrate to teachers or trainers project counties have significantly a key challenge in ensuring teacher that there is a need for them to lower per capita GDP, higher levels of quality. While in Guangdong, annual improve and in what areas. poverty, larger rural populations, and grade-wide examinations are used higher reliance on agriculture than the effectively to measure learning Classroom observations province overall. These 16 counties also outcomes, attempts to measure accompanied by detailed feedback lag behind in meeting Guangdong’s the effect of teaching practices on on each teacher has emerged as “Chuang Qiang” standards for school learning outcomes have been less one of the most promising ways of quality. This study focuses on three of effective. Teachers in Guangdong are improving teaching and learning. the 16 project counties, Wuhua, Dianbai, often observed in the classroom, but The CLASS tool was chosen for and Lianjiang, one low-performing, these observations are subjective this study because it has been one medium-performing, and one and idiosyncratic and do not measure implemented in a number of high-performing county in terms of quality in a standardized way. Without countries and proven to be one education performance. a consistent metric of instructional of the most valid and reliable 4 RFB EDUCATION | EVIDENCE classroom observation instruments. (a) to assess to what extent an establish the preconditions for It has also been shown to be internationally validated measure an RBF mechanism for teacher particularly effective in producing of teaching practices could be trainers. The ultimate goal was teacher assessment results applied in the Chinese context; to establish a reliable measure that are correlated with student (b) to gain an understanding of of teacher quality agreed on by learning outcomes, as measured by how to use classroom observations all relevant stakeholders so that standardized test scores in the U.S.8 to assess the strengths and it could be used to assess how a and in developing country settings.9 weaknesses of existing teaching teacher performed before and after practices in order to improve training and to reward trainers with A pilot of the CLASS tool was teacher training and monitoring performance-based bonuses. conducted in Guangdong as a proof and evaluation (M&E); and (c) to of concept with three objectives: HOW DID THE of teaching practices. Today, CLASS on a 1 to 7 scale in 12 dimensions, observation training is administered organized into four domains (Table 1). INTERVENTION by Teachstone, a private U.S.-based company. As of April 2016, the tool In addition to the classroom WORK? had been validated and implemented observations, teachers were given in a number of countries. While a questionnaire to fill in to assess such validation is always necessary their beliefs and attitudes about CLASS is a classroom observation teaching. This questionnaire when bringing a tool into a new tool that measures the quality consisted of questions on topics setting, it is especially critical for of teacher-student interactions, such as the role of teachers in the China given that the CLASS tool is which are the main mechanism classroom, the role of assessments, designed for a student-centered through which children learn. It was and the structure of student-teacher learning environment, while Chinese developed by researchers at the interactions. Its purpose was to classrooms are historically more University of Virginia to provide an measure each teacher’s alignment traditional and teacher-centered, objective, quantitative measurement with either student-centered or focused on lecturing and the transmission of knowledge. teacher-centered beliefs. Table 1. CLASS Dimensions Furthermore, before the government CLASS domain CLASS dimension adopted the CLASS tool to determine The study sample consisted of 36 Emotional Positive climate incentives for teacher trainers, it was teachers in 12 schools, with each support critical to demonstrate that it was of the three pilot counties each Negative climate reliable and accurate when applied contributing two primary schools Teacher sensitivity in Chinese classrooms. (urban and rural) and two junior Regard for student secondary schools (urban and perspective In other countries, the CLASS tool rural). Each school included one new Classroom Behavior management organization has been used in all grades from teacher (with less than three years of Productivity kindergarten to secondary school. experience), one backbone teacher, Instructional learning and one potential backbone teacher formats In this pilot, it was tested in primary schools (grade four) and junior (who has not yet completed backbone Instructional Content understanding support secondary schools (grade eight). teacher training). Within each type Quality of feedback While the tool is subject-agnostic, of school, the teachers were also Analysis and inquiry in this study it was used to observe evenly distributed between English, Instructional dialogue English, math, and Chinese classes. math, and Chinese. This breakdown In each classroom, two raters used was designed to assess the CLASS Student Student engagement engagement the tool to score classroom behavior tool’s validity across a balanced CHINA 5 distribution of teachers and to assess teacher performance across several relevant dimensions, including teacher experience, school level, urban vs. rural, and school subject. CLASS includes four cycles of 15-minute observations of teachers and students by certified observers. CLASS has been validated in over 2,000 schools, primarily in the United States. WHAT WERE THE RESULTS? The results of the pilot showed that that many teachers are unable to teachers reveals large disparities teachers in Guangdong have strong establish emotional connection with in performance. For example, in classroom organization skills but students, create a positive learning grade eight, their emotional support score lower on emotional support environment, or promote a depth scores ranged from 1.7 to 5.1, with and instructional support, which of understanding in their students. the best teacher scoring three is similar to results that have been These scores suggest that teachers in times higher than the one with the seen in the U.S. Teachers scored China have similar levels of teaching lowest score. However, average very high on classroom organization, quality and similar strengths to those classroom performance was with average scores of 6.5 (out of 7) of U.S. teachers as shown in three similar in each pilot county despite for primary school teachers and 6.2 studies in the United States (although large differences between them in for junior secondary school teachers it is difficult to assess whether such terms of economic prosperity and (Table 2). Scores of 6 out of 7 mean comparisons may be influenced by the educational achievement. Similarly, that most teachers are able to prevent raters’ pre-existing cultural attitudes). there was only a slight difference in and redirect students’ misbehavior, average scores between urban and manage instructional time, and However, these averages mask rural schools, with rural classrooms maximize students’ engagement a substantial variation in the performing slightly better in all effectively. In contrast, primary and classroom performance of three domains. That being said, junior secondary teachers averaged individual teachers as well as even though the “urban” schools 4.2 and 3.7 respectively on emotional modest variations by county, were located in the county seats, all support, and 3.7 and 3.6 respectively urban/rural location, teacher type, schools in the pilot counties could on instructional support. Scores grade, and years of experience. be considered to be rural compared of 4 out of 7 in these areas imply The breakdown by individual with Guangzhou. 6 RFB EDUCATION | EVIDENCE The differences in scores between correlated with CLASS scores in are strongly correlated with both new teachers, backbone teachers, the emotional support, classroom classroom organization (0.71) and potential backbone teachers organization, and student and instructional support (0.78). were also relatively small. Potential engagement domains, while scores However, there was a lower backbone teachers scored highest in associated with student-centered correlation between classroom all three domains, perhaps because beliefs were positively correlated organization and instructional they are more likely to be tenured with all domains and more strongly support (0.46), which suggests than new teachers, have been correlated with instructional that even teachers who can keep identified as strong performers, and support scores. This suggests their class organized may not be are more motivated than backbone that a student-centered teaching effectively engaging their students teachers to perform well. These approach, in addition to fostering gaps were larger in junior secondary a positive classroom environment, in instruction. schools, suggesting a more difficult is also more conducive to effective However, there were larger transition for new teachers, perhaps instruction and classroom discrepancies between raters on due to the larger class sizes, more organization. difficult curricula, and the students’ some dimensions for both technical greater emotional needs. Grade four The CLASS tool provided valid and cultural reasons. Another way teachers consistently scored higher and reliable measures of teaching of examining differences between than their grade eight counterparts quality, with high levels of multiple raters is to consider the in all three domains. The youngest agreement between raters. The magnitude of the spread between teachers performed best on percent agreement, defined as the their scores. For example, a spread emotional support and the oldest fraction of scores that were equal of 3 or more on a 1 to 7 scale scored highest on instructional or adjacent (±1) between the two indicates a very serious discrepancy support and classroom organization. raters, was 75 percent on average, in judgment by one of the two similar to the 77 to 80 percent raters. Certain dimensions proved Teachers with more student- agreement in the three comparable harder for Chinese raters to agree centered beliefs scored U.S. studies. In addition, many of on, including analysis and inquiry, significantly higher than those the 12 dimensions were strongly instructional dialogue, regard with teacher-centered beliefs. On correlated with each other, with the beliefs survey, all teachers were an average correlation of 0.51, for student perspectives, and scored based on their alignment suggesting that the tool is internally teacher sensitivity, on which the with either student-centered or consistent and that teachers tend raters scored at least 10 percent teacher-centered beliefs. Scores to perform consistently well (or of teachers with a spread of 3 or associated with the most teacher- poorly) across all dimensions. more. In some cases, the challenges centered beliefs were negatively The emotional support scores were mainly technical and could be addressed by improving the training provided to raters. However, Table 2: Average Teacher Scores on Each CLASS Domain in the cases of regard for student 7 perspectives and teacher sensitivity, 6 6.5 cultural factors may explain the 5 6.2 discrepancies. These dimensions 4 4.2 are not usually considered in 3 3.7 3.7 3.6 existing teacher observations in 2 Guangdong, and therefore raters 1 0 may have very different opinions Classroom organization Emotional support Instructional support about what constitutes these Primary Junior secondary characteristics in China. CHINA 7 WHAT WERE THE LESSONS LEARNED? To use RBF to improve results in are likely to have been technical and Chinese context. The results from education, many preconditions could be addressed by providing the pilot can be used to inform the must be met. One precondition is raters with more extensive training. design of an RBF scheme to establish establishing a set of indicators that In other cases, cultural factors may performance-based contracts for is accurate and reliable, easy to have caused discrepancies in scores teacher training providers. However, measure, and has a strong causal across raters. In these cases, the the design of such a program link to ultimate learning outcomes. CLASS scoring rubrics may need must also meet several additional The main objective of this pilot was to be modified to make them more preconditions, including how to to determine whether the CLASS tool applicable to China. In fact, it is critical ensure that training providers agree could provide effective indicators to to pay careful attention to adapting to the financial incentives, how to measure improvements in teaching any classroom observation tool to manage the additional complexity practices over time. If so, this the Chinese context to ensure that and cost of RBF contracts, and would pave the way to using RBF all relevant stakeholders consider the how to provide feedback to teacher to incentivize training providers by tool to be valid and culturally relevant trainers to ensure that they have the rewarding them depending on the and trust its results. information that they need to improve impact of their training. how they do their jobs. This pilot also examined the While the observations in the pilot implications of observing classrooms were not taken at different points either in person or via a video link. in time, the variation in scores can There are advantages to each of help to answer this question. The these options. For example, a video scores in emotional support and can be scored by many different instructional support varied widely, raters and can be re-watched at any with teachers’ scores ranging from time to resolve any discrepancies, 1.7 to 5.1 and 1.7 to 4.6 respectively. while in person observation is a However, teachers scored very one-time experience, but generally high on classroom organization allows for a better evaluation of and student engagement with little the classroom environment and variation. Ninety-eight percent of teacher-student interactions. While all teachers received a 5 or higher no definitive conclusions could be in classroom organization, and drawn from the pilot’s analysis of this 90 percent received a 5 or higher question, the in-person observations in student engagement. With so yielded higher scores in most little variation, there would be little dimensions, but further investigation possibility of capturing measurable will be needed to select the best changes over time. To measure option going forward. meaningful improvements in these domains, it may be necessary to While this pilot was conducted in modify the scoring rubrics to identify only three counties in Guangdong areas where teachers are weak. and therefore cannot be considered to be representative of all of China The pilot also identified some or of the province, it shows that the challenges in the rating of specific CLASS tool is an accurate, reliable dimensions. In some cases, the issues indicator of teacher quality in the The CLASS tool proved CONCLUSION and weaknesses that could be taken into account in the design of to be successful in In June 2017, at a workshop attended teacher training and the new QAME system. The pilot has shown that providing valid and by Department of Education officials, classroom observers, and teachers, classroom observations can be reliable measures of there was strong agreement on the used as an outcome measure in an RBF scheme to give teacher training tool’s usefulness and applicability. teaching quality in Teachers scored high on classroom providers incentives to change the three pilot counties organization but lower on emotional teacher behavior. Ensuring that such a scheme is carefully designed will support and instructional support, in Guangdong. which was similar to CLASS results in be critical to ensuring support from other settings. This pilot established all stakeholders and to establishing a baseline of teaching practices in clear links between training and Guangdong and identified strengths expected outcomes. 1 Coflan, Andrew, Andrew Ragatz, Amer Hasan, and Yilin Pan (2018). Understanding effective teaching practices in Chinese classrooms: evidence from primary and junior secondary schools in Guangdong, Policy Research Working Paper, World Bank, Washington D.C. 2 Tsang, M.C., and Y. Ding. (2005). “Resource utilization and disparities in compulsory education in China.” China Review: An Interdisciplinary Journal on Greater China, 5(1): 1–31. 3 Wen J. Peng, Elizabeth McNess, Sally Thomas, Xiang Rong Wu, Chong Zhang, Jian Zhong Li, and Hui Sheng Tian (2014). “Emerging perceptions of teacher quality and teacher development in China,” International Journal of Educational Development, Volume 34, Pages 77–89, http://dx.doi.org/10.1016/j.ijedudev.2013.04.005 4 OECD (2016), PISA 2015 Results (Volume I): Excellence and Equity in Education, OECD Publishing, Paris, http://dx.doi.org/10.1787/9789264266490-en. 5 Li Keqiang (2016). Report on the Work of the Government. Delivered at the Fourth Session of the 12th National People’s Congress of the People’s Republic of China on March 5, 2016. 6 Evans, D., and A. Popova. (2016). “What really works to improve learning in developing countries? An analysis of divergent findings in systematic reviews.” The World Bank Research Observer, Volume 31, Issue 2, pages 242–270. 7 Lu, M., P. Loyalka, Y. Shi, F. Chang, C. Liu, and S. Rozelle (2017). The Impact of Teacher Professional Development Programs on Student Achievement in Rural China, Rural Education Action Program Working Paper 313, Stanford University, Stanford, CA. 8 Allen, Joseph, Anne Gregory, Amori Mikami, Janetta Lun, Bridget Hamre, and Robert Pianta (2013). “Observations of Effective Teacher–Student Interactions in Secondary School Classrooms: Predicting Student Achievement With the Classroom Assessment Scoring System—Secondary.” School Psychology Review, 42, 76–98. 9 Araujo, M. Caridad, Pedro Carneiro, Yyannu Cruz-Aguayo, and Norbert Schady (2016). “Teacher Quality and Learning Outcomes in Kindergarten.” The Quarterly Journal of Economics, Volume 131, Issue 3, 1 August 2016, Pages 1415–1453. PHOTO CREDITS: Cover: Project photo courtesy of the World Bank. Page 3: “China_2010_Peng-Yang” by SIM USA, license: CC BY-SA 2.0 Page 5: Project photo courtesy of the World Bank. Page 7: “Classroom” by WabbitWanderer, license: CC BY-SA 2.0 RESULTS IN EDUCATION FOR ALL CHILDREN (REACH) worldbank.org/reach REACH is funded by the Government of Norway through NORAD, the Government of the United States of America through USAID, and the Government of Germany reach@worldbank.org through the Federal Ministry for Economic Cooperation and Development.