ASSESSING TEACHERS-MADE TEST, WHAT DOES IT IMPLY?

Please download to get full document.

View again

of 16
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Documents

Published:

Views: 0 | Pages: 16

Extension: PDF | Download: 0

Share
Related documents
Description
ASSESSING TEACHERS-MADE TEST, WHAT DOES IT IMPLY? A PAPER th PRESENTED IN THE 57 INTERNATIONAL TEFLIN CONFERENCE UNIVERSITAS PENDIDIKAN BANDUNG, BANDUNG,…
Transcript
ASSESSING TEACHERS-MADE TEST, WHAT DOES IT IMPLY? A PAPER th PRESENTED IN THE 57 INTERNATIONAL TEFLIN CONFERENCE UNIVERSITAS PENDIDIKAN BANDUNG, BANDUNG, INDONESIA Nur Hidayanto PSP YOGYAKARTA STATE UNIVERSITY 2010 ASSESSING TEACHERS-MADE TEST, WHAT DOES IT IMPLY? Nur Hidayanto Pancoro Setyo Putro (Yogyakarta State University) INTRODUCTION Everyone in Jogja was shocked to hear the news that the result of Junior High School (SMP) National Examination (UN) 2010 showing 22% out of 49.126 SMP students in Yogyakarta failed in the exam. It, of course, amazed them since Yogyakarta has been well- known as The City of Education for long time. It places Yogyakarta in number 4 among the other 4 provinces in Java. However, it doesn’t mean that Jogja has lost it’s prestige since the government insists that Jogja has proven to place number 1 in honesty in National Examination 2010. English and Mathematics have been the most terrifying subjects for many students since the data show that most of the SMP students in Jogja were failed in those two subjects (Kompas.com dated 07/05/2010). The great number of SMP students who failed in the SMP UN 2010 implies that the two subjects are still considered difficult by students of SMP in Jogja, and statistically, English has caused more students to fail in the UN 2010 than the other subjects. There must be some possible reasons why English has been one of the major causes in students’ failure in the SMP National Examination. The question is ”What are the factors that make English so difficult to students?”, is it the material, is it the teacher, is it the teaching technique, the assessment or even all those factors come together. This question remains in teachers’ mind. However, to think about the question might not be the correct answer to this problem; the wiser thing to do is to conduct a research as an effort to search for the possible causes and then to solve them. The asessment process, one of the phases in the learning process, cannot be separated from the whole process. A good teaching a learning process can be seen from the result of the appropriate assessment. However, it seems that some teachers sometimes takes it for granted. Some of them prefer using the questions provided in the LKS (Lembar Kerja Siswa) or the ones provided in the handbooks. This might not be an appropriate way to check students achievement since the questions in those two sources might not measure the skills to be measured. One of the reasons is that those questions come with no characteristics that can be used to recognize students’ ability. A good set of test should consist of items which meet the characteristics of good items including the item difficulty, item discrimination, the effectiveness of the distractors, or even the item information function. Viswanathan (2005:1) claims that assessment means a process to assign numbers to the result of the previous process that is measurement which represents quantities of attributes. The process of assessing students’ achievement should employ a standardized means. In other words, assessing students’ achievement includes the process of qualifying students’ scores obtained from the measurement process. The means used in the assessment itself should be standard or meet the characteristics of a good test. Chittenden in Syaifuddin Azwar (2008: 6) suggests that the assessment process should be aimed at: 1) Investigating, that is an effort to find out whether the teaching and learning process has reached the targets. The teachers are to obtain information about students’ improvement. 2) Checking, that is an effort to find out is there any discrepancies in the process of teaching and learning, especially in students. This effort is to check which targeted skills have been and have not been mastered by students. 3) Searching, that assessment is conducted to search and find the cause of the discrepancies in the teaching and learning process especially in students’ achievement as well as to find the solution to overcome the problems existed during the teaching and learning process. 4) Concluding, that assessment is conducted to conclude the level of students’ achievement which will in turn be used as a report of students’ progress. English achievement test is closely related to the topic as well as the curriculum. An achievement test should valid which means that it should measure what should be measured, that is the material given in the teaching and learning process. The test might come in the form multiple choices, matching, filling in the blanks, or even essay. Nowadays, multiple choices are commonly used to measure students’ achievement. Alderson (2000:211) argues the this form of test is well-liked because it can be used to control the range of students’ answers as well as to control the variety of students’ answers. Another reason why this form of test is commonly used is because the result can be checked with the computer so that it saves time, money, and energy. This test is also considered as the most objective form of test. Heaton (1991:30) argues that the stem and the options in a good item should be stated clearly, logically, and it cannot be ambiguous. The stem should give general information on a problem or question being asked as well as the options given. At the same time, the stems and options cannot use any words that may confuse students. Furthermore, Brown (2004: 55-58) suggests 4 rules to write a good test, they are: a. Each item should measure a certain objective. For example, an item on a reading comprehension test only measures students’ understanding on the setting of the text, without the actor, or even the plot. b. Each item should be stated clearly and try to avoid repetition. c. There is only a correct answer in a multiple choice test. d. Use the characters of the items to accept or reject the items, in terms of the level of difficulty, discrimination index and the effectiveness of the distractors. Bachman and Palmer (1996:176) recommend that an English test need to measure the four basic skills. To do so, test makers have to make sure which micro skills to measure in the test. They need to specify theoretical definition from the measured constructs. Practically, the definition construct itself which will cover the certain components from the language micro skills. This research aims at finding the characteristics of the items in teachers-made test used in English final semester test of 2nd grade junior high school students in Gunungkidul, one of the districts in Yogyakarta special province in 2009, and its implication toward the improvement of assessment in English teaching. The researcher was interested in investigating this field because the researcher believes that a good assessment will result in a good quality of teaching. METHODS This is a descriptive quantitative research. This research used 2000 samples of students’ answer sheets from the 2nd grade junior high school English final semester test in 2009 in Gunungkidul. The research used some techniques to analyze the set of the test as well as the result of the test obtained fro students response on their answer sheet. a. Analyzing the set of the test The questions or items in the sets of the tests, especially the multiple choice questions, were analyzed to make sure that the indicators are based on the standard competence (SK) and basic competence (KD). Since the test was given to the 2nd grade students of junior high school in Gunungkidul regency, the researcher used the guidelines from the KTSP for SMP. b. The next technique used to analyze the items is expert judgment. It is used to review the construct, language, as well as the materials of the questions. With the reference to the guidelines of test writing from Depdiknas. All the 3 experts are lecturers in the English Education Department of Yogyakarta State University. They are Drs. Suharso, M.Pd., Dra. Jamilah, M.Pd. and Ari Purnawan, M.Pd. M.A. c. Students’ responses were then quantitatively analyzed by using Iteman 3.00 to find the item characteristics. This software is commonly used to analyze test items with the reference to classical test theory. d. Students’ responses were also quantitatively analyzed by using Bilog 3.0 3 PL (Logistic Parameter) to get the information about the level of difficulty, discrimination index, as well as the pseudo guessing. FINDINGS The findings of the research shows the following result: a. The qualitative analysis with expert judgment was conducted by using the standardized format given by Ditjen Dikmenum (1994) in the form of analyzing card, which includes 18 points. The result of the expert judgment shows that 72 % or 36 out of the 50 questions are categorized as good, while 28% or 14 out of the 50 questions are categorized as not good. In other words, the 36 questions have fulfilled the requirements of good items in terms of the materials, the construct and the language. The questions which are considered not good are questions number 1,4,8,12,20,23,33,34,35,37,40, and 50. The result of the expert judgment is shown in the following table: Table 1 Item EXPERT Judgment Notes 1 2 3 1 Not Good Good Not Good Not Good The length of the options is not proportional. 2 Not Good Good Good Good 3 Not Good Good Good Good 4 Not Good Not Good Not Good Not Good The length of the options is not proportional. 5 Good Good Good Good 6 Good Good Not Good Good 7 Good Good Good Good 8 Not Good Not Good Good Not Good The options/answers are not homogenous and logical. 9 Good Good Good Good 10 Not Good Good Good Good 11 Good Good Good Good 12 Good Not Good Not Good Not Good The question is not clear. 13 Good Good Not Good Good 14 Good Good Good Good 15 Good Good Good Good 16 Good Good Good Good 17 Good Not Good Good Good 18 Good Good Good Good 19 Good Good Good Good 20 Good Not Good Not Good Not Good The question does not use appropriate rules of English grammar. 21 Good Good Not Good Good 22 Good Good Not Good Good 23 Not Good Good Not Good Not Good The length of the options is not proportional. 24 Good Good Not Good Good 25 Good Good Good Good 26 Good Not Good Good Good 27 Good Good Good Good 28 Good Not Good Good Good 29 Good Good Good Good 30 Good Good Good Good 31 Good Not Good Not Good Not Good The options/answers are not homogenous and logical. 32 Good Not Good Not Good Not Good The options/answers are not homogenous and logical. 33 Good Not Good Not Good Not Good The options/answers are not homogenous and logical. 34 Not Good Not Good Good Not Good The options/answers are not homogenous and logical. 35 Not Good Not Good Good Not Good The options/answers are not homogenous and logical. 36 Good Good Not Good Good 37 Not Good Not Good Good Not Good The options/answers are not homogenous and logical. 38 Good Not Good Good Good 39 Good Not Good Good Good 40 Good Not Good Not Good Not Good The question is not clear. 41 Good Good Good Good 42 Good Good Not Good Good 43 Good Good Good Good 44 Good Good Not Good Good 45 Good Not Good Good Good 46 Good Good Not Good Good 47 Good Good Not Good Good 48 Good Good Not Good Good 49 Good Good Not Good Good 50 Good Not Good Not Good Not Good The options/answers are not homogenous and logical. b. The result of quantitative analysis with Iteman 3.00 shows that 13 questions (26%) out of 50 questions are considered not good in terms of the level of difficulty, item discrimination as well as the effectiveness of the distractors. Those questions are questions number 5, 10, 12, 22, 25, 28, 30, 36, 38, 39, 40, and 47. The reliability of the test can be seen in the alpha that is 0,787. The result of Iteman 3.00 can be seen in the following table: Table 2 Item Item Distractor DECISION Item difficulty discrimination a b c d 1 0.439 0.417 0.22 0.16 0.44 0.18 FIT 2 0.472 0.431 0.07 0.18 0.27 0.47 FIT 3 0.482 0.164 0.17 0.15 0.48 0.20 FIT 4 0.661 0.494 0.08 0.15 0.11 0.66 FIT NOT FIT, the item difficulty and item discrimination do not satisfy the 5 0.157 -0.044 0.36 0.37 0.16 0.11 requirements for a good item 6 0.717 0.266 0.15 0.09 0.05 0.72 FIT 7 0.433 0.448 0.19 0.14 0.43 0.24 FIT 8 0.468 0.541 0.13 0.47 0.25 0.15 FIT 9 0.34 0.271 0.34 0.18 0.13 0.35 FIT NOT FIT, the item discrimination does not 10 0.251 0.14 0.19 0.37 0.19 0.25 satisfy the requirements for a good item 11 0.675 0.357 0.13 0.05 0.68 0.14 FIT NOT FIT, the item discrimination does not 12 0.301 0.11 0.18 0.30 0.27 0.25 satisfy the requirements for a good item 13 0.352 0.411 0.30 0.23 0.12 0.35 FIT 14 0.444 0.394 0.12 0.11 0.44 0.32 FIT 15 0.415 0.297 0.24 0.18 0.42 0.16 FIT NOT FIT, the item difficulty and item discrimination do not satisfy the 16 0.216 0.111 0.17 0.22 0.39 0.22 requirements for a good item 17 0.329 0.38 0.33 0.24 0.19 0.24 FIT 18 0.292 0.339 0.17 0.28 0.26 0.29 FIT 19 0.463 0.367 0.33 0.46 0.15 0.06 FIT 20 0.365 0.46 0.13 0.37 0.29 0.21 FIT 21 0.752 0.532 0.75 0.16 0.03 0.05 FIT NOT FIT, the item difficulty and item discrimination do not satisfy the 22 0.221 0.011 0.10 0.34 0.34 0.22 requirements for a good item 23 0.37 0.224 0.18 0.26 0.19 0.37 FIT 24 0.521 0.303 0.10 0.16 0.52 0.22 FIT NOT FIT, the item difficulty does not 25 0.242 0.515 0.24 0.23 0.21 0.31 satisfy the requirements for a good item 26 0.409 0.366 0.24 0.41 0.15 0.20 FIT 27 0.475 0.325 0.05 0.22 0.26 0.48 FIT NOT FIT, the item discrimination does not 28 0.296 0.141 0.30 0.36 0.14 0.20 satisfy the requirements for a good item 29 0.394 0.323 0.21 0.39 0.20 0.20 FIT NOT FIT, the item difficulty does not 30 0.243 0.221 0.25 0.39 0.12 0.24 satisfy the requirements for a good item 31 0.628 0.497 0.63 0.13 0.14 0.11 FIT 32 0.402 0.328 0.16 0.40 0.16 0.28 FIT 33 0.627 0.552 0.63 0.26 0.07 0.04 FIT 34 0.575 0.371 0.58 0.24 0.04 0.15 FIT 35 0.349 0.224 0.21 0.26 0.35 0.18 FIT NOT FIT, the item difficulty and item discrimination do not satisfy the 36 0.148 -0.005 0.52 0.14 0.15 0.20 requirements for a good item 37 0.373 0.216 0.37 0.28 0.14 0.20 FIT NOT FIT, the item difficulty and item discrimination do not satisfy the 38 0.169 -0.067 0.37 0.33 0.17 0.13 requirements for a good item NOT FIT, the item difficulty and item discrimination do not satisfy the 39 0.084 -0.053 0.08 0.51 0.24 0.17 requirements for a good item NOT FIT, the item discrimination does not 40 0.275 0.048 0.26 0.25 0.22 0.28 satisfy the requirements for a good item 41 0.524 0.295 0.25 0.52 0.13 0.09 FIT 42 0.396 0.365 0.18 0.17 0.25 0.40 FIT 43 0.34 0.366 0.19 0.11 0.36 0.34 FIT 44 0.266 0.169 0.10 0.27 0.07 0.56 FIT 45 0.447 0.264 0.29 0.13 0.13 0.45 FIT 46 0.53 0.266 0.21 0.53 0.18 0.08 FIT NOT FIT, the item difficulty and item discrimination do not satisfy the 47 0.21 -0.017 0.52 0.21 0.17 0.10 requirements for a good item 48 0.626 0.352 0.08 0.12 0.63 0.17 FIT 49 0.667 0.498 0.07 0.13 0.14 0.67 FIT 50 0.456 0.338 0.16 0.46 0.17 0.21 FIT c. The quantitative analysis with Bilog 3.00 shows that 23 questions (46%) out of the 50 questions are considered not good in terms of item discrimination, level of difficulty as well as the pseudoguessing. Those questions are questions number 3, 8, 17, 21, 25, 26, 27, 28, 33, 34, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, and 50. This result is considered to give more information since it does not depend on the students’ ability like the one with Iteman 3.00. The findings, especially the result of quantitative analysis with Bilog 3.00, shows that only 54 % of teachers made-test in Gunungkidul regency is considered good in terms of level of difficulty, discrimination index and pseudoguessing. This result implies that teachers of English (for SMP) in Gunungkidul regency still need to improve their skills in test-writing as the minimum requirement of a good set of test is the percentage of the good items, that is 70% of the items are considered good. The result of Bilog 3.00 can be seen in the following table: Table 3 slope (item Threshold asymtote Item Discrimination) (item difficulty) (Pseudoguessing) Decision 1 0.835 0.584 0.183 fit 2 0.593 2.564 0.398 fit 3 0.319 ******** 0.238 Not Fit: overestimate in item difficulty 4 0.414 -0.686 0.267 fit 5 0.985 0.608 0.164 fit 6 0.656 1.873 0.231 fit 7 0.413 5.422 0.25 fit 8 1.605 0.543 0.499 Not Fit: Pseudoguessing more than 0,25 9 1.05 1.208 0.179 fit 10 0.55 1.03 0.178 fit 11 0.568 1.379 0.216 fit 12 1.474 2.24 0.182 fit 13 0.948 1.231 0.184 fit 14 0.601 1.588 0.101 fit 15 0.59 0.777 0.163 fit 16 1.151 -0.76 0.113 fit 17 0.333 ******** 0.238 Not Fit: overestimate in item difficulty 18 0.425 0.41 0.164 fit 19 1.8 1.148 0.073 fit 20 0.91 1.228 0.239 fit 21 0.714 1.09 0.279 Not Fit: Pseudoguessing more than 0,25 22 0.975 2.512 0.252 fit 23 0.553 1.106 0.141 fit 24 0.801 2.326 0.176 fit 25 0.38 ******** 0.237 Not Fit: overestimate in item difficulty 26 0.365 ******** 0.238 Not Fit: overestimate in item difficulty 27 0.349 ******** 0.209 Not Fit: overestimate in item difficulty 28 0.703 1.028 0.337 Not Fit: Pseudoguessing more than 0,25 29 0.478 1.028 0.083 fit 30 0.55 1.167 0.089 fit 31 0.401 3.45 0.204 fit 32 0.372 1.684 0.222 fit 33 0.431 1.096 0.273 Not Fit: Pseudoguessing more than 0,25 34 0.332 ******** 0.238 Not Fit: overestimate in item difficulty 35 0.493 -0.227 0.178 fit 36 1.22 -0.378 0.103 fit 37 0.169 -6.54 0.246 fit 38 0.19 2.371 0.361 Not Fit: Pseudoguessing more than 0,25 39 0.528 2.469 0.369 Not Fit: Pseudoguessing more than 0,25 40 0.483 4.462 0.465 Not Fit: Pseudoguessing more than 0,25 41 0.114 -1.396 0.308 Not Fit: Pseudoguessing more than 0,25 42 0.175 0.138 0.313 Not Fit: Pseudoguessing more than 0,25 43 0.494 3.065 0.453 Not Fit: Pseudoguessing more than 0,25 44 0.443 5.383 0.291 Not Fit: Pseudoguessing more than 0,25 45 0.331 2.92 0.329 Not Fit: Pseudoguessing more than 0,25 46 0.373 3.418 0.375 Not Fit: Pseudoguessing more than 0,25 47 0.173 -1.993 0.268 Not Fit: Pseudoguessing more than 0,25 48 0.414 3.208 0.383 Not Fit: Pseudoguessing more than 0,25 49 0.177 -0.273 0.297 Not Fit: Pseudoguessing more than 0,25 50 0.351 ******** 0.238 Not Fit: overestimate in item difficulty DISCUSSION The findings of the expert judgment shows that the 36 questions (72%) used in the final semester exam of SMP students in Gunungkidul in 2009 have fulfilled the requirements of good items in terms of the materials, the construct and the language. In other words, there were still 28% questions which did not meet the characters of good test items. There are several reasons why those items are categorized as not good. Question 33 is an example of item which does not meet the guidelines given by The Ministry of Education in Indonesia in term of the homogeneity and the logic of the options. This of course violates Heaton (1991:30) argument that a good item should be clear and logic either in its stem or the options. 33. a –girl – in the forest – Sangkuriang – beautiful – met 1 2 3 4 5 6 The best arrangement of the words to make a good sentence is.. a. 4-6-1-5-2-3 c. 2-5-4-1-3-6 b. 4-3-1-5-2-6 d. 2-4-1-6-3-5 It can be seen that options c and d seem clearly illogical, even to early learners of English. If such kind of questions is used to test students’ achievement, it might cause bias in their result since two of the options are categorized not good. The result of Iteman analysis also shows that choice c and d are only chosen by 7% and 4 %, a very small number of students. Eventhough this item is considered good in classical test theory, in terms of the effectiveness of the distractors, the result shows that there were only a very few students choosing the options. This implies that teachers “sometimes” arrange the jumbled words without noticing the possible answer instead of the answer key. The result of the analysis with classical test theory shows that 26 % of the questions do not meet the characters of good test items. They are not good in terms of the level of difficulty, index of discrimination and the effectiveness of the distractors. It can be seen that 7 questions (14%) are categorized
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x