Running head: Construction of a Verbal Reasoning Instrument

 

 

 

 

 

 

 

The Construction and Evaluation of a Verbal Reasoning Instrument

Ingrid Campbell

The University of the West Indies, Mona


Abstract

 

Intelligence has mystified psychologist over the years and as such instrument to measure the concept as they see it have been extremely difficult to construct. Nevertheless in an attempt to construct an adequate verbal reasoning instrument this experiment was conducted. The instrument consisted of 30 items which were chose from a pool of over 300 items produced by 170 university level psychology students. It was then administered to these said students, in one hour sessions over the course of a week. The instrument proved to be relatively difficulty but extremely poor in distinguishing between groups in the study. Furthermore the test’s reliability was not as high as desired but upon the removal of several problem items it increased considerably. These all suggest that while the instrument has some obstacles to overcome, a combination of improvement and further testing will make it an acceptable verbal reasoning instrument.


Results

            To facilitate statistical analysis participants’ test scores were arranged in ascending numerical order, after which they were divided into 3 parts: a lower third, middle third and upper third. The lower and upper thirds, on which this analysis concentrates, yielded the results in Table 1.

 

Table 1.

Lower and Upper Thirds Score for the Items on the Test, along with the Corresponding Levels of Difficulty and Discrimination.

 

 

Q1

Q2

Q3

Q4

Q5

Q6

L

17

31

12

13

28

57

U

39

46

24

24

43

56

p

0.4912

0.6754

0.3158

0.3246

0.6228

0.9912

d

0.3860

0.2632

0.2130

0.1930

0.2632

-0.0175

 

 

Q7

Q8

Q9

Q10

Q11

Q12

L

15

20

38

36

10

7

U

33

35

57

57

27

13

p

0.4211

0.4825

0.8333

0.8158

0.3246

0.1754

d

0.3158

0.2632

0.3333

0.3684

0.2982

0.1053

 

 

Q13

Q14

Q15

Q16

Q17

Q18

L

50

46

28

3

26

41

U

56

55

30

19

26

55

p

0.9298

0.8860

0.5088

0.1930

0.4561

0.8421

d

0.1053

0.1579

0.0351

0.2807

0

0.2456

 

 

Q19

Q20

Q21

Q22

Q23

Q24

L

29

13

5

10

42

8

U

52

51

31

35

55

39

p

0.7105

0.5614

0.3158

0.3947

0.8509

0.4123

d

0.4035

0.6667

0.4561

0.4386

0.2280

0.5439

 

 

Q25

Q26

Q27

Q28

Q27

Q30

L

8

42

43

4

42

50

U

45

57

57

34

57

57

p

0.4649

0.8684

0.8772

0.3333

0.8654

0.9386

d

0.6491

0.2632

0.2456

0.5263

0.2632

0.1228

Note: “L”, “U”, “p” and “d” signify Lower third, Upper third, difficulty and discrimination respectively.

           

Difficulty on each item was calculated by summing the scores of the two thirds then dividing it by their total populations. The tests had a mean difficulty of 0.60, within the accepted range of 0.3 to 0.7. While one item, Q16 was extremely difficult exhibiting a p score of 0.19, the majority of the items fell well with in the acceptable range with only a few exceptions (question 6, 9, 10, 13, 14, 18, 23, 29 and 30) that had p scores well over 0.8 indicating low levels of difficulty. Although the test’s mean p score was to the upper end of the acceptable range, its level of difficulty was still satisfactory.

            The distinction between the lower and upper groups on each item was calculated by dividing the score differential between the lower and upper groups by 57 (the populations in each group). The test had a mean discrimination of 0.28 falling well below the acceptable range of 0.7 to 1.0. Other than Q20 and Q25 which had d values of 0.67 and 0.64 respectively, the bulk of the items exhibited d values closer 0.0. One item, Q17, showed a d value of 0.0 as the two thirds had the same number of persons producing the correct answers. Another item, Q6, interestingly enough displayed a negative d value as the lower group had one person, than the upper group, who answering correctly. These figures indicate that the test did not do well in providing a significant distinction between the two thirds.

            Upon running a Cronbachs Alpha test it was revealed that the instrument had a a score of 0.6977. Good reliability scores fall within a a range of 0.8 to 1.0, which indicates that all the items on the instrument were not measuring the same thing, hence its unreliability. As such we set about removing the items that were making the instrument unreliable in attempt to get instrument reliability as close to 0.8 as possible. Six items were removed and the results of this exercise are presented in Table 2.

 

Table 2.

Deleted Items Which Improve Instrument Reliability

# of Items

a

a if item deleted

Item to be deleted

30

0.6977

0.7144

Q17

29

0.7144

0.7280

Q15

28

0.7280

0.7350

Q4

27

0.7350

0.7418

Q3

26

0.7418

0.7479

Q12

25

0.7479

0.7528

Q2

24

0.7528

 

 

 

Question 17 was the one items that made the test most unreliable and upon its removal the instrument reliability jumped from a  0.6977 to a 0.7144. Next item Q15 was removed and reliability further increased to a a score of 0.7280. Subsequently items Q4, Q3, Q12 and Q2 were removed upon which the instrument reliability increase to a a of 0.7528. Although 0.80, was the target a score 0.7528 is an acceptable level of reliability and as such there was no need to continue removing items. At a 0.7528 it would seem that the now remaining 24 items were more or less measuring the same thing.

Discussion

            Although the test’s mean difficulty places it within the acceptable difficulty range, the nine items with p scores over 0.80 and the lack of items with p values below 0.3 to balance out the previously mentioned extreme, combined to skew the tests’ difficulty to the easier end of the acceptable range. With this in mind the test overall difficulty was probably easier than it should have been. There are several possibilities why this is so. One could be due to the manner in which the items on the test were constructed. Using novice students who do not have a full appreciation of that which it takes to construct a solid item was not necessarily the best way of producing the instrument. Secondly since the students themselves produced the items it seems reasonable to assume that they were very familiar with some of the items thus increasing the likelihood of getting it correct. A third possibility is that the test was indeed difficult but because it was only administered to, presumably intelligent university students, then the result gave a false sense of ease.

            The mean discrimination of test fell well below the acceptable range, suggesting that the test was extremely poor in differentiating between the lower and upper third of the tested population. Low discrimination could be due to any or all of the following factors. Firstly there is the matter of the tested population, all the participants are university students and as such they are relatively equal in their reasoning capabilities. Secondly, is the manner in which the test was administered, participants were in close proximity of each other while completing the test, increasing the possibility of cheating and or group discussion of the items. Consequently, if everyone has similar reasoning ability and similar answers due to consultation then technically there was nothing for the test to differentiate between.

            The test’s internal consistency as indicated by the alpha score was somewhat low and while the removal of some items increased the instrument’s reliability to an acceptable level it still did not attain the desired level of reliability. This indicates that the items on the test were not adequately cohesive to have sufficient confidence in the instrument’s ability to measure reasoning capabilities.

Some of test’s and instrument’s problems can be linked to the test’s methodology. A verbal reasoning test is a time consuming and extremely complex instrument to create and giving inexperienced student an hour to create a valid, reliable instrument is simply unacceptable. Ideally testing conditions include isolating the participants; in this experiment group testing gave the participants the opportunity to confer which may have negatively affected the instruments results. A single test administered to a homogenous population does not foster effective comparison—this may have been evidenced in the discriminations scores.

The instrument is by no means absolutely useless and to rule out other non-instrument related problems I recommend that the test, as it is, be re-administered to another group of university students each in isolation. If this reveals the similar results then we can set about correcting the instrument. With regards to discrimination it is advised that the test be administered to non-university students as well in order to see if the instrument is able to make a distinction between the two groups. This would give insight as to whether the test itself was poor or if the weak discrimination results were due to the flawed population. The test is reasoning based which indicates its constructors subscribe to Spearman’s theory of general, and by extension fluid intelligence. The standardized Raven’s Matrix, which is based on Spearman’s theory (Mackintosh, 1998), could be administered to the same population and the results compared and correlated to give a more accurate picture of the instruments reliability. In the original test 6 items proved troublesome I am of the opinion that even in retesting these will continue to be so. Therefore, I am suggesting that these items be modified where possible and if not be replaced with better suited ones. In conclusion the instrument and testing procedures produced enough problems to be discouraging, nevertheless, it must not be discarded as it can be improved and measures to do so must be embarked upon.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Appendix

 


References

Mackintosh, N.J. (1998). IQ and Human Intelligence. New York: Oxford University Press.