Running head: Construction of a Verbal Reasoning Instrument
The
Construction and Evaluation of a Verbal Reasoning Instrument
Ingrid
Campbell
The
University of the
Abstract
Intelligence has
mystified psychologist over the years and as such instrument to measure the
concept as they see it have been extremely difficult to construct. Nevertheless
in an attempt to construct an adequate verbal reasoning instrument this
experiment was conducted. The instrument consisted of 30 items which were chose
from a pool of over 300 items produced by 170 university level psychology
students. It was then administered to these said students, in one hour sessions
over the course of a week. The instrument proved to be relatively difficulty
but extremely poor in distinguishing between groups in the study. Furthermore
the test’s reliability was not as high as desired but upon the removal of
several problem items it increased considerably. These all suggest that while
the instrument has some obstacles to overcome, a combination of improvement and
further testing will make it an acceptable verbal reasoning instrument.
Results
To
facilitate statistical analysis participants’ test scores were arranged in ascending
numerical order, after which they were divided into 3 parts: a lower third,
middle third and upper third. The lower and upper thirds, on which this
analysis concentrates, yielded the results in Table 1.
Table 1.
Lower and Upper Thirds Score for
the Items on the Test, along with the Corresponding Levels of Difficulty and
Discrimination.
|
Q1 |
Q2 |
Q3 |
Q4 |
Q5 |
Q6 |
L
|
17 |
31 |
12 |
13 |
28 |
57 |
U |
39 |
46 |
24 |
24 |
43 |
56 |
p |
0.4912 |
0.6754 |
0.3158 |
0.3246 |
0.6228 |
0.9912 |
d |
0.3860 |
0.2632 |
0.2130 |
0.1930 |
0.2632 |
-0.0175 |
|
Q7 |
Q8 |
Q9 |
Q10 |
Q11 |
Q12 |
L
|
15 |
20 |
38 |
36 |
10 |
7 |
U |
33 |
35 |
57 |
57 |
27 |
13 |
p |
0.4211 |
0.4825 |
0.8333 |
0.8158 |
0.3246 |
0.1754 |
d |
0.3158 |
0.2632 |
0.3333 |
0.3684 |
0.2982 |
0.1053 |
|
Q13 |
Q14 |
Q15 |
Q16 |
Q17 |
Q18 |
L
|
50 |
46 |
28 |
3 |
26 |
41 |
U |
56 |
55 |
30 |
19 |
26 |
55 |
p |
0.9298 |
0.8860 |
0.5088 |
0.1930 |
0.4561 |
0.8421 |
d |
0.1053 |
0.1579 |
0.0351 |
0.2807 |
0 |
0.2456 |
|
Q19 |
Q20 |
Q21 |
Q22 |
Q23 |
Q24 |
L
|
29 |
13 |
5 |
10 |
42 |
8 |
U |
52 |
51 |
31 |
35 |
55 |
39 |
p |
0.7105 |
0.5614 |
0.3158 |
0.3947 |
0.8509 |
0.4123 |
d |
0.4035 |
0.6667 |
0.4561 |
0.4386 |
0.2280 |
0.5439 |
|
Q25 |
Q26 |
Q27 |
Q28 |
Q27 |
Q30 |
L
|
8 |
42 |
43 |
4 |
42 |
50 |
U |
45 |
57 |
57 |
34 |
57 |
57 |
p |
0.4649 |
0.8684 |
0.8772 |
0.3333 |
0.8654 |
0.9386 |
d |
0.6491 |
0.2632 |
0.2456 |
0.5263 |
0.2632 |
0.1228 |
Note: “L”, “U”, “p” and “d”
signify Lower third, Upper third, difficulty and discrimination respectively.
Difficulty on each
item was calculated by summing the scores of the two thirds then dividing it by
their total populations. The tests had a mean difficulty of 0.60, within the
accepted range of 0.3 to 0.7. While one item, Q16 was extremely difficult
exhibiting a p score of 0.19, the
majority of the items fell well with in the acceptable range with only a few
exceptions (question 6, 9, 10, 13, 14, 18, 23, 29 and 30) that had p scores well over 0.8 indicating low
levels of difficulty. Although the test’s mean p score was to the upper end of the acceptable range, its level of
difficulty was still satisfactory.
The
distinction between the lower and upper groups on each item was calculated by
dividing the score differential between the lower and upper groups by 57 (the
populations in each group). The test had a mean discrimination of 0.28 falling
well below the acceptable range of 0.7 to 1.0. Other than Q20 and Q25 which had
d values of 0.67 and 0.64
respectively, the bulk of the items exhibited d values closer 0.0. One item, Q17, showed a d value of 0.0 as the two thirds had the same number of persons
producing the correct answers. Another item, Q6, interestingly enough displayed
a negative d value as the lower group
had one person, than the upper group, who answering correctly. These figures
indicate that the test did not do well in providing a significant distinction
between the two thirds.
Upon
running a Cronbachs Alpha test it was revealed that the instrument had a a score of
0.6977. Good reliability scores fall within a a range of 0.8 to 1.0, which indicates that
all the items on the instrument were not measuring the same thing, hence its
unreliability. As such we set about removing the items that were making the instrument
unreliable in attempt to get instrument reliability as close to 0.8 as possible.
Six items were removed and the results of this exercise are presented in Table
2.
Table 2.
Deleted Items Which Improve
Instrument Reliability
# of Items |
a |
a if item deleted |
Item to be deleted |
30 |
0.6977 |
0.7144 |
Q17 |
29 |
0.7144 |
0.7280 |
Q15 |
28 |
0.7280 |
0.7350 |
Q4 |
27 |
0.7350 |
0.7418 |
Q3 |
26 |
0.7418 |
0.7479 |
Q12 |
25 |
0.7479 |
0.7528 |
Q2 |
24 |
0.7528 |
|
|
Question 17 was the one items that made the test most
unreliable and upon its removal the instrument reliability jumped from a 0.6977 to a 0.7144. Next item Q15 was removed and
reliability further increased to a a score of 0.7280. Subsequently items Q4, Q3, Q12 and Q2 were
removed upon which the instrument reliability increase to a a of
0.7528. Although 0.80, was the target a score 0.7528 is an acceptable level of
reliability and as such there was no need to continue removing items. At a 0.7528
it would seem that the now remaining 24 items were more or less measuring the
same thing.
Discussion
Although
the test’s mean difficulty places it within the acceptable difficulty range,
the nine items with p scores over
0.80 and the lack of items with p values below 0.3 to balance out the
previously mentioned extreme, combined to skew the tests’ difficulty to the
easier end of the acceptable range. With this in mind the test overall
difficulty was probably easier than it should have been. There are several
possibilities why this is so. One could be due to the manner in which the items
on the test were constructed. Using novice students who do not have a full
appreciation of that which it takes to construct a solid item was not
necessarily the best way of producing the instrument. Secondly since the
students themselves produced the items it seems reasonable to assume that they
were very familiar with some of the items thus increasing the likelihood of
getting it correct. A third possibility is that the test was indeed difficult
but because it was only administered to, presumably intelligent university
students, then the result gave a false sense of ease.
The
mean discrimination of test fell well below the acceptable range, suggesting
that the test was extremely poor in differentiating between the lower and upper
third of the tested population. Low discrimination could be due to any or all
of the following factors. Firstly there is the matter of the tested population,
all the participants are university students and as such they are relatively
equal in their reasoning capabilities. Secondly, is the manner in which the
test was administered, participants were in close proximity of each other while
completing the test, increasing the possibility of cheating and or group
discussion of the items. Consequently, if everyone has similar reasoning
ability and similar answers due to consultation then technically there was
nothing for the test to differentiate between.
The
test’s internal consistency as indicated by the alpha score was somewhat low
and while the removal of some items increased the instrument’s reliability to
an acceptable level it still did not attain the desired level of reliability.
This indicates that the items on the test were not adequately cohesive to have
sufficient confidence in the instrument’s ability to measure reasoning
capabilities.
Some of test’s and instrument’s problems can be linked to the test’s methodology. A verbal reasoning test is a time consuming and extremely complex instrument to create and giving inexperienced student an hour to create a valid, reliable instrument is simply unacceptable. Ideally testing conditions include isolating the participants; in this experiment group testing gave the participants the opportunity to confer which may have negatively affected the instruments results. A single test administered to a homogenous population does not foster effective comparison—this may have been evidenced in the discriminations scores.
The instrument is
by no means absolutely useless and to rule out other non-instrument related
problems I recommend that the test, as it is, be re-administered to another
group of university students each in isolation. If this reveals the similar
results then we can set about correcting the instrument. With regards to
discrimination it is advised that the test be administered to non-university
students as well in order to see if the instrument is able to make a
distinction between the two groups. This would give insight as to whether the
test itself was poor or if the weak discrimination results were due to the
flawed population. The test is reasoning based which indicates its constructors
subscribe to Spearman’s theory of general, and by extension fluid intelligence.
The standardized Raven’s Matrix, which is based on Spearman’s theory
(Mackintosh, 1998), could be administered to the same population and the
results compared and correlated to give a more accurate picture of the
instruments reliability. In the original test 6 items proved troublesome I am
of the opinion that even in retesting these will continue to be so. Therefore,
I am suggesting that these items be modified where possible and if not be
replaced with better suited ones. In conclusion the instrument and testing procedures
produced enough problems to be discouraging, nevertheless, it must not be
discarded as it can be improved and measures to do so must be embarked upon.
Appendix
References
Mackintosh, N.J. (1998). IQ and
Human Intelligence.