A Modified OSCE Assessing the Assimilation and Application of Ethical Principles Relevant to Obstetric and Gynaecological Practice
Dr H van Woerden, Dr F Agbo*, Mr NN Amso†, Mr I Stokes‡
A modified form of the Objective Structured Clinical Examination (OSCE)1,2 has become a standard part of the Part II of the Obstetrics and Gynecology membership examination in Wales. It is a useful method of examining knowledge and practice across a set of clinical, administrative and ethical areas of competence. This modified OSCE involved the use of an examiner instead of an actor to simulate the patient.
The modified OSCE developed for this study was designed to assess assimilation and application of the principles outlined in Royal College of Obstetricians and Gynecologists’ (RCOG) guidelines.3 These cover the following areas: general attitude to women, consent to treatment and examination by medical students, clinical training, use of tissue, research, innovative procedures and professional disagreement.
Four questions were developed which translate the principles of the RCOG Guidelines on ethical practice into a modified OSCE. Two questions applied the principles in a clinical context, one question in the area of clinical governance and one question in an area of non-clinical practice (Appendix 1). Questions were developed by one of the authors (NNA). The modified OSCE was used on two sets of 10 and 16 candidates being interviewed for entry to a Specialist Registrar training scheme in obstetrics and gynecology in South Wales in 2000.
All candidates were Senior House Officers in Obstetrics and Gynecology looking to obtain Registrar posts and therefore had similar levels of experience. Candidates were given 15 minutes to read the RCOG Ethical Guidelines3, they were then interviewed by two doctors, (NNA and IS) for 15 minutes. All 26 candidates were seen by the same two examiners. Prior to the start of the examination, discussion took place between the two examiners as to the interpretation of specific items on the marking schedule. However, each examiner was blind to the marking of the other examiner during the interview process.
The marking schedule was assessed retrospectively using a set of criteria proposed for the assessment of postgraduate medical examinations.4
Each item on the marking schedule was scored 0 if no attempt was made to address the relevant concept, 1 if a partial answer was provided and 2 if the item was fully addressed. Marks were entered in a Microsoft Excel? spreadsheet.
The mean and standard deviation of scores given to candidates by each examiner was calculated and plotted in two histograms using the Statistical Package for the Social Sciences (SPSS®).
The level of agreement or disagreement between the two examiners was measured by calculating the average difference between examiners in their assessment of whether candidates had addressed the issue sought in that item. If both examiners scored a candidate as 2, 1 or 0, the “difference between examiners” was 0 (i.e. no difference). If one examiner score a candidate as 2 and the other as 1, or if the first examiner scores a candidate and 1 and the second as 2, the “difference between examiners” was 1 (i.e. moderate). If one examiner scores a candidate as 2 and the other as 0, the “difference between examiners” was 2 (i.e. large). The average of the absolute difference between examiners for each candidate was then calculated.
Inter-rater reliability was also assessed by calculated weighted and unweighted Kappa statistics for each of the four questions using Pepi 30 software. To assess the contribution of each of the four questions to the total score, correlation were calculated.
Correlation values were calculated in Excel? using the average score of the two examiners for each item in the marking schedule and correlating this with the total score of each candidate.
The ability of each question to discriminate among the candidates was assessed by examining the proportion of candidates who either scored full marks on each score sheet item or did not score on that item at all.
Data from the marking schedule of one candidate by one of the examiners was missing. This resulted in some of the analyses including only 25 candidates. The one marking schedule which was available on this candidate did not indicate that these marks were in any way atypical.
The mean, standard deviation and a histogram of the scores given to the candidates by each examiner are shown in Figure 1. The level of agreement between examiners for each part of the four questions is shown in Table 1.
Weighted and unweighted Kappa values and the correlation of each item with the total score are shown in Table 2. Table 3 presents the frequency of responses for the three rating categories for each item in the marking schedule. This provides a means of assessing the ability of each item to discriminate among the examinees. McNamar test for bias gave a p=1.00 for all four items suggests that there was no consistent bias by one or other of the examiners.
The results of assessing the marking schedule against the set of criteria proposed for the assessment of postgraduate medical examinations4 is outlined in Table 4.
An OSCE approach has previously been used in assessing other aspects of obstetrics and gynaecology,7,8,9,10 and in testing knowledge of ethical aspects of clinical practice11,12, however no previously published OSCE could be identified which tested the assimilation and application of ethical principles in the practice of obstetrics and gynecology. We were also unable to identify the previous use of an OSCE to select candidates for a postgraduate training scheme.
Figure 1 demonstrates the similarity in the characteristics of the two examiners ratings apart from the fact that Examiner 2's marks have a wider standard deviation and a different modal value. Examiner 2 also appears to dichotomize candidates’ marks to produce a more binomial distribution. This may simply be a random effect due to a small sample size. However, it is possible that this reflects a different marking style. Examiner 1gave most candidates an average mark and a small number of candidates high or low marks. Examiner 2 may subconsciously have classified most candidates as falling into two groups: stronger candidates who would pass and weaker candidates who would fail. This hypothesis would require confirmation in a larger sample of examiners. The use of a wider sample of examiners would also address other issues including the effect of the gender of an examiner and any bias that may have been introduced by using one of the authors as an examiner.
The average absolute difference between the raters for each item was at most 0.68 and for most items substantially less. This suggests that there is good agreement between the examiners in their marking of candidates. However, it should be noted that two sets of independently, randomly assigned values of 0, 1, and 2 have a 'chance level of disagreement' of about 0.9. Consequently, the result we obtained for the level of agreement between the examiners should be interpreted with caution. Inter-rater reliability was also measured using the Kappa statistic. Based on the weighted Kappa values, the inter-rater reliability were good for questions 1, 2 and 3 and moderate for question 4 suggesting the procedures used in the examination allow for reasonable inter-rater reliability.
Six items displayed relatively low discrimination while three had a high level of discrimination. Items 3.6, 4.3 and 4.4 appear difficult; however, there were no questions which all of the examinees missed.
The scale used in the marking schedule is categorical and may not reflect a true interval scale. The validity of adding and averaging of such scores is potentially suspect and other methods of quantification of responses could be considered if this instrument was used in other situations.
The correlation between question scores and the total score (Table 2) appears to be in part related to the number of concepts included under each question with the questions encompassing more areas exhibiting a higher correlation with the total score. An alternative approach to scoring the exam, which addresses this issue, would involve weighting each question by the number of concepts it addresses.
The high proportion of "not addressed" items in the clinical questions (questions 2 and 3) is of some concern. For example, items 2.2 (exchange pleasantries), 2.5 (enquire about progress of pregnancy), 2.6 (check notes), 3.2 (details of hysterectomy), 3.4 (types of incision), 3.5 (removal/conservation of ovaries in general), and 3.6 (oophorectomy if unexpected disease is found) are all very important aspects of everyday clinical practice, whether in terms of communication with patients or in obtaining preoperative consent. Failure to meet these standards may carry medical/medico-legal implications. Some candidates may have had a problem in this area. However, the candidates may not totally be to blame. The wording of the questions, the manner in which the questions were delivered and the wording of the marking schedules were potential influences, on the responses. These issues may have raised the proportion of "not addressed" questions. This could be addressed by exploring with candidates whether they felt that aspects of the examination had adversely influenced marking of their answers.
Table 4 provides a useful schedule against which to assess an examination. The 'purpose', 'aims' and 'stakes' involved in the examination are clear. Aspects of content validity are also addressed. However, five of the categories: consent to treatment and examination by medical students, clinical training, use of tissue and professional disagreement have only partially been addressed in the marking schedule. Several items may also be seen as evaluating clinical rather than ethical principles. These include: 2.6 'check notes for any relevant past medical/O&G history', 3.4 'types of incision and routes of hysterectomy , type of anesthesia', 4.3 'does it require GA, local or overnight stay', and 4.4 'can it be done in the outpatient or day surgery unit?'
In real life, doctors increasingly have access to documents, particularly via the internet and this is reflected in this OSCE. However, the effect of giving the RCOG document on “Professional Competence” beforehand is not clear. It may have influenced the marks achieved and this merits further investigation. Potential bias due to ethnicity has not been assessed in this study although it has been shown to have a small effect in a recent study.6 A fully acted out OSCE using actors could have been used, as opposed to a modified OSCE, as it has the advantage of mimicking real life more closely. However, it is also much more resource intensive. Further work could also be done in the areas of consequential, construct and predictive validity as there is always a question as to whether answers in examinations carry over into practice in real life.4,5 However, assessment of the modified OSCE against the schedule in Table 4 suggests it is of a reasonable standard.
This modified OSCE examination demonstrates the feasibility of testing ethical principles in Obstetrics and Gynecology and on candidates for postgraduate posts and provides a basis for further work. It meets most of the criteria laid down in a checklist developed to assess postgraduate medical examinations. This modified OSCE can appropriately be used to assess the assimilation and application of a range of ethical principles applicable to Obstetric and Gynecological practice. We have also provided tentative evidence that a modified OSCE may be an appropriate method for selecting candidates for postgraduate training schemes. Finally, our results suggest that a number of postgraduate doctors may have deficits in this important area of competency.
Dr H van Woerden
Email address: email@example.com
van Woerden H, Agho F, Amso N,N, Stokes I. A modified OSCE
assessing the assimilation and application of ethical principles relevant
to obstetric and gynaecological practice. Med Educ
Online [serial online] 2003;8:8. Available