The problem of inter-rater bias in manual segmentation of medical images is often overlooked in the context of deep neural networks (DNNs). It leads to a phenomenon of training-induced bias in the automatic DNNs’ segmentations, which can mislead medical diagnoses.

Medical image segmentation (e.g. the extraction of a pathology’s boundaries) can contribute considerably to supporting clinical decision making and devising individualised, patient-specific treatments such as surgical intervention.

It is considered a challenging task, not necessarily due to moderate image quality, but since often the pathological region is not well defined. Therefore, different radiologists may annotate a particular region of interest in different ways.

Automatic image segmentation algorithms are often praised for being repeatable and objective, allowing the problem of an inter-rater bias to be bypassed. This is, however, only true when considering model-based approaches, such as the application of a threshold function to the image intensities to extract darker or brighter regions.

MRI scans of MS lesions (first row) and CT scans of ICH patients (second row), along with the manual annotations (columns 1, 2) and the automatic segmentation contours provided by the networks (columns 3, 4). (a) Segmentations provided by rater #1 (b) Segmentations provided by rater #2 (c) Automatic segmentations of the neural network, trained with the data labelled by rater #1 (d) Automatic segmentations of the neural network trained with the data labelled by rater #2.

The emergence of data-driven approaches, such as DNNs and their application to supervised semantic segmentation, has brought this issue of raters’ disagreement back to the fore. A supervised DNN is trained using input images and their corresponding manual annotations. This process has been shown to be very powerful – outperforming model-based segmentation approaches – as it allows us to generalise from seen to unseen data. Nevertheless, in most cases, while the images to segment vary, the annotations are often done by a single annotator. Therefore, it is not unlikely that it is the annotator’s subjective outlook or bias that actually shapes the segmentation model and consequently influences the resulting image segmentations.

While the issue of inter-rater bias has significant implications – particularly nowadays, when an increasing number of deep learning (DL) systems are utilised for the analysis of clinical data – it often seems to be neglected.

The main objective of this work was to quantify the impact of the rater’s conception and competence on segmentation predictions of a supervised DL framework. In a sequence of experiments, we show that an inter-rater bias significantly affects the training process of a DNN and therefore leads to different segmentation results for exactly the same data.

Bar plot of the MS lesion loads of 21 brain MR volumes as calculated based on the segmentations of raters and corresponding networks. The plot shows that the lesion loads calculated based on the manual segmentations from rater #1 (yellow) are, for most brains (all but six, circled in red/black), lower than the lesion loads calculated from the manual segmentations provided by rater #2 (red). However, the notable results refer to the network segmentations. MS load estimations based on network #1 (trained with the data labelled by rater #1, in black) segmentations are almost consistently (all of the brains but one, circled in black) much smaller than the load estimations of network #2 (trained with the data labelled by rater #2, in light blue), showing that inter-rater bias induces an increased and generalised bias between the networks.

For this study, we used two different datasets: a multiple sclerosis (MS) challenge dataset, which includes MRI scans, each with annotations provided by two radiologists with different levels of expertise; and intracerebral haemorrhage (ICH) CT scans with manual and semi-manual segmentations.

The results obtained allow us to underline a worrisome clinical implication of a DNN bias induced by an inter-rater bias during training. Specifically, relative underestimation of the MS lesion load by the less experienced rater was amplified and became consistent when the volume calculations were based on the segmentation predictions of the DNN that was trained on annotations provided by that rater.

In the same manner, the differences in ICH volumes calculated based on outputs of identical DNNs, each trained on annotations from a different source, were more consistent and larger than the differences between the ICH volumes as calculated based on the manual and semi-manual annotations used for training.

Or Shwartzman, who will deliver the oral presentation of this paper, is an MSc student at Ben Gurion University of the Negev, Israel.

He recognises the work of his co-authors on this paper:

Harel Gazit is an MSc student at Ben Gurion University of the Negev, Israel. Prof. Ilan Shelef is head of the radiology department at Soroka University Medical Center, Israel. Dr. Tammy Riklin Raviv works at the School of Electrical and Computer engineering, Ben Gurion University of the Negev, Israel.

Research Presentation Session
RPS 1705 Artificial intelligence: technical aspects

The worrisome impact of an inter-rater bias on neural network training
O. Shwartzman, H. Gazit, I. Shelef, T. Riklin-Raviv; Beersheba/IL

Read the full abstract in the ECR 2020 Book of Abstracts
Shwartzman O, et al. (2020) The worrisome impact of an inter-rater bias on neural network training. Abstract RPS 1705-1 in: ECR 2020 Book of Abstracts. Insights Imaging 11, 34 (2020). DOI 10.1186/s13244-020-00851-0