Consensus
Last updated
Last updated
Because of the way the Consensus mechanism works under the hood, logic stages of the type "Annotator" and "Duration" may not work as expected when processing tasks output from a Consensus stage.
Requeuing tasks with issues which have been output from a Consensus stage might lead to unexpected behavior regarding the issues. We recommend closing all issues on such tasks before requeuing them.
A Consensus stage is a way for you to present tasks to multiple annotators, and have the task be output in either Agreement or Disagreement conditional upon how much the annotators agree with one another.
Essentially, the Consensus stage is a container for other Label or Plugin sub-stages.
The Consensus stage accepts plugin sub-stages, such that, for example, you can have a task be labeled by an annotator and a plugin, and you may return the task based on how similarly the annotator labeled the task compared to the plugin.
There is a limit of ten maximum sub-stages you may add to the Consensus stage.
The Consensus stage, by default, does not prevent the same task from being labeled by the same person. To prevent that from happening, you will have to assign different annotators to different label stages, as mentioned in the section for the Label stage. This can be done automatically by clicking on Auto Assign in the settings for the consensus stage.
More details in the section for Auto Assign.
Consensus agreement cannot be calculated for the following class types:
Brush
Voxel Brush
Segmentation
Polyline
Rotated Bounding Box
In video assets, consensus calculation ignores pages. Because of this, we do not recommend using consensus in video-based tasks yet.
As mentioned in the diagram above, whenever a task enters the Consensus stage, it is 'duplicated' into sub-tasks, and each sub-task is sent to its own sub-stage.
You may examine individual sub-tasks and check their current status from the "Tasks" tab, by clicking on the "Plus" next to the Consensus task to expand it and see details pertaining to the sub-tasks:
If a sub-task is in the "Archive" stage it means it has been completed and submitted.
Once all sub-stages have been annotated, they will be archived and they will no longer be accessible through the "Tasks" tab. They will, however, be accessible from the "Stage History" panel in the labeling editor when opening the main task.
By default, Ango Hub does not prevent the same annotator from annotating the same asset more than once as part of a consensus stage.
For example, if you add two Label tasks which can be annotated by Anyone, like so:
Labeler A will open their labeling queue and go through the tasks in Consensus_1.
If no other annotator has opened the tasks annotated by Labeler A, and Labeler A clicks on Start Labeling and enters the labeling queue, they may enter the Consensus_2 queue and label the same tasks again. This way, consensus will not be calculated between two different annotators, as usually expected, since the same annotator will have annotated both tasks themselves.
To prevent this, you'd have to assign each labeling stage in consensus to different annotators. Auto Assign automates this process for you.
From the Consensus stage settings, click on Auto Assign. The following dialog will pop up:
Toggle on the users you'd like to assign to the stages within the selected consensus container, and they'll be distributed to every consensus stage in the container. If, after doing so, there are no consensus stages in your container assigned to Anyone, then you have guaranteed that no labeler will see the same task twice.
You can mark certain stages in your Consensus stage as dynamic by turning on the toggle on the stage(s) you wish to mark as dynamic:
Dynamic stages are only activated if a task has not reached the consensus threshold in the non-dynamic (static) stages.
Take the following example, where we have four labeling stages in our consensus: two are static and two are dynamic:
When a task is sent to this Consensus stage, it will be first shown to the sub-stages Consensus_1
and Consensus_2
. If the required consensus is not met, instead of the task being sent out from the 'Disagreement' output, the task will be sent to the Consensus_3
sub-stage and annotated there.
If the threshold, at this point, has still not been reached, the task will ulteriorly be sent to Consensus_4
and annotated there too. If, at this point, the consensus threshold has been reached, then the task will be sent out from the 'Agreement' output, as explained in the How Consensus Works section. Otherwise, it will be sent out from the 'Disagreement' output.
Clicking on Add Label will add a label stage. Clicking on Add Plugin will add a plugin stage. Click on each individual stage to change their options. Enable the grey toggle to mark the stage as dynamic. (see section on Dynamic Consensus). Click on the trash can to delete the stage.
From this view, you will be able to pick what will be determined as Agreement and Disagreement. You will see a list of labeling tools present in your project.
To have a tool be included in the Consensus calculation, enable the toggle next to it.
In the example above, we have three tools: a bounding box named Vehicle, a radio classification named Color, and a single dropdown named Model. In this example, the task will be considered in agreement when at least 30% of the annotators give the same answer to Color, and at least 30% of annotators give the same answer to Model. When both of these conditions are satisfied, the task is marked as being in Agreement.
Since the "Vehicle" bounding box had its toggle turned off, annotations from that class will not be counted in the consensus score calculation.
The task sent as output is not the judgment from a single annotator – it is instead a composite task, the contents of which will be determined by the adjudication method you pick here.
Best Answer
The output task contains the annotations with the highest consensus score, for each class, for classes where consensus can be calculated.
For example, if the consensus stage has three judgment sub-stages, and the task has three radio classifications A, B, and C, and one bounding box class D, the task output at the end will have, for each classification, the answer annotators coalesced on the most, and for class D, the bounding boxes created by the annotator with the highest class D consensus score.
For classes where consensus cannot be calculated (e.g. assume in our project there is a points class E and a rotated bounding box class F), the final task will have the non-calculable classes from the first user who has submitted them in the consensus stage.
So in this case, we would have the best answers from classes A, B, and C, then the bounding boxes drawn by the user with the highest class D consensus score, and for classes E and F, we would have the answers given by the first user to submit them in the consensus stage.
If, in the Consensus stage, some annotators did not create annotations using a certain class, or did not answer some classification answers, but others did, the output task will contain them, even if not all consensus annotators responded.
For example, if we have a project with a bounding box class A, a polygon class B, a radio classification C, and a text classification D, assuming:
User 1 only created 1 bounding box with class A, and answered the radio classification C (no other answers/annotations)
User 2 only created 1 bounding box with class A, and a polygon with class B (no other answers/annotations)
User 3 only created 1 bounding box with class A, answered the text classification D (no other answers/annotations)
The output composite task will have:
The class A bounding boxes drawn by the user with the highest class A consensus score
The class B polygons created by User 2
The class C radio answer from User 1
The class D text answer from User 3
Here is a visual representation of the algorithm, given three annotators working on the same image:
All Answers
The output task contains all annotations from all consensus stages/judgments, merged together.
Here is a visual representation of the algorithm, given three annotators working on the same image:
Let questionCount
be the total number of classification questions in the project, and taskCount
the total number of tasks assigned to an asset.
We calculate the single-question consensus for a single task as sameAnswers / taskCount
, where sameAnswers
is the count of answers that are equal to one another, current one included.
We repeat the above calculation for all tasks in the asset, the overall consensus on a single question (classification) is the highest value achieved during the repetitions, (y).
We repeat the above calculation for all questions in the asset, to get to the final result represented as Σ(y) below.
The final consensus score, then, is calculated as ∑(y) / questionCount
.
Note on Rank Benchmarking
In the Rank classification tool, if the annotator's answers differ, in any way, with the benchmark, their score for that classification will be 0. If they are the exact same, it wil be 1 (e.g. 100%) for that classification.
The algorithm checks the proportion between the distance of two points and the longest distance on the image. For example, let’s say we have an image with 500 height and 1200 width. The longest distance for the image here will be 1300.
And I have a point from Consensus one [200,200] and from Consensus two [500,600]. The distance between these points is 500. The consensus result of this two points will be 1300 / 500 = 26(%) And since we also compare the point with itself to adjust the consensus, which will be 100%, the overall consensus will be 63%.
We calculate consensus for objects using the Intersection over Union (IoU) method.
We compare objects with one another to generate their IoU scores. If some annotations are completely separate, for example, with not even a pixel in common, their IoU score would be 0. If they overlapped completely, their score would be 100.
Noting that objects are compared to themselves too, hence for the above not-intersecting objects example the score of each object would be 50. The highest score achieved for all tasks will be taken into account as the consensus score of that tool (a tool means a unique schema id here, not the tool type).
We then average the IoU scores of all tools to calculate the final consensus score.
The Consensus stage has two output: Agreement and Disagreement.
If the consensus threshold has been achieved for all labeling tools and classifications specified in the stage setup, the consensus task will be output from the Agreement output. Otherwise, it will be sent from the Disagreement output.
The task you will get as the output will be determined by the method you pick in the stage's Adjudication tab. Please refer to the section on adjudication for more on the task being output.
The task sent as output is not the judgment from a single annotator – it is instead a composite task. Here are this task's properties:
The output task contains the annotations with the highest consensus score, for each class, for classes where consensus can be calculated.
For example, if the consensus stage has three judgment sub-stages, and the task has three radio classifications A, B, and C, and one bounding box class D, the task output at the end will have, for each classification, the answer annotators coalesced on the most, and for class D, the bounding boxes created by the annotator with the highest class D consensus score.
For classes where consensus cannot be calculated (e.g. assume in our project there is a points class E and a rotated bounding box class F), the final task will have the non-calculable classes from the first user who has submitted them in the consensus stage.
So in this case, we would have the best answers from classes A, B, and C, then the bounding boxes drawn by the user with the highest class D consensus score, and for classes E and F, we would have the answers given by the first user to submit them in the consensus stage.
If, in the Consensus stage, some annotators did not create annotations using a certain class, or did not answer some classification answers, but others did, the output task will contain them, even if not all consensus annotators responded.
For example, if we have a project with a bounding box class A, a polygon class B, a radio classification C, and a text classification D, assuming:
User 1 only created 1 bounding box with class A, and answered the radio classification C (no other answers/annotations)
User 2 only created 1 bounding box with class A, and a polygon with class B (no other answers/annotations)
User 3 only created 1 bounding box with class A, answered the text classification D (no other answers/annotations)
The output composite task will have:
The class A bounding boxes drawn by the user with the highest class A consensus score
The class B polygons created by User 2
The class C radio answer from User 1
The class D text answer from User 3
In the Workflow view, dynamic stages are marked by a symbol next to their name.