Consensus

The current implementation of parallel consensus has a number of limitations which must be taken into consideration when creating a new project.

  1. It is currently not possible to re-queue tasks to and from a Consensus stage. They will need to be deleted and re-uploaded. This can be circumvented by re-queuing to a Hold or Start stage before the desire Consensus stage, and by then forwarding the tasks, but it is not an officially supported workaround and may result in unexpected behavior.

  2. Because of the way the Consensus mechanism works under the hood, logic stages of the type "Annotator" and "Duration" will not work as expected when processing tasks output from a Consensus stage.

A Consensus stage is a way for you to present tasks to multiple annotators, and have the task be output in either Agreement or Disagreement conditional upon how much the annotators agree with one another.

Essentially, the Consensus stage is a container for other Label or Plugin sub-stages.

The Consensus stage accepts plugin sub-stages, such that, for example, you can have a task be labeled by an annotator and a plugin, and you may return the task based on how similarly the annotator labeled the task compared to the plugin.

There is a limit of ten maximum sub-stages you may add to the Consensus stage.

The Consensus stage, by default, does not prevent the same task from being labeled by the same person. To prevent that from happening, you will have to assign different annotators to different label stages, as mentioned in the section for the Label stage. This can be done automatically by clicking on Auto Assign in the settings for the consensus stage.

More details in the section for Auto Assign.

Consensus agreement cannot be calculated for the following class types:

  • Brush

  • Voxel Brush

  • Segmentation

  • Point

  • Polyline

  • Rotated Bounding Box

In video assets, consensus calculation ignores pages. Because of this, we do not recommend using consensus in video-based tasks yet.

Diagram of how Consensus works

As mentioned in the diagram above, whenever a task enters the Consensus stage, it is 'duplicated' into sub-tasks, and each sub-task is sent to its own sub-stage.

You may examine individual sub-tasks and check their current status from the "Tasks" tab, by clicking on the "Plus" next to the Consensus task to expand it and see details pertaining to the sub-tasks:

If a sub-task is in the "Archive" stage it means it has been completed and submitted.

Once all sub-stages have been annotated, they will be archived and they will no longer be accessible through the "Tasks" tab. They will, however, be accessible from the "Stage History" panel in the labeling editor when opening the main task.

Settings

Auto Assign

By default, Ango Hub does not prevent the same annotator from annotating the same asset more than once as part of a consensus stage.

For example, if you add two Label tasks which can be annotated by Anyone, like so:

Labeler A will open their labeling queue and go through the tasks in Consensus_1.

If no other annotator has opened the tasks annotated by Labeler A, and Labeler A clicks on Start Labeling and enters the labeling queue, they may enter the Consensus_2 queue and label the same tasks again. This way, consensus will not be calculated between two different annotators, as usually expected, since the same annotator will have annotated both tasks themselves.

To prevent this, you'd have to assign each labeling stage in consensus to different annotators. Auto Assign automates this process for you.

From the Consensus stage settings, click on Auto Assign. The following dialog will pop up:

Toggle on the users you'd like to assign to the stages within the selected consensus container, and they'll be distributed to every consensus stage in the container. If, after doing so, there are no consensus stages in your container assigned to Anyone, then you have guaranteed that no labeler will see the same task twice.

Dynamic Consensus

You can mark certain stages in your Consensus stage as dynamic by turning on the toggle on the stage(s) you wish to mark as dynamic:

Dynamic stages are only activated if a task has not reached the consensus threshold in the non-dynamic (static) stages.

Take the following example, where we have four labeling stages in our consensus: two are static and two are dynamic:

When a task is sent to this Consensus stage, it will be first shown to the sub-stages Consensus_1 and Consensus_2. If the required consensus is not met, instead of the task being sent out from the 'Disagreement' output, the task will be sent to the Consensus_3 sub-stage and annotated there.

If the threshold, at this point, has still not been reached, the task will ulteriorly be sent to Consensus_4 and annotated there too. If, at this point, the consensus threshold has been reached, then the task will be sent out from the 'Agreement' output, as explained in the How Consensus Works section. Otherwise, it will be sent out from the 'Disagreement' output.

Appearance

Setup

Clicking on Add Label will add a label stage. Clicking on Add Plugin will add a plugin stage. Click on each individual stage to change their options.

Threshold

From this view, you will be able to pick what will be determined as Agreement and Disagreement. You will see a list of labeling tools present in your project.

To have a tool be included in the Consensus calculation, enable the toggle next to it.

In the example above, we have three tools: a bounding box named Vehicle, a radio classification named Color, and a single dropdown named Model. In this example, the task will be considered in agreement when at least 30% of the annotators give the same answer to Color, and at least 30% of annotators give the same answer to Model. When both of these conditions are satisfied, the task is marked as being in Agreement.

Since the "Vehicle" bounding box had its toggle turned off, annotations from that class will not be counted in the consensus score calculation.

How Consensus is Calculated

Classifications

Let questionCount be the total number of classification questions in the project, and taskCount the total number of tasks assigned to an asset.

We calculate the single-question consensus for a single task as sameAnswers / taskCount, where sameAnswers is the count of answers that are equal to one another, current one included.

We repeat the above calculation for all tasks in the asset, the overall consensus on a single question (classification) is the highest value achieved during the repetitions, (y).

We repeat the above calculation for all questions in the asset, to get to the final result represented as Σ(y) below.

The final consensus score, then, is calculated as ∑(y) / questionCount.

Note on Rank Benchmarking

In the Rank classification tool, if the annotator's answers differ, in any way, with the benchmark, their score for that classification will be 0. If they are the exact same, it wil be 1 (e.g. 100%) for that classification.

Objects (Bounding Box, Polygon, PDF Area)

We calculate consensus for objects using the Intersection over Union (IoU) method.

We compare objects with one another to generate their IoU scores. If some annotations are completely separate, for example, with not even a pixel in common, their IoU score would be 0. If they overlapped completely, their score would be 100.

Noting that objects are compared to themselves too, hence for the above not-intersecting objects example the score of each object would be 50. The highest score achieved for all tasks will be taken into account as the consensus score of that tool (a tool means a unique schema id here, not the tool type).

We then average the IoU scores of all tools to calculate the final consensus score.

Output

The Consensus stage has two output: Agreement and Disagreement.

If the consensus threshold has been achieved for all labeling tools and classifications specified in the stage setup, the consensus task will be output from the Agreement output. Otherwise, it will be sent from the Disagreement output.

The Output Task

The task sent as output is not the judgment from a single annotator – it is instead a composite task. Here are this task's properties:

  • The output task contains the annotations with the highest consensus score, for each class, for classes where consensus can be calculated.

    • For example, if the consensus stage has three judgment sub-stages, and the task has three radio classifications A, B, and C, and one bounding box class D, the task output at the end will have, for each classification, the answer annotators coalesced on the most, and for class D, the bounding boxes created by the annotator with the highest class D consensus score.

    • For classes where consensus cannot be calculated (e.g. assume in our project there is a points class E and a rotated bounding box class F), the final task will have the non-calculable classes from the first user who has submitted them in the consensus stage.

    • So in this case, we would have the best answers from classes A, B, and C, then the bounding boxes drawn by the user with the highest class D consensus score, and for classes E and F, we would have the answers given by the first user to submit them in the consensus stage.

  • If, in the Consensus stage, some annotators did not create annotations using a certain class, or did not answer some classification answers, but others did, the output task will contain them, even if not all consensus annotators responded.

    • For example, if we have a project with a bounding box class A, a polygon class B, a radio classification C, and a text classification D, assuming:

      • User 1 only created 1 bounding box with class A, and answered the radio classification C (no other answers/annotations)

      • User 2 only created 1 bounding box with class A, and a polygon with class B (no other answers/annotations)

      • User 3 only created 1 bounding box with class A, answered the text classification D (no other answers/annotations)

    • The output composite task will have:

      • The class A bounding boxes drawn by the user with the highest class A consensus score

      • The class B polygons created by User 2

      • The class C radio answer from User 1

      • The class D text answer from User 3

Last updated