Benchmarks
Set certain labeling tasks as the gold standard and measure annotator performance.
Last updated
Set certain labeling tasks as the gold standard and measure annotator performance.
Last updated
Ango Hub allows project managers to mark certain labeling tasks as benchmark (also known as 'Test Question' or 'Gold Standard' in other environments). Benchmarks allow the project manager to measure the performance of annotators.
Benchmarking is not enabled by default in new projects.
To enable benchmarking, navigate to your project's Settings page, then to the General section. Enable the toggle next to Benchmark.
Click on the Save button at the bottom of the page.
From the same menu, you may also choose the likelihood of annotators being shown a benchmark task whenever they are shown a new task from the queue. By default, it is 10%, meaning that each time an annotator clicks on "Submit" and a new task is shown to them, there is a 10% chance that task is a benchmark (if any benchmarks are left to annotate for that user).
Disabling benchmarking in a project where benchmark tests have taken place will delete all benchmarking information from the project completed so far.
It is strongly recommended not to disable benchmarking in projects after it has been enabled.
If you wish to pause benchmarking on your project, you may set the likelihood annotators are shown a benchmark to 0% in the project settings. This will cause benchmark tasks not to appear to annotators.
Hub allows you to mark existing tasks as benchmarks. The task must have already been created and existing in the project. You may not mark certain tasks as benchmark during asset upload – assets must first be uploaded – only then they can be marked.
Tasks may be marked as benchmark one at a time (single) or in bulk.
Navigate to and open the task you would like to set as benchmark. For example, you may click on the task from the Assets or the Tasks tab.
Once you have opened the task, from the three-dot menu at the top right of the labeling editor, click on Set as Benchmark.
To unset a task as benchmark, and turn it back to a normal task, follow the same steps, then click Remove as Benchmark. You may need to refresh the page if the benchmark status was changed recently.
The Set as Benchmark dialog will appear. Click on Set as Benchmark to finalize setting the task as benchmark. The task will be shown to all annotators in all labeling stages in the project.
By default, the consensus score is calculated against classification answers as they were when the task was set as benchmark. For example, if a radio classification with three answers, A, B, and C had "B" marked as the benchmark answer, then if the annotators give any other answer, they will be given a 0% score for that classification.
You may make it so that more than one answer may be accepted for a 100% benchmark score for that classification.
To do so, before you mark the task as benchmark, or after you have opened a benchmark task, look at the left-hand side of the screen and find the classification for which you would like to create a new potentially correct answer.
Answer the classification with the first correct answer. Then, click on "OR":
A new classification answer will appear below. You may then answer it to add a new potentially correct answer. Keep on clicking on "OR" to add more correct answers. Save the task.
In the example above, now, annotators may answer either "Top-Down" or "Orthogonal" to the "Camera Angle" classification and they will, in both cases, get a 100% score for this classification.
From the Tasks tab of your project, select one or multiple tasks using the checkboxes to their left. Then, from the Items menu, click on Set as Benchmark.
To unset tasks as benchmark, and turn them back to normal tasks, follow the same steps, then click Remove as Benchmark.
The Set as Benchmark dialog will appear. Click on Set as Benchmark to finalize setting the task as benchmark. The task will be shown to all annotators in all labeling stages in the project.
In this dialog, you click on Set as Benchmark to finalize your benchmark selection.
What this means in practice is that:
The tasks you have selected will be marked as benchmarks.
The tasks you have selected will be moved to the Complete stage. This is because since you have marked the task(s) as the gold standard, they are assumed to be complete.
Benchmark tasks may be re-queued to other stages from Complete. What this means, however, is that an annotator will annotate it again, or a reviewer may alter it, changing the benchmark for users who have not yet been tested on the benchmark. This is not recommended.
We strongly recommend tasks marked as benchmark not be re-queued from Complete to other stages where they can be edited.
Hub will make copies of the task(s) you have selected, one for each user in the project, and place them in every user's labeling queues in all label-type stages in the project. The tasks created this way are known as "Benchmark tasks". Benchmark tasks are not included in the final export, are not sent to Complete, and are only be shown to users in the stage you have selected in this dialog. They are archived afterwards. All users who annotate in label-type stages will be shown the benchmark tasks. To limit who can see the benchmark tasks, you must limit who can annotate or review in all label-type stages in your project.
Tasks selected as benchmarks will be visually distinguished from other tasks, to the project manager only, by the presence of a small yellow crown in their row, both in the Assets and Tasks tab:
Users will not be able to tell that they are annotating a benchmark task. The task will look and feel exactly like any other task, with no indication whatsoever that the task they are annotating will be utilized in their performance evaluation.
Users will be able to annotate, create issues, skip, save, view instructions, and perform any other action they can normally perform on normal labeling tasks.
The only difference is that completing a benchmark task will not increase the number of "Completed" tasks, as benchmark tasks do not appear in the final export, are not sent to Complete, and are only used once in the stage where they have been created to measure the annotator's (or reviewer's) performance.
For as long as there are benchmark tasks in the stage, for the user annotating, whenever the user clicks on "Submit", the next task that is shown to them has a, by default, 10% chance of being a benchmark task.
If only benchmark tasks are remaining in the user's queue, the user will be exclusively shown benchmark tasks.
This frequency can be changed from the project Settings -> General section, under the benchmark toggle:
By default, there is a 10% chance that, whenever an annotator submits a task, the next task they will receive will be a benchmark (if there are any left for the user to annotate).
Setting this to 0% will cause benchmark tasks to stop showing to users, and setting this to 100% will cause only benchmark tasks to be shown to users until benchmark tasks are done. Users would then be shown normal labeling tasks.
As project manager, you can see the performance of each user, as well as the performance of each benchmark question.
To see each user's performance, enter the Performance tab. Each user's average benchmark score will be shown on the user's row:
To see the performance of each benchmark question, from the Tasks tab, filter by Benchmark. You will then only see tasks which have been set as benchmark.
Click on the "+" icon next to a benchmark task to see each user's answers and score as it relates to that benchmark question:
From each benchmark's row, you may see the average benchmark score for that question, as well as the number of annotators who have submitted an answer to that benchmark:
You may also download a JSON containing all information on all tasks used to benchmark users from Settings -> General -> Export Benchmark Tasks.
Benchmark tasks shown to users:
Do not get sent to Complete
Do not contribute to completion statistics (e.g. the "Tasks Completed" number will not go up as benchmark tasks are completed)
Do contribute to all other statistics (TPT, etc.)
Are immediately archived after being submitted (e.g. they are available for project managers to inspect, but they are not present in any stage.
Let questionCount
be the total number of classification questions in the project, and taskCount
the total number of tasks assigned to an asset.
We calculate x
, the single-question score for a single task as (sameAnswers / (taskCount - 1))
, where sameAnswers
is the count of answers that are equal to one another, current one excluded.
We repeat the above calculation for all tasks in the asset, to calculate the final result represented as Σ(x) below.
We calculate y
, the overall score on a single question (classification) as (∑(x) / taskCount)
.
We repeat the above calculation for all questions in the asset, to get to the final result represented as Σ(y) below.
The final score, then, is calculated as ∑(y) / questionCount
.
Note on Rank Benchmarking
In the Rank classification tool, if the annotator's answers differ, in any way, with the benchmark, their score for that classification will be 0. If they are the exact same, it wil be 1 (e.g. 100%) for that classification.
We calculate benchmarks for objects using the Intersection over Union (IoU) method.
We compare objects with one another to generate their IoU scores. If some annotations are completely separate, for example, with not even a pixel in common, their IoU score would be 0. If they overlapped completely, their score would be 100.
We then average the IoU scores of all annotation to calculate the final score.
Yes. Open the task from the Tasks or the Assets tab, edit it, and save it.
Existing benchmark scores will not be changed. Users who have not yet been benchmarked on the task will see the new, updated task, and they will be tested on this new version of the task.
The benchmark score remains unchanged. Once a user is tested on a benchmark task, their score regarding that task is unchangeable. Users who have not yet seen the benchmark task, however, will be tested on the edited version of the task.