Examining Inter-Rater Reliability in a Reality Baking Show

An updated version of this post (with all the code the isn't rendered here) can be found here>>

Game of Chefs is an Israeli reality cooking competition show, where chefs compete for the title of "Israel's most talented chef". The show has four stages: blind auditions, training camp, kitchen battles, and finals. Here I will examine the 4th season's (Game of Chefs: Confectioner) inter-rater reliability in the blind auditions, using some simple R code.

The Format

In the blind auditions, candidates have 90 minutes to prepare a dessert, confection or other baked good, which is then sent to the show's four judges for a blind taste test. The judges don't see the candidate, and know nothing about him or her, until after they deliver their decision - A "pass" decision is signified by the judge granting the candidate a kitchen knife; a "fail" decision is signified by the judge not granting a knife. A candidate who receives a knife from at least three of the judges continues on to the next stage, the training camp.

The 4 judges and host, from left to right: Yossi Shitrit, Erez Komarovsky, Miri Bohadana (host), Assaf Granit, and Moshik Roth

Inter-Rater Reliability

I've watched all 4 episodes of the blind auditions (for purely academic reasons!), for a total of 31 candidates. For each dish, I recorded each of the four judges' verdict (fail / pass).

Let's load the data.

We will need the following packages:
We can now use the psych package to compute Cohen's Kappa coefficient for inter-rater agreement for categorical items, such as the fail/pass categories we have here. \$\kappa\$ ranges from -1 (full disagreement) through 0 (no pattern of agreement) to +1 (full agreement). Normally, \$\kappa\$ is computed between two raters, and for the case of more than two raters, the mean \$\kappa\$ across all pair-wise raters is used.

We can see that overall \$\kappa=0.25\$ - surprisingly low for what might be expected from a group of pristine, accomplished, professional chefs and bakers in a blind taste test.

When examining the pair-wise coefficients, we can also see that Erez seems to be in lowest agreement with each of the other judges (and even in a slight disagreement with Yossi!). This might be because Erez is new on the show (this is his first season as judge), but it might also be because of the four judges, he is the only one who is actually a confectioner (the other 3 are restaurant chefs).

For curiosity's sake, let's also look at the \$\kappa\$ coefficient between each judge's rating and the total fail/pass decision, based on whether a dish got a "pass" from at least 3 judges.

We can now use the wonderful new corrr package, which is intended for exploring correlations, but can also generally be used to manipulate any symmetric matrix in a tidy-fashion.
Perhaps unsurprisingly (to people familiar with previous seasons of the show), it seems that Assaf's judgment of a dish is a good indicator of whether or not a candidate will continue on to the next stage. Also, once again we see that Erez is barely aligned with the other judges' total decision.

Conclusion

Every man to his taste...

Even among the experts there is little agreement on what is fine cuisine and what is not worth a doggy bag. Having said that, if you still have your heart set on competing in Game of Chefs, it seems that you should at least appeal to Assaf's palate.

Bon Appétit!

Mattan S. Ben-Shachar

Search This Blog