Image 1 of 1: ‘graphs of overfitting and underfitting’
Figure 2
Image 1 of 1: ‘Screenshot of Google Translate output. The English sentence "The doctor is on her lunch break" is translated to Turkish, and then the Turkish output is translated back to English as either "The doctor is on his lunch break" or "The doctor is on his lunch break".’
Figure 3
Image 1 of 1: ‘Screenshot of Google Translate output. The English sentence "The doctor is on her lunch break" is translated to Norwegian, and then the Norwegian output is translated back to English as "The doctor is on his lunch break".’
Figure 4
Image 1 of 1: ‘blurry image of Barack Obama’
Who is shown in this blurred picture?
Figure 5
Image 1 of 1: ‘Unblurred version of the pixelated picture of Obama. Instead of showing Obama, it shows a white man.’
While the picture is of Barack Obama, the upsampled image shows a
white face.
Image 1 of 1: ‘_Credits: AAAI 2021 Tutorial on Explaining Machine Learning Predictions: State of the Art, Challenges, Opportunities._’
Figure 2
Image 1 of 1: ‘Table caption: "Generated anchors for Tabular datasets". Table shows the following rules: for the adult dataset, predict less than 50K if no capital gain or loss and never married. Predict over 50K if country is US, married, and work hours over 45. For RCDV dataset, predict not rearrested if person has no priors, no prison violations, and crime not against property. Predict re-arrested if person is male, black, has 1-5 priors, is not married, and the crime not against property. For the Lending dataset, predict bad loan if FICO score is less than 650. Predict good loan if FICO score is between 650 and 700 and loan amount is between 5400 and 10000.’
Figure 3
Image 1 of 1: ‘Image shows a grid with 3 rows and 50 columns. Each cell is colored on a scale of -1.5 (white) to 0.9 (dark blue). Darker colors are concentrated in the first row in seemingly-random columns.’
Figure 4
Image 1 of 1: ‘Two images. On the left, several antelope are standing in the background on a grassy field. On the right, several zebra graze in a field in the background, while there is one antelope in the foreground and other antelope in the background.’
Figure 5
Image 1 of 1: ‘Two rows images (5 images per row). Leftmost column shows two different pictures, each containing a cat and a dog. Remaining columns show the saliency maps using different techniques (VanillaGrad, InteGrad, GuidedBackProp, and SmoothGrad). Each saliency map has red dots (indicated regions that are influential for predicting "dog") and blue dots (influential for predicting "cat"). All methods except GuidedBackProp have good overlap between the respective dots and where the animals appear in the image. SmoothGrad has the most precise mapping.’
Figure 6
Image 1 of 1: ‘The phrase "The nurse examined the farmer for injuries because PRONOUN" is shown twice, once with PRONOUN=she and once with PRONOUN=he. Each word is annotated with the importance of three different attention heads. The distribution of which heads are important with each pronoun differs for all words, but especially for nurse and farmer.’
From this plot, we see that sandals are more
likely to be confused as T-shirts than pants. It also may be surprising
to see that these data clouds overlap so much given their semantic
differences. Why might this be?
Figure 3
Image 1 of 1: ‘ID confusion matrix’
Figure 4
Image 1 of 1: ‘Histograms of ID oand OOD data’
Alternatively, for a better
comparison across all three classes, we can use a probability density
plot. This will allow for an easier comparison when the counts across
classes lie on vastly different sclaes (i.e., max of 35 vs max of
5000).
Figure 5
Image 1 of 1: ‘Probability densities’
Unfortunately, we observe a significant
amount of overlap between OOD data and high T-shirt probability.
Furthermore, the blue line doesn’t seem to decrease much as you move
from 0.9 to 1, suggesting that even a very high threshold is likely to
lead to OOD contamination (while also tossing out a significant portion
of ID data).
Figure 6
Image 1 of 1: ‘Probability densities’
Even with a high threshold of 0.9, we end
up with nearly a couple hundred OOD samples classified as ID. In
addition, over 800 ID samples had to be tossed out due to
uncertainty.
Figure 7
Image 1 of 1: ‘OOD-detection_metrics_vs_softmax-thresholds’
Figure 8
Image 1 of 1: ‘Optimized threshold confusion matrix’