Image

Multi-discipline college-level reasoning problems

59.4%0-shot pass@1

Gemini Ultra (pixel only*)

56.8%0-shot pass@1

GPT-4V

Natural image understanding

77.8%0-shot

Gemini Ultra (pixel only*)

77.2%0-shot

GPT-4V

OCR on natural images

82.3%0-shot

Gemini Ultra (pixel only*)

78%0-shot

GPT-4V

Document understanding

90.9%0-shot

Gemini Ultra (pixel only*)

88.4%0-shot

GPT-4V (pixel only)

Infographic understanding

80.3%0-shot

Gemini Ultra (pixel only*)

75.1%0-shot

GPT-4V (pixel only)

Mathematical reasoning in visual contexts

53%0-shot

Gemini Ultra (pixel only*)

49.9%0-shot

GPT-4V

Video

English video captioning

(CIDEr)

62.74-shot

Gemini Ultra

564-shot

DeepMind Flamingo

Video question answering

54.7%0-shot

Gemini Ultra

46.3%0-shot

SeViLA

Audio

Automatic speech translation

(BLEU score)

40.1Gemini Pro

29.1Whisper v2

Automatic speech recognition

(based on word error rate, lower is better)

7.6%Gemini Pro

17.6%Whisper v3

Read More