by Nick Greenspan

Nicholas Greenspan lives in New York City and is a freshman at Rice University who is majoring in computer science. Nicholas is interested in Machine Learning, Natural Language Processing, and their applications to various fields. Nicholas recently worked at the UTHealth School of Biomedical Informatics working on a Natural Language Processing project to help doctors find relevant treatments for their patients, and is excited to work on more interesting and meaningful problems at NLMatics. Outside of CS, Nicholas likes to read, play ice hockey, and listen to many genres of music including indie rock and electronic. Nicholas was one of NLMatics’ 2020 summer interns

SQUAD 2.0 and Google Natural Questions:

A Comparison and Investigation into Model Performance


SQuAD 2.0 and Google Natural Questions are two of the most prominent datasets in NLP Questioning Answering today. Both include tens of thousands of training examples which consist of a question, context, and an answer span. Though they both have the same general structure, there are many differences, both major and nuanced, that distinguish the two datasets. Whether you want to learn about the datasets for research purposes, or are trying to decide which one to use to train a QA model for your business, this is an in-depth guide to understanding the two datasets, their differences, and their relationships.

Overviews from the Dataset Websites

SQuAD 2.0: “SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.”

Google Natural Questions: “The NQ corpus contains questions from real users, and it requires QA systems to read and comprehend an entire Wikipedia article that may or may not contain the answer to the question.”

Important takeaways for each dataset

SQuAD 2.0:

  • Questions generated by hired workers whose task was to write questions about a given article.
  • Sets of questions are from paragraphs in a wikipedia article.
    • 442 different articles.
  • There are unanswerable questions (33.4% of the dataset is unanswerable), which forces a model to “know what it doesn’t know”.
    • Ensures there are plausible answers in the paragraph if the question is unanswerable so that model can’t just use superficial clues to determine if an answer is in the paragraph.
  • All questions end with a question mark.
  • Since there are series of questions that refer to the same article, some of the questions use pronouns such as “where did she grow up?”.
  • There are some ungrammatical sentences such as: “Why political movement was named for Joseph McCarthy?”.
  • There are occasional misspellings.

Google Natural Questions:

  • Real, user generated questions by people who are actually seeking information, not people who were hired for the explicit purpose of writing questions.
  • Requires a model to read a whole Wikipedia article, not just a paragraph, and has two different tasks: identifying a long answer, which is the paragraph or table that the contains the information to answer the question, and the short answer, which is the exact text that provides the answer to the question.
    • Note that since the article is so long BERT models are unequipped to handle them as they have a max sequence length capped at 512.
  • All questions aren’t necessarily “questions” per se, some of them are phrase searches like “benefits of colonial life for single celled organisms”.
  • Questions don’t necessarily have a long answer or short answer, so there is some of the unanswerable functionality that is also present in SQuAD 2.0.
  • If a question has a short answer, it definitely has a long answer, but not the other way around.
  • There is a “Yes or No” answer field which is “Yes” if the answer is yes, “No” if the answer is no, and “None” if there is not a yes or no answer to the question. If the “Yes or No” answer is not “None” then there is no short answer.
  • All questions don’t end with a question mark.
  • Since only one question per article there is more variety of topics of questions.
  • There seem to be less ungrammatical or misspelled sentences.

Datapoint Examples

In version 2 of the dataset, SQuAD 2.0 examples have two formats –
For answerable questions: { "question": "In what country is Normandy located?", "id": "56ddde6b9a695914005b9628", "answers": [ { "text": "France", "answer_start": 159 } ], "is_impossible": false }
For impossible questions: { "plausible_answers": [ { "text": "Normans", "answer_start": 4 } ], "question": "Who gave their name to Normandy in the 1000's and 1100's", "id": "5ad39d53604f3c001a3fe8d1", "answers": [], "is_impossible": true }

Google Natural Questions (abbreviated due to length of document html and tokens):
{"annotations":[{"annotation_id":6782080525527814293,"long_answer":{"candidate_index":92,"end_byte":96948,"end_token":3538,"start_byte":82798,"start_token":2114},"short_answers":[{"end_byte":96731,"end_token":3525,"start_byte":96715,"start_token":3521}],"yes_no_answer":"NONE"}],"document_html":</HTML>\n","document_title":"The Walking Dead (season 8)","document_tokens":[{"end_byte":95,"html_token":false,"start_byte":92,"token":"The"},{"end_byte":103,"html_token":false,"start_byte":96,"token":"Walking"},{"end_byte":108,"html_token":false,"start_byte":104,"token":"Dead"}, …], document_url":";oldid=828222625","example_id":4549465242785278785,"long_answer_candidates":[{"end_byte":57620,"end_token":216,"start_byte":53609,"start_token":24,"top_level":true},{"end_byte":53883,"end_token":36,"start_byte":53666,"start_token":25,"top_level":false},{"end_byte":54388,"end_token":42,"start_byte":53884,"start_token":36,"top_level":false},…], "question_text":"when is the last episode of season 8 of the walking dead","question_tokens":["when","is","the","last","episode","of","season","8","of","the","walking","dead"]}

Papers for more info on the datasets:
SQuAD 2.0
Google Natural Questions

Dataset Statistics

To find the best dataset for your use case, here are some statistics about the different types of questions in the datasets. Note that this data is approximate, as general rules were used to separate questions into these categories and some may have fallen through the cracks mostly due to the fact that some question start with context and don’t lead or end with the question word such as “in september 1849 where did chopin take up residence?”. The “other” category in the table is made up of all questions that didn’t fall into our general descriptions of the question types, and the fact there are many more questions in the other category for Google Natural Questions vs SQuAD 2.0 is due to both the relative size of the dataset and the increased prevalence of phrase like “questions” such as “benefits of colonial life for single celled organisms”.

Question Type Distribution of Train Sets

  Google Natural Questions SQuAD 2.0
Total Questions 307373 130319
What (%, total num) 17.1%, 52535 59.6%, 77701
Where (%, total num) 10.3%, 31776 4.07%, 5303
When (%, total num) 13.6%, 41725 6.38%, 8308
Who (%, total num) 25.1%, 77281 10.4%, 13533
Why (%, total num) 1.31%, 4041 1.44%, 1881
Which (%, total num) 2.83%, 8721 6.21%, 8088
Whom (%, total num) 0%, 6 0.343%, 447
How (%, total num) 5.87%, 18041 9.95%, 12969
Other (%, total num) 22.8%, 70157 1.50%, 1954

Side Note: The fact that question distribution is much more balanced for Google Natural Questions vs SQuAD 2.0, is an interesting comment on the types of questions people naturally vs artificially come up with.

Cross Training Experiment

An interesting research question that we wanted to explore was how a model trained on one dataset would perform when tested on the other dataset. This would possibly allow us to see if one dataset allowed for better generalization to out of domain data, which could help inform a decision of the best dataset for one’s needs.

Since the contexts in the Google Natural Questions (GNQ) dataset consist of whole wikipedia articles, which are much too long for the bert models we wanted to use, we decided to make use of the long answer as context for our training, and use the short answer as the target answer. We used data points which had long answers but not short answers to act as “unanswerable” questions. As there are 152148 data points with long answers in GNQ, our reformatted dataset had 152148 data points, as none of the data points without long answers were used in our reformatted dataset. 29.7% of the GNQ questions that have a long answer don’t have a short answer, which makes it comparable to the 33.4% of SQuAD 2.0 questions in the train data that are unanswerable.

For maximum convenience and fair comparibility, we decided to use Huggingface’s script to train our model on our modified GNQ Dataset, which meant converting the GNQ dataset into the SQuAD 2.0 format, which involved stripping the text of html, finding the correct answer span in the the long answer context, among other things. If you want to do this yourself, or are just curious about the specifics check out part 2 of this guide to model training I helped write:

One limitation we came across that caused us to change our training methods slightly was the amount of time it takes to train a model. Since there are a number of SQuAD 2.0 pretrained models available for public use, we were able to use a model trained on the full SQuAD 2.0 dataset in this comparison. Since the GNQ dataset is very large, we initially choose to just run the training on 1/10th of the reformatted dataset, and then later on decided to train the model that was trained on the first 1/10th on the next 4/10th so that the model ended up being trained on 1/2th of the reformatted GNQ dataset.


Dataset model was trained on as the rows, and evaluation dataset as the columns.

GNQ 1/10th Data {'exact': 53.137003841229195, 'f1': 57.66917657734894, 'total': 781, 'HasAns_exact': 40.0, 'HasAns_f1': 46.67854133379154, 'HasAns_total': 530, 'NoAns_exact': 80.87649402390439, 'NoAns_f1': 80.87649402390439, 'NoAns_total': 251, 'best_exact': 53.137003841229195, 'best_exact_thresh': 0.0, 'best_f1': 57.66917657734895, 'best_f1_thresh': 0.0} {'exact': 50.61062915859513, 'f1': 51.25986732355112, 'total': 11873, 'HasAns_exact': 12.179487179487179, 'HasAns_f1': 13.479825359737246, 'HasAns_total': 5928, 'NoAns_exact': 88.9318755256518, 'NoAns_f1': 88.9318755256518, 'NoAns_total': 5945, 'best_exact': 50.720121283584604, 'best_exact_thresh': 0.0, 'best_f1': 51.30938713471512, 'best_f1_thresh': 0.0}
GNQ ½ Data {'exact': 57.01398285972034, 'f1': 63.63002288066548, 'total': 2217, 'HasAns_exact': 50.23380093520374, 'HasAns_f1': 60.03190429287585, 'HasAns_total': 1497, 'NoAns_exact': 71.11111111111111, 'NoAns_f1': 71.11111111111111, 'NoAns_total': 720, 'best_exact': 57.01398285972034, 'best_exact_thresh': 0.0, 'best_f1': 63.63002288066556, 'best_f1_thresh': 0.0} {'exact': 47.23321822622758, 'f1': 50.06099675107974, 'total': 11873, 'HasAns_exact': 35.27327935222672, 'HasAns_f1': 40.93694575330119, 'HasAns_total': 5928, 'NoAns_exact': 59.158957106812444, 'NoAns_f1': 59.158957106812444, 'NoAns_total': 5945, 'best_exact': 50.08843594710688, 'best_exact_thresh': 0.0, 'best_f1': 50.75645811811136, 'best_f1_thresh': 0.0}
SQuAD 2.0 {'exact': 30.985915492957748, 'f1': 35.324918748038996, 'total': 781, 'HasAns_exact': 15.283018867924529, 'HasAns_f1': 21.67690857022349, 'HasAns_total': 530, 'NoAns_exact': 64.14342629482071, 'NoAns_f1': 64.14342629482071, 'NoAns_total': 251, 'best_exact': 32.394366197183096, 'best_exact_thresh': 0.0, 'best_f1': 35.43007732875682, 'best_f1_thresh': 0.0} {‘exact': 75.90331003116314, 'f1': 79.23560349162027, 'total': 11873, 'HasAns_exact': 64.97975708502024, 'HasAns_f1': 71.65390017813893, 'HasAns_total': 5928, 'NoAns_exact': 86.79562657695543, 'NoAns_f1': 86.79562657695543, 'NoAns_total': 5945, 'best_exact': 75.90331003116314, 'best_exact_thresh': 0.0, 'best_f1': 79.23560349162024, 'best_f1_thresh': 0.0}


A number of interesting observations and plausible conclusions can be made by looking at the data, and the eval data from the model trained on 1/10th and ½ of the reformatted GNQ dataset gives insight into how the performance of the model varies as it is fed more data.

  • One initial comparison to make is to look at the model’s performance on the dev set of the same dataset of the train set it was trained on. The SQuAD 2.0 trained model has a better f1 of 79.2 than either the GNQ trained model trained on 1/10th or ½ of the data with f1s of 57.7 and 63.6 respectively.
  • Note that GNQ is a much harder task, as state of the art f1 for short answer identification is .64 vs the around .93 for SQuAD 2.0. Much of the difficulty may have been offset by the fact we are using the long answer for context instead of the whole wikipedia article, but the fact that GNQ’s questions are naturally generated, and don’t tend to borrow pieces of the paragraph, is one way in which GNQ is still a harder task than SQuAD 2.0.
  • The fact that the GNQ trained model was only trained on half of the dataset whereas the SQuAD 2.0 trained model was trained on the whole thing also probably had an effect on the model performance.
  • Though the SQuAD 2.0 model did better on its own dev set than the GNQ model did, the GNQ model did better on the SQuAD 2.0 dev set than the SQuAD 2.0 model did on the GNQ dev set when the overall and the has_ans exact and f1 scores are compared. This implies that the GNQ dataset does a better job of instilling a general purpose language understanding in the model that can generalize to other domains.

One thing to keep track of which can yield interesting insights into a model’s behavior is a model’s tendency to answer questions or refrain from answering them.

  • Looking at how the no_ans scores decrease and the has_ans score increases for the GNQ model on GNQ eval on differing amounts of data likely demonstrates how as the model is fed more data it tends to guess that the question has an answer more often.
  • This can be seen even more clearly by looking at the GNQ model on the SQuAD 2.0 eval, where for the 1/10th data model the has_ans score are very low and the no_ans score is very high, likely demonstrating that the model does not have much understanding and just tends to say that the question is unanswerable. For the 1/2th data model, the overall exact and f1 scores are lower, but it is clear the model has better understanding and not just relying on guessing that the question is unanswerable all the time as the has_ans scores are much higher, and the no_ans scores are lower.

A couple of differences between the datasets that may have affected the experimental results are as follows.

  • There is no explicit plausible yet wrong answer in the long answer of questions that don’t have a short answer in the GNQ dataset, unlike the plausible answers in the SQuAD 2.0 dataset. Though there is no plausible answer per se, the long answer paragraph is necessarily related to the question, so there might be spans in the GNQ paragraphs that don’t have a short answer that are similar to the plausible answers found in the SQuAD 2.0 dataset.
  • Also, in the reformatting of the dataset we counted questions that had Yes or No answer as unanswerable as they don’t have a short answer span.


Hopefully this article gave you further insight into whether SQuAD 2.0 or Google Natural Questions is right to use to train your model, or just gave insight into the nature of the datasets. NLP Question Answering is a very exciting area of active research, and there are a number of different interesting datasets for it out there beyond the two mentioned in the article such as WikiQA, HotpotQA, and NarrativeQA. In the future, I hope to explore these other datasets and run more experiments to understand their strengths, weaknesses, and relations.


SQuAD Website
SQuAD Paper
Google Natural Questions Website
Google Natural Questions Paper