Intelligence 7 min read

All you need to know about Question-Answering (QA) systems built using NLP

Explore various solutions for building a QA Utility
Table of Contents

Blog Overview

The goal is to explore various solutions for building a Question-Answering(QA) Utility. Presenting you a quick glance at all the information I gathered by going through various research papers, blogs, and articles, whether it be the different classifications of QA Systems, Frameworks, and Components of QA Systems, or various ways by which we can implement the same. 

Introduction to QA System

A Question-Answering (QA) system is a natural language processing (NLP) application that is designed to automatically answer questions posed in natural language. QA systems use a variety of techniques, including natural language understanding, information retrieval, and machine learning, to extract the relevant information from a given text corpus and generate an answer to a user’s question. Some QA systems are able to answer questions based on a specific domain or knowledge base, while others are more general and can answer a wide range of questions. They are widely used in fields like customer service, knowledge management, and e-commerce.

Classification of QA System

  • Open vs Closed Domain: Closed domain question answering (CDQA) is a broad name for the task of answering questions only from one domain, for example legal, medical, engineering, etc. Open-Domain Question Answering (OPQA) is the task of answering a question from any domain. In this way, a trained model can be asked a question about anything.
  • Open-Book vs Closed-Book: An open-book QA system is one that can use external resources to generate answers.A closed-book QA system is one that is limited to the information it has been provided with and cannot access external resources to generate answers.
  • Extractive vs Abstractive Answering: An extractive model answers a Query by returning the substring of the context that is most relevant to the Query. The substring is always found verbatim within the Context. An abstractive model paraphrases this substring to a more human-readable form before returning it as the answer to the Query.
  • Ability to Answer Factoid vs Non-Factoid Queries: Factoid Queries are questions whose answers are short factual statements. Most Queries that begin with “who”, “where” and “when” are factoid because they expect concise facts as answers. Non-factoid Queries include questions that require logics and reasoning (e.g. most “why” and “how” questions) and those that involve mathematical calculations, ranking, sorting, etc.
  • Conversational vs Non-Conversation: Conversational QA models involve a back-and-forth exchange between the user and the system, with the system generating responses based on the user’s input. These models can use techniques such as dialogue management and natural language generation to generate appropriate responses and provides a more natural and engaging experience for users but can be more difficult to design and build than non-conversational models.


Components of the QA System

  1. Retriever: Retriever is usually regarded as an Information Retrieval System, with the goal of retrieving related documents or passages that probably contain the correct answer, given a natural language question, as well as ranking them in descending order according to their relevancy. Broadly, current approaches to Retriever can be classified into three categories, i.e.
    • Sparse Retriever: It refers to the systems that search for relevant documents by adopting classical Information Retrieval methods such as TF-IDF and BM25.
    • Dense Retriever:
      • Representation-based Retriever: It uses a fixed set of pre-computed representations of documents or text to retrieve relevant information as a response to a user’s query.
      • Interaction-based Retriever: It allows users to iteratively refine their search by interacting with the system. These systems usually allow users to specify additional constraints or modify their query in order to better focus the search.
      • Representation-interaction Retriever: These systems use pre-computed representations of documents or text to identify relevant information, but also allow users to iteratively refine their search through interactions with the system.
    • Iterative Retriever: It aims to search for the relevant documents from a large collection in multiple steps, which is also called Multi-step Retriever. In order to obtain a sufficient amount of relevant documents, the search queries need to vary for different steps and be reformulated based on the context information in the previous step.
  2. Document Post-Processing: The documents received from the Retriever often contain one which is irrelevant and some of them are so large that the capacity of the Answer Extraction model is overwhelmed. Hence we post-process the documents to address the above issues. Common approaches include document filtering(to identify and remove noise w.r.t a given question), document re-ranking(to sort documents as per the plausibility degree of containing the correct answer, in descending order), document selection(for choosing the top relevant document), etc.
  3. Reader: It is the main feature that differentiates QA systems against other IR(Information Retrieval) or IE(Information Extraction) systems, which is usually implemented as a neural MRC(Machine Reading Comprehension) model. It is aimed at inferring the answer in response to the question from a set of ordered documents. Generative Reader aims to generate answers as naturally as possible instead of extracting answer spans, usually relying on Seq2Seq models. For example, S-Net is developed by combining extraction and generation methods to complement each other. It employs an evidence extraction model to predict the boundary of the text span as the evidence to the answer first and then feeds it into a Seq2Seq answer synthesis model to generate the final answer.
  4. The answer Post-Processing module is developed to help detect the final answer from a set of answer candidates extracted by the Reader, taking into account their respective supporting facts. The methods adopted in existing systems can be classified into two categories, i.e. rule-based method, and learning-based method.
    • Rule-based answer post-processing involves using a set of pre-defined rules or heuristics to refine or adjust the initial answer generated by the QA system. For example, a rule-based post-processing system might use the presence of certain keywords or phrases in the question to determine whether to include additional information or modify the answer in some way.
    • Learning-based post-processing involves using machine learning techniques to learn from past interactions with the QA system in order to improve the accuracy and relevance of the answers.

Learning-based post-processing can be more effective than rule-based post-processing in situations where the appropriate response to a question may depend on subtle patterns or relationships that are difficult to capture with rules.

Retriever-Generator Model

The retriever-generator architecture is used to generate answers to user questions based on information stored in a large database. The retriever component searches the database for relevant information, and the generator component uses that information to generate a natural language answer.

The generator component is typically implemented using a machine learning model, such as a transformer or a BERT model, which is trained to generate human-like text. The retriever component may be implemented using a search engine or other information retrieval system.

The retriever-generator architecture is useful because it allows the QA system to generate answers based on a wide range of information sources, while also providing the flexibility to generate answers in a variety of different formats (e.g., text, tables, graphs). It is a popular choice for building QA systems that need to handle a large volume of queries and provide accurate and relevant answers in real-time.

Rule-based approaches 

One way to build a question-answering (QA) system is through a rule-based approach, where a set of rules is created for the model to follow in order to determine the correct answer to a given question. This approach can be effective for generating responses to certain types of non-factoid questions, but it also has its limitations.

Keyword Matching

One example of a rule-based approach is a “keyword matching” method. This approach involves looking for specific keywords in the question and then searching for those keywords in a database of information to find the answer. For example, if the question is “What is the capital of France?”, the model would look for the keywords “capital” and “France” and then find the answer “Paris” in the database. This method can be especially useful for simple factoid questions.

Syntax Parsing

Another example of a rule-based approach is a “syntax parsing” method. This approach involves analyzing the structure of the question to determine what information is being asked for, and then searching for that information in a database. For example, if the question is “Which country is known for its cheese?”, the model would understand that the question is looking for a country and that the answer should be related to cheese. By parsing the syntax, the model can then search for the answer “France” in the database.


While a rule-based approach can be effective for generating responses to certain types of non-factoid questions, it can be difficult to cover all possible variations and nuances of language with a fixed set of rules. Additionally, rule-based systems are not as flexible as machine learning-based approaches, which can learn to generate responses based on patterns in large amounts of training data. For example, a rule-based system would struggle to understand a question like “What is the best way to make macaroni and cheese?”, which requires understanding the context and making a subjective judgment.

Linguistic Approach to QA

When it comes to understanding natural language text, one approach is to use linguistic techniques such as tokenization, POS tagging, and parsing. These techniques can help to formulate a user’s question into a precise query that can extract the relevant information from a structured database. This approach is known as a linguistic approach to question answering (QA).


Linguistic approaches can be quite effective for generating abstractive responses to non-factoid questions, particularly when combined with machine learning techniques such as deep learning. However, the performance of these approaches will depend on the quality and coverage of the linguistic resources and knowledge bases being used, as well as the specific task being performed.


Linguistic approaches are heavily dependent on the quality of the data, and knowledge bases that are being used. If the data is not accurate or is incomplete, the answer generated will not be accurate. Additionally, this approach may not be good for factoid questions as it heavily depends on structured data and the answers to factoid questions are usually short and precise.

Statistical approach

The rise of big data and the internet has made statistical approaches more important in building question-answering (QA) systems. Statistical learning methods can provide better results than other approaches. These methods use online text repositories and can accept natural language queries. Examples of statistical techniques used in QA systems include support vector machine classifiers, Bayesian classifiers, and maximum entropy models.

Advantages of Statistical Approach

The availability of huge amounts of data on the internet has made statistical approaches more significant. These methods can give better results than other approaches. They can also use online text repositories and formulate queries in natural language form, rather than relying on structured query languages.

Techniques Used in Statistical QA Systems

Many statistical-based QA systems use techniques such as Support Vector Machine classifiers, Bayesian Classifiers, and maximum entropy models. These techniques can help the system understand the underlying patterns in the data and generate accurate answers.

Support Vector Machine Classifiers

One statistical technique commonly used in QA systems is support vector machine (SVM) classifiers. SVM classifiers are a type of machine-learning algorithm that can be used for both classification and regression tasks. They work by finding the best boundary or “hyperplane” that separates different classes of data.

Bayesian Classifiers and Maximum Entropy Models

Another statistical technique used in QA systems is Bayesian classifiers. These classifiers use Bayes’ theorem to make predictions based on prior knowledge and new evidence. Maximum entropy models are also used in QA systems, which use the principle of maximum entropy to make predictions about the probability distribution of a set of variables.

Neural Networks

Attention-over-Attention (AoA)

One example of a neural network used in QA systems is the Attention-over-Attention (AoA) model. AoA is a neural net model that uses multiple levels of attention mechanisms to focus on different parts of the input. For example, in a question-answering task, one level of attention might focus on the question and another on the context in which the question is being asked. During training, the model is fed input-output pairs and uses its attention mechanisms to generate an output that is compared to the correct answer.


r-NET is an extractive reading comprehension model developed by Microsoft. It is designed to extract an answer span from the input text, rather than generating abstractive responses. It uses a gated attention-based recurrent network to match a question and passage and then predicts the start and end positions of the answer inside the passage.

Reasoning Network

Microsoft’s Reasoning Network, also known as QnA Maker, is a natural language processing tool that can be used to build and deploy QA systems. It is designed to generate answers to fact-based questions, using a combination of machine learning and knowledge base technologies.


FusionNet is a machine learning model that has been developed for text classification tasks. It is based on a fully-aware attention mechanism, which allows the model to focus on different parts of the input text when making predictions. It has been applied to machine comprehension, which involves understanding and answering questions about a given piece of text.


BiDAF is a closed-domain, extractive Q&A model that can only answer factoid questions. It requires a context to answer a query and the answer returned is always a substring of the provided context.


Seq2Seq models consist of an encoder network that processes the input sequence and a decoder network that generates the output sequence. The decoder can use an attention mechanism to selectively focus on different parts of the input sequence when generating the output.

Fig: seq2seq attention model

Q = question

P = paragraph(context)

H_P, H_Q = encoded question and paragraph matrices 

H_C = weighted context under the specific question

When it comes to building a question-answering (QA) system with a neural network approach, there are a few key factors to consider:

  • The size and complexity of the data: Some neural network architectures, such as convolutional neural networks (CNNs) and long short-term memory networks (LSTMs), may be better suited to handling large or complex datasets.
  • The type of questions being asked: Some neural network approaches, such as r-NET and BiDAF, have been specifically designed for question answering and may be more effective at handling a wide range of questions.
  • The available resources: Some neural network approaches, such as FusionNet and HyperQA, may require more computational resources and may be less feasible to implement in some cases.
  • The desired level of interpretability: Some neural networks approaches, such as AoA and ReasoNet, may be more interpretable and transparent than others, which may be important for certain applications.

Transformer-Based QA system

One of the most popular and widely used approaches is transformer-based QA systems. These systems have achieved state-of-the-art performance on a number of QA benchmarks and are widely used in industry and research. Transformer-based QA systems have become increasingly popular in recent years, and have achieved state-of-the-art performance on a number of QA benchmarks. These systems are widely used in industry and research. There are a number of state-of-the-art QA models that are based on transformer architectures. 


One popular transformer-based QA model is BERT (Bidirectional Encoder Representations from Transformers). It is a bidirectional model, which means it can consider the full context of a word by looking at the words that come before and after it. This is particularly useful for understanding the intent behind a query. BERT has been shown to achieve state-of-the-art performance on a wide range of natural language processing tasks, including language translation, language modeling, and text classification. BERT has been shown to achieve state-of-the-art performance on a wide range of natural language processing tasks, including language translation, language modeling, and text classification.

RoBERTa: Robustly Optimized BERT

RoBERTa is an extension of BERT that was developed by researchers at Facebook. It is trained on a larger dataset and using a different training objective, which makes it more robust and more effective for a variety of natural language processing tasks.


DistilBERT is a smaller and more efficient version of BERT that was developed by researchers at Hugging Face. It is trained using the same training objective as BERT but is able to achieve similar performance using fewer parameters and less computation.


XL-Net is a transformer-based language model developed by researchers at Google. It is trained on a large dataset of unstructured text using a novel training objective called “permutation language modeling,” which allows the model to better capture the dependencies between words and phrases in the input.

Graph-based approach 

One of the most popular and widely used approaches is a graph-based approach to building a question-answering (QA) system. This approach involves representing the relationships between entities in a graph structure and using that structure to reason about the relationships between entities and predict answers to questions.

Knowledge Graph

One example of a graph-based model for question answering is the “knowledge graph” approach. In this approach, the model represents entities and relationships in the form of nodes and edges in a graph. The structure of the graph is then used to reason about the correct answer to a question. For example, if the question is “Who is the president of the United States?”, the model would use the structure of the graph to find the node representing the president of the United States and return the answer “Joe Biden”.

Graph Convolutional Network

Another example of a graph-based approach is the “graph convolutional network” (GCN) approach. In this approach, the model uses convolutional neural networks to process the graph representation of the information and make predictions about the correct answer. This approach can be more effective than traditional convolutional networks in handling graph-structured data and can achieve state-of-the-art performance on various graph-based tasks.


RDF (Resource Description Framework) is a standardized model for representing data as a graph of relationships between resources. It can be used to represent a wide range of information, including facts and relationships between resources. By organizing information in this way, it may be possible to use RDF to represent the knowledge base that a question-answering system relies on, and to use graph-based algorithms to search for and retrieve relevant information in response to a user’s question.

Different Approaches, Different Datasets

The specific types of data you need will depend on the approach you are using to build the question-answering system. For example, if you are using a rule-based approach, you may only need a dataset of questions and answers. If you are using an information retrieval approach, you may need a dataset of documents and corresponding answers. And if you are using a machine learning approach, you may need a dataset of context, questions, and answers. It’s important to carefully consider the types of data you need and plan accordingly to ensure that you have a high-quality dataset that will enable us to build an accurate and reliable question-answering system.

Evaluating QA System Performance

When building a question-answering system, it’s important to have a way to evaluate its performance. There are several metrics that can be used, including:

  • Accuracy: This measures the percentage of questions that the system answers correctly.
  • Precision: This measures the percentage of returned answers that are correct.
  • Recall: This measures the percentage of questions that the system answers correctly out of all the questions it was asked.
  • F1 Score: A combination of precision and recall, calculated as the harmonic mean of the two.
  • MRR: Mean Reciprocal Rank, which measures the average ranking of the correct answer among all the returned answers.
  • MAP: Mean Average Precision, measures the average precision of the system considering the ranking of the correct answers.
  • Exact match ratio: Percentage of questions for which the system’s answer is an exact match to the correct answer.
  • Top-k accuracy: Percentage of questions for which the correct answer is among the top k answers returned by the system.

These metrics can help to identify areas where the system needs improvement and guide the development process.

Challenges in Q&A

Question-answering systems come with their own set of challenges. Some of the major ones include:

  • Context Extraction: It can be complex to fetch relevant data and their responses based on questions.
  • Data Preprocessing: The data can be unstructured and contain irrelevant information, making it difficult to extract useful information for the Q&A system.
  • Lack of Labeled Data: It may be challenging to obtain a sufficient amount of labeled data, as the data may be informal and may not always contain clear answers.
  • Ambiguity and Subjectivity: The data may contain ambiguous or subjective language, making it difficult for a model to determine the correct answer.
  • Contextual Understanding: To accurately answer a question, the model must understand the context in which the question is being asked and how it relates to the information in the data.
  • Changes in Language and Content: The language and content of data can change over time, making it difficult for the model to accurately answer questions based on older data and requiring retraining.

Need help with technology
for your digital platform?

Get to know how technology can be leveraged to turn your idea into a reality.
Schedule a call with design expert

unthinkable ideas