Topics In Demand
Notification
New

No notification found.

Deep learning for understanding text data and related applications
Deep learning for understanding text data and related applications

3348

0

Introduction

The volume of text data has been increasing rapidly in recent years. Businesses generate huge amounts of text data in form of social media comments, customer service chat history, survey comments, internal FAQs, knowledge articles etc. Humans can easily process and understand text data, but can’t scale to large data. In recent years, many NLP techniques have been developed to automatically process large amounts of text data for various use-cases.

Applications of text data:

  • Extracting key phrases and sentiments from user reviews
  • Summarising documents
  • Chat bots
  • Semantic text matching use-cases like finding closest questions from FAQ database, surfacing relevant articles from knowledge base etc
  • Document classification
  • Extract structured data (attributes/entities) from unstructured text.
  • Sentence auto-complete for customer service agents.

 

Approaches to featurize text data

We will discuss a few approaches to transform text data into numerical features, so that they can be fed into machine learning or NLP algorithms. 

Traditional Approaches:

Bag of words:

We prepare the vocabulary, which is a list of unique words in the corpus of documents. Then we can represent each sentence or document with a vector of integers, which counts the  number of times the token is present in the document.

Example:

Bag of words

Tf-Idf:

Tf-Idf (Term frequency-Inverse document frequency) technique incorporates the importance of a word in a document corpus, by computing IDF (Inverse document frequency). If a word is present in a large number of documents then its importance is reduced.

          IDFw = log(C/Cw)

Where, 

C = Total number of documents; 

Cw = Number of documents with word w

 

         Term-frequency, TF(w,d) = n(w,d) / Nd

Where, 

n(w,d) = Number of times word w appears in the document d; 

Nd = Number of words in the document d
 

          TF-IDF(w,d) = TF(w,d) * IDFw

TF-IDF

 

Deep learning based Approaches:

Main drawback of traditional approaches is that they don’t consider the semantic meaning of words and also the order of the words in the document. Deep learning based dense embedding techniques have been proposed to overcome this issue.

Word Embeddings: Many dense embedding techniques like word2vec, Glove, fasttext etc have been proposed in literature. Objective of these techniques is to represent a word by a few 100 dimensional dense vector of real numbers, such that vectors for semantically similar words are nearby in the n-dimensional space.

Word2vec architecture : https://arxiv.org/pdf/1301.3781v3.pdf
                                     Word2vec Architecture

Sentence Embeddings: One approach to generate sentence embeddings could be to directly average the vectors of words within a sentence but such average embeddings does not perform very well for downstream tasks like semantic matching. BERT is one popular deep learning model based on transformer architecture, for directly generating sentence embeddings.

BERT : https://arxiv.org/abs/1810.04805
                               BERT

 

Pre-trained word2vec and BERT models trained on large amounts of text data (like wikipedia) are made available by various NLP groups, which can be directly used to generate embeddings for our text data.


 

Application - Semantic Text Matching

Let’s talk about one application of text data for business i.e. Semantic text matching, which is the task of estimating semantic similarity between source and target text pieces. Semantic text matching can be used in multiple use-cases. Let’s understand semantic matching with the following use-case of finding the closest question. We are given a large corpus of questions and for any new question that is asked or searched, the goal is to find the most similar questions from this corpus. Semantic meaning is an important aspect of this task. For example, in the given figure, the question “what is the step by step guide to invest in the real estate market” is not very relevant to the source question, even though there are many common words between the two questions.

Finding Closest Questions
                                                     Finding closest questions

Following table lists some of the applications of semantic text matching in domains of web search, sponsored search, question answering and product recommendation.

Applications
                                                                 Applications

Building large scale semantic matching system:

To balance the scale and quality, large scale semantic text matching systems generally follow a two step approach. We will use knowledge base search as an example to illustrate the system. Even though the techniques discussed will be applicable to any other semantic matching use-case as well. In knowledge base search, we have a corpus of knowledge articles (also called documents) and the goal is to return the top knowledge articles in response to a user query.

Step 1 - Candidate Generation:

First step is the candidate generation. Goal of the candidate generation step is to quickly retrieve a few 100 articles/documents which are relevant to the input query. Idea is to use simple techniques to make this step very fast, even if that results in some irrelevant documents being selected.

Candidate Generation
                                         Candidate Generation

Sentence embedding techniques (like pre-trained BERT model) combined with nearest neighbour search techniques (like FAISS) can be used for efficient candidate generation.

Candidate Generation using BERT and FAISS
                        Candidate Generation using BERT and FAISS

Step 2 - Reranking:

Once we have a few 100 candidate articles/documents, the next step is the reranking. Goal of the reranking step is to rank the candidate documents with respect to the input query, so that top ranked documents can be shown to the user.

Reranking
                                                 Reranking

Duet and BERT-ranking are two deep learning models which can be used for reranking of candidate documents to get the top documents. And these models can be trained on the historical click data for the specific use-case.

Duet: https://arxiv.org/abs/1610.08136
                                                      Duet ​​​​
BERT-Ranking: https://arxiv.org/abs/1905.09217
                                               BERT-Ranking

 

Further Reading

 

Authors: 

  Naveen                        shrutendra

Naveen K Kaveti               Shrutendra Harsola               

Data Scientist, Intuit                Data Scientist, Intuit                

 


That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.


© Copyright nasscom. All Rights Reserved.