Deep learning for understanding text data and related applications

Terms of use

Terms of Use

The use of this site and the content contained therein is governed by the Terms of Use. When you use this site you acknowledge that you have read the Terms of Use and that you accept and will be bound by the terms hereof and such terms as may be modified from time to time.

All text, graphics, audio, design and other works on the site are the copyrighted works of nasscom unless otherwise indicated. All rights reserved.
Content on the site is for personal use only and may be downloaded provided the material is kept intact and there is no violation of the copyrights, trademarks, and other proprietary rights. Any alteration of the material or use of the material contained in the site for any other purpose is a violation of the copyright of nasscom and / or its affiliates or associates or of its third-party information providers. This material cannot be copied, reproduced, republished, uploaded, posted, transmitted or distributed in any way for non-personal use without obtaining the prior permission from nasscom.
The nasscom Members login is for the reference of only registered nasscom Member Companies.
nasscom reserves the right to modify the terms of use of any service without any liability. nasscom reserves the right to take all measures necessary to prevent access to any service or termination of service if the terms of use are not complied with or are contravened or there is any violation of copyright, trademark or other proprietary right.
From time to time nasscom may supplement these terms of use with additional terms pertaining to specific content (additional terms). Such additional terms are hereby incorporated by reference into these Terms of Use.

Disclaimer

The Company information provided on the nasscom web site is as per data collected by companies. nasscom is not liable on the authenticity of such data.
nasscom has exercised due diligence in checking the correctness and authenticity of the information contained in the site, but nasscom or any of its affiliates or associates or employees shall not be in any way responsible for any loss or damage that may arise to any person from any inadvertent error in the information contained in this site. The information from or through this site is provided "as is" and all warranties express or implied of any kind, regarding any matter pertaining to any service or channel, including without limitation the implied warranties of merchantability, fitness for a particular purpose, and non-infringement are disclaimed. nasscom and its affiliates and associates shall not be liable, at any time, for any failure of performance, error, omission, interruption, deletion, defect, delay in operation or transmission, computer virus, communications line failure, theft or destruction or unauthorised access to, alteration of, or use of information contained on the site. No representations, warranties or guarantees whatsoever are made as to the accuracy, adequacy, reliability, completeness, suitability or applicability of the information to a particular situation.
nasscom or its affiliates or associates or its employees do not provide any judgments or warranty in respect of the authenticity or correctness of the content of other services or sites to which links are provided. A link to another service or site is not an endorsement of any products or services on such site or the site.
The content provided is for information purposes alone and does not substitute for specific advice whether investment, legal, taxation or otherwise. nasscom disclaims all liability for damages caused by use of content on the site.
All responsibility and liability for any damages caused by downloading of any data is disclaimed.
nasscom reserves the right to modify, suspend / cancel, or discontinue any or all sections, or service at any time without notice.

For any grievances under the Information Technology Act 2000, please get in touch with Grievance Officer, Mr. Anirban Mandal at data-query@nasscom.in.

New

See all

No notification found.

Deep learning for understanding text data and related applications

Shrutendra Harsola

@shrutendra

July 28, 2021

Data Science & AI Community

3377

Introduction

The volume of text data has been increasing rapidly in recent years. Businesses generate huge amounts of text data in form of social media comments, customer service chat history, survey comments, internal FAQs, knowledge articles etc. Humans can easily process and understand text data, but can’t scale to large data. In recent years, many NLP techniques have been developed to automatically process large amounts of text data for various use-cases.

Applications of text data:

Extracting key phrases and sentiments from user reviews
Summarising documents
Chat bots
Semantic text matching use-cases like finding closest questions from FAQ database, surfacing relevant articles from knowledge base etc
Document classification
Extract structured data (attributes/entities) from unstructured text.
Sentence auto-complete for customer service agents.

Approaches to featurize text data

We will discuss a few approaches to transform text data into numerical features, so that they can be fed into machine learning or NLP algorithms.

Traditional Approaches:

Bag of words:

We prepare the vocabulary, which is a list of unique words in the corpus of documents. Then we can represent each sentence or document with a vector of integers, which counts the number of times the token is present in the document.

Example:

Tf-Idf:

Tf-Idf (Term frequency-Inverse document frequency) technique incorporates the importance of a word in a document corpus, by computing IDF (Inverse document frequency). If a word is present in a large number of documents then its importance is reduced.

IDFw = log(C/Cw)

Where,

C = Total number of documents;

Cw = Number of documents with word w

Term-frequency, TF(w,d) = n(w,d) / Nd

Where,

n(w,d) = Number of times word w appears in the document d;

Nd = Number of words in the document d

TF-IDF(w,d) = TF(w,d) * IDFw

Deep learning based Approaches:

Main drawback of traditional approaches is that they don’t consider the semantic meaning of words and also the order of the words in the document. Deep learning based dense embedding techniques have been proposed to overcome this issue.

Word Embeddings: Many dense embedding techniques like word2vec, Glove, fasttext etc have been proposed in literature. Objective of these techniques is to represent a word by a few 100 dimensional dense vector of real numbers, such that vectors for semantically similar words are nearby in the n-dimensional space.

Word2vec architecture : https://arxiv.org/pdf/1301.3781v3.pdf — **Word2vec Architecture**

Sentence Embeddings: One approach to generate sentence embeddings could be to directly average the vectors of words within a sentence but such average embeddings does not perform very well for downstream tasks like semantic matching. BERT is one popular deep learning model based on transformer architecture, for directly generating sentence embeddings.

BERT : https://arxiv.org/abs/1810.04805 — **BERT**

Pre-trained word2vec and BERT models trained on large amounts of text data (like wikipedia) are made available by various NLP groups, which can be directly used to generate embeddings for our text data.

Application - Semantic Text Matching

Let’s talk about one application of text data for business i.e. Semantic text matching, which is the task of estimating semantic similarity between source and target text pieces. Semantic text matching can be used in multiple use-cases. Let’s understand semantic matching with the following use-case of finding the closest question. We are given a large corpus of questions and for any new question that is asked or searched, the goal is to find the most similar questions from this corpus. Semantic meaning is an important aspect of this task. For example, in the given figure, the question “what is the step by step guide to invest in the real estate market” is not very relevant to the source question, even though there are many common words between the two questions.

Finding Closest Questions — **Finding closest questions**

Following table lists some of the applications of semantic text matching in domains of web search, sponsored search, question answering and product recommendation.

Building large scale semantic matching system:

To balance the scale and quality, large scale semantic text matching systems generally follow a two step approach. We will use knowledge base search as an example to illustrate the system. Even though the techniques discussed will be applicable to any other semantic matching use-case as well. In knowledge base search, we have a corpus of knowledge articles (also called documents) and the goal is to return the top knowledge articles in response to a user query.

Step 1 - Candidate Generation:

First step is the candidate generation. Goal of the candidate generation step is to quickly retrieve a few 100 articles/documents which are relevant to the input query. Idea is to use simple techniques to make this step very fast, even if that results in some irrelevant documents being selected.

Sentence embedding techniques (like pre-trained BERT model) combined with nearest neighbour search techniques (like FAISS) can be used for efficient candidate generation.

**Candidate Generation using BERT and FAISS**

Step 2 - Reranking:

Once we have a few 100 candidate articles/documents, the next step is the reranking. Goal of the reranking step is to rank the candidate documents with respect to the input query, so that top ranked documents can be shown to the user.

Duet and BERT-ranking are two deep learning models which can be used for reranking of candidate documents to get the top documents. And these models can be trained on the historical click data for the specific use-case.

Duet: https://arxiv.org/abs/1610.08136 — **Duet**

BERT-Ranking: https://arxiv.org/abs/1905.09217 — **BERT-Ranking**

That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.

Shrutendra Harsola

AI Workbenches Powering Underwritin...

TestingXperts

AI

02 Jul 2025

Preserving Legacy During Business A...

Fenix Venture

Project Managem..

01 Jul 2025

Email Strategy for Growing Brands

Getlatest

Sales & Mar..

30 Jun 2025

M&A Due Diligence with Fraction...

Vineet Arya

Future of work

30 Jun 2025

Can Mid-Market GCCs Overcome Their ...

Sneha Sharma

100

Global Capabili..

30 Jun 2025

AI transforming businesses

Ashish Srivastava

Data Science &a..

28 Jun 2025

Acknowledging Major Strides In Tech...

SumCircle

AI

27 Jun 2025

AI OCR: How Context-Aware Intellige...

AlgoDocs

AI

27 Jun 2025

The Evolution of Data Engineering a...

crmsoftware360

Data Science &a..

27 Jun 2025

The Critical Role of Data Annotatio...

Gurpreet Singh Arora

518

Data Science &a..

26 Jun 2025

How AI Is Quietly Transforming Insu...

Ken Milko

AI

26 Jun 2025

Why Global Startups Are Turning to ...

Dev Sukhyani

Project Managem..

25 Jun 2025

The Dawn of Superintelligence: AGI and the Approaching Singularity

Vidyatech

@vidyatech

24 Jun 2025

Data Science & AI Community AI

There was a time when the idea of machines that could think like humans belonged solely in the realm of science fiction. But today, we find ourselves on the cusp of a technological leap that may rival the invention of the internet or the…

The Green Cloud: Powering the Internet Without the Guilt

Cisco India

@Cisco India

23 Jun 2025

ESG & Sustainability Cloud Computing Digital Transformation

We all love the cloud. It’s where our apps live, where our photos get backed up, and where modern business truly happens. But here's the catch: while the cloud may feel light and invisible, it has a very real—and very heavy—carbon footprint…

Is AI finding its way back to Data?

Janhvi Juyal

@juyal janhvi

23 Jun 2025

Data Science & AI Community Cloud Computing AI Inside AI

As AI continues to mature, large-scale investments are increasingly aimed at expanding the scope of data-driven offerings for customers. Recently, there have been several M&A focusing on AI-enabling data capabilities. Deals such as…

Why Every Fintech Will Be an AI Company in 2025 and Beyond

XLNC Technolo..

@XLNC Technologies

21 Jun 2025

The Rise of AI in Fintech and BFSI: Why It’s No Longer Optional In today’s hyper-competitive landscape, fintech companies and broader BFSI (Banking, Financial Services, and Insurance) sectors must deliver instant, personalized, and secure services—…

Agentic AI in Logistics: The Intelligent Automation Transforming Modern Supply Chains

XLNC Technolo..

@XLNC Technologies

21 Jun 2025

In today’s fast-paced global economy, logistics is the backbone of supply chain management. It connects manufacturers, suppliers, and customers in a tightly interwoven network that demands real-time coordination, data-driven decision-making, and…

How Generative AI Is Supercharging Graph Analytics

Tanya Gupta

@tanyagupta

20 Jun 2025

AI Analytics

Today, nearly every piece of data is linked to something else, and graph analytics has become the common method for spotting those hidden links. From mapping social networks and tracing supply chains to catching fraudulent activity, graph tools help…

New

Deep learning for understanding text data and related applications

Shrutendra Harsola

Introduction