Persia, a Context-aware Intelligent Search Engine for the Enterprise
Updated on

Persia, a Context-aware Intelligent Search Engine for the Enterprise

Search is increasingly a key productivity enhancer for the enterprise, and it is normally related to querying unstructured data such as documents in natural language. Structured data is usually queried through SQL queries. At Course5, we have built Persia, a state-of-the-art NLP-driven intelligent, interactive, and semantic search engine to interact with structured or tabular data in natural language. Business users can question in natural language and consume insights in the form of Executive Summary, Charts, Exceptions, Early Warning Signals, and Recommendations. Business users usually consume multiple reports and dashboards loaded with data. But Intelligent Search is one such capability within Augmented Analytics wherein they can just ask the questions for which they need answers, without going through all these reports and dashboards.

An effective business analyst must have the ability to comprehend and solve questions posed by the requester as well as to use hard skills, such as querying databases, creating charts, and writing narratives to answer questions. We can define this entire process described just now as a long process since understanding, processing, and constructing it take time. For illustration, here is a flow chart.

Daily workflow of a business analyst

Figure 1: Daily workflow of a Business Analyst

There may be times when questions are straightforward, so the business analyst needs to quickly resolve them, which means that the business analyst requires to be very familiar with data, so they can perform induction easily. Here, this process of answering questions is defined as a short process. Essentially, Persia can imitate short as well as long processes, quickly and with great accuracy.

How does Persia help these processes? One of the key answers to it is the semantic search engine behind it. Semantic search refers to the ability of search engines to consider the intent and contextual meaning of search words, phrases, or sentences when serving content to users. Specifically, for domain-specific clients, Persia understands user requests and provides accurate and user-friendly answers with or without querying a database. The processes that a business analyst works with can be automatically converted to knowledge, such as pre-generated questions and answers, knowledge graphs, etc., which are accessible to search engines.

Now, it leads to multiple key questions: What is the best method to accomplish this? how much data is needed to train such a model, and how fast can it be? The following sections provide an overview of it.

Design of the Model

Creating a powerful semantic search engine relies on the ability to generate great sentence embeddings based on user queries that can represent rich contextual information. With these embeddings, you can directly find items with high relevance scores within a scalable candidate database (10K-1B items). As shown in Figure 1: How did we perform this quarter, Persia looks for key metrics such as revenue, volume, conversion rate, or equivalent at first, then retrieves the most semantically similar question(s) based on the key metrics. Based on the selected most similar question(s), appropriate data can be fetched and used for analysis and reporting. However, this could be very slow due to the size of the database (questions), e.g., 10M to 1B database. To resolve it, several nearest neighbour algorithms such as Faiss, ElasticSearch, Hnswlib, and Annoy are integrated to ensure fast performance for scalable databases. Additionally, the related answer(s) can be cached since many questions are asked frequently such as a FAQ system.

As a result, the next question is how to derive such sentence embeddings and how many data points will it require? Here is a diagram of the model shown (Figure 2) for better understanding.

Figure 2: Workflow of the search model

Figure 2: Workflow of the Search Model

The purple box on the bottom right is the final model used to obtain sentence embeddings, and it is a bi-encoder structure (Figure 3, left).

The left image shows the structure of a bi-encoder, while the right image is a cross-encoder

Figure 3: The left image shows the structure of a Bi-encoder, while the right image is a Cross-encoder.

As a first step, a large-scale pre-trained language model (PLM) is chosen to further train on the domain corpus without using their labels. In addition to the default pre-train task, masked language model, several other auxiliary tasks were added to the model to enhance its representational power, including word-level ordering [1], sentence order prediction (SOP) [2], and a broadly useful objective, causal language model (CLM). As using word structure sentences may hinder some downstream functions, it must be carefully considered. On the other hand, using sentence structure improves model performance for sentence-pair tasks such as Stanford Natural Language Inference (SNLI), Multi-Genre NLI (MNLI) and Quora Question Pairs (QQP), which is what we aim to achieve for the AI-powered search capabilities of Persia. It is worthwhile to note that different pre-training approaches, such as whole-word masking [3] and selective masking [4], were also used to further improve the model. These strategies will not be discussed in more detail here.

Upon further pre-training of the model, two additional models, the bi-encoder, and the cross-encoder (different objectives in Figure 3), were fine-tuned in the given labelled domain data (Continually, the augmented gold corpus is utilized to fine-tune the cross-encoder). With bi-encoders, you can compute the representation of a single input sequence/sentence; and when given a pair of input sentences, Siamese network architecture [5] is used to calculate the similarity between their representations. In contrast, cross-encoders received both sentences simultaneously to output a value between 0 and 1 indicating the similarity of the input sentence pair. The main difference is that cross-encoders do not produce sentence embeddings, which means it can be very inefficient since it requires labelled sentence pairs, and it is very slow to perform a search in a large database. Table 1 shows a comparison between two models in terms of their speed over 50 million sentences in a Tesla V100 8gb GPU.

Table 1. Speed Comparison

Model Overall Speed (mins) Average Speed (seconds/sample)
Cross-Encoder 3900 0.468
Bi-Encoder 0.08 0.00001

Despite their slower rate of encoding, cross encoders generally provide greater performance in sentence scoring tasks. By leveraging this, the cross-encoder was used to improve the fine-tuned bi-encoder’s performance. Specifically, the fine-tuned cross-encoder assigned pseudo(soft) labels to silver sentence pairs and applied a variety of sampling strategies to minimize the number of pairs of sentences which require high computational overhead, such as n×(n-1)/2 possible combinations. In brief, the target data is comprised of a small fraction of labelled data (gold) and a large fraction of weakly labelled (initially unlabelled) data (silver) from the fine-tuned cross-coder. Finally, we further fine-tuned the bi-encoder model on an extended training dataset (augmented gold+silver) and used the same data for knowledge distillation and pruning to produce a deployment-friendly student model. An interesting finding is that different training objectives e.g., triplet loss and hard-negatives ranking loss, can be created from the target data, and those can then help the bi-encoder to improve its convergence. Overall, the loss of the final bi-encoder model is as following where,
Sum of 1, 2, and 3 is equal to 1, and they are hyper-parameters.

So far, the brief process of the design of the model is done, and there are a few things to address to make it a reasonable production-ready model.

  • Augmentation
  • Pair sampling strategies
  • Error-Analysis
  • Model Compression

Customer data is quite domain-specific, which means not much unique data is available. The good news is that there are a lot of public data sets that match some of our requirements, as well as some domain data that can help make the model more accurate. For the best results, it is crucial to implement several different data augmentation techniques. There are four main methods listed below, and they all help the model learn different styles of sentences. In deep learning, this is called increasing model robustness.

  • Paraphrasing
  • Backward translation
  • Unsupervised style transfer
  • Token shuffling, cut-off, and dropout

The release of augly, another tool for augmenting data, coincided with when we were completing the model, so it was also adopted for making the model more robust. It essentially creates more visual differences from a sequence, for example, an input sentence is Course5 provides cutting-edge AI solutions for the marketing industry, and its augmentations are.

Since many combinations would not only be computationally expensive, but also would not significantly improve performance, sampling can be a very tricky process when building weak labels with the fine-tuned cross-encoder. Some methods, such as semantic search, BM25 sampling, and kernel density estimation, can be used to solve the problem.

As a crucial step following the design of a prototype model is conducting error analysis, which attempts to determine how, why, and when the prototype model fails. These directions can point us in the right direction for improving what we have done so far and allow us to assess whether our model is robust after we have augmented the data. Thanks to some open-source handful tools such as textattack [6], errudite [7], textflint [8], and checklist [9].

Finally, a well fine-tuned model is not ready for production, a natural solution is to apply a series of model compression techniques such as knowledge distillation, pruning and quantization.


Table 2 lists the options of backbones for creating a bi-encoder model. In summary, there are four architectures: BERT, RoBERTa, DeBERTa, and BigBird. There are several options for creating a model with bi-encoders as shown in Table 2. Briefly, there are four structures: BERT, RoBERTa, DeBERTa, and BigBird, and of course, other backbones can be added. These mainstream architectures are considered because they have adequate capacity and support some model compression tools. A notable finding was that some models can work in multi-lingual environments, and they do well in production. In the last column of the table, Spearman’s rank coefficient, which ranges from 0 to 1, indicates the degree of similarity between gold labels and sentence representations, with a higher value (close to 1) representing a closer match.

Table 2. Model Comparison

Property Layers Architecture Cross-Lingual Max Length Pearson’s Coef
MiniLM-L6-Short 6 BERT   75 0.9381
MiniLM-L6 6 BERT   128 0.9477
TinyBERT-L6 6 BERT   128 0.9477
XLM-MiniLM-L12 12 RoBERTa   256 0.9528
DeClutr-L6 6 RoBERTa   128 0.9476
DistilBERT 6 BERT   128 0.9503
DeBERTa 12 DeBERTa   512 0.9586
BigBird 12 BigBird   4094 0.9328

Considering our benchmark dataset, it can be concluded that all models are capable of providing good results e.g., Spearman’s rank coefficient > 0.93. It is also important to measure how fast they can go. For a fair comparison, we only compare how fast those models are to generate sentence embeddings, since nearest neighbour is built later, and they are plug-and-play for any model. Figure 4 shows the results of the Pytorch model (original) and optimized Onnx Runtime model (knowledge distillation + quantization), and the results are obtained by using a single thread on an Intel(R) Xeon(R) Gold 5222 processor. No matter which framework is used to make predictions, a simpler model is more likely to produce faster results, while larger models tend to produce more accurate results. As a result of the compression, the speed gap between simpler and more complex models is further reduced, although some accuracy is lost.

Speed comparison over different model choices

Figure 4: Speed comparison over different model choices

In the end, four nearest neighbour algorithms with 4 threads are added to faster query from scalable database from 30K to 30M, and the elapsed time of each is shown below Figure 5. There are little differences between all methods except for when the data set is larger than 30M. This is due to Faiss’ product quantification and clustering, which can lead to less accurate results even if the speed is fast.

Query speed comparison among scalable databases

Figure 5: Query speed comparison among scalable databases

Lastly, let us examine what Persia had to say in response to the question how our quarterly performance is. The results (Figure 6) are embedded as a dashboard consisting of two parts, which include the textual answer: narratives and the visual answer: charts. Since this question is vague, multiple key metrics are used and supported for interactive results, such as switching metrics to produce different charts and narratives. Overall, many key metrics are shown, such as revenue, conversion rate, and other metrics, such as average order value (AOV) and units per transaction (UPT). Unlike other services, Persia does not just show its results for this quarter, it also shows the results for the same quarter last year. Certainly, the comparison is not limited to this, and it can be made with different entities such as time, location, and organization. Providing a friendly user experience is what Course5 strives to achieve.

A screenshot of Persia demo

Figure 6: A screenshot of Persia demo

What Else?

This blog gives a rough overview of how to build a semantic search model for Persia and how it behaves. There are many more features of Persia that are not mentioned in this blog:

  • Multi-modal
  • Trade-offs balancing between speed and accuracy on nearest-neighbour approximations
  • Few-shots learning, for e.g. only 10, 50, 100 examples are available
  • Mixed quantization which further compresses models for easier deployment
  • Task-adaptive knowledge distillation
  • Graph optimization with Onnx Runtime

More blogs about Course5 techniques for NLP/NLG will be coming soon.


[1] Wang, W., Bi, B., Yan, M., Wu, C., Bao, Z., Peng, L., & Si, L. (2020). StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding. ArXiv, abs/1908.04577.
[2] Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. ArXiv, abs/1909.11942.
[3] Cui, Y., Che, W., Liu, T., Qin, B., Yang, Z., Wang, S., & Hu, G. (2019). Pre-Training with Whole Word Masking for Chinese BERT. ArXiv, abs/1906.08101.
[4] Gu, Y., Zhang, Z., Wang, X., Liu, Z., & Sun, M. (2020). Train No Evil: Selective Masking for Task-guided Pre-training. EMNLP.
[5] Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP/IJCNLP.
[6] Morris, J.X., Lifland, E., Yoo, J.Y., & Qi, Y. (2020). TextAttack: A Framework for Adversarial Attacks in Natural Language Processing. ArXiv, abs/2005.05909.
[7] Tongshuang, W., Marco T.R., Jeffery, H., & Daniel, w. (2019). Errudite: Scalable, Reproducible, and Testable Error Analysis. Association for Computational Linguistics.
[8] Gui, T., Wang, X., Zhang, Q., Liu, Q., Zou, Y., Zhou, X., Zheng, R., Zhang, C., Wu, Q., Ye, J., Pang, Z., Zhang, Y., Li, Z., Ma, R., Fei, Z., Cai, R., Zhao, J., Hu, X., Yan, Z., Tan, Y., Hu, Y., Bian, Q., Liu, Z., Zhu, B., Qin, S., Xing, X., Fu, J., Zhang, Y., Peng, M., Zheng, X., Zhou, Y., Wei, Z., Qiu, X., & Huang, X. (2021). TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing. ArXiv, abs/2103.11441.
[9] Ribeiro, M.T., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond Accuracy: Behavioral Testing of NLP models with CheckList. ACL.

Moyan Mei

Moyan Mei

Moyan Mei is an AI researcher with extensive experience in both machine learning research and data science applications. His interest is to research and develop...

Read More    Read More