OpenSearch using k-NN: Improving Academic Literature Search
Finding relevant literature is crucial for research. However, traditional keyword searches can be limited in their ability to retrieve relevant documents, especially when dealing with large datasets. OpenSearch provides a K-Nearest Neighbor (K-NN) search algorithm for semantic search. K-NN search algorithm is a machine learning technique used for clustering similar data points. In this blog post, we will explore how to use OpenSearch K-NN as a semantic search engine, set up an index, and provide code examples using data pulled from Arxiv PDFs.
What is OpenSearch?
OpenSearch is a protocol that allows search engines to communicate with each other, enabling the search of multiple sources at once. It is an open standard that can be implemented by anyone and is designed to make it easy for developers to create search functionality on their websites. OpenSearch is widely used and supported by popular browsers.
What is k-NN?
The K-NN search algorithm is a machine learning technique that uses distance metrics to identify the similarity between data points. It is often used for clustering similar data points. OpenSearch K-NN search is used to perform a semantic search by comparing the distance between the query and the indexed data. The K-NN search algorithm is useful for searching unstructured data, where the data is not predefined or structured, such as natural language text. OpenSearch K-NN uses cosine similarity to calculate the distance between the query and the indexed data. Cosine similarity measures the angle between two vectors, and returns a value between 0 and 1, where 1 means the vectors are identical.
Quick Setup and Code Examples
In order to create our search engine, we will require the following:
- Text embeddings that allow algorithms to perform similarity searches by finding sentences that have a similar meaning to those in a search query, even if the query does not contain all the same words.
- A database to store the embeddings.
- An algorithm that can locate similar embeddings.
The OpenSearch platform can handle both steps (2) and (3) using its k-NN plugin. Huggingface Transformers or Sentence Transformers can generate the embeddings. To implement OpenSearch using k-NN, lets first create an index of our data.
1. Setup OpenSearch environment
OpenSearch has several installation options, such as Docker, Tarball, and RPM. We use Docker for this demonstration. Create a new directory for your OpenSearch environment, and create a new file called docker-compose.yml in that directory, add the following code:
Next, navigate to the directory of your docker-compose.yml and enter the following command to run OpenSearch and OpenSearch Dashboards:
docker compose up -d
Go to http://localhost:5601. You should be able to access OpenSearch Dashboards as shown in the following figure, and review your indices and use the Dev Tools to run queries directly on the site.
1. Getting Started
First of all, setup the environment properly, you can use a virtual environment management like conda or any other. For instance, using conda and installing the corresponding dependencies from a requirements.txt file looks like this:
conda create -n opensearch python==3.9
conda activate opensearch
Create in your project a requirements.txt file with the following content:
For k-NN plugin to work, we need to define at least one field of type knn_vector and define its dimensions, that will depend on the model of your choice, in our case is 384 (you can find more information about the all-MiniLM-L6-v2 model here).
We can achieve this by using opensearchpy package to connect to the OpenSearch cluster and create the index. We specify the hosts parameter to connect to our local OpenSearch setup:
We are indexing the papers into the OpenSearch index called arxiv and adding body parameter with the settings just defined, and ignoring any HTTP 400 errors that may occur.
2. Creating a Dataset Using Arxiv API
The Arxiv API provides programmatic access to metadata and full-text articles from the Arxiv database. We can use the API to create a dataset of academic papers to index in OpenSearch.
We can use the Python arxiv package to retrieve papers from Arxiv. We will then download the PDF for each paper using the download_pdf function. To populate the index, we'll need to extract the text from the Arxiv PDFs. We can do this using the PyPDF2 library, we define as well a small function to do so and we added some clean up functionality as well to remove the already processed PDFs in a separate file called utils.py:
We're using the Search function to retrieve up to NUM_PAPERS = 100 papers that match the QUERY_SEARCH = "spectral residual techniques". For each paper, we're creating a dictionary that contains the title, abstract and author names. Finally, we're adding the text data to the paper dict under the key 'body', and appending the paper dict to a list called papers.
3. Get the Embeddings and Populate the Index
We loop through the list of papers, define a document to index with the paper's title, abstract, body and authors, and use the index method of the OpenSearch client to index the document in the arxiv index. First, we obtain the embedding for the title, body and abstract fields combined of the paper so we can use it later in our k-NN semantic search:
4. Use the Search Engine
Now that we have our index set up, we can use k-NN to improve the search results. To use it we need some text from the user that will be embedded and sent to OpenSearch, which'll return most similar documents. Just for your reference, a query will look like the following:
The type of this query is knn, which operates on an embedding field by locating documents with vectors that closely resemble a given vector of numbers. I turned off the _source since it isn't necessary and chose to retrieve only the title, name, abstract and authors fields.
Below there is the necessary code to execute a search:
The user will be prompted the question "What are you looking for?" to add their search criteria, that will be embedded using the same model and launch a search against our arxiv index.
Final Thoughts
In this post, we've explored how OpenSearch and k-NN can be used to improve academic literature search. We've provided a quick setup guide for creating an OpenSearch index and code examples for indexing data from Arxiv PDFs, embedding the data using a sentence-transformer model called all-MiniLM-L6-v2, and querying the index using k-NN.
Needless to say, there are other approaches to achieve the same goals. For instance, using different types of embedding models like distilbert. You can also pre-processed the data to clean unwanted characters (if any), papers in general will contain useful information but perhaps formulas or mathematical expressions are not appropriate for this kind of search.
In our particular example, we may want to add the pdf_url property as well and discard the body since it may occupy a lot of storage and it won't be really necessary at the end given that we can offer the user the final PDF URL and searches occur at the embedded latent space.