AI and Machine Learning

OpenSearch using k-NN: Improving Academic Literature Search

Revolutionize your literature search with OpenSearch using k-NN. Discover how to create a semantic search engine. With code examples too!

Natalia Pattarone
March 22, 2024
illustration for outsourcing

OpenSearch using k-NN: Improving Academic Literature Search

Finding relevant literature is crucial for research. However, traditional keyword searches can be limited in their ability to retrieve relevant documents, especially when dealing with large datasets. OpenSearch provides a K-Nearest Neighbor (K-NN) search algorithm for semantic search. K-NN search algorithm is a machine learning technique used for clustering similar data points. In this blog post, we will explore how to use OpenSearch K-NN as a semantic search engine, set up an index, and provide code examples using data pulled from Arxiv PDFs.

What is OpenSearch?

OpenSearch is a protocol that allows search engines to communicate with each other, enabling the search of multiple sources at once. It is an open standard that can be implemented by anyone and is designed to make it easy for developers to create search functionality on their websites. OpenSearch is widely used and supported by popular browsers.

What is k-NN?

The K-NN search algorithm is a machine learning technique that uses distance metrics to identify the similarity between data points. It is often used for clustering similar data points. OpenSearch K-NN search is used to perform a semantic search by comparing the distance between the query and the indexed data. The K-NN search algorithm is useful for searching unstructured data, where the data is not predefined or structured, such as natural language text. OpenSearch K-NN uses cosine similarity to calculate the distance between the query and the indexed data. Cosine similarity measures the angle between two vectors, and returns a value between 0 and 1, where 1 means the vectors are identical.

Quick Setup and Code Examples

In order to create our search engine, we will require the following:

  1. Text embeddings that allow algorithms to perform similarity searches by finding sentences that have a similar meaning to those in a search query, even if the query does not contain all the same words.
  2. A database to store the embeddings.
  3. An algorithm that can locate similar embeddings.

The OpenSearch platform can handle both steps (2) and (3) using its k-NN plugin. Huggingface Transformers or Sentence Transformers can generate the embeddings. To implement OpenSearch using k-NN, lets first create an index of our data.

1. Setup OpenSearch environment

OpenSearch has several installation options, such as Docker, Tarball, and RPM. We use Docker for this demonstration. Create a new directory for your OpenSearch environment, and create a new file called docker-compose.yml in that directory, add the following code:

Next, navigate to the directory of your docker-compose.yml and enter the following command to run OpenSearch and OpenSearch Dashboards:

docker compose up -d

Go to http://localhost:5601. You should be able to access OpenSearch Dashboards as shown in the following figure, and review your indices and use the Dev Tools to run queries directly on the site.

OpenSearch Dashboard

1. Getting Started

First of all, setup the environment properly, you can use a virtual environment management like conda or any other. For instance, using conda and installing the corresponding dependencies from a requirements.txt file looks like this:

conda create -n opensearch python==3.9

conda activate opensearch

Create in your project a requirements.txt file with the following content:

For k-NN plugin to work, we need to define at least one field of type knn_vector and define its dimensions, that will depend on the model of your choice, in our case is 384 (you can find more information about the all-MiniLM-L6-v2 model here).

We can achieve this by using opensearchpy package to connect to the OpenSearch cluster and create the index. We specify the hosts parameter to connect to our local OpenSearch setup:

We are indexing the papers into the OpenSearch index called arxiv and adding body parameter with the settings just defined, and ignoring any HTTP 400 errors that may occur.

2. Creating a Dataset Using Arxiv API

The Arxiv API provides programmatic access to metadata and full-text articles from the Arxiv database. We can use the API to create a dataset of academic papers to index in OpenSearch.

We can use the Python arxiv package to retrieve papers from Arxiv. We will then download the PDF for each paper using the download_pdf function. To populate the index, we'll need to extract the text from the Arxiv PDFs. We can do this using the PyPDF2 library, we define as well a small function to do so and we added some clean up functionality as well to remove the already processed PDFs in a separate file called utils.py:

We're using the Search function to retrieve up to NUM_PAPERS = 100 papers that match the QUERY_SEARCH = "spectral residual techniques". For each paper, we're creating a dictionary that contains the title, abstract and author names. Finally, we're adding the text data to the paper dict under the key 'body', and appending the paper dict to a list called papers.

3. Get the Embeddings and Populate the Index

We loop through the list of papers, define a document to index with the paper's title, abstract, body and authors, and use the index method of the OpenSearch client to index the document in the arxiv index. First, we obtain the embedding for the title, body and abstract fields combined of the paper so we can use it later in our k-NN semantic search:

4. Use the Search Engine

Now that we have our index set up, we can use k-NN to improve the search results. To use it we need some text from the user that will be embedded and sent to OpenSearch, which'll return most similar documents. Just for your reference, a query will look like the following:

The type of this query is knn, which operates on an embedding field by locating documents with vectors that closely resemble a given vector of numbers. I turned off the _source since it isn't necessary and chose to retrieve only the title, name, abstract and authors fields.

Below there is the necessary code to execute a search:

The user will be prompted the question "What are you looking for?" to add their search criteria, that will be embedded using the same model and launch a search against our arxiv index.

Final Thoughts

In this post, we've explored how OpenSearch and k-NN can be used to improve academic literature search. We've provided a quick setup guide for creating an OpenSearch index and code examples for indexing data from Arxiv PDFs, embedding the data using a sentence-transformer model called all-MiniLM-L6-v2, and querying the index using k-NN.

Needless to say, there are other approaches to achieve the same goals. For instance, using different types of embedding models like distilbert. You can also pre-processed the data to clean unwanted characters (if any), papers in general will contain useful information but perhaps formulas or mathematical expressions are not appropriate for this kind of search.

In our particular example, we may want to add the pdf_url property as well and discard the body since it may occupy a lot of storage and it won't be really necessary at the end given that we can offer the user the final PDF URL and searches occur at the embedded latent space.

Try our AI-Powered Enterprise Search Solution.

No items found.

We are Azumo
and we get it

We understand the struggle of finding the right software development team to build your service or solution.

Since our founding in 2016 we have heard countless horror stories of the vanishing developer, the never-ending late night conference calls with the offshore dev team, and the mounting frustration of dealing with buggy code, missed deadlines and poor communication. We built Azumo to solve those problems and offer you more. We deliver well trained, senior developers, excited to work, communicate and build software together that will advance your business.

Want to see how we can deliver for you?

schedule my call

Benefits You Can Expect

Release software features faster and maintain apps with Azumo. Our developers are not freelancers and we are not a marketplace. We take pride in our work and seat dedicated Azumo engineers with you who take ownership of the project and create valuable solutions for you.

Industry Experts

Businesses across industries trust Azumo. Our expertise spans industries from healthcare, finance, retail, e-commerce, media, education, manufacturing and more.

Illustration of globe for technology nearshore software development outsourcing

Real-Time Collaboration

Enjoy seamless collaboration with our time zone-aligned developers. Collaborate, brainstorm, and share feedback easily during your working hours.

vCTO Solution Illustration

Boost Velocity

Increase your development speed. Scale your team up or down as you need with confidence, so you can meet deadlines and market demand without compromise.

Illustration of bullseye for technology nearshore software development outsourcing

Agile Approach

We adhere to strict project management principles that guarantee outstanding software development results.

Quality Code

Benefits from our commitment to quality. Our developers receive continuous training, so they can deliver top-notch code.

Flexible Models

Our engagement models allow you to tailor our services to your budget, so you get the most value for your investment.

Client Testimonials

Zynga

Azumo has been great to work with. Their team has impressed us with their professionalism and capacity. We have a mature and sophisticated tech stack, and they were able to jump in and rapidly make valuable contributions.

Zynga
Drew Heidgerken
Director of Engineering
Zaplabs

We worked with Azumo to help us staff up our custom software platform redevelopment efforts and they delivered everything we needed.

Zaplabs
James Wilson
President
Discovery Channel

The work was highly complicated and required a lot of planning, engineering, and customization. Their development knowledge is impressive.

Discovery Channel
Costa Constantinou
Senior Product Manager
Twitter

Azumo helped my team with the rapid development of a standalone app at Twitter and were incredibly thorough and detail oriented, resulting in a very solid product.

Twitter
Seth Harris
Senior Program Manager
Zemax

So much of a successful Cloud development project is the listening. The Azumo team listens. They clearly understood the request and quickly provided solid answers.

Zemax
Matt Sutton
Head of Product
Bento for Business

Azumo came in with a dedicated team that quickly grasped our problem and designed and built our data integration solution. They delivered a clearer picture for our business in a timeframe I didn’t think was possible.

Bento for Business
Sean Anderson
Chief Operating Officer

How it Works

schedule my call

Step 1: Schedule your call

Find a time convenient for you to discuss your needs and goals

Step 2: We review the details

We estimate the effort, design the team, and propose a solution for you to collaborate.

Step 3: Design, Build, Launch, Maintain

Seamlessly partner with us to confidently build software nearshore

We Deliver Every Sprint

Icon illustrating the advantage of time zone-aligned software developers from Azumo, ensuring work hours synchronized with client schedules.

Time Zone Aligned

Our nearshore developers collaborate with you throughout your working day.

Icon showcasing the advantage of hiring expert engineers from Azumo for software development services.

Experienced Engineers

We hire mid-career software development professionals and invest in them.

Icon symbolizing how Azumo's software developers prioritize honest, English-always communication for building quality software.

Transparent Communication

Good software is built on top of honest, english-always communication.

Icon representing how Azumo's developers enhance velocity by approaching software development with a problem solver's mindset.

Build Like Owners

We boost velocity by taking a problem solvers approach to software development.

Icon illustrating how Azumo's quality assurance process ensures the delivery of reliable, working code for every project.

Expect Consistent Results

Our internal quality assurance process ensures we push good working code.

Icon depicting how Azumo follows strict project management principles to stay aligned with your goals throughout the development process.

Agile Project Management

We follow strict project management principles so we remain aligned to your goals