How to read a scientific research paper using a large language model?

Introduction

GPT (3 or 4) is a Transformer-based large language model that can understand and generate texts developed by OpenAI. GPT is trained in two steps (pretraining and fine-tuning) using a method called unsupervised learning on a large corpus of text data. The pretraining step include training the GPT model on a large dataset containing parts of the Internet consisting of billions of sentences from various sources like books, articles, websites, and other publicly available text. The fine-tuning step comes after the pretraining, where the model is fine-tuned on more specific data to make it more useful and safe for practical applications.

Is it possible to use GPT model to read a research paper?

Yes, you can use GPT models like ChatGPT to read recent scientific research papers, but some “fine-tuning” is required. Directly accessing GPT models via web interfaces like ChatGPT has limitations; they might not be up-to-date with the most recent scientific research due to the rapid pace of publication. Despite regular updates, these models may not encompass the very latest scientific research due to the sheer volume of information being published daily. Moreover, using ChatGPT to inquire about recent research might yield incorrect answers due to inconsistencies in the data it was trained on, and it doesn’t provide the sources of information. However, it’s indeed possible to customize the model to deliver responses solely based on the provided input data.

The limitation of GPT model can be overcome by inputting the research paper data into the language model for fine-tuning, allowing us to retrieve information from the paper. The process entails converting downloaded PDF research papers into a well-structured text format, segmenting the text into smaller pieces that the language model can handle.

Each text chunk is then transformed into a vector embedding using OpenAI text embeddings and stored via FAISS. These embeddings serve as numerical equivalents of the texts, preserving their semantic information and creating a new knowledge base. For a deeper understanding of how embeddings function, refer to this NLP course.

Transformer-based language models represent each token in a span of text as an embedding vector. It turns out that one can “pool” the individual embeddings to create a vector representation for whole sentences, paragraphs, or (in some cases) documents. These embeddings can then be used to find similar documents in the corpus by computing the dot-product similarity (or some other similarity metric) between each embedding and returning the documents with the greatest overlap.

When querying information from the input research paper through the vector stores (representing the knowledge database), the results are ranked based on their similarity to the query. We can then use GPT to generate a response. A very good explanation on how to set up your application to read a pdf is done by this Youtube video by Prompt Engineering.

Implementation

I have written a Python script that reads a PDF document, performs question-answering on the document, and generates an output based on the provided query. You can download the script from here.

The script expects two arguments: configfile, which is the path to the configuration file, and query, which is the query to ask about the paper. The script retrieves the OpenAI API key from the environment variable OPENAI_API_KEY. To set up the API key, you can export your OpenAI API in your os environment. You can go to the OpenAI platform to get your API. You will get some free credit that should be sufficient for heavy usuage of this application for several months.

The script also creates some local cache for relatively fast experience.

a close up of a computer screen with a blurry background

Setting up the python environment

python3 -m venv venv
source venv/bin/activate
pip install langchain
pip install openai
pip install PyPDF2
pip install faiss-cpu
pip install tiktoken
pip install pyyaml

Usage

Who are the authors of this article?

python read_paper.py -c config.yaml -q "who are the authors of the article?"

Summary of this article?

python read_paper.py -c config.yaml -q "write a summary of this article"

Example

As an example, I use this application to read the research paper: Whole-mantle radially anisotropic shear velocity structure from spectral-element waveform tomography

❯ python read_paper.py -c config.yaml -q "who are the authors of the article?"

Cache cleared
====================================================================================================
The authors of the article are S. W. French and B. A. Romanowicz.
----------------------------------------------------------------------------------------------------

❯ python read_paper.py -c config.yaml -q "How many b spline nodes did they use in their model?"

====================================================================================================
They used 20 knots with variable spacing between the core-mantle boundary (CMB) and 30-km depth for their radial b-spline basis.
----------------------------------------------------------------------------------------------------

❯ python read_paper.py -c config.yaml -q "Which optimization algorithm did they chose for their inversion?"

====================================================================================================
They used a Gauss-Newton optimization scheme for their waveform inversion.
----------------------------------------------------------------------------------------------------

Conclusion

The outcome of this application depends on the parameters defined in the configfile such as chunk_size, chunk_overlap and gpt_temperature. The chunk_size ensures that chunk size of the text is within a limit and the chunk_overlap is important for the model to learn the context. If the overlap is too small then the model may not be able to learn the relation between different chunks. Temperature parameter of OpenAI GPT models governs the randomness and thus the creativity of the responses. A temperature of 0 means the responses will be very straightforward, almost deterministic (meaning you almost always get the same response to a given prompt). A temperature of 1 means the responses can vary wildly.