How to read a scientific research paper using a large language model?

Python script to read a PDF document, and perform question-answering on the document, and to generate an output based on the provided query

Introduction

GPT (3 or 4) is a Transformer-based large language model that can understand and generate texts developed by OpenAI. GPT is trained in two steps (pretraining and fine-tuning) using a method called unsupervised learning on a large corpus of text data. The pretraining step include training the GPT model on a large dataset containing parts of the Internet consisting of billions of sentences from various sources like books, articles, websites, and other publicly available text. The fine-tuning step comes after the pretraining, where the model is fine-tuned on more specific data to make it more useful and safe for practical applications.

Is it possible to use GPT model to read a research paper?

Yes, you can use GPT models like ChatGPT to read recent scientific research papers, but some “fine-tuning” is required. Directly accessing GPT models via web interfaces like ChatGPT has limitations; they might not be up-to-date with the most recent scientific research due to the rapid pace of publication. Despite regular updates, these models may not encompass the very latest scientific research due to the sheer volume of information being published daily. Moreover, using ChatGPT to inquire about recent research might yield incorrect answers due to inconsistencies in the data it was trained on, and it doesn’t provide the sources of information. However, it’s indeed possible to customize the model to deliver responses solely based on the provided input data.

The limitation of GPT model can be overcome by inputting the research paper data into the language model for fine-tuning, allowing us to retrieve information from the paper. The process entails converting downloaded PDF research papers into a well-structured text format, segmenting the text into smaller pieces that the language model can handle.

Each text chunk is then transformed into a vector embedding using OpenAI text embeddings and stored via FAISS. These embeddings serve as numerical equivalents of the texts, preserving their semantic information and creating a new knowledge base. For a deeper understanding of how embeddings function, refer to this NLP course.

 Transformer-based language models represent each token in a span of text as an embedding vector. It turns out that one can “pool” the individual embeddings to create a vector representation for whole sentences, paragraphs, or (in some cases) documents. These embeddings can then be used to find similar documents in the corpus by computing the dot-product similarity (or some other similarity metric) between each embedding and returning the documents with the greatest overlap.

When querying information from the input research paper through the vector stores (representing the knowledge database), the results are ranked based on their similarity to the query. We can then use GPT to generate a response. A very good explanation on how to set up your application to read a pdf is done by this Youtube video by Prompt Engineering.

Implementation

I have written a Python script that reads a PDF document, performs question-answering on the document, and generates an output based on the provided query. You can download the script from here.

The script expects two arguments: configfile, which is the path to the configuration file, and query, which is the query to ask about the paper. The script retrieves the OpenAI API key from the environment variable OPENAI_API_KEY. To set up the API key, you can export your OpenAI API in your os environment. You can go to the OpenAI platform to get your API. You will get some free credit that should be sufficient for heavy usuage of this application for several months.

The script also creates some local cache for relatively fast experience.

a close up of a computer screen with a blurry background

Setting up the python environment

python3 -m venv venv
source venv/bin/activate
pip install langchain
pip install openai
pip install PyPDF2
pip install faiss-cpu
pip install tiktoken
pip install pyyaml

Usage

Who are the authors of this article?

python read_paper.py -c config.yaml -q "who are the authors of the article?"

Summary of this article?

python read_paper.py -c config.yaml -q "write a summary of this article"

Example

As an example, I use this application to read the research paper: Whole-mantle radially anisotropic shear velocity structure from spectral-element waveform tomography

French and Romanowicz 2014
❯ python read_paper.py -c config.yaml -q "who are the authors of the article?"

Cache cleared
====================================================================================================
The authors of the article are S. W. French and B. A. Romanowicz.
----------------------------------------------------------------------------------------------------
❯ python read_paper.py -c config.yaml -q "How many b spline nodes did they use in their model?"

====================================================================================================
They used 20 knots with variable spacing between the core-mantle boundary (CMB) and 30-km depth for their radial b-spline basis.
----------------------------------------------------------------------------------------------------

❯ python read_paper.py -c config.yaml -q "Which optimization algorithm did they chose for their inversion?"

====================================================================================================
They used a Gauss-Newton optimization scheme for their waveform inversion.
----------------------------------------------------------------------------------------------------

Conclusion

The outcome of this application depends on the parameters defined in the configfile such as chunk_sizechunk_overlap and gpt_temperature. The chunk_size ensures that chunk size of the text is within a limit and the chunk_overlap is important for the model to learn the context. If the overlap is too small then the model may not be able to learn the relation between different chunks. Temperature parameter of OpenAI GPT models governs the randomness and thus the creativity of the responses. A temperature of 0 means the responses will be very straightforward, almost deterministic (meaning you almost always get the same response to a given prompt). A temperature of 1 means the responses can vary wildly.

Wu, T., He, S., Liu, J., Sun, S., Liu, K., Han, Q.-L., & Tang, Y. (2023). A Brief Overview of ChatGPT: The History, Status Quo and Potential Future Development. IEEE/CAA Journal of Automatica Sinica, 10(5), 1122–1136. https://doi.org/10.1109/JAS.2023.123618
Utpal Kumar
Utpal Kumar

Geophysicist | Geodesist | Seismologist | Open-source Developer
I am a geophysicist with a background in computational geophysics, currently working as a postdoctoral researcher at UC Berkeley. My research focuses on seismic data analysis, structural health monitoring, and understanding deep Earth structures. I have had the opportunity to work on diverse projects, from investigating building characteristics using smartphone data to developing 3D models of the Earth's mantle beneath the Yellowstone hotspot.

In addition to my research, I have experience in cloud computing, high-performance computing, and single-board computers, which I have applied in various projects. This includes working with platforms like AWS, GCP, Linode, DigitalOcean, as well as supercomputing environments such as STAMPEDE2, ANVIL, Savio and PERLMUTTER (and CORI). My work involves developing innovative solutions for structural health monitoring and advancing real-time seismic response analysis. I am committed to applying these skills to further research in computational seismology and structural health monitoring.

Articles: 42

Leave a Reply

Your email address will not be published. Required fields are marked *