User Documentation

pygetpapers

Research Papers right from python

What is pygetpapers

  • pygetpapers is a tool to assist text miners. It makes requests to open access scientific text repositories, analyses the hits, and systematically downloads the articles without further interaction.

  • Comes with the packages pygetpapers and downloadtools which provide various functions to download, process and save research papers and their metadata.

  • The main medium of its interaction with users is through a command-line interface.

  • pygetpapers has a modular design which makes maintenance easy and simple. This also allows adding support for more repositories simple.


img img img img img Documentation Status DOI badge

The developer documentation has been setup at readthedocs

History

getpapers is a tool written by Rik Smith-Unna funded by ContentMine at https://github.com/ContentMine/getpapers. The OpenVirus community requires a Python version and Ayush Garg has written an implementation from scratch, with some enhancements.

Formats supported by pygetpapers

  • pygetpapers gives fulltexts in xml and pdf format.

  • The metadata for papers can be saved in many formats including JSON, CSV, HTML.

  • Queries can be saved in form of an ini configuration file.

  • The additional files for papers can also be downloaded. References and citations for papers are given in XML format.

  • Log files can be saved in txt format.

Repository Structure

The main code is located in the pygetpapers directory. All the supporting modules for different repositories are described in the pygetpapers/repository directory.

Architecture

About the author and community

pygetpapers has been developed by Ayush Garg under the dear guidance of the OpenVirus community and Peter Murray Rust. Ayush is currently a high school student who believes that the world can only truly progress when knowledge is open and accessible by all.

Testers from OpenVirus have given a lot of useful feedback to Ayush without which this project would not have been possible.

The community has taken time to ensure that everyone can contribute to this project. So, YOU, the developer, reader and researcher can also contribute by testing, developing, and sharing.

Installation

Ensure that pip is installed along with python. Download python from: https://www.python.org/downloads/ and select the option Add Python to Path while installing.

Check out https://pip.pypa.io/en/stable/installing/ if difficulties installing pip. Also, checkout https://packaging.python.org/en/latest/tutorials/installing-packages/ to learn more about installing packages in python.


Method two (Install Directly From Head):

  • Ensure git cli is installed and is available in path. Check out (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)

  • Enter the command: pip install git+https://github.com/petermr/pygetpapers.git

  • Ensure pygetpapers has been installed by reopening the terminal and typing the command pygetpapers

  • You should see a help message come up.


Usage

pygetpapers is a commandline tool. You can ask for help by running:

pygetpapers --help
usage: pygetpapers [-h] [--config CONFIG] [-v] [-q QUERY] [-o OUTPUT]
                   [--save_query] [-x] [-p] [-s] [-z] [--references REFERENCES]
                   [-n] [--citations CITATIONS] [-l LOGLEVEL] [-f LOGFILE]
                   [-k LIMIT] [-r] [-u] [--onlyquery] [-c] [--makehtml]
                   [--synonym] [--startdate STARTDATE] [--enddate ENDDATE]
                   [--terms TERMS] [--notterms NOTTERMS] [--api API]
                   [--filter FILTER]

Welcome to Pygetpapers version 0.0.9.3. -h or --help for help

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       config file path to read query for pygetpapers
  -v, --version         output the version number
  -q QUERY, --query QUERY
                        query string transmitted to repository API. Eg.
                        "Artificial Intelligence" or "Plant Parts". To escape
                        special characters within the quotes, use backslash.
                        Incase of nested quotes, ensure that the initial quotes
                        are double and the qutoes inside are single. For eg:
                        `'(LICENSE:"cc by" OR LICENSE:"cc-by") AND
                        METHODS:"transcriptome assembly"' ` is wrong. We should
                        instead use `"(LICENSE:'cc by' OR LICENSE:'cc-by') AND
                        METHODS:'transcriptome assembly'"`
  -o OUTPUT, --output OUTPUT
                        output directory (Default: Folder inside current working directory named current date and time)
  --save_query          saved the passed query in a config file
  -x, --xml             download fulltext XMLs if available or save metadata as
                        XML
  -p, --pdf             [E][A] download fulltext PDFs if available (only eupmc
                        and arxiv supported)
  -s, --supp            [E] download supplementary files if available (only eupmc
                        supported)
  -z, --zip             [E] download files from ftp endpoint if available (only
                        eupmc supported)
  --references REFERENCES
                        [E] Download references if available. (only eupmc
                        supported)Requires source for references
                        (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
  -n, --noexecute       [ALL] report how many results match the query, but don't
                        actually download anything
  --citations CITATIONS
                        [E] Download citations if available (only eupmc
                        supported). Requires source for citations
                        (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
  -l LOGLEVEL, --loglevel LOGLEVEL
                        [All] Provide logging level. Example --log warning
                        <<info,warning,debug,error,critical>>, default='info'
  -f LOGFILE, --logfile LOGFILE
                        [All] save log to specified file in output directory as
                        well as printing to terminal
  -k LIMIT, --limit LIMIT
                        [All] maximum number of hits (default: 100)
  -r, --restart         [E] Downloads the missing flags for the corpus.Searches
                        for already existing corpus in the output directory
  -u, --update          [E][B][M][C] Updates the corpus by downloading new
                        papers. Requires -k or --limit (If not provided, default
                        will be used) and -q or --query (must be provided) to be
                        given. Searches for already existing corpus in the output
                        directory
  --onlyquery           [E] Saves json file containing the result of the query in
                        storage. (only eupmc supported)The json file can be given
                        to --restart to download the papers later.
  -c, --makecsv         [All] Stores the per-document metadata as csv.
  --makehtml            [All] Stores the per-document metadata as html.
  --synonym             [E] Results contain synonyms as well.
  --startdate STARTDATE
                        [E][B][M] Gives papers starting from given date. Format:
                        YYYY-MM-DD
  --enddate ENDDATE     [E][B][M] Gives papers till given date. Format: YYYY-MM-
                        DD
  --terms TERMS         [All] Location of the file which contains terms
                        serperated by a comma or an ami dict which will beOR'ed
                        among themselves and AND'ed with the query
  --notterms NOTTERMS   [All] Location of the txt file which contains terms
                        serperated by a comma or an ami dict which will beOR'ed
                        among themselves and NOT'ed with the query
  --api API             API to search [eupmc,
                        crossref,arxiv,biorxiv,medrxiv,rxivist] (default: eupmc)
  --filter FILTER       [C] filter by key value pair (only crossref supported)

Queries are build using -q flag. The query format can be found at http://europepmc.org/docs/EBI_Europe_PMC_Web_Service_Reference.pdf A condensed guide can be found at https://github.com/petermr/pygetpapers/wiki/query-format

Repository-specific flags

To convey the repository specificity, we’ve used the first letter of the repository in square brackets in its description.

What is CProject?

A CProject is a directory structure that the AMI toolset uses to gather and process data. Each paper gets its folder. A CTree is a subdirectory of a CProject that deals with a single paper.

Tutorial

pygetpapers was on version 0.0.9.3 when the tutorials were documented.

pygetpapers supports multiple APIs including eupmc, crossref,arxiv,biorxiv,medrxiv,rxivist-bio,rxivist-med. By default, it queries EPMC. You can specify the API by using --api flag.

You can also follow this colab notebook as part of the tutorial.

Features

EPMC

crossref

arxiv

biorxiv

medarxiv

rxvist

Fulltext formats

xml, pdf

NA

pdf

xml

xml

xml

Metdata formats

json, html, csv

json, xml, csv

json, csv, html, xml

json, csv, html

json, csv, html

json, html, csv

--query

yes

yes

yes

NA

NA

NA

--update

yes

NA

NA

yes

yes

--restart

yes

NA

NA

NA

NA

NA

--citation

yes

NA

NA

NA

NA

NA

--references

yes

NA

NA

NA

NA

NA

--terms

yes

yes

yes

NA

NA

NA

EPMC (Default API)

Example Query

Let’s break down the following query:

pygetpapers -q "METHOD: invasive plant species" -k 10 -o "invasive_plant_species_test" -c --makehtml -x --save_query

Flag

What it does

In this case pygetpapers

-q

specifies the query

queries for ‘invasive plant species’ in METHODS section

-k

number of hits (default 100)

limits hits to 10

-o

specifies output directory

outputs to invasive_plant_species_test

-x

downloads fulltext xml

-c

saves per-paper metadata into a single csv

saves single CSV named europe_pmc.csv

--makehtml

saves per-paper metadata into a single HTML file

saves single HTML named europe_pmc.html

--save_query

saves the given query in a config.ini in output directory

saves query to saved_config.ini

pygetpapers, by default, writes metadata to a JSON file within:

  • individual paper directory for corresponding paper (epmc_result.json)

  • working directory for all downloaded papers (epmc_results.json)

OUTPUT:

INFO: Final query is METHOD: invasive plant species
INFO: Total Hits are 17910
0it [00:00, ?it/s]WARNING: Keywords not found for paper 1
WARNING: Keywords not found for paper 4
1it [00:00, 164.87it/s]
INFO: Saving XML files to C:\Users\shweata\invasive_plant_species_test\*\fulltext.xml
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:21<00:00,  2.11s/it]

Scope the number of hits for a query

If you are just scoping the number of hits for a given query, you can use -n flag as shown below.

pygetpapers -n -q "essential oil"

OUTPUT:

INFO: Final query is essential oil
INFO: Total number of hits for the query are 190710

Update an existing CProject with new papers by feeding the metadata JSON

The --update command is used to update a CProject with a new set of papers on same or different query. If let’s say you have a corpus of a 30 papers on ‘essential oil’ (like before) and would like to download 20 more papers to the same CProject directory, you use --update command.

To update your Cproject, you would give it the -o flag the already existing CProject name. Additionally, you should also add --update flag. INPUT:

pygetpapers -q "invasive plant species" -k 10 -x -o lantana_test_5 --update

OUTPUT:

INFO: Final query is invasive plant species
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
INFO: Total Hits are 32956
0it [00:00, ?it/s]WARNING: html url not found for paper 5
WARNING: pdf url not found for paper 5
WARNING: Keywords not found for paper 6
WARNING: Keywords not found for paper 7
WARNING: Author list not found for paper 10
1it [00:00, 166.68it/s]
INFO: Saving XML files to C:\Users\shweata\lantana_test_5\*\fulltext.xml
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [01:03<00:00,  3.16s/it]
How is --update different from just downloading x number of papers to the same output directory?

By using --update command you can be sure that you don’t overwrite the existing papers.

Restart downloading papers to an existing CProject

--restart flag can be used for two purposes:

  • To download papers in different format. Let’s say you downloaded XMLs in the first round. If you want to download pdfs for same set of papers, you use this flag.

  • Continue the download from the stage where it broke. This feature would particularly come in handy if you are on poor lines. Let’s start off by forcefully interrupting the download. INPUT:

pygetpapers -q "pinus" -k 10 -o pinus_10 -x

OUTPUT:

INFO: Final query is pinus
INFO: Total Hits are 32086
0it [00:00, ?it/s]WARNING: html url not found for paper 10
WARNING: pdf url not found for paper 10
1it [00:00, 63.84it/s]
INFO: Saving XML files to C:\Users\shweata\pinus_10\*\fulltext.xml
 60%|██████████████████████████████████████████████████████████████████████████████▌                                                    | 6/10 [00:20<00:13,  3.42s/it]
Traceback (most recent call last):
...
KeyboardInterrupt
^C

If you take a look at the CProject directory, there are 6 papers downloaded so far.

C:.
│   eupmc_results.json
│
├───PMC8157994
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8180188
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8198815
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8216501
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8309040
│       eupmc_result.json
│       fulltext.xml
│
└───PMC8325914
        eupmc_result.json
        fulltext.xml

To download the rest, we can use --restart flag. INPUT

pygetpapers -q "pinus" -o pinus_10 --restart -x

OUTPUT:

INFO: Saving XML files to C:\Users\shweata\pinus_10\*\fulltext.xml
 80%|████████████████████████████████████████████████████████████████████████████████████████████████████████▊                          | 8/10 [00:27<00:07,  3.51s/it 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉             | 9/10 [00:38<00:05,  5.95s/it100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:40<00:00,  4.49s/it100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:40<00:00,  4.01s/it]

Now if we inspect the CProject directory, we see that we have 10 papers as specified.

C:.
│   eupmc_results.json
│
├───PMC8157994
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8180188
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8198815
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8199922
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8216501
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8309040
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8309338
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8325914
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8399312
│       eupmc_result.json
│       fulltext.xml
│
└───PMC8400686
        eupmc_result.json
        fulltext.xml

Under the hood, pygetpapers looks for eupmc_results.json, reads it and resumes the download.

You could also use --restart to download the fulltext or metadata in different format other than the ones that you’ve already downloaded. For example, if I want all the fulltext PDFs of the 10 papers on pinus, I can run:

INPUT:

pygetpapers -q "pinus" -o pinus_10 --restart -p --makehtml

OUTPUT:

>pygetpapers -q "pinus" -o pinus_10 --restart -p --makehtml
100%|█████████████████████████████████████████████| 10/10 [03:26<00:00, 20.68s/it]

Now, if we take a look at the CProject:

C:.
│   eupmc_results.json
│
├───PMC8157994
│       eupmc_result.html
│       eupmc_result.json
│       fulltext.pdf
│       fulltext.xml
│
├───PMC8180188
│       eupmc_result.html
│       eupmc_result.json
│       fulltext.pdf
│       fulltext.xml
│
├───PMC8198815
│       eupmc_result.html
│       eupmc_result.json
│       fulltext.pdf
│       fulltext.xml
...

We find that each paper now has fulltext PDFs and metadata in HTML.

Difference between --restart and --update
  • If you aren’t looking download new set of papers but would want to download a papers in different format for existing papers, --restart is the flag you’d want to use

  • If you are looking to download a new set of papers to an existing Cproject, then you’d use --update command. You should note that the format in which you download papers would only apply to the new set of papers and not for the old.

Downloading citations and references for papers, if available

  • --references and --citations flags can be used to download the references and citations respectively.

  • It also requires source for references (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR)

    pygetpapers -q "lantana" -k 10 -o "test" -c -x --citation PMC

Downloading only the metadata

If you are looking to download just the metadata in the supported formats--onlyquery is the flag you use. It saves the metadata in the output directory.

You can use --restart feature to download the fulltexts for these papers. INPUT:

pygetpapers --onlyquery -q "lantana" -k 10 -o "lantana_test" -c

OUTPUT:

INFO: Final query is lantana
INFO: Total Hits are 1909
0it [00:00, ?it/s]WARNING: html url not found for paper 1
WARNING: pdf url not found for paper 1
WARNING: Keywords not found for paper 2
WARNING: Keywords not found for paper 3
WARNING: Author list not found for paper 5
WARNING: Author list not found for paper 8
WARNING: Keywords not found for paper 9
1it [00:00, 407.69it/s]

Download papers within certain start and end date range

By using --startdate and --enddate you can specify the date range within which the papers you want to download were first published.

pygetpapers -q "METHOD:essential oil" --startdate "2020-01-02" --enddate "2021-09-09"

Saving query for later use

To save a query for later use, you can use --save_query. What it does is that it saves the query in a .ini file in the output directory.

pygetpapers -q "lantana" -k 10 -o "lantana_query_config"--save_query

Here is an example config file pygetpapers outputs

Feed query using config.ini file

Using can use the config.ini file you created using --save_query, you re-run the query. To do so, you will give --config flag the absolute path of the saved_config.ini file.

pygetpapers --config "C:\Users\shweata\lantana_query_config\saved_config.ini"

Querying using a term list

--terms flag

If your query is complex with multiple ORs, you can use --terms feature. To do so, you will:

  • Create a .txt file with list of terms separated by commas or an ami-dictionary (Click here to learn how to create dictionaries).

  • Give the --terms flag the absolute path of the .txt file or ami-dictionary (XML)

-q is optional.The terms would be OR’ed with each other ANDed with the query, if given.

INPUT:

pygetpapers -q "essential oil" --terms C:\Users\shweata\essential_oil_terms.txt -k 10 -o "terms_test_essential_oil" -x

OUTPUT:

C:\Users\shweata>pygetpapers -q "essential oil" --terms C:\Users\shweata\essential_oil_terms.txt -k 10 -o "terms_test_essential_oil"
INFO: Final query is (essential oil AND (antioxidant OR  antibacterial OR  antifungal OR  antiseptic OR  antitrichomonal agent))
INFO: Total Hits are 43397
0it [00:00, ?it/s]WARNING: Author list not found for paper 9
1it [00:00, 1064.00it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:19<00:00,  1.99s/it]

You can also use this feature to download papers by using the PMC Ids. You can feed the .txt file with PMC ids comman-separated. Make sure to give a large enough hit number to download all the papers specified in the text file.

Example text file can be found, here INPUT:

pygetpapers --terms C:\Users\shweata\PMCID_pygetpapers_text.txt -k 100 -o "PMCID_test"

OUTPUT:

INFO: Final query is (PMC6856665 OR  PMC6877543 OR  PMC6927906 OR  PMC7008714 OR  PMC7040181 OR  PMC7080866 OR  PMC7082878 OR  PMC7096589 OR  PMC7111464 OR  PMC7142259 OR  PMC7158757 OR  PMC7174509 OR  PMC7193700 OR  PMC7198785 OR  PMC7201129 OR  PMC7203781 OR  PMC7206980 OR  PMC7214627 OR  PMC7214803 OR  PMC7220991
)
INFO: Total Hits are 20
WARNING: Could not find more papers
1it [00:00, 505.46it/s]
100%|█████████████████████████████████████████████| 20/20 [00:32<00:00,  1.61s/it]
--notterms

Excluded papers that have certain keywords might also be of interest for you. For example, if you want papers on essential oil which doesn’t mention antibacterial , antiseptic or antimicrobial, you can run either create a dictionary or a text file with these terms (comma-separated), specify its absolute path to --notterms flag.

INPUT:

pygetpapers -q "essential oil" -k 10 -o essential_oil_not_terms_test --notterms C:\Users\shweata\not_terms_test.txt

OUTPUT:

INFO: Final query is (essential oil AND NOT (antimicrobial OR  antiseptic OR  antibacterial))
INFO: Total Hits are 165557
1it [00:00, ?it/s]
100%|█| 10/10 [00:49<00:00,  4.95s/

The number of papers are reduced by a some proportion. For comparision, “essential oil” query gives us 193922 hits.

C:\Users\shweata>pygetpapers -q "essential oil" -n
INFO: Final query is essential oil
INFO: Total number of hits for the query are 193922
Using --terms with dictionaries

We will take the same example as before.

  • Assuming you have ami3 installed, you can create ami-dictionaries

    • Start off by listing the terms in a .txt file

    antimicrobial
    antiseptic
    antibacterial
    
    • Run the following command from the directory in which the text file exists

    amidict -v --dictionary pygetpapers_terms --directory pygetpapers_terms --input pygetpapers_terms.txt create --informat list --outformats xml
    

That’s it! You’ve now created a simple ami-dictionary. There are ways of creating dictionaries from Wikidata as well. You can learn more about how to do that in this Wiki page.

  • You can also use standard dictionaries that are available. we, then, pass the absolute path of the dictionary to --terms flag.

INPUT:

pygetpapers -q "essential oil" --terms C:\Users\shweata\pygetpapers_terms\pygetpapers_terms.xml -k 10 -o pygetpapers_dictionary_test -x

OUTPUT:

INFO: Final query is (essential oil AND (antibacterial OR antimicrobial OR antiseptic))
INFO: Total Hits are 28365
0it [00:00, ?it/s]WARNING: Keywords not found for paper 5
WARNING: Keywords not found for paper 7
1it [00:00, ?it/s]
INFO: Saving XML files to C:\Users\shweata\pygetpapers_dictionary_test\*\fulltext.xml
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:36<00:00,  3.67s/it]

Log levels

You can specify the log level using the -l flag. The default as you’ve already seen so far is info.

INPUT:

pygetpapers -q "lantana" -k 10 -o lantana_test_10_2 --loglevel debug -x

Log file

You can also choose to write the log to a .txt file in your HOME directory, while simultaneously printing it out.

INPUT:

pygetpapers -q "lantana" -k 10 -o lantana_test_10_4 --loglevel debug -x --logfile test_log.txt

Here is the log file.

Crossref

You can query crossref api for the metadata only.

Sample query

  • The metadata formats flags are applicable as described in the EPMC tutorial

  • --terms and -q are also applicable to crossref INPUT:

pygetpapers --api crossref -q "essential oil" --terms C:\Users\shweata\essential_oil_terms.txt -k 10 -o "terms_test_essential_oil_crossref_3" -x -c --makehtml

OUTPUT:

INFO: Final query is (essential oil AND (antioxidant OR  antibacterial OR  antifungal OR  antiseptic OR  antitrichomonal agent))
INFO: Making request to crossref
INFO: Got request result from crossref
INFO: Making csv files for metadata at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 185.52it/s]
INFO: Making html files for metadata at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 87.98it/s]
INFO: Making xml files for metadata at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 366.97it/s]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 996.82it/s]

We have 10 folders in the CProject directory.

C:\Users\shweata>cd terms_test_essential_oil_crossref_3

C:\Users\shweata\terms_test_essential_oil_crossref_3>tree
Folder PATH listing for volume Windows-SSD
Volume serial number is D88A-559A
C:.
├───10.1016_j.bcab.2021.101913
├───10.1055_s-0029-1234896
├───10.1080_0972060x.2016.1231597
├───10.1080_10412905.1989.9697767
├───10.1111_j.1745-4565.2012.00378.x
├───10.17795_bhs-24733
├───10.23880_oajmms-16000131
├───10.34302_crpjfst_2019.11.2.8
├───10.5220_0008855200960099
└───10.5220_0009957801190122

--update

--update works the same as in EPMC. You can use this flag to increase the number of papers in a given CProject. INPUT

pygetpapers --api crossref -q "essential oil" --terms C:\Users\shweata\essential_oil_terms.txt -k 5 -o "terms_test_essential_oil_crossref_3" -x -c --makehtml --update

OUTPUT:

INFO: Final query is (essential oil AND (antioxidant OR  antibacterial OR  antifungal OR  antiseptic OR  antitrichomonal agent))
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
INFO: Reading old json metadata file
INFO: Making request to crossref
INFO: Got request result from crossref
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|██████████████████████████████████████████████| 5/5 [00:00<00:00, 346.84it/s]

The CProject after updating:

C:.
├───10.1002_mbo3.459
├───10.1016_j.bcab.2021.101913
├───10.1055_s-0029-1234896
├───10.1080_0972060x.2014.895156
├───10.1080_0972060x.2016.1231597
├───10.1080_0972060x.2017.1345329
├───10.1080_10412905.1989.9697767
├───10.1080_10412905.2021.1941338
├───10.1111_j.1745-4565.2012.00378.x
├───10.15406_oajs.2019.03.00121
├───10.17795_bhs-24733
├───10.23880_oajmms-16000131
├───10.34302_crpjfst_2019.11.2.8
├───10.5220_0008855200960099
└───10.5220_0009957801190122

We started off with 10 paper folders, and increased the number to 15.

Filter

arxiv

pygetpapers allows you to query arxiv for full text PDF and metadata in all supported formats.

Sample query

INPUT

pygetpapers --api arxiv -k 10 -o arxiv_test_3 -q "artificial intelligence" -x -p --makehtml -c

OUTPUT


INFO: Final query is artificial intelligence
INFO: Making request to Arxiv through pygetpapers
INFO: Got request result from Arxiv through pygetpapers
INFO: Requesting 10 results at offset 0
INFO: Requesting page of results
INFO: Got first page; 10 of 10 results available
INFO: Downloading Pdfs for papers
100%|█████████████████████████████████████████████| 10/10 [01:02<00:00,  6.27s/it]
INFO: Making csv files for metadata at C:\Users\shweata\arxiv_test_3
100%|████████████████████████████████████████████| 10/10 [00:00<00:00, 187.31it/s]
INFO: Making html files for metadata at C:\Users\shweata\arxiv_test_3
100%|████████████████████████████████████████████| 10/10 [00:00<00:00, 161.87it/s]
INFO: Making xml files for metadata at C:\Users\shweata\arxiv_test_3
100%|█████████████████████████████████████████████████████| 10/10 [00:00<?, ?it/s]
100%|███████████████████████████████████████████| 10/10 [00:00<00:00, 1111.22it/s]

Note: --update isn’t supported for arxiv

Biorxiv and Medrxiv

You can query biorxiv and medrxiv for fulltext and metadata (in all supported formats). However, passing a query string using -q flag isn’t supported for both the Repositories. You can only provide a date range.

Sample Query - biorxiv

INPUT:

pygetpapers --api biorxiv -k 10 -x --startdate 2021-01-01 -o biorxiv_test_20210831

OUTPUT:

WARNING: Currently biorxiv api is malfunctioning and returning wrong DOIs
INFO: Making Request to rxiv
INFO: Making xml for paper
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:23<00:00,  2.34s/it]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\biorxiv_test_20210831
100%|█████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 684.72it/s]

--update command

INPUT

pygetpapers --api biorxiv -k 10 -x --startdate 2021-01-01 -o biorxiv_test_20210831 --update

OUTPUT

WARNING: Currently biorxiv api is malfunctioning and returning wrong DOIs
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
INFO: Reading old json metadata file
INFO: Making Request to rxiv
INFO: Making xml for paper
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:22<00:00,  2.23s/it]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\biorxiv_test_20210831
100%|█████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 492.39it/s]

The CProject now has 20 papers, in total after updating.

├───10.1101_008326
├───10.1101_010553
├───10.1101_035972
├───10.1101_046052
├───10.1101_060012
├───10.1101_067736
├───10.1101_086710
├───10.1101_092205
├───10.1101_092619
├───10.1101_093237
├───10.1101_121061
├───10.1101_135749
├───10.1101_145664
├───10.1101_145896
├───10.1101_165845
├───10.1101_180273
├───10.1101_181198
├───10.1101_191858
├───10.1101_194266
└───10.1101_196105

The working of medarxiv is same as biorxiv

rxivist

Lets you specify a queries string to both biorxiv and medarxiv. The results you get would be a mixture of papers from both repository since rxivist doesn’t differentiate.

Another caveat here is that you can only retrieve metadata from rxivist.

INPUT:

pygetpapers --api rxivist -q "biomedicine" -k 10 -c -x -o "biomedicine_rxivist" --makehtml -p

OUTPUT:

WARNING: Pdf is not supported for this api
INFO: Final query is biomedicine
INFO: Making Request to rxivist
INFO: Making csv files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 125.54it/s]
INFO: Making html files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 124.71it/s]
INFO: Making xml files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 633.38it/s]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 751.09it/s]

Query hits only

Like any other repositories under pygetpapers, you can use the -n flag to get only the hit number INPUT:

C:\Users\shweata>pygetpapers --api rxivist -q "biomedical sciences" -n

OUTPUT:

INFO: Final query is biomedical sciences
INFO: Making Request to rxivist
INFO: Total number of hits for the query are 62

Update

--update works the same as many other repositories. Make sure to provide rxvist as api.

INPUT:

pygetpapers --api rxivist -q "biomedical sciences" -k 20 -c -x -o "biomedicine_rxivist" --update

OUPUT:

INFO: Final query is biomedical sciences
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
INFO: Reading old json metadata file
INFO: Making Request to rxivist
INFO: Making csv files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 203.69it/s]
INFO: Making xml files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 1059.17it/s]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\biomedicine_rxivist
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 1077.12it/s]

Run pygetpapers within the module

def run_command(output=False, query=False, save_query=False, xml=False, pdf=False, supp=False, zip=False, references=False, noexecute=False, citations=False, limit=100, restart=False, update=False, onlyquery=False, makecsv=False, makehtml=False, synonym=False, startdate=False, enddate=False, terms=False, notterms=False, api='europe_pmc', filter=None, loglevel='info', logfile=False, version=False)

Here’s an example script to download 50 papers from EPMC on ‘lantana camara’.

from pygetpapers import Pygetpapers
pygetpapers_call=Pygetpapers()
pygetpapers_call.run_command(query='lantana camara',limit=-50 ,output= lantana_camara, xml=True)

Test pygetpapers

To run automated testing on pygetpapers, do the following:

  1. Install pygetpapers

  2. Clone into pygetpapers repository

  3. Install pytest

  4. Run the command, pytest

Contributions

https://github.com/petermr/pygetpapers/blob/main/resources/CONTRIBUTING.md

Feature Requests

To request features, please put them in issues

User Documentation

pygetpapers

Research Papers right from python

What is pygetpapers

  • pygetpapers is a tool to assist text miners. It makes requests to open access scientific text repositories, analyses the hits, and systematically downloads the articles without further interaction.

  • Comes with the packages pygetpapers and downloadtools which provide various functions to download, process and save research papers and their metadata.

  • The main medium of its interaction with users is through a command-line interface.

  • pygetpapers has a modular design which makes maintenance easy and simple. This also allows adding support for more repositories simple.


img img img img img Documentation Status DOI badge

The developer documentation has been setup at readthedocs

History

getpapers is a tool written by Rik Smith-Unna funded by ContentMine at https://github.com/ContentMine/getpapers. The OpenVirus community requires a Python version and Ayush Garg has written an implementation from scratch, with some enhancements.

Formats supported by pygetpapers

  • pygetpapers gives fulltexts in xml and pdf format.

  • The metadata for papers can be saved in many formats including JSON, CSV, HTML.

  • Queries can be saved in form of an ini configuration file.

  • The additional files for papers can also be downloaded. References and citations for papers are given in XML format.

  • Log files can be saved in txt format.

Repository Structure

The main code is located in the pygetpapers directory. All the supporting modules for different repositories are described in the pygetpapers/repository directory.

Architecture

About the author and community

pygetpapers has been developed by Ayush Garg under the dear guidance of the OpenVirus community and Peter Murray Rust. Ayush is currently a high school student who believes that the world can only truly progress when knowledge is open and accessible by all.

Testers from OpenVirus have given a lot of useful feedback to Ayush without which this project would not have been possible.

The community has taken time to ensure that everyone can contribute to this project. So, YOU, the developer, reader and researcher can also contribute by testing, developing, and sharing.

Installation

Ensure that pip is installed along with python. Download python from: https://www.python.org/downloads/ and select the option Add Python to Path while installing.

Check out https://pip.pypa.io/en/stable/installing/ if difficulties installing pip. Also, checkout https://packaging.python.org/en/latest/tutorials/installing-packages/ to learn more about installing packages in python.


Method two (Install Directly From Head):

  • Ensure git cli is installed and is available in path. Check out (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)

  • Enter the command: pip install git+https://github.com/petermr/pygetpapers.git

  • Ensure pygetpapers has been installed by reopening the terminal and typing the command pygetpapers

  • You should see a help message come up.


Usage

pygetpapers is a commandline tool. You can ask for help by running:

pygetpapers --help
usage: pygetpapers [-h] [--config CONFIG] [-v] [-q QUERY] [-o OUTPUT]
                   [--save_query] [-x] [-p] [-s] [-z] [--references REFERENCES]
                   [-n] [--citations CITATIONS] [-l LOGLEVEL] [-f LOGFILE]
                   [-k LIMIT] [-r] [-u] [--onlyquery] [-c] [--makehtml]
                   [--synonym] [--startdate STARTDATE] [--enddate ENDDATE]
                   [--terms TERMS] [--notterms NOTTERMS] [--api API]
                   [--filter FILTER]

Welcome to Pygetpapers version 0.0.9.3. -h or --help for help

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       config file path to read query for pygetpapers
  -v, --version         output the version number
  -q QUERY, --query QUERY
                        query string transmitted to repository API. Eg.
                        "Artificial Intelligence" or "Plant Parts". To escape
                        special characters within the quotes, use backslash.
                        Incase of nested quotes, ensure that the initial quotes
                        are double and the qutoes inside are single. For eg:
                        `'(LICENSE:"cc by" OR LICENSE:"cc-by") AND
                        METHODS:"transcriptome assembly"' ` is wrong. We should
                        instead use `"(LICENSE:'cc by' OR LICENSE:'cc-by') AND
                        METHODS:'transcriptome assembly'"`
  -o OUTPUT, --output OUTPUT
                        output directory (Default: Folder inside current working directory named current date and time)
  --save_query          saved the passed query in a config file
  -x, --xml             download fulltext XMLs if available or save metadata as
                        XML
  -p, --pdf             [E][A] download fulltext PDFs if available (only eupmc
                        and arxiv supported)
  -s, --supp            [E] download supplementary files if available (only eupmc
                        supported)
  -z, --zip             [E] download files from ftp endpoint if available (only
                        eupmc supported)
  --references REFERENCES
                        [E] Download references if available. (only eupmc
                        supported)Requires source for references
                        (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
  -n, --noexecute       [ALL] report how many results match the query, but don't
                        actually download anything
  --citations CITATIONS
                        [E] Download citations if available (only eupmc
                        supported). Requires source for citations
                        (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
  -l LOGLEVEL, --loglevel LOGLEVEL
                        [All] Provide logging level. Example --log warning
                        <<info,warning,debug,error,critical>>, default='info'
  -f LOGFILE, --logfile LOGFILE
                        [All] save log to specified file in output directory as
                        well as printing to terminal
  -k LIMIT, --limit LIMIT
                        [All] maximum number of hits (default: 100)
  -r, --restart         [E] Downloads the missing flags for the corpus.Searches
                        for already existing corpus in the output directory
  -u, --update          [E][B][M][C] Updates the corpus by downloading new
                        papers. Requires -k or --limit (If not provided, default
                        will be used) and -q or --query (must be provided) to be
                        given. Searches for already existing corpus in the output
                        directory
  --onlyquery           [E] Saves json file containing the result of the query in
                        storage. (only eupmc supported)The json file can be given
                        to --restart to download the papers later.
  -c, --makecsv         [All] Stores the per-document metadata as csv.
  --makehtml            [All] Stores the per-document metadata as html.
  --synonym             [E] Results contain synonyms as well.
  --startdate STARTDATE
                        [E][B][M] Gives papers starting from given date. Format:
                        YYYY-MM-DD
  --enddate ENDDATE     [E][B][M] Gives papers till given date. Format: YYYY-MM-
                        DD
  --terms TERMS         [All] Location of the file which contains terms
                        serperated by a comma or an ami dict which will beOR'ed
                        among themselves and AND'ed with the query
  --notterms NOTTERMS   [All] Location of the txt file which contains terms
                        serperated by a comma or an ami dict which will beOR'ed
                        among themselves and NOT'ed with the query
  --api API             API to search [eupmc,
                        crossref,arxiv,biorxiv,medrxiv,rxivist] (default: eupmc)
  --filter FILTER       [C] filter by key value pair (only crossref supported)

Queries are build using -q flag. The query format can be found at http://europepmc.org/docs/EBI_Europe_PMC_Web_Service_Reference.pdf A condensed guide can be found at https://github.com/petermr/pygetpapers/wiki/query-format

Repository-specific flags

To convey the repository specificity, we’ve used the first letter of the repository in square brackets in its description.

What is CProject?

A CProject is a directory structure that the AMI toolset uses to gather and process data. Each paper gets its folder. A CTree is a subdirectory of a CProject that deals with a single paper.

Tutorial

pygetpapers was on version 0.0.9.3 when the tutorials were documented.

pygetpapers supports multiple APIs including eupmc, crossref,arxiv,biorxiv,medrxiv,rxivist-bio,rxivist-med. By default, it queries EPMC. You can specify the API by using --api flag.

You can also follow this colab notebook as part of the tutorial.

Features

EPMC

crossref

arxiv

biorxiv

medarxiv

rxvist

Fulltext formats

xml, pdf

NA

pdf

xml

xml

xml

Metdata formats

json, html, csv

json, xml, csv

json, csv, html, xml

json, csv, html

json, csv, html

json, html, csv

--query

yes

yes

yes

NA

NA

NA

--update

yes

NA

NA

yes

yes

--restart

yes

NA

NA

NA

NA

NA

--citation

yes

NA

NA

NA

NA

NA

--references

yes

NA

NA

NA

NA

NA

--terms

yes

yes

yes

NA

NA

NA

EPMC (Default API)

Example Query

Let’s break down the following query:

pygetpapers -q "METHOD: invasive plant species" -k 10 -o "invasive_plant_species_test" -c --makehtml -x --save_query

Flag

What it does

In this case pygetpapers

-q

specifies the query

queries for ‘invasive plant species’ in METHODS section

-k

number of hits (default 100)

limits hits to 10

-o

specifies output directory

outputs to invasive_plant_species_test

-x

downloads fulltext xml

-c

saves per-paper metadata into a single csv

saves single CSV named europe_pmc.csv

--makehtml

saves per-paper metadata into a single HTML file

saves single HTML named europe_pmc.html

--save_query

saves the given query in a config.ini in output directory

saves query to saved_config.ini

pygetpapers, by default, writes metadata to a JSON file within:

  • individual paper directory for corresponding paper (epmc_result.json)

  • working directory for all downloaded papers (epmc_results.json)

OUTPUT:

INFO: Final query is METHOD: invasive plant species
INFO: Total Hits are 17910
0it [00:00, ?it/s]WARNING: Keywords not found for paper 1
WARNING: Keywords not found for paper 4
1it [00:00, 164.87it/s]
INFO: Saving XML files to C:\Users\shweata\invasive_plant_species_test\*\fulltext.xml
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:21<00:00,  2.11s/it]
Scope the number of hits for a query

If you are just scoping the number of hits for a given query, you can use -n flag as shown below.

pygetpapers -n -q "essential oil"

OUTPUT:

INFO: Final query is essential oil
INFO: Total number of hits for the query are 190710
Update an existing CProject with new papers by feeding the metadata JSON

The --update command is used to update a CProject with a new set of papers on same or different query. If let’s say you have a corpus of a 30 papers on ‘essential oil’ (like before) and would like to download 20 more papers to the same CProject directory, you use --update command.

To update your Cproject, you would give it the -o flag the already existing CProject name. Additionally, you should also add --update flag. INPUT:

pygetpapers -q "invasive plant species" -k 10 -x -o lantana_test_5 --update

OUTPUT:

INFO: Final query is invasive plant species
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
INFO: Total Hits are 32956
0it [00:00, ?it/s]WARNING: html url not found for paper 5
WARNING: pdf url not found for paper 5
WARNING: Keywords not found for paper 6
WARNING: Keywords not found for paper 7
WARNING: Author list not found for paper 10
1it [00:00, 166.68it/s]
INFO: Saving XML files to C:\Users\shweata\lantana_test_5\*\fulltext.xml
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [01:03<00:00,  3.16s/it]
How is --update different from just downloading x number of papers to the same output directory?

By using --update command you can be sure that you don’t overwrite the existing papers.

Restart downloading papers to an existing CProject

--restart flag can be used for two purposes:

  • To download papers in different format. Let’s say you downloaded XMLs in the first round. If you want to download pdfs for same set of papers, you use this flag.

  • Continue the download from the stage where it broke. This feature would particularly come in handy if you are on poor lines. Let’s start off by forcefully interrupting the download. INPUT:

pygetpapers -q "pinus" -k 10 -o pinus_10 -x

OUTPUT:

INFO: Final query is pinus
INFO: Total Hits are 32086
0it [00:00, ?it/s]WARNING: html url not found for paper 10
WARNING: pdf url not found for paper 10
1it [00:00, 63.84it/s]
INFO: Saving XML files to C:\Users\shweata\pinus_10\*\fulltext.xml
 60%|██████████████████████████████████████████████████████████████████████████████▌                                                    | 6/10 [00:20<00:13,  3.42s/it]
Traceback (most recent call last):
...
KeyboardInterrupt
^C

If you take a look at the CProject directory, there are 6 papers downloaded so far.

C:.
│   eupmc_results.json
│
├───PMC8157994
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8180188
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8198815
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8216501
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8309040
│       eupmc_result.json
│       fulltext.xml
│
└───PMC8325914
        eupmc_result.json
        fulltext.xml

To download the rest, we can use --restart flag. INPUT

pygetpapers -q "pinus" -o pinus_10 --restart -x

OUTPUT:

INFO: Saving XML files to C:\Users\shweata\pinus_10\*\fulltext.xml
 80%|████████████████████████████████████████████████████████████████████████████████████████████████████████▊                          | 8/10 [00:27<00:07,  3.51s/it 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉             | 9/10 [00:38<00:05,  5.95s/it100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:40<00:00,  4.49s/it100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:40<00:00,  4.01s/it]

Now if we inspect the CProject directory, we see that we have 10 papers as specified.

C:.
│   eupmc_results.json
│
├───PMC8157994
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8180188
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8198815
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8199922
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8216501
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8309040
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8309338
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8325914
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8399312
│       eupmc_result.json
│       fulltext.xml
│
└───PMC8400686
        eupmc_result.json
        fulltext.xml

Under the hood, pygetpapers looks for eupmc_results.json, reads it and resumes the download.

You could also use --restart to download the fulltext or metadata in different format other than the ones that you’ve already downloaded. For example, if I want all the fulltext PDFs of the 10 papers on pinus, I can run:

INPUT:

pygetpapers -q "pinus" -o pinus_10 --restart -p --makehtml

OUTPUT:

>pygetpapers -q "pinus" -o pinus_10 --restart -p --makehtml
100%|█████████████████████████████████████████████| 10/10 [03:26<00:00, 20.68s/it]

Now, if we take a look at the CProject:

C:.
│   eupmc_results.json
│
├───PMC8157994
│       eupmc_result.html
│       eupmc_result.json
│       fulltext.pdf
│       fulltext.xml
│
├───PMC8180188
│       eupmc_result.html
│       eupmc_result.json
│       fulltext.pdf
│       fulltext.xml
│
├───PMC8198815
│       eupmc_result.html
│       eupmc_result.json
│       fulltext.pdf
│       fulltext.xml
...

We find that each paper now has fulltext PDFs and metadata in HTML.

Difference between --restart and --update
  • If you aren’t looking download new set of papers but would want to download a papers in different format for existing papers, --restart is the flag you’d want to use

  • If you are looking to download a new set of papers to an existing Cproject, then you’d use --update command. You should note that the format in which you download papers would only apply to the new set of papers and not for the old.

Downloading citations and references for papers, if available
  • --references and --citations flags can be used to download the references and citations respectively.

  • It also requires source for references (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR)

    pygetpapers -q "lantana" -k 10 -o "test" -c -x --citation PMC

Downloading only the metadata

If you are looking to download just the metadata in the supported formats--onlyquery is the flag you use. It saves the metadata in the output directory.

You can use --restart feature to download the fulltexts for these papers. INPUT:

pygetpapers --onlyquery -q "lantana" -k 10 -o "lantana_test" -c

OUTPUT:

INFO: Final query is lantana
INFO: Total Hits are 1909
0it [00:00, ?it/s]WARNING: html url not found for paper 1
WARNING: pdf url not found for paper 1
WARNING: Keywords not found for paper 2
WARNING: Keywords not found for paper 3
WARNING: Author list not found for paper 5
WARNING: Author list not found for paper 8
WARNING: Keywords not found for paper 9
1it [00:00, 407.69it/s]
Download papers within certain start and end date range

By using --startdate and --enddate you can specify the date range within which the papers you want to download were first published.

pygetpapers -q "METHOD:essential oil" --startdate "2020-01-02" --enddate "2021-09-09"
Saving query for later use

To save a query for later use, you can use --save_query. What it does is that it saves the query in a .ini file in the output directory.

pygetpapers -q "lantana" -k 10 -o "lantana_query_config"--save_query

Here is an example config file pygetpapers outputs

Feed query using config.ini file

Using can use the config.ini file you created using --save_query, you re-run the query. To do so, you will give --config flag the absolute path of the saved_config.ini file.

pygetpapers --config "C:\Users\shweata\lantana_query_config\saved_config.ini"

Querying using a term list
--terms flag

If your query is complex with multiple ORs, you can use --terms feature. To do so, you will:

  • Create a .txt file with list of terms separated by commas or an ami-dictionary (Click here to learn how to create dictionaries).

  • Give the --terms flag the absolute path of the .txt file or ami-dictionary (XML)

-q is optional.The terms would be OR’ed with each other ANDed with the query, if given.

INPUT:

pygetpapers -q "essential oil" --terms C:\Users\shweata\essential_oil_terms.txt -k 10 -o "terms_test_essential_oil" -x

OUTPUT:

C:\Users\shweata>pygetpapers -q "essential oil" --terms C:\Users\shweata\essential_oil_terms.txt -k 10 -o "terms_test_essential_oil"
INFO: Final query is (essential oil AND (antioxidant OR  antibacterial OR  antifungal OR  antiseptic OR  antitrichomonal agent))
INFO: Total Hits are 43397
0it [00:00, ?it/s]WARNING: Author list not found for paper 9
1it [00:00, 1064.00it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:19<00:00,  1.99s/it]

You can also use this feature to download papers by using the PMC Ids. You can feed the .txt file with PMC ids comman-separated. Make sure to give a large enough hit number to download all the papers specified in the text file.

Example text file can be found, here INPUT:

pygetpapers --terms C:\Users\shweata\PMCID_pygetpapers_text.txt -k 100 -o "PMCID_test"

OUTPUT:

INFO: Final query is (PMC6856665 OR  PMC6877543 OR  PMC6927906 OR  PMC7008714 OR  PMC7040181 OR  PMC7080866 OR  PMC7082878 OR  PMC7096589 OR  PMC7111464 OR  PMC7142259 OR  PMC7158757 OR  PMC7174509 OR  PMC7193700 OR  PMC7198785 OR  PMC7201129 OR  PMC7203781 OR  PMC7206980 OR  PMC7214627 OR  PMC7214803 OR  PMC7220991
)
INFO: Total Hits are 20
WARNING: Could not find more papers
1it [00:00, 505.46it/s]
100%|█████████████████████████████████████████████| 20/20 [00:32<00:00,  1.61s/it]
--notterms

Excluded papers that have certain keywords might also be of interest for you. For example, if you want papers on essential oil which doesn’t mention antibacterial , antiseptic or antimicrobial, you can run either create a dictionary or a text file with these terms (comma-separated), specify its absolute path to --notterms flag.

INPUT:

pygetpapers -q "essential oil" -k 10 -o essential_oil_not_terms_test --notterms C:\Users\shweata\not_terms_test.txt

OUTPUT:

INFO: Final query is (essential oil AND NOT (antimicrobial OR  antiseptic OR  antibacterial))
INFO: Total Hits are 165557
1it [00:00, ?it/s]
100%|█| 10/10 [00:49<00:00,  4.95s/

The number of papers are reduced by a some proportion. For comparision, “essential oil” query gives us 193922 hits.

C:\Users\shweata>pygetpapers -q "essential oil" -n
INFO: Final query is essential oil
INFO: Total number of hits for the query are 193922
Using --terms with dictionaries

We will take the same example as before.

  • Assuming you have ami3 installed, you can create ami-dictionaries

    • Start off by listing the terms in a .txt file

    antimicrobial
    antiseptic
    antibacterial
    
    • Run the following command from the directory in which the text file exists

    amidict -v --dictionary pygetpapers_terms --directory pygetpapers_terms --input pygetpapers_terms.txt create --informat list --outformats xml
    

That’s it! You’ve now created a simple ami-dictionary. There are ways of creating dictionaries from Wikidata as well. You can learn more about how to do that in this Wiki page.

  • You can also use standard dictionaries that are available. we, then, pass the absolute path of the dictionary to --terms flag.

INPUT:

pygetpapers -q "essential oil" --terms C:\Users\shweata\pygetpapers_terms\pygetpapers_terms.xml -k 10 -o pygetpapers_dictionary_test -x

OUTPUT:

INFO: Final query is (essential oil AND (antibacterial OR antimicrobial OR antiseptic))
INFO: Total Hits are 28365
0it [00:00, ?it/s]WARNING: Keywords not found for paper 5
WARNING: Keywords not found for paper 7
1it [00:00, ?it/s]
INFO: Saving XML files to C:\Users\shweata\pygetpapers_dictionary_test\*\fulltext.xml
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:36<00:00,  3.67s/it]
Log levels

You can specify the log level using the -l flag. The default as you’ve already seen so far is info.

INPUT:

pygetpapers -q "lantana" -k 10 -o lantana_test_10_2 --loglevel debug -x
Log file

You can also choose to write the log to a .txt file in your HOME directory, while simultaneously printing it out.

INPUT:

pygetpapers -q "lantana" -k 10 -o lantana_test_10_4 --loglevel debug -x --logfile test_log.txt

Here is the log file.

Crossref

You can query crossref api for the metadata only.

Sample query
  • The metadata formats flags are applicable as described in the EPMC tutorial

  • --terms and -q are also applicable to crossref INPUT:

pygetpapers --api crossref -q "essential oil" --terms C:\Users\shweata\essential_oil_terms.txt -k 10 -o "terms_test_essential_oil_crossref_3" -x -c --makehtml

OUTPUT:

INFO: Final query is (essential oil AND (antioxidant OR  antibacterial OR  antifungal OR  antiseptic OR  antitrichomonal agent))
INFO: Making request to crossref
INFO: Got request result from crossref
INFO: Making csv files for metadata at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 185.52it/s]
INFO: Making html files for metadata at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 87.98it/s]
INFO: Making xml files for metadata at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 366.97it/s]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 996.82it/s]

We have 10 folders in the CProject directory.

C:\Users\shweata>cd terms_test_essential_oil_crossref_3

C:\Users\shweata\terms_test_essential_oil_crossref_3>tree
Folder PATH listing for volume Windows-SSD
Volume serial number is D88A-559A
C:.
├───10.1016_j.bcab.2021.101913
├───10.1055_s-0029-1234896
├───10.1080_0972060x.2016.1231597
├───10.1080_10412905.1989.9697767
├───10.1111_j.1745-4565.2012.00378.x
├───10.17795_bhs-24733
├───10.23880_oajmms-16000131
├───10.34302_crpjfst_2019.11.2.8
├───10.5220_0008855200960099
└───10.5220_0009957801190122
--update

--update works the same as in EPMC. You can use this flag to increase the number of papers in a given CProject. INPUT

pygetpapers --api crossref -q "essential oil" --terms C:\Users\shweata\essential_oil_terms.txt -k 5 -o "terms_test_essential_oil_crossref_3" -x -c --makehtml --update

OUTPUT:

INFO: Final query is (essential oil AND (antioxidant OR  antibacterial OR  antifungal OR  antiseptic OR  antitrichomonal agent))
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
INFO: Reading old json metadata file
INFO: Making request to crossref
INFO: Got request result from crossref
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|██████████████████████████████████████████████| 5/5 [00:00<00:00, 346.84it/s]

The CProject after updating:

C:.
├───10.1002_mbo3.459
├───10.1016_j.bcab.2021.101913
├───10.1055_s-0029-1234896
├───10.1080_0972060x.2014.895156
├───10.1080_0972060x.2016.1231597
├───10.1080_0972060x.2017.1345329
├───10.1080_10412905.1989.9697767
├───10.1080_10412905.2021.1941338
├───10.1111_j.1745-4565.2012.00378.x
├───10.15406_oajs.2019.03.00121
├───10.17795_bhs-24733
├───10.23880_oajmms-16000131
├───10.34302_crpjfst_2019.11.2.8
├───10.5220_0008855200960099
└───10.5220_0009957801190122

We started off with 10 paper folders, and increased the number to 15.

Filter

arxiv

pygetpapers allows you to query arxiv for full text PDF and metadata in all supported formats.

Sample query

INPUT

pygetpapers --api arxiv -k 10 -o arxiv_test_3 -q "artificial intelligence" -x -p --makehtml -c

OUTPUT


INFO: Final query is artificial intelligence
INFO: Making request to Arxiv through pygetpapers
INFO: Got request result from Arxiv through pygetpapers
INFO: Requesting 10 results at offset 0
INFO: Requesting page of results
INFO: Got first page; 10 of 10 results available
INFO: Downloading Pdfs for papers
100%|█████████████████████████████████████████████| 10/10 [01:02<00:00,  6.27s/it]
INFO: Making csv files for metadata at C:\Users\shweata\arxiv_test_3
100%|████████████████████████████████████████████| 10/10 [00:00<00:00, 187.31it/s]
INFO: Making html files for metadata at C:\Users\shweata\arxiv_test_3
100%|████████████████████████████████████████████| 10/10 [00:00<00:00, 161.87it/s]
INFO: Making xml files for metadata at C:\Users\shweata\arxiv_test_3
100%|█████████████████████████████████████████████████████| 10/10 [00:00<?, ?it/s]
100%|███████████████████████████████████████████| 10/10 [00:00<00:00, 1111.22it/s]

Note: --update isn’t supported for arxiv

Biorxiv and Medrxiv

You can query biorxiv and medrxiv for fulltext and metadata (in all supported formats). However, passing a query string using -q flag isn’t supported for both the Repositories. You can only provide a date range.

Sample Query - biorxiv

INPUT:

pygetpapers --api biorxiv -k 10 -x --startdate 2021-01-01 -o biorxiv_test_20210831

OUTPUT:

WARNING: Currently biorxiv api is malfunctioning and returning wrong DOIs
INFO: Making Request to rxiv
INFO: Making xml for paper
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:23<00:00,  2.34s/it]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\biorxiv_test_20210831
100%|█████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 684.72it/s]
--update command

INPUT

pygetpapers --api biorxiv -k 10 -x --startdate 2021-01-01 -o biorxiv_test_20210831 --update

OUTPUT

WARNING: Currently biorxiv api is malfunctioning and returning wrong DOIs
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
INFO: Reading old json metadata file
INFO: Making Request to rxiv
INFO: Making xml for paper
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:22<00:00,  2.23s/it]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\biorxiv_test_20210831
100%|█████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 492.39it/s]

The CProject now has 20 papers, in total after updating.

├───10.1101_008326
├───10.1101_010553
├───10.1101_035972
├───10.1101_046052
├───10.1101_060012
├───10.1101_067736
├───10.1101_086710
├───10.1101_092205
├───10.1101_092619
├───10.1101_093237
├───10.1101_121061
├───10.1101_135749
├───10.1101_145664
├───10.1101_145896
├───10.1101_165845
├───10.1101_180273
├───10.1101_181198
├───10.1101_191858
├───10.1101_194266
└───10.1101_196105

The working of medarxiv is same as biorxiv

rxivist

Lets you specify a queries string to both biorxiv and medarxiv. The results you get would be a mixture of papers from both repository since rxivist doesn’t differentiate.

Another caveat here is that you can only retrieve metadata from rxivist.

INPUT:

pygetpapers --api rxivist -q "biomedicine" -k 10 -c -x -o "biomedicine_rxivist" --makehtml -p

OUTPUT:

WARNING: Pdf is not supported for this api
INFO: Final query is biomedicine
INFO: Making Request to rxivist
INFO: Making csv files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 125.54it/s]
INFO: Making html files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 124.71it/s]
INFO: Making xml files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 633.38it/s]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 751.09it/s]
Query hits only

Like any other repositories under pygetpapers, you can use the -n flag to get only the hit number INPUT:

C:\Users\shweata>pygetpapers --api rxivist -q "biomedical sciences" -n

OUTPUT:

INFO: Final query is biomedical sciences
INFO: Making Request to rxivist
INFO: Total number of hits for the query are 62
Update

--update works the same as many other repositories. Make sure to provide rxvist as api.

INPUT:

pygetpapers --api rxivist -q "biomedical sciences" -k 20 -c -x -o "biomedicine_rxivist" --update

OUPUT:

INFO: Final query is biomedical sciences
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
INFO: Reading old json metadata file
INFO: Making Request to rxivist
INFO: Making csv files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 203.69it/s]
INFO: Making xml files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 1059.17it/s]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\biomedicine_rxivist
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 1077.12it/s]

Run pygetpapers within the module

def run_command(output=False, query=False, save_query=False, xml=False, pdf=False, supp=False, zip=False, references=False, noexecute=False, citations=False, limit=100, restart=False, update=False, onlyquery=False, makecsv=False, makehtml=False, synonym=False, startdate=False, enddate=False, terms=False, notterms=False, api='europe_pmc', filter=None, loglevel='info', logfile=False, version=False)

Here’s an example script to download 50 papers from EPMC on ‘lantana camara’.

from pygetpapers import Pygetpapers
pygetpapers_call=Pygetpapers()
pygetpapers_call.run_command(query='lantana camara',limit=-50 ,output= lantana_camara, xml=True)

Test pygetpapers

To run automated testing on pygetpapers, do the following:

  1. Install pygetpapers

  2. Clone into pygetpapers repository

  3. Install pytest

  4. Run the command, pytest

Contributions

https://github.com/petermr/pygetpapers/blob/main/resources/CONTRIBUTING.md

Feature Requests

To request features, please put them in issues

pygetpapers module

class pygetpapers.pygetpapers.ApiPlugger(query_namespace)

Bases: object

add_terms_from_file()

Builds query from terms mentioned in a text file described in the argparse namespace object. See (https://pygetpapers.readthedocs.io/en/latest/index.html?highlight=terms#querying-using-a-term-list) Edits the namespace object’s query flag. :param query_namespace: namespace object from argparse (using –terms and –notterms)

check_query_logic_and_run()

Checks the logic in query_namespace and runs pygetpapers for the given query

setup_api_support_variables(config, api)

Reads in the configuration file namespace object and sets up class variable for the given api :param config: Configparser configured configuration file :type config: configparser object :param api: the repository to get the variables for :type api: string

class pygetpapers.pygetpapers.Pygetpapers

Bases: object

[summary]

create_argparser()

Creates the cli

generate_logger(query_namespace)

Creates logger for the given loglevel :param query_namespace: pygetpaper’s name space object :type query_namespace: dict

static makes_output_directory(query_namespace)

Makes the output directory for the given output in query_namespace :param query_namespace: pygetpaper’s name space object :type query_namespace: dict

run_command(output=None, query=None, save_query=False, xml=False, pdf=False, supp=False, zip=False, references=False, noexecute=False, citations=False, limit=100, restart=False, update=False, onlyquery=False, makecsv=False, makehtml=False, synonym=False, startdate=False, enddate=False, terms=False, notterms=False, api='europe_pmc', filter=None, loglevel='info', logfile=False, version=False)

Runs pygetpapers for the given parameters

runs_pygetpapers_for_given_args(query_namespace)

Runs pygetpapers for flags described in a dictionary :param query_namespace: pygetpaper’s namespace object :type query_namespace: dict

static write_configuration_file(query_namespace)

Writes the argparse namespace to SAVED_CONFIG_INI :param query_namespace: argparse namespace object

write_logfile(query_namespace, level)

This functions stores logs to a logfile :param query_namespace: argparse namespace object :param level: level of logger (See https://docs.python.org/3/library/logging.html#logging-levels)

pygetpapers.pygetpapers.main()

Runs the CLI

download_tools module

class pygetpapers.download_tools.DownloadTools(api=None)

Bases: object

Generic tools for retrieving literature. Several are called by each repository

check_if_content_is_zip(request_handler)

Checks if content in request object is a zip

Parameters

request_handler (request object) – request object for the given zip

Returns

if zip file exits

Return type

bool

static check_or_make_directory(directory_url)

Makes directory if doesn’t already exist

Parameters

directory_url (string) – path to directory

static dumps_json_to_given_path(path, json_dict, filemode='w')

dumps json dict to given path

Parameters
  • path (string) – path to dump dict

  • json_dict (dictionary) – json dictionary

  • filemode (string, optional) – file mode, defaults to “w”

extract_zip_files(byte_content_to_extract_from, destination_url)

Extracts zip file to destination_url

Parameters
  • byte_content_to_extract_from (bytes) – byte content to extract from

  • destination_url (string) – path to save the extracted zip files to

get_metadata_results_file()

Gets the url of metadata file (eg. eupmc-results.json) from the current working directory

Returns

path of the master metadata file

Return type

string

get_parent_directory(path)

Returns path of the parent directory for given path

Parameters

path (string) – path of the file

Returns

path of the parent directory

Return type

string

get_request_endpoint_for_citations(identifier, source)

Gets endpoint to get citations from the configuration file

Parameters
  • identifier (string) – unique identifier present in the url for the particular paper

  • source (string) – which repository to get the citations from

Returns

request_handler.content

Return type

bytes

get_request_endpoint_for_references(identifier, source)

Gets endpoint to get references from the configuration file

Parameters
  • identifier (string) – unique identifier present in the url for the particular paper

  • source (string) – which repository to get the citations from

Returns

request_handler.content

Return type

bytes

get_request_endpoint_for_xml(identifier)

Gets endpoint to full text xml from the configuration file

Parameters

identifier (string) – unique identifier present in the url for the particular paper

Returns

request_handler.content

Return type

bytes

static get_version()

Gets version from the configuration file

Returns

version of pygetpapers as described in the configuration file

Return type

string

gets_result_dict_for_query(headers, data)

Queries query_url provided in configuration file for the given headers and payload and returns result in the form of a python dictionary

Parameters
  • headers (dict) – headers given to the request

  • payload (dict) – payload given to the request

Returns

result in the form of a python dictionary

Return type

dictionary

getsupplementaryfiles(identifier, path_to_save, from_ftp_end_point=False)

Retrieves supplementary files for the given paper (according to identifier) and saves to path_to_save

Parameters
  • identifier (string) – unique identifier present in the url for the particular paper

  • path_to_save (string) – path to save the supplementary files to

  • from_ftp_end_point (bool, optional) – to get the results from eupmc ftp endpoint

handle_creation_of_csv_html_xml(makecsv, makehtml, makexml, metadata_dictionary, name)

Writes csv, html, xml for given conditions

Parameters
  • makecsv (bool) – whether to get csv

  • makehtml (bool) – whether to get html

  • makexml (bool) – whether to get xml

  • metadata_dictionary (dict) – dictionary to write the content for

  • name (string) – name of the file to save

make_citations(source, citationurl, identifier)

Retreives URL for the citations for the given paperid, gets the xml, writes to citationurl

Parameters
  • source (which repository to get the citations from) – which repository to get the citations from

  • citationurl (string) – path to save the citations to

  • identifier (string) – unique identifier present in the url for the particular paper

make_csv_for_dict(metadata_dictionary, name_main_result_file, name_result_file_for_paper)

Writes csv content for the given dictionary to disk

Parameters
  • metadata_dictionary (dict) – dictionary to write the content for

  • name_main_result_file (string) – name of the main result file (eg. eupmc-results.xml)

  • name_result_file_for_paper (string) – name of the result file for a paper

make_html_for_dict(metadata_dictionary, name_main_result_file, name_result_file_for_paper)

Writes html content for the given dictionary to disk

Parameters
  • metadata_dictionary (dict) – dictionary to write the content for

  • name_main_result_file (string) – name of the main result file (eg. eupmc-results.xml)

  • name_result_file_for_paper (string) – name of the result file for a paper

make_html_from_dataframe(dataframe, path_to_save)

Makes html page from the pandas given dataframe

Parameters
  • dataframe (pandas dataframe) – pandas dataframe to convert to html

  • path_to_save (string) – path to save the dataframe to

make_references(paperid, identifier, path_to_save)

Writes references for the given paperid from source to reference url

Parameters
  • identifier (string) – identifier for the paper

  • source (string) – source to get references from

  • path_to_save (string) – path to store the references

make_xml_for_dict(metadata_dictionary, name_main_result_file, name_result_file_for_paper)

Writes xml content for the given dictionary to disk

Parameters
  • metadata_dictionary (dict) – dictionary to write the content for

  • name_main_result_file (string) – name of the main result file (eg. eupmc-results.xml)

  • name_result_file_for_paper (string) – name of the result file for a paper

parse_request_handler(request_handler)
post_query(url, data=None, headers=None)

Queries url

Parameters
  • headers (dict) – headers given to the request

  • payload (dict) – payload given to the request

Returns

result in the form of a python dictionary

Return type

dictionary

static queries_the_url_and_writes_response_to_destination(url, destination)

queries the url and writes response to destination

Parameters
  • url (string) – url to query

  • destination (string) – destination to save response to

static readjsondata(path)

reads json from path and returns python dictionary

static removing_added_attributes_from_dictionary(resultant_dict)

pygetpapers adds some attributes like “pdfdownloaded” to track the progress of downloads for a particular corpus. When we are exporting data to a csv file, we dont want these terms to appear. So this funtion makes a copy of the given dictionary, removes the added attributes from dictionaries inside the given dict and returns the new dictionary.

Parameters

resultant_dict (dictionary) – given parent dictionary

Returns

dictionary with additional attributes removed from the child dictionaries

Return type

dictionary

set_up_config_variables(config, api)

Sets class variable reading the configuration file for the provided api

Parameters
  • config (configparser object) – configparser object for the configuration file

  • api (string) – Name of api as described in the configuration file

setup_config_file(config_ini)

Reads config_ini file and returns configparser object

Parameters

config_ini (string) – path of configuration file

Returns

configparser object for the configuration file

Return type

configparser object

static url_encode_id(doi_of_paper)

Encodes the doi of paper in a file savable name

Parameters

doi_of_paper (string) – doi

Returns

url encoded doi

Return type

string

static write_or_append_to_csv(df_transposed, csv_path='europe_pmc.csv')

write pandas dataframe to given csv file

Parameters
  • df_transposed (pandas dataframe) – dataframe to save

  • csv_path (str, optional) – path to csv file, defaults to “europe_pmc.csv”

writexml(destination_url, xml_content)

writes xml content to given destination_url

Parameters
  • destination_url (string) – path to dump xml content

  • xml_content (byte string) – xml content

europe_pmc module

class pygetpapers.repository.europe_pmc.EuropePmc

Bases: RepositoryInterface

Downloads metadata and optionally fulltext from https://europepmc.org

apipaperdownload(query_namespace)

Takes in the query_namespace object as the parameter and runs the query search for given search parameters.

Parameters

query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse

build_and_send_query(maximum_hits_per_page, cursor_mark, query, synonym)

Retrieves metadata from EPMC for given query

Parameters
  • maximum_hits_per_page (int) – number of papers to get

  • cursor_mark (string) – cursor mark

  • query (string) – query

  • synonym (bool) – whether to get synonyms, defaults to True

Returns

metadata dictionary

Return type

dict

static buildquery(cursormark, page_size, query, synonym=True)

Builds query parameters

static create_parameters_for_paper_download()

Creates parameters for paper download

Returns

parameters for paper download tuple

Return type

tuple

get_supplementary_metadata(metadata_dictionary_with_all_papers, getpdf=False, makecsv=False, makehtml=False, makexml=False, references=False, citations=False, supplementary_files=False, zip_files=False)

Gets supplementary metadata

Parameters
  • metadata_dictionary_with_all_papers (dict) – metadata dictionary

  • getpdf (bool, optional) – whether to get pdfs

  • makecsv (bool, optional) – whether to create csv output

  • makehtml (bool, optional) – whether to create html output

  • makexml (bool, optional) – whether to download xml fulltext

  • references (bool, optional) – whether to download references

  • citations (bool, optional) – whether to download citations

  • supplementary_files (bool, optional) – whether to download supplementary_files

  • zip_files (bool, optional) – whether to download zip_files from the ftp endpoint

get_urls_to_write_to(identifier_for_paper)

Gets urls to write the metadata to

Parameters

identifier_for_paper (str) – identifier for paper

Returns

urls to write the metadata to

Return type

tuple

make_html_from_dict(dict_to_write_html_from, url, identifier_for_paper)

Makes html from dict

Parameters
  • dict_to_write_html_from (dict) – dict to write html from

  • url (str) – url to write html to

noexecute(query_namespace)

Takes in the query_namespace object as the parameter and runs the query search for given search parameters but only prints the output and not write to disk.

Parameters

query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse

query(query, cutoff_size, synonym=True, cursor_mark='*')

Queries eupmc for given query for given number(cutoff_size) papers

Parameters
  • query (string) – query

  • cutoff_size (int) – number of papers to get

  • synonym (bool, optional) – whether to get synonyms, defaults to True

Returns

list containg the papers

Return type

list

restart(query_namespace)

Restarts query to add new metadata for existing papers

Parameters

query_namespace (dict) – pygetpaper’s name space object

run_eupmc_query_and_get_metadata(query, cutoff_size, update=None, onlymakejson=False, getpdf=False, makehtml=False, makecsv=False, makexml=False, references=False, citations=False, supplementary_files=False, synonym=True, zip_files=False)
update(query_namespace)

If there is a previously existing corpus, this function reads in the ‘cursor mark’ from the previous run, increments in, and adds new papers for the given parameters to the existing corpus.

Parameters

query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse

arxiv module

class pygetpapers.repository.arxiv.Arxiv

Bases: RepositoryInterface

arxiv.org repository

This uses a PyPI code arxiv to download metadata. It is not clear whether this is created by the arXiv project or layered on top of the public API.

arXiv current practice for bulk data download (e.g. PDFs) is described in

https://arxiv.org/help/bulk_data. Please be considerate and also include a rate limit.

apipaperdownload(query_namespace)

Takes in the query_namespace object as the parameter and runs the query search for given search parameters.

Parameters

query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse

arxiv(query, cutoff_size, getpdf=False, makecsv=False, makexml=False, makehtml=False)

Builds the arxiv searcher and writes the xml, pdf, csv and html

Parameters
  • query (string) – query given to arxiv

  • cutoff_size (int) – number of papers to retrieve

  • getpdf (bool, optional) – whether to get pdf

  • makecsv (bool) – whether to get csv

  • makehtml (bool) – whether to get html

  • makexml (bool) – whether to get xml

Returns

dictionary of results retrieved from arxiv

Return type

dict

download_pdf(metadata_dictionary)

Downloads pdfs for papers in metadata dictionary

Parameters

metadata_dictionary (dict) – metadata dictionary for papers

static noexecute(query_namespace)

Takes in the query_namespace object as the parameter and runs the query search for given search parameters but only prints the output and not write to disk.

Parameters

query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse

static update(query_namespace)

If there is a previously existing corpus, this function reads in the ‘cursor mark’ from the previous run, increments in, and adds new papers for the given parameters to the existing corpus.

Parameters

query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse

write_metadata_json_from_arxiv_dict(metadata_dictionary)

Iterates through metadata_dictionary and makes json metadata file for papers

Parameters

metadata_dictionary (dict) – metadata dictionary for papers

rxivist module

class pygetpapers.repository.rxivist.Rxivist

Bases: RepositoryInterface

Rxivist wrapper for biorxiv and medrxiv

From the site (rxivist.org): “Rxivist combines biology preprints from bioRxiv and medRxiv with data from Twitter to help you find the papers being discussed in your field.”

Appears to be metadata-only. To get full-text you may have to submit the IDs to biorxiv or medrxiv or EPMC as this aggregates preprints.

apipaperdownload(query_namespace)

Takes in the query_namespace object as the parameter and runs the query search for given search parameters.

Parameters

query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse

download_and_save_results(query, size, update=False, makecsv=False, makexml=False, makehtml=False)
make_request_add_papers(query, cursor_mark, total_number_of_results, total_papers_list)
noexecute(query_namespace)

Takes in the query_namespace object as the parameter and runs the query search for given search parameters but only prints the output and not write to disk.

Parameters

query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse

rxivist(query, size, update=None, makecsv=False, makexml=False, makehtml=False)
send_post_request(query, cursor_mark=0, page_size=20)
update(query_namespace)

If there is a previously existing corpus, this function reads in the ‘cursor mark’ from the previous run, increments in, and adds new papers for the given parameters to the existing corpus.

Parameters

query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse

crossref module

class pygetpapers.repository.crossref.CrossRef

Bases: RepositoryInterface

CrossRef class which handles crossref repository. It uses habanero repository wrapper to make its query

apipaperdownload(query_namespace)

Takes in the query_namespace object as the parameter and runs the query search for given search parameters.

Parameters

query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse

crossref(query, cutoff_size, filter_dict=None, update=None, makecsv=False, makexml=False, makehtml=False)

Builds the crossref searcher and writes the xml, csv and html

Parameters
  • query (string) – query given to crossref

  • cutoff_size (int) – number of papers to retrieve

  • filter_dict (bool, optional) – filters for crossref search

  • makecsv (bool) – whether to get csv

  • makehtml (bool) – whether to get html

  • makexml (bool) – whether to get xml

  • update (dict) – dictionary containing results from previous run of pygetpapers

Returns

dictionary of results retrieved from crossref

Return type

dict

initiate_crossref()

Initate habanero wrapper for crossref

Returns

crossref object

noexecute(query_namespace)

Takes in the query_namespace object as the parameter and runs the query search for given search parameters but only prints the output and not write to disk.

Parameters

query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse

update(query_namespace)

If there is a previously existing corpus, this function reads in the ‘cursor mark’ from the previous run, increments in, and adds new papers for the given parameters to the existing corpus.

Parameters

query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse

rxiv module

class pygetpapers.repository.rxiv.Rxiv(api='biorxiv')

Bases: RepositoryInterface

Biorxiv and Medrxiv repositories

At present (2022-03) the API appears only to support date searches. The rxivist system is layered on top and supports fuller queries

apipaperdownload(query_namespace)

Takes in the query_namespace object as the parameter and runs the query search for given search parameters.

Parameters

query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse

download_and_save_results(query, cutoff_size, source, update=False, makecsv=False, makexml=False, makehtml=False)
make_request_add_papers(interval, cursor_mark, source, total_number_of_results, total_papers_list)
make_request_url_for_rxiv(cursor_mark, interval, source)
make_xml_for_rxiv(dict_of_papers, xml_identifier, paper_id_identifier, filename)
noexecute(query_namespace)

Takes in the query_namespace object as the parameter and runs the query search for given search parameters but only prints the output and not write to disk.

Parameters

query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse

rxiv(query, cutoff_size, source='biorxiv', update=None, makecsv=False, makehtml=False)
rxiv_update(interval, cutoff_size, source='biorxiv', update=None, makecsv=False, makexml=False, makehtml=False)
update(query_namespace)

If there is a previously existing corpus, this function reads in the ‘cursor mark’ from the previous run, increments in, and adds new papers for the given parameters to the existing corpus.

Parameters

query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse

QueryBuilder

pygetpapers builds and runs all queries through a query_builder module (Pygetpapers.py) . There are several reasons:

  • each repository may use its own query language and syntax

  • there is frequent need to use punctuation (e.g. (..), “..”, ‘..’) and these may be nested. Punctuation can also interact with command-line syntax

  • complex queries (e.g. repeated OR, AND, NOT ) are tedious and error-prone

  • many values (especially dates) need converting or standardising

  • some options require or forbid other options (e.g. –xml requires an –output value)

  • successful queries can be saved , edited, and rerun

  • queries may be rerun at a later date, or request a larger number of downloads.

Users may wish to build queries:

  • completely from the commandline (argparse Namespace).

  • from a saved query (configparser configuration file)

  • programmatically through an instance of Pygetpapers

  • mixtures of the above

QueryBuilder contains or creates flags indicating which of the following is to be processed

  • query strings to be submitted to the particular repository

  • flags controlling the execution (download rate, limits, formats)

  • creation of the local repository (CProject)

  • creation of the per-article subdirectories (CTree)

  • postprocessing options (e.g. docanalysis and py4ami, and standard Unix/Python libraries)

How to add a new repository

pygetpapers makes it really easy to add support for new repositories.

To add a new repository, clone the repo and cd into the directory pygetpapers. Thereafter, create a new module with the class for the repo. Make sure you edit the config.ini file with the specifications of the new repo.

Following is an example config

[europe_pmc]
posturl=https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
citationurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/citations?page=1&pageSize=1000&format=xml
referencesurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/references?page=1&pageSize=1000&format=xml
xmlurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML
suppurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles
zipurl= http://europepmc.org/ftp/suppl/OA/{key}/{pmcid}.zip
date_query=SUPPORTED
term=SUPPORTED
update=SUPPORTED
restart=SUPPORTED
class_name=EuropePmc
library_name= europe_pmc
features_not_supported = ["filter",]

After this, in the repo class, ensure that you can request scientific papers, download them and do post-processing on them. There are multiple functions in the class download_tools which can help you with the same. I suggest looking at previously configured repos for the same.

It is necessary to have three functions in particular.

  1. apipaperdownload

  2. noexecute

  3. update

Following is an example implementation.

    def update(
        self,
        args
    ):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        logging.info("Reading old json metadata file")
        update_path = self.get_metadata_results_file()
        os.chdir(os.path.dirname(update_path))
        update = self.download_tools.readjsondata(update_path)
        query = args.query
        size = args.limit
        filter_dict = args.filter
        makecsv = args.makecsv
        makexml = args.xml
        makehtml = args.makehtml
        result_dict = self.crossref(
            query,
            size,
            filter_dict=filter_dict,
            update=update,
            makecsv=makecsv,
            makexml=makexml,
            makehtml=makehtml,
        )
        self.download_tools.make_json_files_for_paper(
            result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI,
            name_of_file=CROSSREF_RESULTS
        )

    def noexecute(self, args):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        query = args.query
        filter_dict = args.filter
        result_dict = self.crossref(
            query, size=10, filter_dict=filter_dict
        )
        totalhits = result_dict[NEW_RESULTS][TOTAL_HITS]
        logging.info("Total number of hits for the query are %s", totalhits)

    def apipaperdownload(
        self,
        args
    ):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        query = args.query
        size = args.limit
        filter_dict = args.filter
        makecsv = args.makecsv
        makexml = args.xml
        makehtml = args.makehtml
        result_dict = self.crossref(
            query,
            size,
            filter_dict=filter_dict,
            update=None,
            makecsv=makecsv,
            makexml=makexml,
            makehtml=makehtml,
        )
        self.download_tools.make_json_files_for_paper(
            result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI, name_of_file=CROSSREF_RESULTS
        )

The class ApiPlugger looks for these functions along with the config file to serve the API on the cli.

pygetpapers makes it really easy to add support for new repositories.

To add a new repository, clone the repo and cd into the directory pygetpapers. Thereafter, create a new module with the class for the repo. Make sure you edit the config.ini file with the specifications of the new repo.

Following is an example config

[europe_pmc]
posturl=https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
citationurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/citations?page=1&pageSize=1000&format=xml
referencesurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/references?page=1&pageSize=1000&format=xml
xmlurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML
suppurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles
zipurl= http://europepmc.org/ftp/suppl/OA/{key}/{pmcid}.zip
date_query=SUPPORTED
term=SUPPORTED
update=SUPPORTED
restart=SUPPORTED
class_name=EuropePmc
library_name= europe_pmc
features_not_supported = ["filter",]

After this, in the repo class, ensure that you can request scientific papers, download them and do post-processing on them. There are multiple functions in the class download_tools which can help you with the same. I suggest looking at previously configured repos for the same.

It is necessary to have three functions in particular.

  1. apipaperdownload

  2. noexecute

  3. update

Following is an example implementation.

    def update(
        self,
        args
    ):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        logging.info("Reading old json metadata file")
        update_path = self.get_metadata_results_file()
        os.chdir(os.path.dirname(update_path))
        update = self.download_tools.readjsondata(update_path)
        query = args.query
        size = args.limit
        filter_dict = args.filter
        makecsv = args.makecsv
        makexml = args.xml
        makehtml = args.makehtml
        result_dict = self.crossref(
            query,
            size,
            filter_dict=filter_dict,
            update=update,
            makecsv=makecsv,
            makexml=makexml,
            makehtml=makehtml,
        )
        self.download_tools.make_json_files_for_paper(
            result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI,
            name_of_file=CROSSREF_RESULTS
        )

    def noexecute(self, args):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        query = args.query
        filter_dict = args.filter
        result_dict = self.crossref(
            query, size=10, filter_dict=filter_dict
        )
        totalhits = result_dict[NEW_RESULTS][TOTAL_HITS]
        logging.info("Total number of hits for the query are %s", totalhits)

    def apipaperdownload(
        self,
        args
    ):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        query = args.query
        size = args.limit
        filter_dict = args.filter
        makecsv = args.makecsv
        makexml = args.xml
        makehtml = args.makehtml
        result_dict = self.crossref(
            query,
            size,
            filter_dict=filter_dict,
            update=None,
            makecsv=makecsv,
            makexml=makexml,
            makehtml=makehtml,
        )
        self.download_tools.make_json_files_for_paper(
            result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI, name_of_file=CROSSREF_RESULTS
        )

The class ApiPlugger looks for these functions along with the config file to serve the API on the cli.

pygetpapers makes it really easy to add support for new repositories.

To add a new repository, clone the repo and cd into the directory pygetpapers. Thereafter, create a new module with the class for the repo. Make sure you edit the config.ini file with the specifications of the new repo.

Following is an example config

[europe_pmc]
posturl=https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
citationurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/citations?page=1&pageSize=1000&format=xml
referencesurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/references?page=1&pageSize=1000&format=xml
xmlurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML
suppurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles
zipurl= http://europepmc.org/ftp/suppl/OA/{key}/{pmcid}.zip
date_query=SUPPORTED
term=SUPPORTED
update=SUPPORTED
restart=SUPPORTED
class_name=EuropePmc
library_name= europe_pmc
features_not_supported = ["filter",]

After this, in the repo class, ensure that you can request scientific papers, download them and do post-processing on them. There are multiple functions in the class download_tools which can help you with the same. I suggest looking at previously configured repos for the same.

It is necessary to have three functions in particular.

  1. apipaperdownload

  2. noexecute

  3. update

Following is an example implementation.

    def update(
        self,
        args
    ):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        logging.info("Reading old json metadata file")
        update_path = self.get_metadata_results_file()
        os.chdir(os.path.dirname(update_path))
        update = self.download_tools.readjsondata(update_path)
        query = args.query
        size = args.limit
        filter_dict = args.filter
        makecsv = args.makecsv
        makexml = args.xml
        makehtml = args.makehtml
        result_dict = self.crossref(
            query,
            size,
            filter_dict=filter_dict,
            update=update,
            makecsv=makecsv,
            makexml=makexml,
            makehtml=makehtml,
        )
        self.download_tools.make_json_files_for_paper(
            result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI,
            name_of_file=CROSSREF_RESULTS
        )

    def noexecute(self, args):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        query = args.query
        filter_dict = args.filter
        result_dict = self.crossref(
            query, size=10, filter_dict=filter_dict
        )
        totalhits = result_dict[NEW_RESULTS][TOTAL_HITS]
        logging.info("Total number of hits for the query are %s", totalhits)

    def apipaperdownload(
        self,
        args
    ):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        query = args.query
        size = args.limit
        filter_dict = args.filter
        makecsv = args.makecsv
        makexml = args.xml
        makehtml = args.makehtml
        result_dict = self.crossref(
            query,
            size,
            filter_dict=filter_dict,
            update=None,
            makecsv=makecsv,
            makexml=makexml,
            makehtml=makehtml,
        )
        self.download_tools.make_json_files_for_paper(
            result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI, name_of_file=CROSSREF_RESULTS
        )

The class ApiPlugger looks for these functions along with the config file to serve the API on the cli.

pygetpapers makes it really easy to add support for new repositories.

To add a new repository, clone the repo and cd into the directory pygetpapers. Thereafter, create a new module with the class for the repo. Make sure you edit the config.ini file with the specifications of the new repo.

Following is an example config

[europe_pmc]
posturl=https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
citationurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/citations?page=1&pageSize=1000&format=xml
referencesurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/references?page=1&pageSize=1000&format=xml
xmlurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML
suppurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles
zipurl= http://europepmc.org/ftp/suppl/OA/{key}/{pmcid}.zip
date_query=SUPPORTED
term=SUPPORTED
update=SUPPORTED
restart=SUPPORTED
class_name=EuropePmc
library_name= europe_pmc
features_not_supported = ["filter",]

After this, in the repo class, ensure that you can request scientific papers, download them and do post-processing on them. There are multiple functions in the class download_tools which can help you with the same. I suggest looking at previously configured repos for the same.

It is necessary to have three functions in particular.

  1. apipaperdownload

  2. noexecute

  3. update

Following is an example implementation.

    def update(
        self,
        args
    ):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        logging.info("Reading old json metadata file")
        update_path = self.get_metadata_results_file()
        os.chdir(os.path.dirname(update_path))
        update = self.download_tools.readjsondata(update_path)
        query = args.query
        size = args.limit
        filter_dict = args.filter
        makecsv = args.makecsv
        makexml = args.xml
        makehtml = args.makehtml
        result_dict = self.crossref(
            query,
            size,
            filter_dict=filter_dict,
            update=update,
            makecsv=makecsv,
            makexml=makexml,
            makehtml=makehtml,
        )
        self.download_tools.make_json_files_for_paper(
            result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI,
            name_of_file=CROSSREF_RESULTS
        )

    def noexecute(self, args):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        query = args.query
        filter_dict = args.filter
        result_dict = self.crossref(
            query, size=10, filter_dict=filter_dict
        )
        totalhits = result_dict[NEW_RESULTS][TOTAL_HITS]
        logging.info("Total number of hits for the query are %s", totalhits)

    def apipaperdownload(
        self,
        args
    ):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        query = args.query
        size = args.limit
        filter_dict = args.filter
        makecsv = args.makecsv
        makexml = args.xml
        makehtml = args.makehtml
        result_dict = self.crossref(
            query,
            size,
            filter_dict=filter_dict,
            update=None,
            makecsv=makecsv,
            makexml=makexml,
            makehtml=makehtml,
        )
        self.download_tools.make_json_files_for_paper(
            result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI, name_of_file=CROSSREF_RESULTS
        )

The class ApiPlugger looks for these functions along with the config file to serve the API on the cli.

pygetpapers makes it really easy to add support for new repositories.

To add a new repository, clone the repo and cd into the directory pygetpapers. Thereafter, create a new module with the class for the repo. Make sure you edit the config.ini file with the specifications of the new repo.

Following is an example config

[europe_pmc]
posturl=https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
citationurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/citations?page=1&pageSize=1000&format=xml
referencesurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/references?page=1&pageSize=1000&format=xml
xmlurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML
suppurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles
zipurl= http://europepmc.org/ftp/suppl/OA/{key}/{pmcid}.zip
date_query=SUPPORTED
term=SUPPORTED
update=SUPPORTED
restart=SUPPORTED
class_name=EuropePmc
library_name= europe_pmc
features_not_supported = ["filter",]

After this, in the repo class, ensure that you can request scientific papers, download them and do post-processing on them. There are multiple functions in the class download_tools which can help you with the same. I suggest looking at previously configured repos for the same.

It is necessary to have three functions in particular.

  1. apipaperdownload

  2. noexecute

  3. update

Following is an example implementation.

    def update(
        self,
        args
    ):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        logging.info("Reading old json metadata file")
        update_path = self.get_metadata_results_file()
        os.chdir(os.path.dirname(update_path))
        update = self.download_tools.readjsondata(update_path)
        query = args.query
        size = args.limit
        filter_dict = args.filter
        makecsv = args.makecsv
        makexml = args.xml
        makehtml = args.makehtml
        result_dict = self.crossref(
            query,
            size,
            filter_dict=filter_dict,
            update=update,
            makecsv=makecsv,
            makexml=makexml,
            makehtml=makehtml,
        )
        self.download_tools.make_json_files_for_paper(
            result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI,
            name_of_file=CROSSREF_RESULTS
        )

    def noexecute(self, args):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        query = args.query
        filter_dict = args.filter
        result_dict = self.crossref(
            query, size=10, filter_dict=filter_dict
        )
        totalhits = result_dict[NEW_RESULTS][TOTAL_HITS]
        logging.info("Total number of hits for the query are %s", totalhits)

    def apipaperdownload(
        self,
        args
    ):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        query = args.query
        size = args.limit
        filter_dict = args.filter
        makecsv = args.makecsv
        makexml = args.xml
        makehtml = args.makehtml
        result_dict = self.crossref(
            query,
            size,
            filter_dict=filter_dict,
            update=None,
            makecsv=makecsv,
            makexml=makexml,
            makehtml=makehtml,
        )
        self.download_tools.make_json_files_for_paper(
            result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI, name_of_file=CROSSREF_RESULTS
        )

The class ApiPlugger looks for these functions along with the config file to serve the API on the cli.

pygetpapers makes it really easy to add support for new repositories.

To add a new repository, clone the repo and cd into the directory pygetpapers. Thereafter, create a new module with the class for the repo. Make sure you edit the config.ini file with the specifications of the new repo.

Following is an example config

[europe_pmc]
posturl=https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
citationurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/citations?page=1&pageSize=1000&format=xml
referencesurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/references?page=1&pageSize=1000&format=xml
xmlurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML
suppurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles
zipurl= http://europepmc.org/ftp/suppl/OA/{key}/{pmcid}.zip
date_query=SUPPORTED
term=SUPPORTED
update=SUPPORTED
restart=SUPPORTED
class_name=EuropePmc
library_name= europe_pmc
features_not_supported = ["filter",]

After this, in the repo class, ensure that you can request scientific papers, download them and do post-processing on them. There are multiple functions in the class download_tools which can help you with the same. I suggest looking at previously configured repos for the same.

It is necessary to have three functions in particular.

  1. apipaperdownload

  2. noexecute

  3. update

Following is an example implementation.

    def update(
        self,
        args
    ):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        logging.info("Reading old json metadata file")
        update_path = self.get_metadata_results_file()
        os.chdir(os.path.dirname(update_path))
        update = self.download_tools.readjsondata(update_path)
        query = args.query
        size = args.limit
        filter_dict = args.filter
        makecsv = args.makecsv
        makexml = args.xml
        makehtml = args.makehtml
        result_dict = self.crossref(
            query,
            size,
            filter_dict=filter_dict,
            update=update,
            makecsv=makecsv,
            makexml=makexml,
            makehtml=makehtml,
        )
        self.download_tools.make_json_files_for_paper(
            result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI,
            name_of_file=CROSSREF_RESULTS
        )

    def noexecute(self, args):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        query = args.query
        filter_dict = args.filter
        result_dict = self.crossref(
            query, size=10, filter_dict=filter_dict
        )
        totalhits = result_dict[NEW_RESULTS][TOTAL_HITS]
        logging.info("Total number of hits for the query are %s", totalhits)

    def apipaperdownload(
        self,
        args
    ):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        query = args.query
        size = args.limit
        filter_dict = args.filter
        makecsv = args.makecsv
        makexml = args.xml
        makehtml = args.makehtml
        result_dict = self.crossref(
            query,
            size,
            filter_dict=filter_dict,
            update=None,
            makecsv=makecsv,
            makexml=makexml,
            makehtml=makehtml,
        )
        self.download_tools.make_json_files_for_paper(
            result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI, name_of_file=CROSSREF_RESULTS
        )

The class ApiPlugger looks for these functions along with the config file to serve the API on the cli.

pygetpapers makes it really easy to add support for new repositories.

To add a new repository, clone the repo and cd into the directory pygetpapers. Thereafter, create a new module with the class for the repo. Make sure you edit the config.ini file with the specifications of the new repo.

Following is an example config

[europe_pmc]
posturl=https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
citationurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/citations?page=1&pageSize=1000&format=xml
referencesurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/references?page=1&pageSize=1000&format=xml
xmlurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML
suppurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles
zipurl= http://europepmc.org/ftp/suppl/OA/{key}/{pmcid}.zip
date_query=SUPPORTED
term=SUPPORTED
update=SUPPORTED
restart=SUPPORTED
class_name=EuropePmc
library_name= europe_pmc
features_not_supported = ["filter",]

After this, in the repo class, ensure that you can request scientific papers, download them and do post-processing on them. There are multiple functions in the class download_tools which can help you with the same. I suggest looking at previously configured repos for the same.

It is necessary to have three functions in particular.

  1. apipaperdownload

  2. noexecute

  3. update

Following is an example implementation.

    def update(
        self,
        args
    ):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        logging.info("Reading old json metadata file")
        update_path = self.get_metadata_results_file()
        os.chdir(os.path.dirname(update_path))
        update = self.download_tools.readjsondata(update_path)
        query = args.query
        size = args.limit
        filter_dict = args.filter
        makecsv = args.makecsv
        makexml = args.xml
        makehtml = args.makehtml
        result_dict = self.crossref(
            query,
            size,
            filter_dict=filter_dict,
            update=update,
            makecsv=makecsv,
            makexml=makexml,
            makehtml=makehtml,
        )
        self.download_tools.make_json_files_for_paper(
            result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI,
            name_of_file=CROSSREF_RESULTS
        )

    def noexecute(self, args):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        query = args.query
        filter_dict = args.filter
        result_dict = self.crossref(
            query, size=10, filter_dict=filter_dict
        )
        totalhits = result_dict[NEW_RESULTS][TOTAL_HITS]
        logging.info("Total number of hits for the query are %s", totalhits)

    def apipaperdownload(
        self,
        args
    ):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        query = args.query
        size = args.limit
        filter_dict = args.filter
        makecsv = args.makecsv
        makexml = args.xml
        makehtml = args.makehtml
        result_dict = self.crossref(
            query,
            size,
            filter_dict=filter_dict,
            update=None,
            makecsv=makecsv,
            makexml=makexml,
            makehtml=makehtml,
        )
        self.download_tools.make_json_files_for_paper(
            result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI, name_of_file=CROSSREF_RESULTS
        )

The class ApiPlugger looks for these functions along with the config file to serve the API on the cli.

pygetpapers makes it really easy to add support for new repositories.

To add a new repository, clone the repo and cd into the directory pygetpapers. Thereafter, create a new module with the class for the repo. Make sure you edit the config.ini file with the specifications of the new repo.

Following is an example config

[europe_pmc]
posturl=https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
citationurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/citations?page=1&pageSize=1000&format=xml
referencesurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/references?page=1&pageSize=1000&format=xml
xmlurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML
suppurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles
zipurl= http://europepmc.org/ftp/suppl/OA/{key}/{pmcid}.zip
date_query=SUPPORTED
term=SUPPORTED
update=SUPPORTED
restart=SUPPORTED
class_name=EuropePmc
library_name= europe_pmc
features_not_supported = ["filter",]

After this, in the repo class, ensure that you can request scientific papers, download them and do post-processing on them. There are multiple functions in the class download_tools which can help you with the same. I suggest looking at previously configured repos for the same.

It is necessary to have three functions in particular.

  1. apipaperdownload

  2. noexecute

  3. update

Following is an example implementation.

    def update(
        self,
        args
    ):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        logging.info("Reading old json metadata file")
        update_path = self.get_metadata_results_file()
        os.chdir(os.path.dirname(update_path))
        update = self.download_tools.readjsondata(update_path)
        query = args.query
        size = args.limit
        filter_dict = args.filter
        makecsv = args.makecsv
        makexml = args.xml
        makehtml = args.makehtml
        result_dict = self.crossref(
            query,
            size,
            filter_dict=filter_dict,
            update=update,
            makecsv=makecsv,
            makexml=makexml,
            makehtml=makehtml,
        )
        self.download_tools.make_json_files_for_paper(
            result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI,
            name_of_file=CROSSREF_RESULTS
        )

    def noexecute(self, args):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        query = args.query
        filter_dict = args.filter
        result_dict = self.crossref(
            query, size=10, filter_dict=filter_dict
        )
        totalhits = result_dict[NEW_RESULTS][TOTAL_HITS]
        logging.info("Total number of hits for the query are %s", totalhits)

    def apipaperdownload(
        self,
        args
    ):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        query = args.query
        size = args.limit
        filter_dict = args.filter
        makecsv = args.makecsv
        makexml = args.xml
        makehtml = args.makehtml
        result_dict = self.crossref(
            query,
            size,
            filter_dict=filter_dict,
            update=None,
            makecsv=makecsv,
            makexml=makexml,
            makehtml=makehtml,
        )
        self.download_tools.make_json_files_for_paper(
            result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI, name_of_file=CROSSREF_RESULTS
        )

The class ApiPlugger looks for these functions along with the config file to serve the API on the cli.

pygetpapers makes it really easy to add support for new repositories.

To add a new repository, clone the repo and cd into the directory pygetpapers. Thereafter, create a new module with the class for the repo. Make sure you edit the config.ini file with the specifications of the new repo.

Following is an example config

[europe_pmc]
posturl=https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
citationurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/citations?page=1&pageSize=1000&format=xml
referencesurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/references?page=1&pageSize=1000&format=xml
xmlurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML
suppurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles
zipurl= http://europepmc.org/ftp/suppl/OA/{key}/{pmcid}.zip
date_query=SUPPORTED
term=SUPPORTED
update=SUPPORTED
restart=SUPPORTED
class_name=EuropePmc
library_name= europe_pmc
features_not_supported = ["filter",]

After this, in the repo class, ensure that you can request scientific papers, download them and do post-processing on them. There are multiple functions in the class download_tools which can help you with the same. I suggest looking at previously configured repos for the same.

It is necessary to have three functions in particular.

  1. apipaperdownload

  2. noexecute

  3. update

Following is an example implementation.

    def update(
        self,
        args
    ):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        logging.info("Reading old json metadata file")
        update_path = self.get_metadata_results_file()
        os.chdir(os.path.dirname(update_path))
        update = self.download_tools.readjsondata(update_path)
        query = args.query
        size = args.limit
        filter_dict = args.filter
        makecsv = args.makecsv
        makexml = args.xml
        makehtml = args.makehtml
        result_dict = self.crossref(
            query,
            size,
            filter_dict=filter_dict,
            update=update,
            makecsv=makecsv,
            makexml=makexml,
            makehtml=makehtml,
        )
        self.download_tools.make_json_files_for_paper(
            result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI,
            name_of_file=CROSSREF_RESULTS
        )

    def noexecute(self, args):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        query = args.query
        filter_dict = args.filter
        result_dict = self.crossref(
            query, size=10, filter_dict=filter_dict
        )
        totalhits = result_dict[NEW_RESULTS][TOTAL_HITS]
        logging.info("Total number of hits for the query are %s", totalhits)

    def apipaperdownload(
        self,
        args
    ):
        """[summary]

        :param args: [description]
        :type args: [type]
        """
        query = args.query
        size = args.limit
        filter_dict = args.filter
        makecsv = args.makecsv
        makexml = args.xml
        makehtml = args.makehtml
        result_dict = self.crossref(
            query,
            size,
            filter_dict=filter_dict,
            update=None,
            makecsv=makecsv,
            makexml=makexml,
            makehtml=makehtml,
        )
        self.download_tools.make_json_files_for_paper(
            result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI, name_of_file=CROSSREF_RESULTS
        )

The class ApiPlugger looks for these functions along with the config file to serve the API on the cli.

Repository Interface

class pygetpapers.repositoryinterface.RepositoryInterface

Bases: ABC

abstract apipaperdownload(query_namespace)

Takes in the query_namespace object as the parameter and runs the query search for given search parameters.

Parameters

query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse

abstract noexecute(query_namespace)

Takes in the query_namespace object as the parameter and runs the query search for given search parameters but only prints the output and not write to disk.

Parameters

query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse

abstract update(query_namespace)

If there is a previously existing corpus, this function reads in the ‘cursor mark’ from the previous run, increments in, and adds new papers for the given parameters to the existing corpus.

Parameters

query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse