User Documentation
Research Papers right from python
What is pygetpapers
pygetpapers is a tool to assist text miners. It makes requests to open access scientific text repositories, analyses the hits, and systematically downloads the articles without further interaction.
Comes with the packages
pygetpapers
anddownloadtools
which provide various functions to download, process and save research papers and their metadata.The main medium of its interaction with users is through a command-line interface.
pygetpapers
has a modular design which makes maintenance easy and simple. This also allows adding support for more repositories simple.
The developer documentation has been setup at readthedocs
History
getpapers
is a tool written by Rik Smith-Unna funded by ContentMine at https://github.com/ContentMine/getpapers. The OpenVirus community requires a Python version and Ayush Garg has written an implementation from scratch, with some enhancements.
Formats supported by pygetpapers
pygetpapers gives fulltexts in xml and pdf format.
The metadata for papers can be saved in many formats including JSON, CSV, HTML.
Queries can be saved in form of an ini configuration file.
The additional files for papers can also be downloaded. References and citations for papers are given in XML format.
Log files can be saved in txt format.
Repository Structure
The main code is located in the pygetpapers directory. All the supporting modules for different repositories are described in the pygetpapers/repository directory.
Architecture
Installation
Ensure that pip
is installed along with python. Download python from: https://www.python.org/downloads/ and select the option Add Python to Path while installing.
Check out https://pip.pypa.io/en/stable/installing/ if difficulties installing pip. Also, checkout https://packaging.python.org/en/latest/tutorials/installing-packages/ to learn more about installing packages in python
.
Method one (recommended):
Enter the command:
pip install pygetpapers
Ensure pygetpapers has been installed by reopening the terminal and typing the command
pygetpapers
You should see a help message come up.
Method two (Install Directly From Head):
Ensure git cli is installed and is available in path. Check out (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
Enter the command:
pip install git+https://github.com/petermr/pygetpapers.git
Ensure pygetpapers has been installed by reopening the terminal and typing the command
pygetpapers
You should see a help message come up.
Usage
pygetpapers
is a commandline tool. You can ask for help by running:
pygetpapers --help
usage: pygetpapers [-h] [--config CONFIG] [-v] [-q QUERY] [-o OUTPUT]
[--save_query] [-x] [-p] [-s] [-z] [--references REFERENCES]
[-n] [--citations CITATIONS] [-l LOGLEVEL] [-f LOGFILE]
[-k LIMIT] [-r] [-u] [--onlyquery] [-c] [--makehtml]
[--synonym] [--startdate STARTDATE] [--enddate ENDDATE]
[--terms TERMS] [--notterms NOTTERMS] [--api API]
[--filter FILTER]
Welcome to Pygetpapers version 0.0.9.3. -h or --help for help
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file path to read query for pygetpapers
-v, --version output the version number
-q QUERY, --query QUERY
query string transmitted to repository API. Eg.
"Artificial Intelligence" or "Plant Parts". To escape
special characters within the quotes, use backslash.
Incase of nested quotes, ensure that the initial quotes
are double and the qutoes inside are single. For eg:
`'(LICENSE:"cc by" OR LICENSE:"cc-by") AND
METHODS:"transcriptome assembly"' ` is wrong. We should
instead use `"(LICENSE:'cc by' OR LICENSE:'cc-by') AND
METHODS:'transcriptome assembly'"`
-o OUTPUT, --output OUTPUT
output directory (Default: Folder inside current working directory named current date and time)
--save_query saved the passed query in a config file
-x, --xml download fulltext XMLs if available or save metadata as
XML
-p, --pdf [E][A] download fulltext PDFs if available (only eupmc
and arxiv supported)
-s, --supp [E] download supplementary files if available (only eupmc
supported)
-z, --zip [E] download files from ftp endpoint if available (only
eupmc supported)
--references REFERENCES
[E] Download references if available. (only eupmc
supported)Requires source for references
(AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
-n, --noexecute [ALL] report how many results match the query, but don't
actually download anything
--citations CITATIONS
[E] Download citations if available (only eupmc
supported). Requires source for citations
(AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
-l LOGLEVEL, --loglevel LOGLEVEL
[All] Provide logging level. Example --log warning
<<info,warning,debug,error,critical>>, default='info'
-f LOGFILE, --logfile LOGFILE
[All] save log to specified file in output directory as
well as printing to terminal
-k LIMIT, --limit LIMIT
[All] maximum number of hits (default: 100)
-r, --restart [E] Downloads the missing flags for the corpus.Searches
for already existing corpus in the output directory
-u, --update [E][B][M][C] Updates the corpus by downloading new
papers. Requires -k or --limit (If not provided, default
will be used) and -q or --query (must be provided) to be
given. Searches for already existing corpus in the output
directory
--onlyquery [E] Saves json file containing the result of the query in
storage. (only eupmc supported)The json file can be given
to --restart to download the papers later.
-c, --makecsv [All] Stores the per-document metadata as csv.
--makehtml [All] Stores the per-document metadata as html.
--synonym [E] Results contain synonyms as well.
--startdate STARTDATE
[E][B][M] Gives papers starting from given date. Format:
YYYY-MM-DD
--enddate ENDDATE [E][B][M] Gives papers till given date. Format: YYYY-MM-
DD
--terms TERMS [All] Location of the file which contains terms
serperated by a comma or an ami dict which will beOR'ed
among themselves and AND'ed with the query
--notterms NOTTERMS [All] Location of the txt file which contains terms
serperated by a comma or an ami dict which will beOR'ed
among themselves and NOT'ed with the query
--api API API to search [eupmc,
crossref,arxiv,biorxiv,medrxiv,rxivist] (default: eupmc)
--filter FILTER [C] filter by key value pair (only crossref supported)
Queries are build using -q
flag. The query format can be found at http://europepmc.org/docs/EBI_Europe_PMC_Web_Service_Reference.pdf A condensed guide can be found at https://github.com/petermr/pygetpapers/wiki/query-format
Repository-specific flags
To convey the repository specificity, we’ve used the first letter of the repository in square brackets in its description.
What is CProject?
A CProject is a directory structure that the AMI toolset uses to gather and process data. Each paper gets its folder.
A CTree is a subdirectory of a CProject that deals with a single paper.
Tutorial
pygetpapers
was on version 0.0.9.3
when the tutorials were documented.
pygetpapers
supports multiple APIs including eupmc, crossref,arxiv,biorxiv,medrxiv,rxivist-bio,rxivist-med. By default, it queries EPMC. You can specify the API by using --api
flag.
You can also follow this colab notebook as part of the tutorial.
Features |
EPMC |
crossref |
arxiv |
biorxiv |
medarxiv |
rxvist |
---|---|---|---|---|---|---|
Fulltext formats |
xml, pdf |
NA |
xml |
xml |
xml |
|
Metdata formats |
json, html, csv |
json, xml, csv |
json, csv, html, xml |
json, csv, html |
json, csv, html |
json, html, csv |
|
yes |
yes |
yes |
NA |
NA |
NA |
|
yes |
NA |
NA |
yes |
yes |
|
|
yes |
NA |
NA |
NA |
NA |
NA |
|
yes |
NA |
NA |
NA |
NA |
NA |
|
yes |
NA |
NA |
NA |
NA |
NA |
|
yes |
yes |
yes |
NA |
NA |
NA |
EPMC (Default API)
Example Query
Let’s break down the following query:
pygetpapers -q "METHOD: invasive plant species" -k 10 -o "invasive_plant_species_test" -c --makehtml -x --save_query
Flag |
What it does |
In this case |
---|---|---|
|
specifies the query |
queries for ‘invasive plant species’ in METHODS section |
|
number of hits (default 100) |
limits hits to 10 |
|
specifies output directory |
outputs to invasive_plant_species_test |
|
downloads fulltext xml |
|
|
saves per-paper metadata into a single csv |
saves single CSV named |
|
saves per-paper metadata into a single HTML file |
saves single HTML named |
|
saves the given query in a |
saves query to |
pygetpapers
, by default, writes metadata to a JSON file within:
individual paper directory for corresponding paper (
epmc_result.json
)working directory for all downloaded papers (
epmc_results.json
)
OUTPUT:
INFO: Final query is METHOD: invasive plant species
INFO: Total Hits are 17910
0it [00:00, ?it/s]WARNING: Keywords not found for paper 1
WARNING: Keywords not found for paper 4
1it [00:00, 164.87it/s]
INFO: Saving XML files to C:\Users\shweata\invasive_plant_species_test\*\fulltext.xml
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:21<00:00, 2.11s/it]
Scope the number of hits for a query
If you are just scoping the number of hits for a given query, you can use -n
flag as shown below.
pygetpapers -n -q "essential oil"
OUTPUT:
INFO: Final query is essential oil
INFO: Total number of hits for the query are 190710
Update an existing CProject with new papers by feeding the metadata JSON
The --update
command is used to update a CProject with a new set of papers on same or different query.
If let’s say you have a corpus of a 30 papers on ‘essential oil’ (like before) and would like to download 20 more papers to the same CProject directory, you use --update
command.
To update your Cproject, you would give it the -o
flag the already existing CProject name. Additionally, you should also add --update
flag.
INPUT:
pygetpapers -q "invasive plant species" -k 10 -x -o lantana_test_5 --update
OUTPUT:
INFO: Final query is invasive plant species
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
INFO: Total Hits are 32956
0it [00:00, ?it/s]WARNING: html url not found for paper 5
WARNING: pdf url not found for paper 5
WARNING: Keywords not found for paper 6
WARNING: Keywords not found for paper 7
WARNING: Author list not found for paper 10
1it [00:00, 166.68it/s]
INFO: Saving XML files to C:\Users\shweata\lantana_test_5\*\fulltext.xml
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [01:03<00:00, 3.16s/it]
How is --update
different from just downloading x number of papers to the same output directory?
By using --update
command you can be sure that you don’t overwrite the existing papers.
Restart downloading papers to an existing CProject
--restart
flag can be used for two purposes:
To download papers in different format. Let’s say you downloaded XMLs in the first round. If you want to download pdfs for same set of papers, you use this flag.
Continue the download from the stage where it broke. This feature would particularly come in handy if you are on poor lines. Let’s start off by forcefully interrupting the download. INPUT:
pygetpapers -q "pinus" -k 10 -o pinus_10 -x
OUTPUT:
INFO: Final query is pinus
INFO: Total Hits are 32086
0it [00:00, ?it/s]WARNING: html url not found for paper 10
WARNING: pdf url not found for paper 10
1it [00:00, 63.84it/s]
INFO: Saving XML files to C:\Users\shweata\pinus_10\*\fulltext.xml
60%|██████████████████████████████████████████████████████████████████████████████▌ | 6/10 [00:20<00:13, 3.42s/it]
Traceback (most recent call last):
...
KeyboardInterrupt
^C
If you take a look at the CProject directory, there are 6 papers downloaded so far.
C:.
│ eupmc_results.json
│
├───PMC8157994
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8180188
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8198815
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8216501
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8309040
│ eupmc_result.json
│ fulltext.xml
│
└───PMC8325914
eupmc_result.json
fulltext.xml
To download the rest, we can use --restart
flag.
INPUT
pygetpapers -q "pinus" -o pinus_10 --restart -x
OUTPUT:
INFO: Saving XML files to C:\Users\shweata\pinus_10\*\fulltext.xml
80%|████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 8/10 [00:27<00:07, 3.51s/it 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 9/10 [00:38<00:05, 5.95s/it100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:40<00:00, 4.49s/it100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:40<00:00, 4.01s/it]
Now if we inspect the CProject directory, we see that we have 10 papers as specified.
C:.
│ eupmc_results.json
│
├───PMC8157994
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8180188
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8198815
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8199922
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8216501
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8309040
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8309338
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8325914
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8399312
│ eupmc_result.json
│ fulltext.xml
│
└───PMC8400686
eupmc_result.json
fulltext.xml
Under the hood, pygetpapers
looks for eupmc_results.json
, reads it and resumes the download.
You could also use --restart
to download the fulltext or metadata in different format other than the ones that you’ve already downloaded. For example, if I want all the fulltext PDFs of the 10 papers on pinus
, I can run:
INPUT:
pygetpapers -q "pinus" -o pinus_10 --restart -p --makehtml
OUTPUT:
>pygetpapers -q "pinus" -o pinus_10 --restart -p --makehtml
100%|█████████████████████████████████████████████| 10/10 [03:26<00:00, 20.68s/it]
Now, if we take a look at the CProject:
C:.
│ eupmc_results.json
│
├───PMC8157994
│ eupmc_result.html
│ eupmc_result.json
│ fulltext.pdf
│ fulltext.xml
│
├───PMC8180188
│ eupmc_result.html
│ eupmc_result.json
│ fulltext.pdf
│ fulltext.xml
│
├───PMC8198815
│ eupmc_result.html
│ eupmc_result.json
│ fulltext.pdf
│ fulltext.xml
...
We find that each paper now has fulltext PDFs and metadata in HTML.
Difference between --restart
and --update
If you aren’t looking download new set of papers but would want to download a papers in different format for existing papers,
--restart
is the flag you’d want to useIf you are looking to download a new set of papers to an existing Cproject, then you’d use
--update
command. You should note that the format in which you download papers would only apply to the new set of papers and not for the old.
Downloading citations and references for papers, if available
--references
and--citations
flags can be used to download the references and citations respectively.It also requires source for references (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR)
pygetpapers -q "lantana" -k 10 -o "test" -c -x --citation PMC
Downloading only the metadata
If you are looking to download just the metadata in the supported formats--onlyquery
is the flag you use. It saves the metadata in the output directory.
You can use --restart
feature to download the fulltexts for these papers.
INPUT:
pygetpapers --onlyquery -q "lantana" -k 10 -o "lantana_test" -c
OUTPUT:
INFO: Final query is lantana
INFO: Total Hits are 1909
0it [00:00, ?it/s]WARNING: html url not found for paper 1
WARNING: pdf url not found for paper 1
WARNING: Keywords not found for paper 2
WARNING: Keywords not found for paper 3
WARNING: Author list not found for paper 5
WARNING: Author list not found for paper 8
WARNING: Keywords not found for paper 9
1it [00:00, 407.69it/s]
Download papers within certain start and end date range
By using --startdate
and --enddate
you can specify the date range within which the papers you want to download were first published.
pygetpapers -q "METHOD:essential oil" --startdate "2020-01-02" --enddate "2021-09-09"
Saving query for later use
To save a query for later use, you can use --save_query
. What it does is that it saves the query in a .ini
file in the output directory.
pygetpapers -q "lantana" -k 10 -o "lantana_query_config"--save_query
Here is an example config file pygetpapers
outputs
Feed query using config.ini
file
Using can use the config.ini
file you created using --save_query
, you re-run the query. To do so, you will give --config
flag the absolute path of the saved_config.ini
file.
pygetpapers --config "C:\Users\shweata\lantana_query_config\saved_config.ini"
Querying using a term list
--terms
flag
If your query is complex with multiple ORs, you can use --terms
feature. To do so, you will:
Create a
.txt
file with list of terms separated by commas or anami-dictionary
(Click here to learn how to create dictionaries).Give the
--terms
flag the absolute path of the.txt
file orami-dictionary
(XML)
-q
is optional.The terms would be OR’ed with each other ANDed with the query, if given.
INPUT:
pygetpapers -q "essential oil" --terms C:\Users\shweata\essential_oil_terms.txt -k 10 -o "terms_test_essential_oil" -x
OUTPUT:
C:\Users\shweata>pygetpapers -q "essential oil" --terms C:\Users\shweata\essential_oil_terms.txt -k 10 -o "terms_test_essential_oil"
INFO: Final query is (essential oil AND (antioxidant OR antibacterial OR antifungal OR antiseptic OR antitrichomonal agent))
INFO: Total Hits are 43397
0it [00:00, ?it/s]WARNING: Author list not found for paper 9
1it [00:00, 1064.00it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:19<00:00, 1.99s/it]
You can also use this feature to download papers by using the PMC Ids. You can feed the .txt
file with PMC ids comman-separated. Make sure to give a large enough hit number to download all the papers specified in the text file.
Example text file can be found, here INPUT:
pygetpapers --terms C:\Users\shweata\PMCID_pygetpapers_text.txt -k 100 -o "PMCID_test"
OUTPUT:
INFO: Final query is (PMC6856665 OR PMC6877543 OR PMC6927906 OR PMC7008714 OR PMC7040181 OR PMC7080866 OR PMC7082878 OR PMC7096589 OR PMC7111464 OR PMC7142259 OR PMC7158757 OR PMC7174509 OR PMC7193700 OR PMC7198785 OR PMC7201129 OR PMC7203781 OR PMC7206980 OR PMC7214627 OR PMC7214803 OR PMC7220991
)
INFO: Total Hits are 20
WARNING: Could not find more papers
1it [00:00, 505.46it/s]
100%|█████████████████████████████████████████████| 20/20 [00:32<00:00, 1.61s/it]
--notterms
Excluded papers that have certain keywords might also be of interest for you. For example, if you want papers on essential oil which doesn’t mention antibacterial
, antiseptic
or antimicrobial
, you can run either create a dictionary or a text file with these terms (comma-separated), specify its absolute path to --notterms
flag.
INPUT:
pygetpapers -q "essential oil" -k 10 -o essential_oil_not_terms_test --notterms C:\Users\shweata\not_terms_test.txt
OUTPUT:
INFO: Final query is (essential oil AND NOT (antimicrobial OR antiseptic OR antibacterial))
INFO: Total Hits are 165557
1it [00:00, ?it/s]
100%|█| 10/10 [00:49<00:00, 4.95s/
The number of papers are reduced by a some proportion. For comparision, “essential oil” query gives us 193922 hits.
C:\Users\shweata>pygetpapers -q "essential oil" -n
INFO: Final query is essential oil
INFO: Total number of hits for the query are 193922
Using --terms
with dictionaries
We will take the same example as before.
Assuming you have
ami3
installed, you can createami-dictionaries
Start off by listing the terms in a
.txt
file
antimicrobial antiseptic antibacterial
Run the following command from the directory in which the text file exists
amidict -v --dictionary pygetpapers_terms --directory pygetpapers_terms --input pygetpapers_terms.txt create --informat list --outformats xml
That’s it! You’ve now created a simple ami-dictionary
. There are ways of creating dictionaries from Wikidata as well. You can learn more about how to do that in this Wiki page.
You can also use standard dictionaries that are available. we, then, pass the absolute path of the dictionary to
--terms
flag.
INPUT:
pygetpapers -q "essential oil" --terms C:\Users\shweata\pygetpapers_terms\pygetpapers_terms.xml -k 10 -o pygetpapers_dictionary_test -x
OUTPUT:
INFO: Final query is (essential oil AND (antibacterial OR antimicrobial OR antiseptic))
INFO: Total Hits are 28365
0it [00:00, ?it/s]WARNING: Keywords not found for paper 5
WARNING: Keywords not found for paper 7
1it [00:00, ?it/s]
INFO: Saving XML files to C:\Users\shweata\pygetpapers_dictionary_test\*\fulltext.xml
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:36<00:00, 3.67s/it]
Log levels
You can specify the log level using the -l
flag. The default as you’ve already seen so far is info.
INPUT:
pygetpapers -q "lantana" -k 10 -o lantana_test_10_2 --loglevel debug -x
Log file
You can also choose to write the log to a .txt
file in your HOME directory, while simultaneously printing it out.
INPUT:
pygetpapers -q "lantana" -k 10 -o lantana_test_10_4 --loglevel debug -x --logfile test_log.txt
Here is the log file.
Crossref
You can query crossref api for the metadata only.
Sample query
The metadata formats flags are applicable as described in the EPMC tutorial
--terms
and-q
are also applicable to crossref INPUT:
pygetpapers --api crossref -q "essential oil" --terms C:\Users\shweata\essential_oil_terms.txt -k 10 -o "terms_test_essential_oil_crossref_3" -x -c --makehtml
OUTPUT:
INFO: Final query is (essential oil AND (antioxidant OR antibacterial OR antifungal OR antiseptic OR antitrichomonal agent))
INFO: Making request to crossref
INFO: Got request result from crossref
INFO: Making csv files for metadata at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 185.52it/s]
INFO: Making html files for metadata at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 87.98it/s]
INFO: Making xml files for metadata at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 366.97it/s]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 996.82it/s]
We have 10 folders in the CProject directory.
C:\Users\shweata>cd terms_test_essential_oil_crossref_3
C:\Users\shweata\terms_test_essential_oil_crossref_3>tree
Folder PATH listing for volume Windows-SSD
Volume serial number is D88A-559A
C:.
├───10.1016_j.bcab.2021.101913
├───10.1055_s-0029-1234896
├───10.1080_0972060x.2016.1231597
├───10.1080_10412905.1989.9697767
├───10.1111_j.1745-4565.2012.00378.x
├───10.17795_bhs-24733
├───10.23880_oajmms-16000131
├───10.34302_crpjfst_2019.11.2.8
├───10.5220_0008855200960099
└───10.5220_0009957801190122
--update
--update
works the same as in EPMC
. You can use this flag to increase the number of papers in a given CProject.
INPUT
pygetpapers --api crossref -q "essential oil" --terms C:\Users\shweata\essential_oil_terms.txt -k 5 -o "terms_test_essential_oil_crossref_3" -x -c --makehtml --update
OUTPUT:
INFO: Final query is (essential oil AND (antioxidant OR antibacterial OR antifungal OR antiseptic OR antitrichomonal agent))
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
INFO: Reading old json metadata file
INFO: Making request to crossref
INFO: Got request result from crossref
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|██████████████████████████████████████████████| 5/5 [00:00<00:00, 346.84it/s]
The CProject after updating:
C:.
├───10.1002_mbo3.459
├───10.1016_j.bcab.2021.101913
├───10.1055_s-0029-1234896
├───10.1080_0972060x.2014.895156
├───10.1080_0972060x.2016.1231597
├───10.1080_0972060x.2017.1345329
├───10.1080_10412905.1989.9697767
├───10.1080_10412905.2021.1941338
├───10.1111_j.1745-4565.2012.00378.x
├───10.15406_oajs.2019.03.00121
├───10.17795_bhs-24733
├───10.23880_oajmms-16000131
├───10.34302_crpjfst_2019.11.2.8
├───10.5220_0008855200960099
└───10.5220_0009957801190122
We started off with 10 paper folders, and increased the number to 15.
Filter
arxiv
pygetpapers
allows you to query arxiv
for full text PDF and metadata in all supported formats.
Sample query
INPUT
pygetpapers --api arxiv -k 10 -o arxiv_test_3 -q "artificial intelligence" -x -p --makehtml -c
OUTPUT
INFO: Final query is artificial intelligence
INFO: Making request to Arxiv through pygetpapers
INFO: Got request result from Arxiv through pygetpapers
INFO: Requesting 10 results at offset 0
INFO: Requesting page of results
INFO: Got first page; 10 of 10 results available
INFO: Downloading Pdfs for papers
100%|█████████████████████████████████████████████| 10/10 [01:02<00:00, 6.27s/it]
INFO: Making csv files for metadata at C:\Users\shweata\arxiv_test_3
100%|████████████████████████████████████████████| 10/10 [00:00<00:00, 187.31it/s]
INFO: Making html files for metadata at C:\Users\shweata\arxiv_test_3
100%|████████████████████████████████████████████| 10/10 [00:00<00:00, 161.87it/s]
INFO: Making xml files for metadata at C:\Users\shweata\arxiv_test_3
100%|█████████████████████████████████████████████████████| 10/10 [00:00<?, ?it/s]
100%|███████████████████████████████████████████| 10/10 [00:00<00:00, 1111.22it/s]
Note: --update
isn’t supported for arxiv
Biorxiv and Medrxiv
You can query biorxiv
and medrxiv
for fulltext and metadata (in all supported formats). However, passing a query string using -q
flag isn’t supported for both the Repositories. You can only provide a date range.
Sample Query - biorxiv
INPUT:
pygetpapers --api biorxiv -k 10 -x --startdate 2021-01-01 -o biorxiv_test_20210831
OUTPUT:
WARNING: Currently biorxiv api is malfunctioning and returning wrong DOIs
INFO: Making Request to rxiv
INFO: Making xml for paper
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:23<00:00, 2.34s/it]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\biorxiv_test_20210831
100%|█████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 684.72it/s]
--update
command
INPUT
pygetpapers --api biorxiv -k 10 -x --startdate 2021-01-01 -o biorxiv_test_20210831 --update
OUTPUT
WARNING: Currently biorxiv api is malfunctioning and returning wrong DOIs
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
INFO: Reading old json metadata file
INFO: Making Request to rxiv
INFO: Making xml for paper
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:22<00:00, 2.23s/it]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\biorxiv_test_20210831
100%|█████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 492.39it/s]
The CProject now has 20 papers, in total after updating.
├───10.1101_008326
├───10.1101_010553
├───10.1101_035972
├───10.1101_046052
├───10.1101_060012
├───10.1101_067736
├───10.1101_086710
├───10.1101_092205
├───10.1101_092619
├───10.1101_093237
├───10.1101_121061
├───10.1101_135749
├───10.1101_145664
├───10.1101_145896
├───10.1101_165845
├───10.1101_180273
├───10.1101_181198
├───10.1101_191858
├───10.1101_194266
└───10.1101_196105
The working of medarxiv
is same as biorxiv
rxivist
Lets you specify a queries string to both biorxiv
and medarxiv
. The results you get would be a mixture of papers from both repository since rxivist
doesn’t differentiate.
Another caveat here is that you can only retrieve metadata from rxivist
.
INPUT:
pygetpapers --api rxivist -q "biomedicine" -k 10 -c -x -o "biomedicine_rxivist" --makehtml -p
OUTPUT:
WARNING: Pdf is not supported for this api
INFO: Final query is biomedicine
INFO: Making Request to rxivist
INFO: Making csv files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 125.54it/s]
INFO: Making html files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 124.71it/s]
INFO: Making xml files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 633.38it/s]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 751.09it/s]
Query hits only
Like any other repositories under pygetpapers
, you can use the -n
flag to get only the hit number
INPUT:
C:\Users\shweata>pygetpapers --api rxivist -q "biomedical sciences" -n
OUTPUT:
INFO: Final query is biomedical sciences
INFO: Making Request to rxivist
INFO: Total number of hits for the query are 62
Update
--update
works the same as many other repositories. Make sure to provide rxvist
as api.
INPUT:
pygetpapers --api rxivist -q "biomedical sciences" -k 20 -c -x -o "biomedicine_rxivist" --update
OUPUT:
INFO: Final query is biomedical sciences
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
INFO: Reading old json metadata file
INFO: Making Request to rxivist
INFO: Making csv files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 203.69it/s]
INFO: Making xml files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 1059.17it/s]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\biomedicine_rxivist
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 1077.12it/s]
Run pygetpapers
within the module
def run_command(output=False, query=False, save_query=False, xml=False, pdf=False, supp=False, zip=False, references=False, noexecute=False, citations=False, limit=100, restart=False, update=False, onlyquery=False, makecsv=False, makehtml=False, synonym=False, startdate=False, enddate=False, terms=False, notterms=False, api='europe_pmc', filter=None, loglevel='info', logfile=False, version=False)
Here’s an example script to download 50 papers from EPMC on ‘lantana camara’.
from pygetpapers import Pygetpapers
pygetpapers_call=Pygetpapers()
pygetpapers_call.run_command(query='lantana camara',limit=-50 ,output= lantana_camara, xml=True)
Test pygetpapers
To run automated testing on pygetpapers
, do the following:
Install
pygetpapers
Clone into
pygetpapers
repositoryInstall pytest
Run the command,
pytest
Contributions
https://github.com/petermr/pygetpapers/blob/main/resources/CONTRIBUTING.md
Feature Requests
To request features, please put them in issues
Legal Implications
If you usepygetpapers
, you should be careful to understand the law as it applies to their content mining, as they assume full responsibility for their actions when using the software.
Countries with copyright exceptions for content mining:
UK
Japan
Countries with proposed copyright exceptions:
Ireland
EU countries
Countries with permissive interpretations of ‘fair use’ that might allow content mining:
Israel
USA
Canada
General summaries and guides:
User Documentation
Research Papers right from python
What is pygetpapers
pygetpapers is a tool to assist text miners. It makes requests to open access scientific text repositories, analyses the hits, and systematically downloads the articles without further interaction.
Comes with the packages
pygetpapers
anddownloadtools
which provide various functions to download, process and save research papers and their metadata.The main medium of its interaction with users is through a command-line interface.
pygetpapers
has a modular design which makes maintenance easy and simple. This also allows adding support for more repositories simple.
The developer documentation has been setup at readthedocs
History
getpapers
is a tool written by Rik Smith-Unna funded by ContentMine at https://github.com/ContentMine/getpapers. The OpenVirus community requires a Python version and Ayush Garg has written an implementation from scratch, with some enhancements.
Formats supported by pygetpapers
pygetpapers gives fulltexts in xml and pdf format.
The metadata for papers can be saved in many formats including JSON, CSV, HTML.
Queries can be saved in form of an ini configuration file.
The additional files for papers can also be downloaded. References and citations for papers are given in XML format.
Log files can be saved in txt format.
Repository Structure
The main code is located in the pygetpapers directory. All the supporting modules for different repositories are described in the pygetpapers/repository directory.
Architecture
Installation
Ensure that pip
is installed along with python. Download python from: https://www.python.org/downloads/ and select the option Add Python to Path while installing.
Check out https://pip.pypa.io/en/stable/installing/ if difficulties installing pip. Also, checkout https://packaging.python.org/en/latest/tutorials/installing-packages/ to learn more about installing packages in python
.
Method one (recommended):
Enter the command:
pip install pygetpapers
Ensure pygetpapers has been installed by reopening the terminal and typing the command
pygetpapers
You should see a help message come up.
Method two (Install Directly From Head):
Ensure git cli is installed and is available in path. Check out (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
Enter the command:
pip install git+https://github.com/petermr/pygetpapers.git
Ensure pygetpapers has been installed by reopening the terminal and typing the command
pygetpapers
You should see a help message come up.
Usage
pygetpapers
is a commandline tool. You can ask for help by running:
pygetpapers --help
usage: pygetpapers [-h] [--config CONFIG] [-v] [-q QUERY] [-o OUTPUT]
[--save_query] [-x] [-p] [-s] [-z] [--references REFERENCES]
[-n] [--citations CITATIONS] [-l LOGLEVEL] [-f LOGFILE]
[-k LIMIT] [-r] [-u] [--onlyquery] [-c] [--makehtml]
[--synonym] [--startdate STARTDATE] [--enddate ENDDATE]
[--terms TERMS] [--notterms NOTTERMS] [--api API]
[--filter FILTER]
Welcome to Pygetpapers version 0.0.9.3. -h or --help for help
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file path to read query for pygetpapers
-v, --version output the version number
-q QUERY, --query QUERY
query string transmitted to repository API. Eg.
"Artificial Intelligence" or "Plant Parts". To escape
special characters within the quotes, use backslash.
Incase of nested quotes, ensure that the initial quotes
are double and the qutoes inside are single. For eg:
`'(LICENSE:"cc by" OR LICENSE:"cc-by") AND
METHODS:"transcriptome assembly"' ` is wrong. We should
instead use `"(LICENSE:'cc by' OR LICENSE:'cc-by') AND
METHODS:'transcriptome assembly'"`
-o OUTPUT, --output OUTPUT
output directory (Default: Folder inside current working directory named current date and time)
--save_query saved the passed query in a config file
-x, --xml download fulltext XMLs if available or save metadata as
XML
-p, --pdf [E][A] download fulltext PDFs if available (only eupmc
and arxiv supported)
-s, --supp [E] download supplementary files if available (only eupmc
supported)
-z, --zip [E] download files from ftp endpoint if available (only
eupmc supported)
--references REFERENCES
[E] Download references if available. (only eupmc
supported)Requires source for references
(AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
-n, --noexecute [ALL] report how many results match the query, but don't
actually download anything
--citations CITATIONS
[E] Download citations if available (only eupmc
supported). Requires source for citations
(AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
-l LOGLEVEL, --loglevel LOGLEVEL
[All] Provide logging level. Example --log warning
<<info,warning,debug,error,critical>>, default='info'
-f LOGFILE, --logfile LOGFILE
[All] save log to specified file in output directory as
well as printing to terminal
-k LIMIT, --limit LIMIT
[All] maximum number of hits (default: 100)
-r, --restart [E] Downloads the missing flags for the corpus.Searches
for already existing corpus in the output directory
-u, --update [E][B][M][C] Updates the corpus by downloading new
papers. Requires -k or --limit (If not provided, default
will be used) and -q or --query (must be provided) to be
given. Searches for already existing corpus in the output
directory
--onlyquery [E] Saves json file containing the result of the query in
storage. (only eupmc supported)The json file can be given
to --restart to download the papers later.
-c, --makecsv [All] Stores the per-document metadata as csv.
--makehtml [All] Stores the per-document metadata as html.
--synonym [E] Results contain synonyms as well.
--startdate STARTDATE
[E][B][M] Gives papers starting from given date. Format:
YYYY-MM-DD
--enddate ENDDATE [E][B][M] Gives papers till given date. Format: YYYY-MM-
DD
--terms TERMS [All] Location of the file which contains terms
serperated by a comma or an ami dict which will beOR'ed
among themselves and AND'ed with the query
--notterms NOTTERMS [All] Location of the txt file which contains terms
serperated by a comma or an ami dict which will beOR'ed
among themselves and NOT'ed with the query
--api API API to search [eupmc,
crossref,arxiv,biorxiv,medrxiv,rxivist] (default: eupmc)
--filter FILTER [C] filter by key value pair (only crossref supported)
Queries are build using -q
flag. The query format can be found at http://europepmc.org/docs/EBI_Europe_PMC_Web_Service_Reference.pdf A condensed guide can be found at https://github.com/petermr/pygetpapers/wiki/query-format
Repository-specific flags
To convey the repository specificity, we’ve used the first letter of the repository in square brackets in its description.
What is CProject?
A CProject is a directory structure that the AMI toolset uses to gather and process data. Each paper gets its folder.
A CTree is a subdirectory of a CProject that deals with a single paper.
Tutorial
pygetpapers
was on version 0.0.9.3
when the tutorials were documented.
pygetpapers
supports multiple APIs including eupmc, crossref,arxiv,biorxiv,medrxiv,rxivist-bio,rxivist-med. By default, it queries EPMC. You can specify the API by using --api
flag.
You can also follow this colab notebook as part of the tutorial.
Features |
EPMC |
crossref |
arxiv |
biorxiv |
medarxiv |
rxvist |
---|---|---|---|---|---|---|
Fulltext formats |
xml, pdf |
NA |
xml |
xml |
xml |
|
Metdata formats |
json, html, csv |
json, xml, csv |
json, csv, html, xml |
json, csv, html |
json, csv, html |
json, html, csv |
|
yes |
yes |
yes |
NA |
NA |
NA |
|
yes |
NA |
NA |
yes |
yes |
|
|
yes |
NA |
NA |
NA |
NA |
NA |
|
yes |
NA |
NA |
NA |
NA |
NA |
|
yes |
NA |
NA |
NA |
NA |
NA |
|
yes |
yes |
yes |
NA |
NA |
NA |
EPMC (Default API)
Example Query
Let’s break down the following query:
pygetpapers -q "METHOD: invasive plant species" -k 10 -o "invasive_plant_species_test" -c --makehtml -x --save_query
Flag |
What it does |
In this case |
---|---|---|
|
specifies the query |
queries for ‘invasive plant species’ in METHODS section |
|
number of hits (default 100) |
limits hits to 10 |
|
specifies output directory |
outputs to invasive_plant_species_test |
|
downloads fulltext xml |
|
|
saves per-paper metadata into a single csv |
saves single CSV named |
|
saves per-paper metadata into a single HTML file |
saves single HTML named |
|
saves the given query in a |
saves query to |
pygetpapers
, by default, writes metadata to a JSON file within:
individual paper directory for corresponding paper (
epmc_result.json
)working directory for all downloaded papers (
epmc_results.json
)
OUTPUT:
INFO: Final query is METHOD: invasive plant species
INFO: Total Hits are 17910
0it [00:00, ?it/s]WARNING: Keywords not found for paper 1
WARNING: Keywords not found for paper 4
1it [00:00, 164.87it/s]
INFO: Saving XML files to C:\Users\shweata\invasive_plant_species_test\*\fulltext.xml
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:21<00:00, 2.11s/it]
Scope the number of hits for a query
If you are just scoping the number of hits for a given query, you can use -n
flag as shown below.
pygetpapers -n -q "essential oil"
OUTPUT:
INFO: Final query is essential oil
INFO: Total number of hits for the query are 190710
Update an existing CProject with new papers by feeding the metadata JSON
The --update
command is used to update a CProject with a new set of papers on same or different query.
If let’s say you have a corpus of a 30 papers on ‘essential oil’ (like before) and would like to download 20 more papers to the same CProject directory, you use --update
command.
To update your Cproject, you would give it the -o
flag the already existing CProject name. Additionally, you should also add --update
flag.
INPUT:
pygetpapers -q "invasive plant species" -k 10 -x -o lantana_test_5 --update
OUTPUT:
INFO: Final query is invasive plant species
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
INFO: Total Hits are 32956
0it [00:00, ?it/s]WARNING: html url not found for paper 5
WARNING: pdf url not found for paper 5
WARNING: Keywords not found for paper 6
WARNING: Keywords not found for paper 7
WARNING: Author list not found for paper 10
1it [00:00, 166.68it/s]
INFO: Saving XML files to C:\Users\shweata\lantana_test_5\*\fulltext.xml
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [01:03<00:00, 3.16s/it]
How is --update
different from just downloading x number of papers to the same output directory?
By using --update
command you can be sure that you don’t overwrite the existing papers.
Restart downloading papers to an existing CProject
--restart
flag can be used for two purposes:
To download papers in different format. Let’s say you downloaded XMLs in the first round. If you want to download pdfs for same set of papers, you use this flag.
Continue the download from the stage where it broke. This feature would particularly come in handy if you are on poor lines. Let’s start off by forcefully interrupting the download. INPUT:
pygetpapers -q "pinus" -k 10 -o pinus_10 -x
OUTPUT:
INFO: Final query is pinus
INFO: Total Hits are 32086
0it [00:00, ?it/s]WARNING: html url not found for paper 10
WARNING: pdf url not found for paper 10
1it [00:00, 63.84it/s]
INFO: Saving XML files to C:\Users\shweata\pinus_10\*\fulltext.xml
60%|██████████████████████████████████████████████████████████████████████████████▌ | 6/10 [00:20<00:13, 3.42s/it]
Traceback (most recent call last):
...
KeyboardInterrupt
^C
If you take a look at the CProject directory, there are 6 papers downloaded so far.
C:.
│ eupmc_results.json
│
├───PMC8157994
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8180188
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8198815
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8216501
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8309040
│ eupmc_result.json
│ fulltext.xml
│
└───PMC8325914
eupmc_result.json
fulltext.xml
To download the rest, we can use --restart
flag.
INPUT
pygetpapers -q "pinus" -o pinus_10 --restart -x
OUTPUT:
INFO: Saving XML files to C:\Users\shweata\pinus_10\*\fulltext.xml
80%|████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 8/10 [00:27<00:07, 3.51s/it 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 9/10 [00:38<00:05, 5.95s/it100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:40<00:00, 4.49s/it100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:40<00:00, 4.01s/it]
Now if we inspect the CProject directory, we see that we have 10 papers as specified.
C:.
│ eupmc_results.json
│
├───PMC8157994
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8180188
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8198815
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8199922
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8216501
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8309040
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8309338
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8325914
│ eupmc_result.json
│ fulltext.xml
│
├───PMC8399312
│ eupmc_result.json
│ fulltext.xml
│
└───PMC8400686
eupmc_result.json
fulltext.xml
Under the hood, pygetpapers
looks for eupmc_results.json
, reads it and resumes the download.
You could also use --restart
to download the fulltext or metadata in different format other than the ones that you’ve already downloaded. For example, if I want all the fulltext PDFs of the 10 papers on pinus
, I can run:
INPUT:
pygetpapers -q "pinus" -o pinus_10 --restart -p --makehtml
OUTPUT:
>pygetpapers -q "pinus" -o pinus_10 --restart -p --makehtml
100%|█████████████████████████████████████████████| 10/10 [03:26<00:00, 20.68s/it]
Now, if we take a look at the CProject:
C:.
│ eupmc_results.json
│
├───PMC8157994
│ eupmc_result.html
│ eupmc_result.json
│ fulltext.pdf
│ fulltext.xml
│
├───PMC8180188
│ eupmc_result.html
│ eupmc_result.json
│ fulltext.pdf
│ fulltext.xml
│
├───PMC8198815
│ eupmc_result.html
│ eupmc_result.json
│ fulltext.pdf
│ fulltext.xml
...
We find that each paper now has fulltext PDFs and metadata in HTML.
Difference between --restart
and --update
If you aren’t looking download new set of papers but would want to download a papers in different format for existing papers,
--restart
is the flag you’d want to useIf you are looking to download a new set of papers to an existing Cproject, then you’d use
--update
command. You should note that the format in which you download papers would only apply to the new set of papers and not for the old.
Downloading citations and references for papers, if available
--references
and--citations
flags can be used to download the references and citations respectively.It also requires source for references (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR)
pygetpapers -q "lantana" -k 10 -o "test" -c -x --citation PMC
Downloading only the metadata
If you are looking to download just the metadata in the supported formats--onlyquery
is the flag you use. It saves the metadata in the output directory.
You can use --restart
feature to download the fulltexts for these papers.
INPUT:
pygetpapers --onlyquery -q "lantana" -k 10 -o "lantana_test" -c
OUTPUT:
INFO: Final query is lantana
INFO: Total Hits are 1909
0it [00:00, ?it/s]WARNING: html url not found for paper 1
WARNING: pdf url not found for paper 1
WARNING: Keywords not found for paper 2
WARNING: Keywords not found for paper 3
WARNING: Author list not found for paper 5
WARNING: Author list not found for paper 8
WARNING: Keywords not found for paper 9
1it [00:00, 407.69it/s]
Download papers within certain start and end date range
By using --startdate
and --enddate
you can specify the date range within which the papers you want to download were first published.
pygetpapers -q "METHOD:essential oil" --startdate "2020-01-02" --enddate "2021-09-09"
Saving query for later use
To save a query for later use, you can use --save_query
. What it does is that it saves the query in a .ini
file in the output directory.
pygetpapers -q "lantana" -k 10 -o "lantana_query_config"--save_query
Here is an example config file pygetpapers
outputs
Feed query using config.ini
file
Using can use the config.ini
file you created using --save_query
, you re-run the query. To do so, you will give --config
flag the absolute path of the saved_config.ini
file.
pygetpapers --config "C:\Users\shweata\lantana_query_config\saved_config.ini"
Querying using a term list
--terms
flag
If your query is complex with multiple ORs, you can use --terms
feature. To do so, you will:
Create a
.txt
file with list of terms separated by commas or anami-dictionary
(Click here to learn how to create dictionaries).Give the
--terms
flag the absolute path of the.txt
file orami-dictionary
(XML)
-q
is optional.The terms would be OR’ed with each other ANDed with the query, if given.
INPUT:
pygetpapers -q "essential oil" --terms C:\Users\shweata\essential_oil_terms.txt -k 10 -o "terms_test_essential_oil" -x
OUTPUT:
C:\Users\shweata>pygetpapers -q "essential oil" --terms C:\Users\shweata\essential_oil_terms.txt -k 10 -o "terms_test_essential_oil"
INFO: Final query is (essential oil AND (antioxidant OR antibacterial OR antifungal OR antiseptic OR antitrichomonal agent))
INFO: Total Hits are 43397
0it [00:00, ?it/s]WARNING: Author list not found for paper 9
1it [00:00, 1064.00it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:19<00:00, 1.99s/it]
You can also use this feature to download papers by using the PMC Ids. You can feed the .txt
file with PMC ids comman-separated. Make sure to give a large enough hit number to download all the papers specified in the text file.
Example text file can be found, here INPUT:
pygetpapers --terms C:\Users\shweata\PMCID_pygetpapers_text.txt -k 100 -o "PMCID_test"
OUTPUT:
INFO: Final query is (PMC6856665 OR PMC6877543 OR PMC6927906 OR PMC7008714 OR PMC7040181 OR PMC7080866 OR PMC7082878 OR PMC7096589 OR PMC7111464 OR PMC7142259 OR PMC7158757 OR PMC7174509 OR PMC7193700 OR PMC7198785 OR PMC7201129 OR PMC7203781 OR PMC7206980 OR PMC7214627 OR PMC7214803 OR PMC7220991
)
INFO: Total Hits are 20
WARNING: Could not find more papers
1it [00:00, 505.46it/s]
100%|█████████████████████████████████████████████| 20/20 [00:32<00:00, 1.61s/it]
--notterms
Excluded papers that have certain keywords might also be of interest for you. For example, if you want papers on essential oil which doesn’t mention antibacterial
, antiseptic
or antimicrobial
, you can run either create a dictionary or a text file with these terms (comma-separated), specify its absolute path to --notterms
flag.
INPUT:
pygetpapers -q "essential oil" -k 10 -o essential_oil_not_terms_test --notterms C:\Users\shweata\not_terms_test.txt
OUTPUT:
INFO: Final query is (essential oil AND NOT (antimicrobial OR antiseptic OR antibacterial))
INFO: Total Hits are 165557
1it [00:00, ?it/s]
100%|█| 10/10 [00:49<00:00, 4.95s/
The number of papers are reduced by a some proportion. For comparision, “essential oil” query gives us 193922 hits.
C:\Users\shweata>pygetpapers -q "essential oil" -n
INFO: Final query is essential oil
INFO: Total number of hits for the query are 193922
Using --terms
with dictionaries
We will take the same example as before.
Assuming you have
ami3
installed, you can createami-dictionaries
Start off by listing the terms in a
.txt
file
antimicrobial antiseptic antibacterial
Run the following command from the directory in which the text file exists
amidict -v --dictionary pygetpapers_terms --directory pygetpapers_terms --input pygetpapers_terms.txt create --informat list --outformats xml
That’s it! You’ve now created a simple ami-dictionary
. There are ways of creating dictionaries from Wikidata as well. You can learn more about how to do that in this Wiki page.
You can also use standard dictionaries that are available. we, then, pass the absolute path of the dictionary to
--terms
flag.
INPUT:
pygetpapers -q "essential oil" --terms C:\Users\shweata\pygetpapers_terms\pygetpapers_terms.xml -k 10 -o pygetpapers_dictionary_test -x
OUTPUT:
INFO: Final query is (essential oil AND (antibacterial OR antimicrobial OR antiseptic))
INFO: Total Hits are 28365
0it [00:00, ?it/s]WARNING: Keywords not found for paper 5
WARNING: Keywords not found for paper 7
1it [00:00, ?it/s]
INFO: Saving XML files to C:\Users\shweata\pygetpapers_dictionary_test\*\fulltext.xml
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:36<00:00, 3.67s/it]
Log levels
You can specify the log level using the -l
flag. The default as you’ve already seen so far is info.
INPUT:
pygetpapers -q "lantana" -k 10 -o lantana_test_10_2 --loglevel debug -x
Log file
You can also choose to write the log to a .txt
file in your HOME directory, while simultaneously printing it out.
INPUT:
pygetpapers -q "lantana" -k 10 -o lantana_test_10_4 --loglevel debug -x --logfile test_log.txt
Here is the log file.
Crossref
You can query crossref api for the metadata only.
Sample query
The metadata formats flags are applicable as described in the EPMC tutorial
--terms
and-q
are also applicable to crossref INPUT:
pygetpapers --api crossref -q "essential oil" --terms C:\Users\shweata\essential_oil_terms.txt -k 10 -o "terms_test_essential_oil_crossref_3" -x -c --makehtml
OUTPUT:
INFO: Final query is (essential oil AND (antioxidant OR antibacterial OR antifungal OR antiseptic OR antitrichomonal agent))
INFO: Making request to crossref
INFO: Got request result from crossref
INFO: Making csv files for metadata at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 185.52it/s]
INFO: Making html files for metadata at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 87.98it/s]
INFO: Making xml files for metadata at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 366.97it/s]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 996.82it/s]
We have 10 folders in the CProject directory.
C:\Users\shweata>cd terms_test_essential_oil_crossref_3
C:\Users\shweata\terms_test_essential_oil_crossref_3>tree
Folder PATH listing for volume Windows-SSD
Volume serial number is D88A-559A
C:.
├───10.1016_j.bcab.2021.101913
├───10.1055_s-0029-1234896
├───10.1080_0972060x.2016.1231597
├───10.1080_10412905.1989.9697767
├───10.1111_j.1745-4565.2012.00378.x
├───10.17795_bhs-24733
├───10.23880_oajmms-16000131
├───10.34302_crpjfst_2019.11.2.8
├───10.5220_0008855200960099
└───10.5220_0009957801190122
--update
--update
works the same as in EPMC
. You can use this flag to increase the number of papers in a given CProject.
INPUT
pygetpapers --api crossref -q "essential oil" --terms C:\Users\shweata\essential_oil_terms.txt -k 5 -o "terms_test_essential_oil_crossref_3" -x -c --makehtml --update
OUTPUT:
INFO: Final query is (essential oil AND (antioxidant OR antibacterial OR antifungal OR antiseptic OR antitrichomonal agent))
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
INFO: Reading old json metadata file
INFO: Making request to crossref
INFO: Got request result from crossref
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\terms_test_essential_oil_crossref_3
100%|██████████████████████████████████████████████| 5/5 [00:00<00:00, 346.84it/s]
The CProject after updating:
C:.
├───10.1002_mbo3.459
├───10.1016_j.bcab.2021.101913
├───10.1055_s-0029-1234896
├───10.1080_0972060x.2014.895156
├───10.1080_0972060x.2016.1231597
├───10.1080_0972060x.2017.1345329
├───10.1080_10412905.1989.9697767
├───10.1080_10412905.2021.1941338
├───10.1111_j.1745-4565.2012.00378.x
├───10.15406_oajs.2019.03.00121
├───10.17795_bhs-24733
├───10.23880_oajmms-16000131
├───10.34302_crpjfst_2019.11.2.8
├───10.5220_0008855200960099
└───10.5220_0009957801190122
We started off with 10 paper folders, and increased the number to 15.
Filter
arxiv
pygetpapers
allows you to query arxiv
for full text PDF and metadata in all supported formats.
Sample query
INPUT
pygetpapers --api arxiv -k 10 -o arxiv_test_3 -q "artificial intelligence" -x -p --makehtml -c
OUTPUT
INFO: Final query is artificial intelligence
INFO: Making request to Arxiv through pygetpapers
INFO: Got request result from Arxiv through pygetpapers
INFO: Requesting 10 results at offset 0
INFO: Requesting page of results
INFO: Got first page; 10 of 10 results available
INFO: Downloading Pdfs for papers
100%|█████████████████████████████████████████████| 10/10 [01:02<00:00, 6.27s/it]
INFO: Making csv files for metadata at C:\Users\shweata\arxiv_test_3
100%|████████████████████████████████████████████| 10/10 [00:00<00:00, 187.31it/s]
INFO: Making html files for metadata at C:\Users\shweata\arxiv_test_3
100%|████████████████████████████████████████████| 10/10 [00:00<00:00, 161.87it/s]
INFO: Making xml files for metadata at C:\Users\shweata\arxiv_test_3
100%|█████████████████████████████████████████████████████| 10/10 [00:00<?, ?it/s]
100%|███████████████████████████████████████████| 10/10 [00:00<00:00, 1111.22it/s]
Note: --update
isn’t supported for arxiv
Biorxiv and Medrxiv
You can query biorxiv
and medrxiv
for fulltext and metadata (in all supported formats). However, passing a query string using -q
flag isn’t supported for both the Repositories. You can only provide a date range.
Sample Query - biorxiv
INPUT:
pygetpapers --api biorxiv -k 10 -x --startdate 2021-01-01 -o biorxiv_test_20210831
OUTPUT:
WARNING: Currently biorxiv api is malfunctioning and returning wrong DOIs
INFO: Making Request to rxiv
INFO: Making xml for paper
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:23<00:00, 2.34s/it]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\biorxiv_test_20210831
100%|█████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 684.72it/s]
--update
command
INPUT
pygetpapers --api biorxiv -k 10 -x --startdate 2021-01-01 -o biorxiv_test_20210831 --update
OUTPUT
WARNING: Currently biorxiv api is malfunctioning and returning wrong DOIs
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
INFO: Reading old json metadata file
INFO: Making Request to rxiv
INFO: Making xml for paper
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:22<00:00, 2.23s/it]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\biorxiv_test_20210831
100%|█████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 492.39it/s]
The CProject now has 20 papers, in total after updating.
├───10.1101_008326
├───10.1101_010553
├───10.1101_035972
├───10.1101_046052
├───10.1101_060012
├───10.1101_067736
├───10.1101_086710
├───10.1101_092205
├───10.1101_092619
├───10.1101_093237
├───10.1101_121061
├───10.1101_135749
├───10.1101_145664
├───10.1101_145896
├───10.1101_165845
├───10.1101_180273
├───10.1101_181198
├───10.1101_191858
├───10.1101_194266
└───10.1101_196105
The working of medarxiv
is same as biorxiv
rxivist
Lets you specify a queries string to both biorxiv
and medarxiv
. The results you get would be a mixture of papers from both repository since rxivist
doesn’t differentiate.
Another caveat here is that you can only retrieve metadata from rxivist
.
INPUT:
pygetpapers --api rxivist -q "biomedicine" -k 10 -c -x -o "biomedicine_rxivist" --makehtml -p
OUTPUT:
WARNING: Pdf is not supported for this api
INFO: Final query is biomedicine
INFO: Making Request to rxivist
INFO: Making csv files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 125.54it/s]
INFO: Making html files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 124.71it/s]
INFO: Making xml files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 633.38it/s]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 751.09it/s]
Query hits only
Like any other repositories under pygetpapers
, you can use the -n
flag to get only the hit number
INPUT:
C:\Users\shweata>pygetpapers --api rxivist -q "biomedical sciences" -n
OUTPUT:
INFO: Final query is biomedical sciences
INFO: Making Request to rxivist
INFO: Total number of hits for the query are 62
Update
--update
works the same as many other repositories. Make sure to provide rxvist
as api.
INPUT:
pygetpapers --api rxivist -q "biomedical sciences" -k 20 -c -x -o "biomedicine_rxivist" --update
OUPUT:
INFO: Final query is biomedical sciences
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
INFO: Reading old json metadata file
INFO: Making Request to rxivist
INFO: Making csv files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 203.69it/s]
INFO: Making xml files for metadata at C:\Users\shweata\biomedicine_rxivist
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 1059.17it/s]
INFO: Wrote metadata file for the query
INFO: Writing metadata file for the papers at C:\Users\shweata\biomedicine_rxivist
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 1077.12it/s]
Run pygetpapers
within the module
def run_command(output=False, query=False, save_query=False, xml=False, pdf=False, supp=False, zip=False, references=False, noexecute=False, citations=False, limit=100, restart=False, update=False, onlyquery=False, makecsv=False, makehtml=False, synonym=False, startdate=False, enddate=False, terms=False, notterms=False, api='europe_pmc', filter=None, loglevel='info', logfile=False, version=False)
Here’s an example script to download 50 papers from EPMC on ‘lantana camara’.
from pygetpapers import Pygetpapers
pygetpapers_call=Pygetpapers()
pygetpapers_call.run_command(query='lantana camara',limit=-50 ,output= lantana_camara, xml=True)
Test pygetpapers
To run automated testing on pygetpapers
, do the following:
Install
pygetpapers
Clone into
pygetpapers
repositoryInstall pytest
Run the command,
pytest
Contributions
https://github.com/petermr/pygetpapers/blob/main/resources/CONTRIBUTING.md
Feature Requests
To request features, please put them in issues
Legal Implications
If you usepygetpapers
, you should be careful to understand the law as it applies to their content mining, as they assume full responsibility for their actions when using the software.
Countries with copyright exceptions for content mining:
UK
Japan
Countries with proposed copyright exceptions:
Ireland
EU countries
Countries with permissive interpretations of ‘fair use’ that might allow content mining:
Israel
USA
Canada
General summaries and guides:
pygetpapers module
- class pygetpapers.pygetpapers.ApiPlugger(query_namespace)
Bases:
object
- add_terms_from_file()
Builds query from terms mentioned in a text file described in the argparse namespace object. See (https://pygetpapers.readthedocs.io/en/latest/index.html?highlight=terms#querying-using-a-term-list) Edits the namespace object’s query flag. :param query_namespace: namespace object from argparse (using –terms and –notterms)
- check_query_logic_and_run()
Checks the logic in query_namespace and runs pygetpapers for the given query
- setup_api_support_variables(config, api)
Reads in the configuration file namespace object and sets up class variable for the given api :param config: Configparser configured configuration file :type config: configparser object :param api: the repository to get the variables for :type api: string
- class pygetpapers.pygetpapers.Pygetpapers
Bases:
object
[summary]
- create_argparser()
Creates the cli
- generate_logger(query_namespace)
Creates logger for the given loglevel :param query_namespace: pygetpaper’s name space object :type query_namespace: dict
- static makes_output_directory(query_namespace)
Makes the output directory for the given output in query_namespace :param query_namespace: pygetpaper’s name space object :type query_namespace: dict
- run_command(output=None, query=None, save_query=False, xml=False, pdf=False, supp=False, zip=False, references=False, noexecute=False, citations=False, limit=100, restart=False, update=False, onlyquery=False, makecsv=False, makehtml=False, synonym=False, startdate=False, enddate=False, terms=False, notterms=False, api='europe_pmc', filter=None, loglevel='info', logfile=False, version=False)
Runs pygetpapers for the given parameters
- runs_pygetpapers_for_given_args(query_namespace)
Runs pygetpapers for flags described in a dictionary :param query_namespace: pygetpaper’s namespace object :type query_namespace: dict
- static write_configuration_file(query_namespace)
Writes the argparse namespace to SAVED_CONFIG_INI :param query_namespace: argparse namespace object
- write_logfile(query_namespace, level)
This functions stores logs to a logfile :param query_namespace: argparse namespace object :param level: level of logger (See https://docs.python.org/3/library/logging.html#logging-levels)
- pygetpapers.pygetpapers.main()
Runs the CLI
download_tools module
- class pygetpapers.download_tools.DownloadTools(api=None)
Bases:
object
Generic tools for retrieving literature. Several are called by each repository
- check_if_content_is_zip(request_handler)
Checks if content in request object is a zip
- Parameters
request_handler (request object) – request object for the given zip
- Returns
if zip file exits
- Return type
bool
- static check_or_make_directory(directory_url)
Makes directory if doesn’t already exist
- Parameters
directory_url (string) – path to directory
- static dumps_json_to_given_path(path, json_dict, filemode='w')
dumps json dict to given path
- Parameters
path (string) – path to dump dict
json_dict (dictionary) – json dictionary
filemode (string, optional) – file mode, defaults to “w”
- extract_zip_files(byte_content_to_extract_from, destination_url)
Extracts zip file to destination_url
- Parameters
byte_content_to_extract_from (bytes) – byte content to extract from
destination_url (string) – path to save the extracted zip files to
- get_metadata_results_file()
Gets the url of metadata file (eg. eupmc-results.json) from the current working directory
- Returns
path of the master metadata file
- Return type
string
- get_parent_directory(path)
Returns path of the parent directory for given path
- Parameters
path (string) – path of the file
- Returns
path of the parent directory
- Return type
string
- get_request_endpoint_for_citations(identifier, source)
Gets endpoint to get citations from the configuration file
- Parameters
identifier (string) – unique identifier present in the url for the particular paper
source (string) – which repository to get the citations from
- Returns
request_handler.content
- Return type
bytes
- get_request_endpoint_for_references(identifier, source)
Gets endpoint to get references from the configuration file
- Parameters
identifier (string) – unique identifier present in the url for the particular paper
source (string) – which repository to get the citations from
- Returns
request_handler.content
- Return type
bytes
- get_request_endpoint_for_xml(identifier)
Gets endpoint to full text xml from the configuration file
- Parameters
identifier (string) – unique identifier present in the url for the particular paper
- Returns
request_handler.content
- Return type
bytes
- static get_version()
Gets version from the configuration file
- Returns
version of pygetpapers as described in the configuration file
- Return type
string
- gets_result_dict_for_query(headers, data)
Queries query_url provided in configuration file for the given headers and payload and returns result in the form of a python dictionary
- Parameters
headers (dict) – headers given to the request
payload (dict) – payload given to the request
- Returns
result in the form of a python dictionary
- Return type
dictionary
- getsupplementaryfiles(identifier, path_to_save, from_ftp_end_point=False)
Retrieves supplementary files for the given paper (according to identifier) and saves to path_to_save
- Parameters
identifier (string) – unique identifier present in the url for the particular paper
path_to_save (string) – path to save the supplementary files to
from_ftp_end_point (bool, optional) – to get the results from eupmc ftp endpoint
- handle_creation_of_csv_html_xml(makecsv, makehtml, makexml, metadata_dictionary, name)
Writes csv, html, xml for given conditions
- Parameters
makecsv (bool) – whether to get csv
makehtml (bool) – whether to get html
makexml (bool) – whether to get xml
metadata_dictionary (dict) – dictionary to write the content for
name (string) – name of the file to save
- make_citations(source, citationurl, identifier)
Retreives URL for the citations for the given paperid, gets the xml, writes to citationurl
- Parameters
source (which repository to get the citations from) – which repository to get the citations from
citationurl (string) – path to save the citations to
identifier (string) – unique identifier present in the url for the particular paper
- make_csv_for_dict(metadata_dictionary, name_main_result_file, name_result_file_for_paper)
Writes csv content for the given dictionary to disk
- Parameters
metadata_dictionary (dict) – dictionary to write the content for
name_main_result_file (string) – name of the main result file (eg. eupmc-results.xml)
name_result_file_for_paper (string) – name of the result file for a paper
- make_html_for_dict(metadata_dictionary, name_main_result_file, name_result_file_for_paper)
Writes html content for the given dictionary to disk
- Parameters
metadata_dictionary (dict) – dictionary to write the content for
name_main_result_file (string) – name of the main result file (eg. eupmc-results.xml)
name_result_file_for_paper (string) – name of the result file for a paper
- make_html_from_dataframe(dataframe, path_to_save)
Makes html page from the pandas given dataframe
- Parameters
dataframe (pandas dataframe) – pandas dataframe to convert to html
path_to_save (string) – path to save the dataframe to
- make_references(paperid, identifier, path_to_save)
Writes references for the given paperid from source to reference url
- Parameters
identifier (string) – identifier for the paper
source (string) – source to get references from
path_to_save (string) – path to store the references
- make_xml_for_dict(metadata_dictionary, name_main_result_file, name_result_file_for_paper)
Writes xml content for the given dictionary to disk
- Parameters
metadata_dictionary (dict) – dictionary to write the content for
name_main_result_file (string) – name of the main result file (eg. eupmc-results.xml)
name_result_file_for_paper (string) – name of the result file for a paper
- parse_request_handler(request_handler)
- post_query(url, data=None, headers=None)
Queries url
- Parameters
headers (dict) – headers given to the request
payload (dict) – payload given to the request
- Returns
result in the form of a python dictionary
- Return type
dictionary
- static queries_the_url_and_writes_response_to_destination(url, destination)
queries the url and writes response to destination
- Parameters
url (string) – url to query
destination (string) – destination to save response to
- static readjsondata(path)
reads json from path and returns python dictionary
- static removing_added_attributes_from_dictionary(resultant_dict)
pygetpapers adds some attributes like “pdfdownloaded” to track the progress of downloads for a particular corpus. When we are exporting data to a csv file, we dont want these terms to appear. So this funtion makes a copy of the given dictionary, removes the added attributes from dictionaries inside the given dict and returns the new dictionary.
- Parameters
resultant_dict (dictionary) – given parent dictionary
- Returns
dictionary with additional attributes removed from the child dictionaries
- Return type
dictionary
- set_up_config_variables(config, api)
Sets class variable reading the configuration file for the provided api
- Parameters
config (configparser object) – configparser object for the configuration file
api (string) – Name of api as described in the configuration file
- setup_config_file(config_ini)
Reads config_ini file and returns configparser object
- Parameters
config_ini (string) – path of configuration file
- Returns
configparser object for the configuration file
- Return type
configparser object
- static url_encode_id(doi_of_paper)
Encodes the doi of paper in a file savable name
- Parameters
doi_of_paper (string) – doi
- Returns
url encoded doi
- Return type
string
- static write_or_append_to_csv(df_transposed, csv_path='europe_pmc.csv')
write pandas dataframe to given csv file
- Parameters
df_transposed (pandas dataframe) – dataframe to save
csv_path (str, optional) – path to csv file, defaults to “europe_pmc.csv”
- writexml(destination_url, xml_content)
writes xml content to given destination_url
- Parameters
destination_url (string) – path to dump xml content
xml_content (byte string) – xml content
europe_pmc module
- class pygetpapers.repository.europe_pmc.EuropePmc
Bases:
RepositoryInterface
Downloads metadata and optionally fulltext from https://europepmc.org
- apipaperdownload(query_namespace)
Takes in the query_namespace object as the parameter and runs the query search for given search parameters.
- Parameters
query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse
- build_and_send_query(maximum_hits_per_page, cursor_mark, query, synonym)
Retrieves metadata from EPMC for given query
- Parameters
maximum_hits_per_page (int) – number of papers to get
cursor_mark (string) – cursor mark
query (string) – query
synonym (bool) – whether to get synonyms, defaults to True
- Returns
metadata dictionary
- Return type
dict
- static buildquery(cursormark, page_size, query, synonym=True)
Builds query parameters
- static create_parameters_for_paper_download()
Creates parameters for paper download
- Returns
parameters for paper download tuple
- Return type
tuple
- get_supplementary_metadata(metadata_dictionary_with_all_papers, getpdf=False, makecsv=False, makehtml=False, makexml=False, references=False, citations=False, supplementary_files=False, zip_files=False)
Gets supplementary metadata
- Parameters
metadata_dictionary_with_all_papers (dict) – metadata dictionary
getpdf (bool, optional) – whether to get pdfs
makecsv (bool, optional) – whether to create csv output
makehtml (bool, optional) – whether to create html output
makexml (bool, optional) – whether to download xml fulltext
references (bool, optional) – whether to download references
citations (bool, optional) – whether to download citations
supplementary_files (bool, optional) – whether to download supplementary_files
zip_files (bool, optional) – whether to download zip_files from the ftp endpoint
- get_urls_to_write_to(identifier_for_paper)
Gets urls to write the metadata to
- Parameters
identifier_for_paper (str) – identifier for paper
- Returns
urls to write the metadata to
- Return type
tuple
- make_html_from_dict(dict_to_write_html_from, url, identifier_for_paper)
Makes html from dict
- Parameters
dict_to_write_html_from (dict) – dict to write html from
url (str) – url to write html to
- noexecute(query_namespace)
Takes in the query_namespace object as the parameter and runs the query search for given search parameters but only prints the output and not write to disk.
- Parameters
query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse
- query(query, cutoff_size, synonym=True, cursor_mark='*')
Queries eupmc for given query for given number(cutoff_size) papers
- Parameters
query (string) – query
cutoff_size (int) – number of papers to get
synonym (bool, optional) – whether to get synonyms, defaults to True
- Returns
list containg the papers
- Return type
list
- restart(query_namespace)
Restarts query to add new metadata for existing papers
- Parameters
query_namespace (dict) – pygetpaper’s name space object
- run_eupmc_query_and_get_metadata(query, cutoff_size, update=None, onlymakejson=False, getpdf=False, makehtml=False, makecsv=False, makexml=False, references=False, citations=False, supplementary_files=False, synonym=True, zip_files=False)
- update(query_namespace)
If there is a previously existing corpus, this function reads in the ‘cursor mark’ from the previous run, increments in, and adds new papers for the given parameters to the existing corpus.
- Parameters
query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse
arxiv module
- class pygetpapers.repository.arxiv.Arxiv
Bases:
RepositoryInterface
arxiv.org repository
This uses a PyPI code arxiv to download metadata. It is not clear whether this is created by the arXiv project or layered on top of the public API.
arXiv current practice for bulk data download (e.g. PDFs) is described in
https://arxiv.org/help/bulk_data. Please be considerate and also include a rate limit.
- apipaperdownload(query_namespace)
Takes in the query_namespace object as the parameter and runs the query search for given search parameters.
- Parameters
query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse
- arxiv(query, cutoff_size, getpdf=False, makecsv=False, makexml=False, makehtml=False)
Builds the arxiv searcher and writes the xml, pdf, csv and html
- Parameters
query (string) – query given to arxiv
cutoff_size (int) – number of papers to retrieve
getpdf (bool, optional) – whether to get pdf
makecsv (bool) – whether to get csv
makehtml (bool) – whether to get html
makexml (bool) – whether to get xml
- Returns
dictionary of results retrieved from arxiv
- Return type
dict
- download_pdf(metadata_dictionary)
Downloads pdfs for papers in metadata dictionary
- Parameters
metadata_dictionary (dict) – metadata dictionary for papers
- static noexecute(query_namespace)
Takes in the query_namespace object as the parameter and runs the query search for given search parameters but only prints the output and not write to disk.
- Parameters
query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse
- static update(query_namespace)
If there is a previously existing corpus, this function reads in the ‘cursor mark’ from the previous run, increments in, and adds new papers for the given parameters to the existing corpus.
- Parameters
query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse
- write_metadata_json_from_arxiv_dict(metadata_dictionary)
Iterates through metadata_dictionary and makes json metadata file for papers
- Parameters
metadata_dictionary (dict) – metadata dictionary for papers
rxivist module
- class pygetpapers.repository.rxivist.Rxivist
Bases:
RepositoryInterface
Rxivist wrapper for biorxiv and medrxiv
From the site (rxivist.org): “Rxivist combines biology preprints from bioRxiv and medRxiv with data from Twitter to help you find the papers being discussed in your field.”
Appears to be metadata-only. To get full-text you may have to submit the IDs to biorxiv or medrxiv or EPMC as this aggregates preprints.
- apipaperdownload(query_namespace)
Takes in the query_namespace object as the parameter and runs the query search for given search parameters.
- Parameters
query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse
- download_and_save_results(query, size, update=False, makecsv=False, makexml=False, makehtml=False)
- make_request_add_papers(query, cursor_mark, total_number_of_results, total_papers_list)
- noexecute(query_namespace)
Takes in the query_namespace object as the parameter and runs the query search for given search parameters but only prints the output and not write to disk.
- Parameters
query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse
- rxivist(query, size, update=None, makecsv=False, makexml=False, makehtml=False)
- send_post_request(query, cursor_mark=0, page_size=20)
- update(query_namespace)
If there is a previously existing corpus, this function reads in the ‘cursor mark’ from the previous run, increments in, and adds new papers for the given parameters to the existing corpus.
- Parameters
query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse
crossref module
- class pygetpapers.repository.crossref.CrossRef
Bases:
RepositoryInterface
CrossRef class which handles crossref repository. It uses habanero repository wrapper to make its query
- apipaperdownload(query_namespace)
Takes in the query_namespace object as the parameter and runs the query search for given search parameters.
- Parameters
query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse
- crossref(query, cutoff_size, filter_dict=None, update=None, makecsv=False, makexml=False, makehtml=False)
Builds the crossref searcher and writes the xml, csv and html
- Parameters
query (string) – query given to crossref
cutoff_size (int) – number of papers to retrieve
filter_dict (bool, optional) – filters for crossref search
makecsv (bool) – whether to get csv
makehtml (bool) – whether to get html
makexml (bool) – whether to get xml
update (dict) – dictionary containing results from previous run of pygetpapers
- Returns
dictionary of results retrieved from crossref
- Return type
dict
- initiate_crossref()
Initate habanero wrapper for crossref
- Returns
crossref object
- noexecute(query_namespace)
Takes in the query_namespace object as the parameter and runs the query search for given search parameters but only prints the output and not write to disk.
- Parameters
query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse
- update(query_namespace)
If there is a previously existing corpus, this function reads in the ‘cursor mark’ from the previous run, increments in, and adds new papers for the given parameters to the existing corpus.
- Parameters
query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse
rxiv module
- class pygetpapers.repository.rxiv.Rxiv(api='biorxiv')
Bases:
RepositoryInterface
Biorxiv and Medrxiv repositories
At present (2022-03) the API appears only to support date searches. The rxivist system is layered on top and supports fuller queries
- apipaperdownload(query_namespace)
Takes in the query_namespace object as the parameter and runs the query search for given search parameters.
- Parameters
query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse
- download_and_save_results(query, cutoff_size, source, update=False, makecsv=False, makexml=False, makehtml=False)
- make_request_add_papers(interval, cursor_mark, source, total_number_of_results, total_papers_list)
- make_request_url_for_rxiv(cursor_mark, interval, source)
- make_xml_for_rxiv(dict_of_papers, xml_identifier, paper_id_identifier, filename)
- noexecute(query_namespace)
Takes in the query_namespace object as the parameter and runs the query search for given search parameters but only prints the output and not write to disk.
- Parameters
query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse
- rxiv(query, cutoff_size, source='biorxiv', update=None, makecsv=False, makehtml=False)
- rxiv_update(interval, cutoff_size, source='biorxiv', update=None, makecsv=False, makexml=False, makehtml=False)
- update(query_namespace)
If there is a previously existing corpus, this function reads in the ‘cursor mark’ from the previous run, increments in, and adds new papers for the given parameters to the existing corpus.
- Parameters
query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse
QueryBuilder
pygetpapers builds and runs all queries through a query_builder module (Pygetpapers.py) . There are several reasons:
each repository may use its own query language and syntax
there is frequent need to use punctuation (e.g. (..), “..”, ‘..’) and these may be nested. Punctuation can also interact with command-line syntax
complex queries (e.g. repeated OR, AND, NOT ) are tedious and error-prone
many values (especially dates) need converting or standardising
some options require or forbid other options (e.g. –xml requires an –output value)
successful queries can be saved , edited, and rerun
queries may be rerun at a later date, or request a larger number of downloads.
Users may wish to build queries:
completely from the commandline (argparse Namespace).
from a saved query (configparser configuration file)
programmatically through an instance of Pygetpapers
mixtures of the above
QueryBuilder contains or creates flags indicating which of the following is to be processed
query strings to be submitted to the particular repository
flags controlling the execution (download rate, limits, formats)
creation of the local repository (CProject)
creation of the per-article subdirectories (CTree)
postprocessing options (e.g. docanalysis and py4ami, and standard Unix/Python libraries)
How to add a new repository
pygetpapers makes it really easy to add support for new repositories.
To add a new repository, clone the repo and cd into the directory pygetpapers. Thereafter, create a new module with the class for the repo. Make sure you edit the config.ini file with the specifications of the new repo.
Following is an example config
[europe_pmc]
posturl=https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
citationurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/citations?page=1&pageSize=1000&format=xml
referencesurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/references?page=1&pageSize=1000&format=xml
xmlurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML
suppurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles
zipurl= http://europepmc.org/ftp/suppl/OA/{key}/{pmcid}.zip
date_query=SUPPORTED
term=SUPPORTED
update=SUPPORTED
restart=SUPPORTED
class_name=EuropePmc
library_name= europe_pmc
features_not_supported = ["filter",]
After this, in the repo class, ensure that you can request scientific papers, download them and do post-processing on them. There are multiple functions in the class download_tools which can help you with the same. I suggest looking at previously configured repos for the same.
It is necessary to have three functions in particular.
apipaperdownload
noexecute
update
Following is an example implementation.
def update(
self,
args
):
"""[summary]
:param args: [description]
:type args: [type]
"""
logging.info("Reading old json metadata file")
update_path = self.get_metadata_results_file()
os.chdir(os.path.dirname(update_path))
update = self.download_tools.readjsondata(update_path)
query = args.query
size = args.limit
filter_dict = args.filter
makecsv = args.makecsv
makexml = args.xml
makehtml = args.makehtml
result_dict = self.crossref(
query,
size,
filter_dict=filter_dict,
update=update,
makecsv=makecsv,
makexml=makexml,
makehtml=makehtml,
)
self.download_tools.make_json_files_for_paper(
result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI,
name_of_file=CROSSREF_RESULTS
)
def noexecute(self, args):
"""[summary]
:param args: [description]
:type args: [type]
"""
query = args.query
filter_dict = args.filter
result_dict = self.crossref(
query, size=10, filter_dict=filter_dict
)
totalhits = result_dict[NEW_RESULTS][TOTAL_HITS]
logging.info("Total number of hits for the query are %s", totalhits)
def apipaperdownload(
self,
args
):
"""[summary]
:param args: [description]
:type args: [type]
"""
query = args.query
size = args.limit
filter_dict = args.filter
makecsv = args.makecsv
makexml = args.xml
makehtml = args.makehtml
result_dict = self.crossref(
query,
size,
filter_dict=filter_dict,
update=None,
makecsv=makecsv,
makexml=makexml,
makehtml=makehtml,
)
self.download_tools.make_json_files_for_paper(
result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI, name_of_file=CROSSREF_RESULTS
)
The class ApiPlugger looks for these functions along with the config file to serve the API on the cli.
pygetpapers makes it really easy to add support for new repositories.
To add a new repository, clone the repo and cd into the directory pygetpapers. Thereafter, create a new module with the class for the repo. Make sure you edit the config.ini file with the specifications of the new repo.
Following is an example config
[europe_pmc]
posturl=https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
citationurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/citations?page=1&pageSize=1000&format=xml
referencesurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/references?page=1&pageSize=1000&format=xml
xmlurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML
suppurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles
zipurl= http://europepmc.org/ftp/suppl/OA/{key}/{pmcid}.zip
date_query=SUPPORTED
term=SUPPORTED
update=SUPPORTED
restart=SUPPORTED
class_name=EuropePmc
library_name= europe_pmc
features_not_supported = ["filter",]
After this, in the repo class, ensure that you can request scientific papers, download them and do post-processing on them. There are multiple functions in the class download_tools which can help you with the same. I suggest looking at previously configured repos for the same.
It is necessary to have three functions in particular.
apipaperdownload
noexecute
update
Following is an example implementation.
def update(
self,
args
):
"""[summary]
:param args: [description]
:type args: [type]
"""
logging.info("Reading old json metadata file")
update_path = self.get_metadata_results_file()
os.chdir(os.path.dirname(update_path))
update = self.download_tools.readjsondata(update_path)
query = args.query
size = args.limit
filter_dict = args.filter
makecsv = args.makecsv
makexml = args.xml
makehtml = args.makehtml
result_dict = self.crossref(
query,
size,
filter_dict=filter_dict,
update=update,
makecsv=makecsv,
makexml=makexml,
makehtml=makehtml,
)
self.download_tools.make_json_files_for_paper(
result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI,
name_of_file=CROSSREF_RESULTS
)
def noexecute(self, args):
"""[summary]
:param args: [description]
:type args: [type]
"""
query = args.query
filter_dict = args.filter
result_dict = self.crossref(
query, size=10, filter_dict=filter_dict
)
totalhits = result_dict[NEW_RESULTS][TOTAL_HITS]
logging.info("Total number of hits for the query are %s", totalhits)
def apipaperdownload(
self,
args
):
"""[summary]
:param args: [description]
:type args: [type]
"""
query = args.query
size = args.limit
filter_dict = args.filter
makecsv = args.makecsv
makexml = args.xml
makehtml = args.makehtml
result_dict = self.crossref(
query,
size,
filter_dict=filter_dict,
update=None,
makecsv=makecsv,
makexml=makexml,
makehtml=makehtml,
)
self.download_tools.make_json_files_for_paper(
result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI, name_of_file=CROSSREF_RESULTS
)
The class ApiPlugger looks for these functions along with the config file to serve the API on the cli.
pygetpapers makes it really easy to add support for new repositories.
To add a new repository, clone the repo and cd into the directory pygetpapers. Thereafter, create a new module with the class for the repo. Make sure you edit the config.ini file with the specifications of the new repo.
Following is an example config
[europe_pmc]
posturl=https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
citationurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/citations?page=1&pageSize=1000&format=xml
referencesurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/references?page=1&pageSize=1000&format=xml
xmlurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML
suppurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles
zipurl= http://europepmc.org/ftp/suppl/OA/{key}/{pmcid}.zip
date_query=SUPPORTED
term=SUPPORTED
update=SUPPORTED
restart=SUPPORTED
class_name=EuropePmc
library_name= europe_pmc
features_not_supported = ["filter",]
After this, in the repo class, ensure that you can request scientific papers, download them and do post-processing on them. There are multiple functions in the class download_tools which can help you with the same. I suggest looking at previously configured repos for the same.
It is necessary to have three functions in particular.
apipaperdownload
noexecute
update
Following is an example implementation.
def update(
self,
args
):
"""[summary]
:param args: [description]
:type args: [type]
"""
logging.info("Reading old json metadata file")
update_path = self.get_metadata_results_file()
os.chdir(os.path.dirname(update_path))
update = self.download_tools.readjsondata(update_path)
query = args.query
size = args.limit
filter_dict = args.filter
makecsv = args.makecsv
makexml = args.xml
makehtml = args.makehtml
result_dict = self.crossref(
query,
size,
filter_dict=filter_dict,
update=update,
makecsv=makecsv,
makexml=makexml,
makehtml=makehtml,
)
self.download_tools.make_json_files_for_paper(
result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI,
name_of_file=CROSSREF_RESULTS
)
def noexecute(self, args):
"""[summary]
:param args: [description]
:type args: [type]
"""
query = args.query
filter_dict = args.filter
result_dict = self.crossref(
query, size=10, filter_dict=filter_dict
)
totalhits = result_dict[NEW_RESULTS][TOTAL_HITS]
logging.info("Total number of hits for the query are %s", totalhits)
def apipaperdownload(
self,
args
):
"""[summary]
:param args: [description]
:type args: [type]
"""
query = args.query
size = args.limit
filter_dict = args.filter
makecsv = args.makecsv
makexml = args.xml
makehtml = args.makehtml
result_dict = self.crossref(
query,
size,
filter_dict=filter_dict,
update=None,
makecsv=makecsv,
makexml=makexml,
makehtml=makehtml,
)
self.download_tools.make_json_files_for_paper(
result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI, name_of_file=CROSSREF_RESULTS
)
The class ApiPlugger looks for these functions along with the config file to serve the API on the cli.
pygetpapers makes it really easy to add support for new repositories.
To add a new repository, clone the repo and cd into the directory pygetpapers. Thereafter, create a new module with the class for the repo. Make sure you edit the config.ini file with the specifications of the new repo.
Following is an example config
[europe_pmc]
posturl=https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
citationurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/citations?page=1&pageSize=1000&format=xml
referencesurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/references?page=1&pageSize=1000&format=xml
xmlurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML
suppurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles
zipurl= http://europepmc.org/ftp/suppl/OA/{key}/{pmcid}.zip
date_query=SUPPORTED
term=SUPPORTED
update=SUPPORTED
restart=SUPPORTED
class_name=EuropePmc
library_name= europe_pmc
features_not_supported = ["filter",]
After this, in the repo class, ensure that you can request scientific papers, download them and do post-processing on them. There are multiple functions in the class download_tools which can help you with the same. I suggest looking at previously configured repos for the same.
It is necessary to have three functions in particular.
apipaperdownload
noexecute
update
Following is an example implementation.
def update(
self,
args
):
"""[summary]
:param args: [description]
:type args: [type]
"""
logging.info("Reading old json metadata file")
update_path = self.get_metadata_results_file()
os.chdir(os.path.dirname(update_path))
update = self.download_tools.readjsondata(update_path)
query = args.query
size = args.limit
filter_dict = args.filter
makecsv = args.makecsv
makexml = args.xml
makehtml = args.makehtml
result_dict = self.crossref(
query,
size,
filter_dict=filter_dict,
update=update,
makecsv=makecsv,
makexml=makexml,
makehtml=makehtml,
)
self.download_tools.make_json_files_for_paper(
result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI,
name_of_file=CROSSREF_RESULTS
)
def noexecute(self, args):
"""[summary]
:param args: [description]
:type args: [type]
"""
query = args.query
filter_dict = args.filter
result_dict = self.crossref(
query, size=10, filter_dict=filter_dict
)
totalhits = result_dict[NEW_RESULTS][TOTAL_HITS]
logging.info("Total number of hits for the query are %s", totalhits)
def apipaperdownload(
self,
args
):
"""[summary]
:param args: [description]
:type args: [type]
"""
query = args.query
size = args.limit
filter_dict = args.filter
makecsv = args.makecsv
makexml = args.xml
makehtml = args.makehtml
result_dict = self.crossref(
query,
size,
filter_dict=filter_dict,
update=None,
makecsv=makecsv,
makexml=makexml,
makehtml=makehtml,
)
self.download_tools.make_json_files_for_paper(
result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI, name_of_file=CROSSREF_RESULTS
)
The class ApiPlugger looks for these functions along with the config file to serve the API on the cli.
pygetpapers makes it really easy to add support for new repositories.
To add a new repository, clone the repo and cd into the directory pygetpapers. Thereafter, create a new module with the class for the repo. Make sure you edit the config.ini file with the specifications of the new repo.
Following is an example config
[europe_pmc]
posturl=https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
citationurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/citations?page=1&pageSize=1000&format=xml
referencesurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/references?page=1&pageSize=1000&format=xml
xmlurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML
suppurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles
zipurl= http://europepmc.org/ftp/suppl/OA/{key}/{pmcid}.zip
date_query=SUPPORTED
term=SUPPORTED
update=SUPPORTED
restart=SUPPORTED
class_name=EuropePmc
library_name= europe_pmc
features_not_supported = ["filter",]
After this, in the repo class, ensure that you can request scientific papers, download them and do post-processing on them. There are multiple functions in the class download_tools which can help you with the same. I suggest looking at previously configured repos for the same.
It is necessary to have three functions in particular.
apipaperdownload
noexecute
update
Following is an example implementation.
def update(
self,
args
):
"""[summary]
:param args: [description]
:type args: [type]
"""
logging.info("Reading old json metadata file")
update_path = self.get_metadata_results_file()
os.chdir(os.path.dirname(update_path))
update = self.download_tools.readjsondata(update_path)
query = args.query
size = args.limit
filter_dict = args.filter
makecsv = args.makecsv
makexml = args.xml
makehtml = args.makehtml
result_dict = self.crossref(
query,
size,
filter_dict=filter_dict,
update=update,
makecsv=makecsv,
makexml=makexml,
makehtml=makehtml,
)
self.download_tools.make_json_files_for_paper(
result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI,
name_of_file=CROSSREF_RESULTS
)
def noexecute(self, args):
"""[summary]
:param args: [description]
:type args: [type]
"""
query = args.query
filter_dict = args.filter
result_dict = self.crossref(
query, size=10, filter_dict=filter_dict
)
totalhits = result_dict[NEW_RESULTS][TOTAL_HITS]
logging.info("Total number of hits for the query are %s", totalhits)
def apipaperdownload(
self,
args
):
"""[summary]
:param args: [description]
:type args: [type]
"""
query = args.query
size = args.limit
filter_dict = args.filter
makecsv = args.makecsv
makexml = args.xml
makehtml = args.makehtml
result_dict = self.crossref(
query,
size,
filter_dict=filter_dict,
update=None,
makecsv=makecsv,
makexml=makexml,
makehtml=makehtml,
)
self.download_tools.make_json_files_for_paper(
result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI, name_of_file=CROSSREF_RESULTS
)
The class ApiPlugger looks for these functions along with the config file to serve the API on the cli.
pygetpapers makes it really easy to add support for new repositories.
To add a new repository, clone the repo and cd into the directory pygetpapers. Thereafter, create a new module with the class for the repo. Make sure you edit the config.ini file with the specifications of the new repo.
Following is an example config
[europe_pmc]
posturl=https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
citationurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/citations?page=1&pageSize=1000&format=xml
referencesurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/references?page=1&pageSize=1000&format=xml
xmlurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML
suppurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles
zipurl= http://europepmc.org/ftp/suppl/OA/{key}/{pmcid}.zip
date_query=SUPPORTED
term=SUPPORTED
update=SUPPORTED
restart=SUPPORTED
class_name=EuropePmc
library_name= europe_pmc
features_not_supported = ["filter",]
After this, in the repo class, ensure that you can request scientific papers, download them and do post-processing on them. There are multiple functions in the class download_tools which can help you with the same. I suggest looking at previously configured repos for the same.
It is necessary to have three functions in particular.
apipaperdownload
noexecute
update
Following is an example implementation.
def update(
self,
args
):
"""[summary]
:param args: [description]
:type args: [type]
"""
logging.info("Reading old json metadata file")
update_path = self.get_metadata_results_file()
os.chdir(os.path.dirname(update_path))
update = self.download_tools.readjsondata(update_path)
query = args.query
size = args.limit
filter_dict = args.filter
makecsv = args.makecsv
makexml = args.xml
makehtml = args.makehtml
result_dict = self.crossref(
query,
size,
filter_dict=filter_dict,
update=update,
makecsv=makecsv,
makexml=makexml,
makehtml=makehtml,
)
self.download_tools.make_json_files_for_paper(
result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI,
name_of_file=CROSSREF_RESULTS
)
def noexecute(self, args):
"""[summary]
:param args: [description]
:type args: [type]
"""
query = args.query
filter_dict = args.filter
result_dict = self.crossref(
query, size=10, filter_dict=filter_dict
)
totalhits = result_dict[NEW_RESULTS][TOTAL_HITS]
logging.info("Total number of hits for the query are %s", totalhits)
def apipaperdownload(
self,
args
):
"""[summary]
:param args: [description]
:type args: [type]
"""
query = args.query
size = args.limit
filter_dict = args.filter
makecsv = args.makecsv
makexml = args.xml
makehtml = args.makehtml
result_dict = self.crossref(
query,
size,
filter_dict=filter_dict,
update=None,
makecsv=makecsv,
makexml=makexml,
makehtml=makehtml,
)
self.download_tools.make_json_files_for_paper(
result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI, name_of_file=CROSSREF_RESULTS
)
The class ApiPlugger looks for these functions along with the config file to serve the API on the cli.
pygetpapers makes it really easy to add support for new repositories.
To add a new repository, clone the repo and cd into the directory pygetpapers. Thereafter, create a new module with the class for the repo. Make sure you edit the config.ini file with the specifications of the new repo.
Following is an example config
[europe_pmc]
posturl=https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
citationurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/citations?page=1&pageSize=1000&format=xml
referencesurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/references?page=1&pageSize=1000&format=xml
xmlurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML
suppurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles
zipurl= http://europepmc.org/ftp/suppl/OA/{key}/{pmcid}.zip
date_query=SUPPORTED
term=SUPPORTED
update=SUPPORTED
restart=SUPPORTED
class_name=EuropePmc
library_name= europe_pmc
features_not_supported = ["filter",]
After this, in the repo class, ensure that you can request scientific papers, download them and do post-processing on them. There are multiple functions in the class download_tools which can help you with the same. I suggest looking at previously configured repos for the same.
It is necessary to have three functions in particular.
apipaperdownload
noexecute
update
Following is an example implementation.
def update(
self,
args
):
"""[summary]
:param args: [description]
:type args: [type]
"""
logging.info("Reading old json metadata file")
update_path = self.get_metadata_results_file()
os.chdir(os.path.dirname(update_path))
update = self.download_tools.readjsondata(update_path)
query = args.query
size = args.limit
filter_dict = args.filter
makecsv = args.makecsv
makexml = args.xml
makehtml = args.makehtml
result_dict = self.crossref(
query,
size,
filter_dict=filter_dict,
update=update,
makecsv=makecsv,
makexml=makexml,
makehtml=makehtml,
)
self.download_tools.make_json_files_for_paper(
result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI,
name_of_file=CROSSREF_RESULTS
)
def noexecute(self, args):
"""[summary]
:param args: [description]
:type args: [type]
"""
query = args.query
filter_dict = args.filter
result_dict = self.crossref(
query, size=10, filter_dict=filter_dict
)
totalhits = result_dict[NEW_RESULTS][TOTAL_HITS]
logging.info("Total number of hits for the query are %s", totalhits)
def apipaperdownload(
self,
args
):
"""[summary]
:param args: [description]
:type args: [type]
"""
query = args.query
size = args.limit
filter_dict = args.filter
makecsv = args.makecsv
makexml = args.xml
makehtml = args.makehtml
result_dict = self.crossref(
query,
size,
filter_dict=filter_dict,
update=None,
makecsv=makecsv,
makexml=makexml,
makehtml=makehtml,
)
self.download_tools.make_json_files_for_paper(
result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI, name_of_file=CROSSREF_RESULTS
)
The class ApiPlugger looks for these functions along with the config file to serve the API on the cli.
pygetpapers makes it really easy to add support for new repositories.
To add a new repository, clone the repo and cd into the directory pygetpapers. Thereafter, create a new module with the class for the repo. Make sure you edit the config.ini file with the specifications of the new repo.
Following is an example config
[europe_pmc]
posturl=https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
citationurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/citations?page=1&pageSize=1000&format=xml
referencesurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/references?page=1&pageSize=1000&format=xml
xmlurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML
suppurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles
zipurl= http://europepmc.org/ftp/suppl/OA/{key}/{pmcid}.zip
date_query=SUPPORTED
term=SUPPORTED
update=SUPPORTED
restart=SUPPORTED
class_name=EuropePmc
library_name= europe_pmc
features_not_supported = ["filter",]
After this, in the repo class, ensure that you can request scientific papers, download them and do post-processing on them. There are multiple functions in the class download_tools which can help you with the same. I suggest looking at previously configured repos for the same.
It is necessary to have three functions in particular.
apipaperdownload
noexecute
update
Following is an example implementation.
def update(
self,
args
):
"""[summary]
:param args: [description]
:type args: [type]
"""
logging.info("Reading old json metadata file")
update_path = self.get_metadata_results_file()
os.chdir(os.path.dirname(update_path))
update = self.download_tools.readjsondata(update_path)
query = args.query
size = args.limit
filter_dict = args.filter
makecsv = args.makecsv
makexml = args.xml
makehtml = args.makehtml
result_dict = self.crossref(
query,
size,
filter_dict=filter_dict,
update=update,
makecsv=makecsv,
makexml=makexml,
makehtml=makehtml,
)
self.download_tools.make_json_files_for_paper(
result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI,
name_of_file=CROSSREF_RESULTS
)
def noexecute(self, args):
"""[summary]
:param args: [description]
:type args: [type]
"""
query = args.query
filter_dict = args.filter
result_dict = self.crossref(
query, size=10, filter_dict=filter_dict
)
totalhits = result_dict[NEW_RESULTS][TOTAL_HITS]
logging.info("Total number of hits for the query are %s", totalhits)
def apipaperdownload(
self,
args
):
"""[summary]
:param args: [description]
:type args: [type]
"""
query = args.query
size = args.limit
filter_dict = args.filter
makecsv = args.makecsv
makexml = args.xml
makehtml = args.makehtml
result_dict = self.crossref(
query,
size,
filter_dict=filter_dict,
update=None,
makecsv=makecsv,
makexml=makexml,
makehtml=makehtml,
)
self.download_tools.make_json_files_for_paper(
result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI, name_of_file=CROSSREF_RESULTS
)
The class ApiPlugger looks for these functions along with the config file to serve the API on the cli.
pygetpapers makes it really easy to add support for new repositories.
To add a new repository, clone the repo and cd into the directory pygetpapers. Thereafter, create a new module with the class for the repo. Make sure you edit the config.ini file with the specifications of the new repo.
Following is an example config
[europe_pmc]
posturl=https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
citationurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/citations?page=1&pageSize=1000&format=xml
referencesurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{source}/{pmcid}/references?page=1&pageSize=1000&format=xml
xmlurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML
suppurl=https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles
zipurl= http://europepmc.org/ftp/suppl/OA/{key}/{pmcid}.zip
date_query=SUPPORTED
term=SUPPORTED
update=SUPPORTED
restart=SUPPORTED
class_name=EuropePmc
library_name= europe_pmc
features_not_supported = ["filter",]
After this, in the repo class, ensure that you can request scientific papers, download them and do post-processing on them. There are multiple functions in the class download_tools which can help you with the same. I suggest looking at previously configured repos for the same.
It is necessary to have three functions in particular.
apipaperdownload
noexecute
update
Following is an example implementation.
def update(
self,
args
):
"""[summary]
:param args: [description]
:type args: [type]
"""
logging.info("Reading old json metadata file")
update_path = self.get_metadata_results_file()
os.chdir(os.path.dirname(update_path))
update = self.download_tools.readjsondata(update_path)
query = args.query
size = args.limit
filter_dict = args.filter
makecsv = args.makecsv
makexml = args.xml
makehtml = args.makehtml
result_dict = self.crossref(
query,
size,
filter_dict=filter_dict,
update=update,
makecsv=makecsv,
makexml=makexml,
makehtml=makehtml,
)
self.download_tools.make_json_files_for_paper(
result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI,
name_of_file=CROSSREF_RESULTS
)
def noexecute(self, args):
"""[summary]
:param args: [description]
:type args: [type]
"""
query = args.query
filter_dict = args.filter
result_dict = self.crossref(
query, size=10, filter_dict=filter_dict
)
totalhits = result_dict[NEW_RESULTS][TOTAL_HITS]
logging.info("Total number of hits for the query are %s", totalhits)
def apipaperdownload(
self,
args
):
"""[summary]
:param args: [description]
:type args: [type]
"""
query = args.query
size = args.limit
filter_dict = args.filter
makecsv = args.makecsv
makexml = args.xml
makehtml = args.makehtml
result_dict = self.crossref(
query,
size,
filter_dict=filter_dict,
update=None,
makecsv=makecsv,
makexml=makexml,
makehtml=makehtml,
)
self.download_tools.make_json_files_for_paper(
result_dict[NEW_RESULTS], updated_dict=result_dict[UPDATED_DICT], key_in_dict=DOI, name_of_file=CROSSREF_RESULTS
)
The class ApiPlugger looks for these functions along with the config file to serve the API on the cli.
Repository Interface
- class pygetpapers.repositoryinterface.RepositoryInterface
Bases:
ABC
- abstract apipaperdownload(query_namespace)
Takes in the query_namespace object as the parameter and runs the query search for given search parameters.
- Parameters
query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse
- abstract noexecute(query_namespace)
Takes in the query_namespace object as the parameter and runs the query search for given search parameters but only prints the output and not write to disk.
- Parameters
query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse
- abstract update(query_namespace)
If there is a previously existing corpus, this function reads in the ‘cursor mark’ from the previous run, increments in, and adds new papers for the given parameters to the existing corpus.
- Parameters
query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse