arxiv module

class pygetpapers.repository.arxiv.Arxiv

Bases: RepositoryInterface

arxiv.org repository

This uses a PyPI code arxiv to download metadata. It is not clear whether this is created by the arXiv project or layered on top of the public API.

arXiv current practice for bulk data download (e.g. PDFs) is described in

https://arxiv.org/help/bulk_data. Please be considerate and also include a rate limit.

apipaperdownload(query_namespace)

Takes in the query_namespace object as the parameter and runs the query search for given search parameters.

Parameters

query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse

arxiv(query, cutoff_size, getpdf=False, makecsv=False, makexml=False, makehtml=False)

Builds the arxiv searcher and writes the xml, pdf, csv and html

Parameters
  • query (string) – query given to arxiv

  • cutoff_size (int) – number of papers to retrieve

  • getpdf (bool, optional) – whether to get pdf

  • makecsv (bool) – whether to get csv

  • makehtml (bool) – whether to get html

  • makexml (bool) – whether to get xml

Returns

dictionary of results retrieved from arxiv

Return type

dict

download_pdf(metadata_dictionary)

Downloads pdfs for papers in metadata dictionary

Parameters

metadata_dictionary (dict) – metadata dictionary for papers

static noexecute(query_namespace)

Takes in the query_namespace object as the parameter and runs the query search for given search parameters but only prints the output and not write to disk.

Parameters

query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse

static update(query_namespace)

If there is a previously existing corpus, this function reads in the ‘cursor mark’ from the previous run, increments in, and adds new papers for the given parameters to the existing corpus.

Parameters

query_namespace (dict) – pygetpaper’s namespace object containing the queries from argparse

write_metadata_json_from_arxiv_dict(metadata_dictionary)

Iterates through metadata_dictionary and makes json metadata file for papers

Parameters

metadata_dictionary (dict) – metadata dictionary for papers