download_tools module

class pygetpapers.download_tools.DownloadTools(api=None)

Bases: object

Generic tools for retrieving literature. Several are called by each repository

check_if_content_is_zip(request_handler)

Checks if content in request object is a zip

Parameters

request_handler (request object) – request object for the given zip

Returns

if zip file exits

Return type

bool

static check_or_make_directory(directory_url)

Makes directory if doesn’t already exist

Parameters

directory_url (string) – path to directory

static dumps_json_to_given_path(path, json_dict, filemode='w')

dumps json dict to given path

Parameters
  • path (string) – path to dump dict

  • json_dict (dictionary) – json dictionary

  • filemode (string, optional) – file mode, defaults to “w”

extract_zip_files(byte_content_to_extract_from, destination_url)

Extracts zip file to destination_url

Parameters
  • byte_content_to_extract_from (bytes) – byte content to extract from

  • destination_url (string) – path to save the extracted zip files to

get_metadata_results_file()

Gets the url of metadata file (eg. eupmc-results.json) from the current working directory

Returns

path of the master metadata file

Return type

string

get_parent_directory(path)

Returns path of the parent directory for given path

Parameters

path (string) – path of the file

Returns

path of the parent directory

Return type

string

get_request_endpoint_for_citations(identifier, source)

Gets endpoint to get citations from the configuration file

Parameters
  • identifier (string) – unique identifier present in the url for the particular paper

  • source (string) – which repository to get the citations from

Returns

request_handler.content

Return type

bytes

get_request_endpoint_for_references(identifier, source)

Gets endpoint to get references from the configuration file

Parameters
  • identifier (string) – unique identifier present in the url for the particular paper

  • source (string) – which repository to get the citations from

Returns

request_handler.content

Return type

bytes

get_request_endpoint_for_xml(identifier)

Gets endpoint to full text xml from the configuration file

Parameters

identifier (string) – unique identifier present in the url for the particular paper

Returns

request_handler.content

Return type

bytes

static get_version()

Gets version from the configuration file

Returns

version of pygetpapers as described in the configuration file

Return type

string

gets_result_dict_for_query(headers, data)

Queries query_url provided in configuration file for the given headers and payload and returns result in the form of a python dictionary

Parameters
  • headers (dict) – headers given to the request

  • payload (dict) – payload given to the request

Returns

result in the form of a python dictionary

Return type

dictionary

getsupplementaryfiles(identifier, path_to_save, from_ftp_end_point=False)

Retrieves supplementary files for the given paper (according to identifier) and saves to path_to_save

Parameters
  • identifier (string) – unique identifier present in the url for the particular paper

  • path_to_save (string) – path to save the supplementary files to

  • from_ftp_end_point (bool, optional) – to get the results from eupmc ftp endpoint

handle_creation_of_csv_html_xml(makecsv, makehtml, makexml, metadata_dictionary, name)

Writes csv, html, xml for given conditions

Parameters
  • makecsv (bool) – whether to get csv

  • makehtml (bool) – whether to get html

  • makexml (bool) – whether to get xml

  • metadata_dictionary (dict) – dictionary to write the content for

  • name (string) – name of the file to save

make_citations(source, citationurl, identifier)

Retreives URL for the citations for the given paperid, gets the xml, writes to citationurl

Parameters
  • source (which repository to get the citations from) – which repository to get the citations from

  • citationurl (string) – path to save the citations to

  • identifier (string) – unique identifier present in the url for the particular paper

make_csv_for_dict(metadata_dictionary, name_main_result_file, name_result_file_for_paper)

Writes csv content for the given dictionary to disk

Parameters
  • metadata_dictionary (dict) – dictionary to write the content for

  • name_main_result_file (string) – name of the main result file (eg. eupmc-results.xml)

  • name_result_file_for_paper (string) – name of the result file for a paper

make_html_for_dict(metadata_dictionary, name_main_result_file, name_result_file_for_paper)

Writes html content for the given dictionary to disk

Parameters
  • metadata_dictionary (dict) – dictionary to write the content for

  • name_main_result_file (string) – name of the main result file (eg. eupmc-results.xml)

  • name_result_file_for_paper (string) – name of the result file for a paper

make_html_from_dataframe(dataframe, path_to_save)

Makes html page from the pandas given dataframe

Parameters
  • dataframe (pandas dataframe) – pandas dataframe to convert to html

  • path_to_save (string) – path to save the dataframe to

make_references(paperid, identifier, path_to_save)

Writes references for the given paperid from source to reference url

Parameters
  • identifier (string) – identifier for the paper

  • source (string) – source to get references from

  • path_to_save (string) – path to store the references

make_xml_for_dict(metadata_dictionary, name_main_result_file, name_result_file_for_paper)

Writes xml content for the given dictionary to disk

Parameters
  • metadata_dictionary (dict) – dictionary to write the content for

  • name_main_result_file (string) – name of the main result file (eg. eupmc-results.xml)

  • name_result_file_for_paper (string) – name of the result file for a paper

parse_request_handler(request_handler)
post_query(url, data=None, headers=None)

Queries url

Parameters
  • headers (dict) – headers given to the request

  • payload (dict) – payload given to the request

Returns

result in the form of a python dictionary

Return type

dictionary

static queries_the_url_and_writes_response_to_destination(url, destination)

queries the url and writes response to destination

Parameters
  • url (string) – url to query

  • destination (string) – destination to save response to

static readjsondata(path)

reads json from path and returns python dictionary

static removing_added_attributes_from_dictionary(resultant_dict)

pygetpapers adds some attributes like “pdfdownloaded” to track the progress of downloads for a particular corpus. When we are exporting data to a csv file, we dont want these terms to appear. So this funtion makes a copy of the given dictionary, removes the added attributes from dictionaries inside the given dict and returns the new dictionary.

Parameters

resultant_dict (dictionary) – given parent dictionary

Returns

dictionary with additional attributes removed from the child dictionaries

Return type

dictionary

set_up_config_variables(config, api)

Sets class variable reading the configuration file for the provided api

Parameters
  • config (configparser object) – configparser object for the configuration file

  • api (string) – Name of api as described in the configuration file

setup_config_file(config_ini)

Reads config_ini file and returns configparser object

Parameters

config_ini (string) – path of configuration file

Returns

configparser object for the configuration file

Return type

configparser object

static url_encode_id(doi_of_paper)

Encodes the doi of paper in a file savable name

Parameters

doi_of_paper (string) – doi

Returns

url encoded doi

Return type

string

static write_or_append_to_csv(df_transposed, csv_path='europe_pmc.csv')

write pandas dataframe to given csv file

Parameters
  • df_transposed (pandas dataframe) – dataframe to save

  • csv_path (str, optional) – path to csv file, defaults to “europe_pmc.csv”

writexml(destination_url, xml_content)

writes xml content to given destination_url

Parameters
  • destination_url (string) – path to dump xml content

  • xml_content (byte string) – xml content