download_tools module

class pygetpapers.download_tools.DownloadTools(api=None)

Bases: object

Generic tools for retrieving literature. Several are called by each repository

check_if_content_is_zip(request_handler)

Checks if content in request object is a zip

Parameters: request_handler (request object) – request object for the given zip
Returns: if zip file exits
Return type: bool

static check_or_make_directory(directory_url)

Makes directory if doesn’t already exist

Parameters: directory_url (string) – path to directory

static dumps_json_to_given_path(path, json_dict, filemode='w')

dumps json dict to given path

Parameters

path (string) – path to dump dict
json_dict (dictionary) – json dictionary
filemode (string, optional) – file mode, defaults to “w”

extract_zip_files(byte_content_to_extract_from, destination_url)

Extracts zip file to destination_url

Parameters

byte_content_to_extract_from (bytes) – byte content to extract from
destination_url (string) – path to save the extracted zip files to

get_metadata_results_file()

Gets the url of metadata file (eg. eupmc-results.json) from the current working directory

Returns: path of the master metadata file
Return type: string

get_parent_directory(path)

Returns path of the parent directory for given path

Parameters: path (string) – path of the file
Returns: path of the parent directory
Return type: string

get_request_endpoint_for_citations(identifier, source)

Gets endpoint to get citations from the configuration file

Parameters

identifier (string) – unique identifier present in the url for the particular paper
source (string) – which repository to get the citations from

Returns

request_handler.content

Return type

bytes

get_request_endpoint_for_references(identifier, source)

Gets endpoint to get references from the configuration file

Parameters

identifier (string) – unique identifier present in the url for the particular paper
source (string) – which repository to get the citations from

Returns

request_handler.content

Return type

bytes

get_request_endpoint_for_xml(identifier)

Gets endpoint to full text xml from the configuration file

Parameters: identifier (string) – unique identifier present in the url for the particular paper
Returns: request_handler.content
Return type: bytes

static get_version()

Gets version from the configuration file

Returns: version of pygetpapers as described in the configuration file
Return type: string

gets_result_dict_for_query(headers, data)

Queries query_url provided in configuration file for the given headers and payload and returns result in the form of a python dictionary

Parameters

headers (dict) – headers given to the request
payload (dict) – payload given to the request

Returns

result in the form of a python dictionary

Return type

dictionary

getsupplementaryfiles(identifier, path_to_save, from_ftp_end_point=False)

Retrieves supplementary files for the given paper (according to identifier) and saves to path_to_save

Parameters

identifier (string) – unique identifier present in the url for the particular paper
path_to_save (string) – path to save the supplementary files to
from_ftp_end_point (bool, optional) – to get the results from eupmc ftp endpoint

handle_creation_of_csv_html_xml(makecsv, makehtml, makexml, metadata_dictionary, name)

Writes csv, html, xml for given conditions

Parameters

makecsv (bool) – whether to get csv
makehtml (bool) – whether to get html
makexml (bool) – whether to get xml
metadata_dictionary (dict) – dictionary to write the content for
name (string) – name of the file to save

make_citations(source, citationurl, identifier)

Retreives URL for the citations for the given paperid, gets the xml, writes to citationurl

Parameters

source (which repository to get the citations from) – which repository to get the citations from
citationurl (string) – path to save the citations to
identifier (string) – unique identifier present in the url for the particular paper

make_csv_for_dict(metadata_dictionary, name_main_result_file, name_result_file_for_paper)

Writes csv content for the given dictionary to disk

Parameters

metadata_dictionary (dict) – dictionary to write the content for
name_main_result_file (string) – name of the main result file (eg. eupmc-results.xml)
name_result_file_for_paper (string) – name of the result file for a paper

make_html_for_dict(metadata_dictionary, name_main_result_file, name_result_file_for_paper)

Writes html content for the given dictionary to disk

Parameters

metadata_dictionary (dict) – dictionary to write the content for
name_main_result_file (string) – name of the main result file (eg. eupmc-results.xml)
name_result_file_for_paper (string) – name of the result file for a paper

make_html_from_dataframe(dataframe, path_to_save)

Makes html page from the pandas given dataframe

Parameters

dataframe (pandas dataframe) – pandas dataframe to convert to html
path_to_save (string) – path to save the dataframe to

make_references(paperid, identifier, path_to_save)

Writes references for the given paperid from source to reference url

Parameters

identifier (string) – identifier for the paper
source (string) – source to get references from
path_to_save (string) – path to store the references

make_xml_for_dict(metadata_dictionary, name_main_result_file, name_result_file_for_paper)

Writes xml content for the given dictionary to disk

Parameters

metadata_dictionary (dict) – dictionary to write the content for
name_main_result_file (string) – name of the main result file (eg. eupmc-results.xml)
name_result_file_for_paper (string) – name of the result file for a paper

parse_request_handler(request_handler)

post_query(url, data=None, headers=None)

Queries url

Parameters

headers (dict) – headers given to the request
payload (dict) – payload given to the request

Returns

result in the form of a python dictionary

Return type

dictionary

static queries_the_url_and_writes_response_to_destination(url, destination)

queries the url and writes response to destination

Parameters

url (string) – url to query
destination (string) – destination to save response to

static readjsondata(path): reads json from path and returns python dictionary

static removing_added_attributes_from_dictionary(resultant_dict)

pygetpapers adds some attributes like “pdfdownloaded” to track the progress of downloads for a particular corpus. When we are exporting data to a csv file, we dont want these terms to appear. So this funtion makes a copy of the given dictionary, removes the added attributes from dictionaries inside the given dict and returns the new dictionary.

Parameters: resultant_dict (dictionary) – given parent dictionary
Returns: dictionary with additional attributes removed from the child dictionaries
Return type: dictionary

set_up_config_variables(config, api)

Sets class variable reading the configuration file for the provided api

Parameters

config (configparser object) – configparser object for the configuration file
api (string) – Name of api as described in the configuration file

setup_config_file(config_ini)

Reads config_ini file and returns configparser object

Parameters: config_ini (string) – path of configuration file
Returns: configparser object for the configuration file
Return type: configparser object

static url_encode_id(doi_of_paper)

Encodes the doi of paper in a file savable name

Parameters: doi_of_paper (string) – doi
Returns: url encoded doi
Return type: string

static write_or_append_to_csv(df_transposed, csv_path='europe_pmc.csv')

write pandas dataframe to given csv file

Parameters

df_transposed (pandas dataframe) – dataframe to save
csv_path (str, optional) – path to csv file, defaults to “europe_pmc.csv”

writexml(destination_url, xml_content)

writes xml content to given destination_url

Parameters

destination_url (string) – path to dump xml content
xml_content (byte string) – xml content