download_tools module
- class pygetpapers.download_tools.DownloadTools(api=None)
Bases:
object
Generic tools for retrieving literature. Several are called by each repository
- check_if_content_is_zip(request_handler)
Checks if content in request object is a zip
- Parameters
request_handler (request object) – request object for the given zip
- Returns
if zip file exits
- Return type
bool
- static check_or_make_directory(directory_url)
Makes directory if doesn’t already exist
- Parameters
directory_url (string) – path to directory
- static dumps_json_to_given_path(path, json_dict, filemode='w')
dumps json dict to given path
- Parameters
path (string) – path to dump dict
json_dict (dictionary) – json dictionary
filemode (string, optional) – file mode, defaults to “w”
- extract_zip_files(byte_content_to_extract_from, destination_url)
Extracts zip file to destination_url
- Parameters
byte_content_to_extract_from (bytes) – byte content to extract from
destination_url (string) – path to save the extracted zip files to
- get_metadata_results_file()
Gets the url of metadata file (eg. eupmc-results.json) from the current working directory
- Returns
path of the master metadata file
- Return type
string
- get_parent_directory(path)
Returns path of the parent directory for given path
- Parameters
path (string) – path of the file
- Returns
path of the parent directory
- Return type
string
- get_request_endpoint_for_citations(identifier, source)
Gets endpoint to get citations from the configuration file
- Parameters
identifier (string) – unique identifier present in the url for the particular paper
source (string) – which repository to get the citations from
- Returns
request_handler.content
- Return type
bytes
- get_request_endpoint_for_references(identifier, source)
Gets endpoint to get references from the configuration file
- Parameters
identifier (string) – unique identifier present in the url for the particular paper
source (string) – which repository to get the citations from
- Returns
request_handler.content
- Return type
bytes
- get_request_endpoint_for_xml(identifier)
Gets endpoint to full text xml from the configuration file
- Parameters
identifier (string) – unique identifier present in the url for the particular paper
- Returns
request_handler.content
- Return type
bytes
- static get_version()
Gets version from the configuration file
- Returns
version of pygetpapers as described in the configuration file
- Return type
string
- gets_result_dict_for_query(headers, data)
Queries query_url provided in configuration file for the given headers and payload and returns result in the form of a python dictionary
- Parameters
headers (dict) – headers given to the request
payload (dict) – payload given to the request
- Returns
result in the form of a python dictionary
- Return type
dictionary
- getsupplementaryfiles(identifier, path_to_save, from_ftp_end_point=False)
Retrieves supplementary files for the given paper (according to identifier) and saves to path_to_save
- Parameters
identifier (string) – unique identifier present in the url for the particular paper
path_to_save (string) – path to save the supplementary files to
from_ftp_end_point (bool, optional) – to get the results from eupmc ftp endpoint
- handle_creation_of_csv_html_xml(makecsv, makehtml, makexml, metadata_dictionary, name)
Writes csv, html, xml for given conditions
- Parameters
makecsv (bool) – whether to get csv
makehtml (bool) – whether to get html
makexml (bool) – whether to get xml
metadata_dictionary (dict) – dictionary to write the content for
name (string) – name of the file to save
- make_citations(source, citationurl, identifier)
Retreives URL for the citations for the given paperid, gets the xml, writes to citationurl
- Parameters
source (which repository to get the citations from) – which repository to get the citations from
citationurl (string) – path to save the citations to
identifier (string) – unique identifier present in the url for the particular paper
- make_csv_for_dict(metadata_dictionary, name_main_result_file, name_result_file_for_paper)
Writes csv content for the given dictionary to disk
- Parameters
metadata_dictionary (dict) – dictionary to write the content for
name_main_result_file (string) – name of the main result file (eg. eupmc-results.xml)
name_result_file_for_paper (string) – name of the result file for a paper
- make_html_for_dict(metadata_dictionary, name_main_result_file, name_result_file_for_paper)
Writes html content for the given dictionary to disk
- Parameters
metadata_dictionary (dict) – dictionary to write the content for
name_main_result_file (string) – name of the main result file (eg. eupmc-results.xml)
name_result_file_for_paper (string) – name of the result file for a paper
- make_html_from_dataframe(dataframe, path_to_save)
Makes html page from the pandas given dataframe
- Parameters
dataframe (pandas dataframe) – pandas dataframe to convert to html
path_to_save (string) – path to save the dataframe to
- make_references(paperid, identifier, path_to_save)
Writes references for the given paperid from source to reference url
- Parameters
identifier (string) – identifier for the paper
source (string) – source to get references from
path_to_save (string) – path to store the references
- make_xml_for_dict(metadata_dictionary, name_main_result_file, name_result_file_for_paper)
Writes xml content for the given dictionary to disk
- Parameters
metadata_dictionary (dict) – dictionary to write the content for
name_main_result_file (string) – name of the main result file (eg. eupmc-results.xml)
name_result_file_for_paper (string) – name of the result file for a paper
- parse_request_handler(request_handler)
- post_query(url, data=None, headers=None)
Queries url
- Parameters
headers (dict) – headers given to the request
payload (dict) – payload given to the request
- Returns
result in the form of a python dictionary
- Return type
dictionary
- static queries_the_url_and_writes_response_to_destination(url, destination)
queries the url and writes response to destination
- Parameters
url (string) – url to query
destination (string) – destination to save response to
- static readjsondata(path)
reads json from path and returns python dictionary
- static removing_added_attributes_from_dictionary(resultant_dict)
pygetpapers adds some attributes like “pdfdownloaded” to track the progress of downloads for a particular corpus. When we are exporting data to a csv file, we dont want these terms to appear. So this funtion makes a copy of the given dictionary, removes the added attributes from dictionaries inside the given dict and returns the new dictionary.
- Parameters
resultant_dict (dictionary) – given parent dictionary
- Returns
dictionary with additional attributes removed from the child dictionaries
- Return type
dictionary
- set_up_config_variables(config, api)
Sets class variable reading the configuration file for the provided api
- Parameters
config (configparser object) – configparser object for the configuration file
api (string) – Name of api as described in the configuration file
- setup_config_file(config_ini)
Reads config_ini file and returns configparser object
- Parameters
config_ini (string) – path of configuration file
- Returns
configparser object for the configuration file
- Return type
configparser object
- static url_encode_id(doi_of_paper)
Encodes the doi of paper in a file savable name
- Parameters
doi_of_paper (string) – doi
- Returns
url encoded doi
- Return type
string
- static write_or_append_to_csv(df_transposed, csv_path='europe_pmc.csv')
write pandas dataframe to given csv file
- Parameters
df_transposed (pandas dataframe) – dataframe to save
csv_path (str, optional) – path to csv file, defaults to “europe_pmc.csv”
- writexml(destination_url, xml_content)
writes xml content to given destination_url
- Parameters
destination_url (string) – path to dump xml content
xml_content (byte string) – xml content