Welcome to SciBib’s documentation!¶

Indices and tables¶

The SciBib Package¶

This package enables scientific bibliographical data retrieval from an author’s Orcid id. The main goal is to collect bibtex entries for the authors works and to collect abstracts for theses works.

Bibtex collection works fine provided the author has an orcid record and the sought article is referenced there, see data_query.OrcidWork.bibtex.

Abstract retrieval can be performed using ArXiv’s API if the article is on the Arxiv and the author associated her/his orcid id with her/his arxiv account, see data_query.AuthorData.work_summary_from_arxiv.

Another option is to get a (sometimes more up-to-date) abstract scraping the journal’s website. In this case, some legal or technical obstructions might appear. However, a tool to try this technique is provided, namely the scrape_abstract method of our OrcidWork class.

Another useful feature is to use the doi of a work to build an url that leads to the article in the publisher’s website. This can be obtained with OrcidWork.doi.

Other data sources and other outputs could be added in the future, depending of the users’ suggestions/pull requests.

The data_query module¶

This module defines two classes that allow to parse author data from Orcid and arxiv. These are AuthorData and OrcidWork.

class scibib.data_query.OrcidWork[source]¶

Methods:

`__init__`(work_data)	Instantiate single work object.
`scrape_abstract`()	Scrape the work's summary from the editor/journal's site.

Attributes:

`path`	Orcid path to the data.
`title`	Work title.
`doi`	The Work's doi.
`bibtex`	Return the bibtex entry for self from source.

__init__(work_data)[source]¶

Instantiate single work object.

Parameters:: work_data (nested lists/dictionaries) – part of a loaded json data corresponding to a single work, as obtained from orcid’s API.

property path¶: Orcid path to the data.

property title¶: Work title.

property doi¶

The Work’s doi.

Returns:: the doi.
Return type:: str

property bibtex¶

Return the bibtex entry for self from source.

Parameters:

source (str, optional) – Equals ‘doi’. Defaults to ‘doi’.
future. (Other sources might be available in the) –

scrape_abstract()[source]¶: Scrape the work’s summary from the editor/journal’s site. Beware that you might need authorization from the editor/journal to use this functionality.

class scibib.data_query.AuthorData[source]¶

A class to parse Orcid author entries.

Methods:

`__init__`(orcid_id)	Instantiator
`work_summary_from_arxiv`(orcid_work)	Match work with an arxiv entry to provide a summary.

Attributes:

`orcid_record`	The raw orcid record as a parsed json.
`arxiv_record`	The raw arxiv record as an atom feed.
`articles`	list of article entries in the author's Orcid entry.
`orcid_id_is_on_arxiv`	Check if the author associated his/her Arxiv with Orcid.
`arxiv_summaries_dic`	Return dict that maps arxiv_entries -> abstracts for the author.

__init__(orcid_id)[source]¶

Instantiator

Parameters:: orcid_id (str) – The author’s orcid id

property orcid_record¶

The raw orcid record as a parsed json.

Returns:: The raw orcid record as a parsed json (using json.load).
Return type:: list

property arxiv_record¶: The raw arxiv record as an atom feed.

property articles¶

list of article entries in the author’s Orcid entry.

Returns:: list of article entries, formatted as OrcidWork instances.
Return type:: list[OrcidWork]

property orcid_id_is_on_arxiv¶

Check if the author associated his/her Arxiv with Orcid.

Returns:: True if yes, False if no!
Return type:: bool

property arxiv_summaries_dic¶: Return dict that maps arxiv_entries -> abstracts for the author.

work_summary_from_arxiv(orcid_work)[source]¶

Match work with an arxiv entry to provide a summary.

Parameters:: orcid_work (OrcidWork) – the work that needs summary.
Returns:: The guessed summary
Return type:: str

The abstract_collector module¶

This module defines the main_paragraph function.

Functions:

main_paragraph(url)

From a web page, return the paragraph with the biggest length.

scibib.abstract_collector.main_paragraph(url)[source]¶

From a web page, return the paragraph with the biggest length.

Parameters:: url (str) – the url of the web page to treat.