Skip to main content

Hello ! This is my blog powered by Known. I post articles and links about coding, FOSS, but not only, in French (mostly) or English (if I did not find anything related before).

phyks.me

github.com/phyks

phyks@phyks.me

 

Comparison of tools to fetch references for scientific papers

3 min read

EDIT: Finally, the impossible build of CERMINE was just a temporary issue, and they are distributing standalone JAR files, which is very interesting to easily ship it with another program. See this Github issue for more infos. You might also be interested in the CERMINE paper which is also presenting some comparisons, as I did below.

 

Recently, I tried to aggregate in a single place various codes I had written) to handle scientific papers. Some feature I was missing, and I would like to add, was the ability to fetch automatically references from a given paper. For arXiv papers, I had a simple solution using the LaTeX sources, but I wanted to have something more universal, taking a simple PDF file in input (thanks John for the suggestion, and Al for the tips on existing software solutions).

I tried a comparison of three existing software to extract references from a PDF file:

  • pdfextract from Crossref, very easy to use, written in Ruby.
  • Grobid, more advanced (using machine learning models), written in Java, but quite easy to use too.
  • Cermine, using the same approach as Grobid, but I could not get it to build on my computer. I used their REST service instead.

To compare them, I asked Antonin to build a list of most important journals and take five papers for every such journal, from Dissemin. This gives us a JSON file containing around 500 papers.

I downloaded some articles, to get a (hopefully) representative set, composed of 147 different papers from various journals (I did not had access to some of them, so I could not fetch the full dataset). I ran pdfextract, Grobid and Cermine on each of them and compared the results.

The raw results are available here for each paper, and I generated a single page comparison to ease the visual diff between the three results, available here (note that this webpage is very heavy, around 16MB).

Briefly comparing the results, the machine learning based models (Cermine and Grobid) seems to give far better results than the simple approach taken by pdfextract, at the expense of being more difficult to build and run. Cermine gives a bunch of infos, too much in my opinion, and I think Grobid is given the most reusable and complete results. Feel free to compare them yourself.

EDIT:

  • I also found ParsCit which may be of interest. Though, you first need to extract text from your PDF file. I did not yet test it more in depth.

  • This tweet tends to confirm the results I had, that Grobid is the best one.

  • If it can be useful, here is a small web service written in Python to allow a user to upload a paper and parse citations and try to assess open-access availability of the cited papers. It uses CERMINE as it was the easiest way to go, especially since it offers a web API, which allows me to distribute a simply working script, without any additional requirements.

,

 

Let's add some metadata on arXiv!

7 min read

This article contains ideas and explanations around this code. Many references to it will be done through this article.

Disclaimer: The above code is here as a proof of concept and to back this article with some code. It is clearly not designed (nor scalable) to run in production. However, the reference_fetcher part was giving good results on the arXiv papers I tested it on.

Nowadays, most of the published scientific papers are available online, either directly on the publisher's website, or as preprints on Open access repositories. For physics and computer science, a large part of them is available on the arXiv.org repository (a major, worldwide, Open access repository managed by Cornell), depending on the research topics. All published papers get a unique (global) identifier, called a DOI, which can be used to identify them and link to them. For instance, if you go to https://dx.doi.org/10.1103%2FPhysRevB.47.7312 you are automatically redirected to the Physical Review B website, on the page of the paper with DOI 10.1103/FPhysRevB.47.7312. This is really useful to target a paper, and identify it uniquely, in a machine-readable way and in a way that will last. However, very little use seems to be done of this system. This is why I had the idea to put some extra metadata on published papers, using such systems.

From now on, I will mainly focus on arXiv for two main reasons. First, it is Open access, so it is accessible everywhere (and not depending on the subscriptions of a particular institution) and reusable, and second, arXiv provides sources for most of the papers, which is of great interest as we will see below. arXiv gives a unique identifier to the preprints. Correspondence between DOIs and arXiv identifiers can be made quite easily as some publishers push back DOIs to arXiv upon publication, and authors manually update the fields on arXiv for the rest of the publishers.

Using services such as Crossref or the publisher's website, it is really easy to get a formatted bibliography (plaintext, BibTeX, …) from a given identifier (e.g. see some codes for DOI or arXiv id for BibTeX output). Then, writing a bibliography should be as easy as keeping track of a list of identifiers!

Let's make a graph of citations!

In scientific papers, references are usually a plaintext list of papers used as reference, at the end of the article. This list follows some rules and formats, but there exist a wide variety of different formats, and it is often really difficult to parse them automatically (see http://arxiv.org/abs/1506.06690 for an example of references format).

If you want to fetch automatically the references from a given paper (to download them in batch for instance), you would basically have to parse a PDF file, find the references section, and parse each textual item, which is really difficult and error-prone. Some repositories, such as arXiv, offer sources for the published preprints. In this case, one can deal with a LaTeX-formatted bibliography (a thebibliography environment, not a full BiBTeX though), which is a bit better, but still a pity to deal with. When referencing an article, nobody uses DOIs!

The first idea is then to try to automatically fetch references for arXiv preprints and mark them as relationships between articles.

Fortunately, arXiv provides bbl source files for most of the articles (which are LaTeX-formatted bibliography). We can then avoid having to parse a PDF file, and directly get some structured text, but bibliography is still in plaintext, without any machine-readable identifier. Here comes Crossref which offers a wonderful API to try to fetch a DOI from a plain text (see http://labs.crossref.org/resolving-citations-we-dont-need-no-stinkin-parser/). And it gives surprisingly good results!

This automatic fetching of DOI for references of a given arXiv papers is available in this code.

Then, one can simply write a simple API accepting POST requests to add papers to a database, fetch referenced papers, and mark relationships between them. This is how https://github.com/Phyks/arxiv_metadata began.

If you post a paper to it, identified either by its DOI (and a valid associated arXiv id is found) or directly by its arXiv id, it will add it to the database, resolve its references and mark relationships in database between this paper and the references papers. One can then simply query the graph of "citations", in direct or reverse order, to get any papers cited by a given one, or citing a given one.

The only similar service I know of on the web is the one provided by SAO/NASA ADS. See for instance how it deals with the introductory paper. It is quite fantastic for giving both the papers citing this one and cited by this one, in a browsable form, but its core is not open-source (or I did not find it), and I have no idea how it works in the background. There is no easily accessible API, and it works only in some very specific fields (typically Physics).

Let's add even more relations!

Now that we have a base API to add papers and relationships between them to a database, we can imagine going one step further and mark any kind of relations between the papers.

For instance, one can find that a given paper could be another reference for another one, which was not citing it. We could then collaboratively work to put extra metadata on scientific papers, such as extra references, which would be useful to everyone.

Such relationships could also be similar to, introductory_course, etc. This is quite limitless and the above code can already handle it. :)

Let's go one step further and add tags!

So, by now, we can have uniquely identified papers, with any kind of relationships between them, which we can crowdsource. Let's take some time to look at how arXiv stores papers.

They classify them by "general categories" (e.g. cond-mat which is a (very) large category called "Condensed Matter") and subcategories (e.g. cond-mat.quant-gas for "Quantum gases" under "Condensed Matter"). A RSS feed is offered for all these categories, and researchers usually follow the subcategory of their research area to keep up to date with published articles.

Although some article are released under multiple categories, most of them only have one category, very often because they do not fit anywhere else, but sometimes because the author did not think it could be relevant in another field. Plus some researchers work at the edge of two fields, and following everything published in these two fields is a very time-consuming task.

Next step is then to collaboratively tag articles. We could get tags as targeted as we want, or as general as we want, and everyone could follow the tags they want. Plus doing it collaboratively allows someone who finds an article interesting for their field, which was not the author's field, to make it appear in the feed of his colleagues.

Conclusion

We finally have the tools to mark relations between papers, to annotate them, complete them, and tag them. And all of this collaboratively. With DOIs and similar unique identifiers, we have the ability to get rid of the painful plaintext citations and references and use easily machine-manageable identifiers, while still getting some nicely rendered BibTeX citations automagically.

People are already doing this kind of things for webpages (identified by their URL) with Reddit or HackerNews and so on, let's do the same for scientific papers! :)

A demo instance should be available at http://arxiv.phyks.me/. This may not be very stable or highly available though. Note that Content-Type is the one of a JSON API and your browser may force you to download the response rather than displaying it. Easier way to browse it is to use cURL, according to the README.

,