Decapoda AToL::ReCite Reference Reformatter

What is this?

This web-based tool will assist you in taking formatted literature references and turning them into records for a reference database.

In practice, what this means is that you can cut-and-paste references from the end of an article or paper into a web form here. This tool will attempt to parse the references into the various fields (authors, title, journal, book title, etc.), allow you to edit that parsing, and then let you download the references formatted for a bibliographic database program.

Note that this is an “expert tool”: you still need to know how to interpret academic references. You cannot expect to just blindly copy & paste stuff in here and get well-parsed reference data. ReCite will assist you in parsing the reference information, but it still requires editing and interpretation.

How do I start?

Start entering preformatted reference text on the submission page. When you submit that page, you'll be on an editing page to clean up the parsing. While you are editing the references, nothing is yet saved. When you are done editing and you submit the results, you'll be on a retrieval page. The parsed and edited references are now saved (temporarily). You can now fetch your references in a database format, revise your edits on one or more references, clear the references you just processed, and add more references.

What will it do?

Preformatted references can be cut-and-pasted into a web form, parsed, edited, then pulled back for use in a bibliographic database. At this point, the only output format supported is the XML format used by Endnote Version 9 (or, presumably, later versions).
The system is best adapted for scientific, and particularly biological, references.
Non-ASCII characters (accented characters, etc.) should work. The web form sends back Unicode (formatted as UTF-8) to the server, and the software should properly handle those characters. Of course, internationalization issues are notoriously difficult, and we do anticipate some problems here.
Multiple references can be entered on the form at once (they will be parsed separately). Once references are edited, they accumulate until you download and then clear them, so you can enter long lists of references in shorter “batches”.
The parsing routines will usually show several versions of the parsed references. You can select which version (or which parts of which versions) you'd like to keep in the “final“ parsed version. Of course, you can also edit the fields directly.
Because italicization of species names in titles is common and important in the biological literature, you can mark sections for italicization by surrounding them with underbar characters (“_”), since there is no way to directly italicize text in a web form window.

What will it not do?

This is not a reference management system. It will hold onto references for about 30 days after you edit them (until you download and clear your set of references). However, there are no facilities for sorting or formatting those reference records. The intent is to make them available for you so that you can use your reference database system for that.
Some references in bizarre formats will not be properly parsed. For those references, you'll end up doing a fair bit of work on the editing page to get them into acceptable form.
This tool does not connect to any electronic reference databases either to fetch or confirm references. It simply reformats the text that is pasted into it.
There is no facility for managing the papers themselves in electronic format. This tool is limited to dealing with the bibliographic references only. Digitizing and databasing the papers themselves is a different project.

To whom can I direct questions and comments?

This was developed by Dean Pentcheff (pentcheff@gmail.com) at the Los Angeles County Museum of Natural History. I'd be happy to receive questions or comments about this tool. While our main focus is on the Decapod Tree of Life project (see below), we are open to modifications to make this more useful for other researchers in other disciplines.

Why was this developed?

This tool was developed to facilitate processing refrences for the Decapod Crustacean “Tree of Life project”. Biological systematics depends critically on the literature references that represent the initial or revised descriptions of species. Capturing the references (and ultimately the papers themselves) electronically is a surprisingly important and difficult part of modern research.

Assembling the Tree of Life (ATOL) is a major National Science Foundation initiative that is dedicated to working out the phylogeny of the (estimated) 1.7 millions species of organisms on earth. Grant awards are made to groups of researchers with expertise in particular groups.

This tool was developed (by Dean Pentcheff at the Natural History Museum of Los Angeles County) to assist the group working on the decapod crustaceans. That group is supported by the following NSF awards:
DEB-EF-0531603 to Darryl Felder at the University of Louisiana at Lafayette,
DEB-EF-0531616 to Joel Martin at the Los Angeles County Museum of Natural History,
DEB-EF-0531670 to Rodney Feldmann and Carrie Schweitzer at Kent State University, and
DEB-EF-0531762 to Keith Crandall and Nikki Hanegan at Brigham Young University.

Upon what software does it depend?

The impetus to create this came from seeing the ParaCite service. It didn't do exactly what we were looking for, but the “ParaTools“ Perl modules provided the core of the reference parsing. The originals are available via CPAN as Biblio::Citation::Parser. The modules were extensively modified for this application.

Also useful were the author reformatting modules: Text::BibTeX, Text::BibTeX::Name, and Text::BibTeX::NameFormat.

Unicode and UTF-8 issues proved “interesting”. The standard Perl module Encode was indispensible. To get UTF-8 into Normalization Form D (needed to process strings prior to basic pattern matching), Unicode::Normalize did the job.

The site framework is written in Mason.


Copyright NHMLAC Design: Dean Pentcheff pentcheff@gmail.com