By Audrey Hamelers, EuropePMC, with contributions from Frederick Atherden, eLife
If you’ve heard of Schematron, you’ve probably read why and how to build your own, and seen that JAST4R offers a Schematron for its recommendations in the JATS4R validator. But using a provided set of Schematron rules and writing your own entirely from scratch aren’t the only two options. There are a number of open source JATS Schematron rule sets, with licenses that allow cutting, copying, and mixing things up into a custom Schematron collage.
Schematron sources
Schematron is a validation language for XML that uses rules to assert or report information about patterns in XML documents. Schematron can be used to check for specific requirements or style not defined by the JATS DTD, and used to check that different requirements are met at different stages of document production.
Read Schematron: a handy XML tool that’s not just for villains! for more on the benefits and basics of Schematron.
Some content in JATS XML, even at very different stages of production, can have similar requirements. The style and content of other parts of JATS XML can vary widely from system to system, or even in the same system at different stages of production. A combination of the reuse of existing rules, where encouraged, with rules written for your exceptional needs can allow you to build a schematron to your exact requirements with less effort than starting from scratch. In my example I’ve done just that, by combining rules specific to the Europe PMC plus manuscript submission system with others from two excellent sources:
- The JATS4R validator tool validates XML using open source Schematron files, already handily divided into separate patterns and documents for different JATS sections and JATS4R recommendations.
- eLife provides an open source Schematron system, with a base set of rules that gets divided into separate schemas for different stages of their production process.
JATS4R, eLife, and Europe PMC encourage open science and its sister principle, the open sourcing of software. The open licensing of their Schematron schemas illustrate these principles, and with them we can demonstrate the efficiency value of reuse.
Combining existing and new tests
As a first step, place different <pattern>
elements in separate .sch files. Housing each of your schema <pattern>
elements in a separate file makes it very easy to combine new and reused Schematron tests into one schema clearly and efficiently, and to share your patterns with others. Here are some ways to decide which <rule>
elements and tests (<assert>
and <report>
elements) should go in which patterns:
- For ease of use and understanding, break rules and tests up into patterns that make sense to you, based on which stage of your production they are meant for, which section of the XML they apply to, what kind of checks they perform, or other considerations that work for your system.
- If you assign roles to your tests, such as “error”, “info”, or “warning”, grouping errors and warnings into separate patterns is very convenient for usage and for sharing.
- If rules with the same or overlapping context are in the same pattern, only the first matching rule will fire. If you want all your rules to fire, put rules for the same context in different patterns.
Specially written tests
In my example, I’ve divided up our Europe PMC Schematron rules and tests into patterns depending on the element or area of the XML document they check, and depending on whether the tests are “error” or “warning” level checks. Each pattern containing specially written tests can be found in the repository as an individual .sch file with an ‘epmc-‘ prefix.
Here’s an example, epmc-email-warning.sch, which contains a single test that checks the XML for email addresses that have not been tagged inside <email>
:
<pattern id="email-warning" xmlns="http://purl.oclc.org/dsdl/schematron"> <rule context="text()[matches(., '(\W|^)[A-Z0-9a-z._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}(\W|$)')]"> <report test="not(parent::email)" role="warning">All email addresses should be inside an <email> element</report> </rule> </pattern>
The specially written tests in the ‘epmc-‘ pattern files are checks for specific Europe PMC plus XML requirements, or for errors or issues we particularly want to watch out for.
About the Europe PMC Schematron
Europe PMC is a free repository of life sciences literature, containing both published and preprint abstracts and full text articles from a variety of sources. Part of Europe PMC is Europe PMC plus, a manuscript submission system that processes pre-publication content, in the form of author manuscripts and preprints, into JATS XML for inclusion in Europe PMC.
The metadata content and other requirements of these articles, which undergo no editorial steps, are very different from published articles—but crucially, the XML for preprints and journal-accepted manuscripts also have different requirements from each other. Our Schematron complements existing checks against the JATS DTDs and our version of the PMC stylechecker, by asserting important differences between the two major types of files processed by our system, and by checking for specific content and style requirements.
Reused tests
More than half of the tests in the Europe PMC Schematron are reused from other sources. Our schema incorporates tests from both JATS4R and eLife. Because the JATS4R Schematron is written in the same way of separating errors and warnings into different patterns, and placing each pattern in an individual file, I was able to reuse entire files from the JATS4R Schematron with no major changes. Our schema incorporates:
- abstract-errors.sch
- abstract-warnings-1.sch
- auths-affs-warnings.sch
- display-object-errors.sch
- display-object-warnings-1.sch
- display-object-warnings-2.sch
- math-errors.sch
In the Europe PMC Schematron file structure, I’ve prefixed each of these filenames with ‘jats-‘ to indicate their provenance.
eLife’s Schematron, on the other hand, offers a very large set of tests in a single schema file. This file is not intended to be used as-is: the eLife system breaks these tests up into smaller schemas for use at different stages of their production process. Many of these tests are specific to eLife’s particular house style and labelling schemes. However, some are of general use, or can demonstrate one particular way to solve a problem using Schematron.
We were impressed with the work eLife has put into validating author names and identities. I grabbed some eLife patterns, rules, and tests around names, and grouped them into two files included in our schema:
The errors and warnings from JATS4R and eLife are very valuable checks on the quality and correctness of the Europe PMC plus XML, and due to open source licensing I was able to use them freely and with very little modification or technical effort.
Creating the combined schema
It’s easy to bring individual pattern files together into one Schematron. The main file of the Europe PMC Schematron is epmc.sch, which imports all the individual pattern files. The main file contains the <schema>
element and other, non-pattern child elements. The patterns in their outside files are imported into the main schema element with <include>
:
... <include href="epmc-url-errors.sch"/> <include href="epmc-article-type-errors.sch"/> <include href="elife-name-errors.sch"/> <include href="elife-name-warnings.sch"/> <include href="jats-abstract-errors.sch"/> <include href="jats-abstract-warnings-1.sch"/> ...
Patterns are grouped together into separate <phase>
elements. Some Schematron processors allow you to run phases individually. In this example, phases can be used to run either errors or warnings alone, but phases could also be used to divide patterns into different sets for different stages of a production process.
<phase id="errors"> <active pattern="article-type-errors"/> <active pattern="abstract-errors"/> <active pattern="name-errors"/> <active pattern="url-errors"/> <active pattern="attribute-space-errors"/> <active pattern="formula-errors"/> <active pattern="math-errors"/> <active pattern="position-errors"/> <active pattern="display-object-errors"/> <active pattern="fn-group-error"/> </phase> <phase id="warnings"> <active pattern="corresp-author-warning"/> <active pattern="auths-aff-warnings"/> <active pattern="abstract-warnings-1"/> <active pattern="email-warning"/> <active pattern="name-warnings"/> <active pattern="xref-warnings"/> <active pattern="display-object-warnings-1"/> <active pattern="display-object-warnings-2"/> </phase>
Create and share your own
You can pick and choose tests from open source Schematron sets that meet your specific needs, and combine them with others written just for you. All three of the Schematron rule sets mentioned here are freely available to copy, modify, merge, publish, distribute, and sublicense:
See Schematron: a handy XML tool that’s not just for villains! for information on Schematron basics and writing your own rules and tests. Each additional JATS Schematron that is open sourced in turn adds even more relevant tests to the pool of resources to choose from, and can help everyone in the community save time in the long run!
Combining Schematron with other tools
Schematron can form part of a wider validation service that makes use of popular existing public APIs (such as those provided by Crossref, Datacite, ROR, ORCID or PubMed) to ensure the validity and completeness of content, and that it conforms to editorial policies.
Implementing Schematron validation
Here are some existing open source tools you can use to implement your own Schematron validation:
- The JATS4R validator web service and UI
- eLife’s baseX validator
These provide validation via API and/or via a user interface.
Notes
Audrey Hamelers wrote/pulled together the Europe PMC Schematron and Frederick Atherden manages and maintains the JATS4R and eLife Schematron