This article covers the techniques authors on the web can use to reference articles and publications in a robust manner. It was written to highlight some of the inadequeces of relying on non persistant URI references for academic and technical papers. It was partially born out of my considerations on holistic hypertext and how this could be integrated into both CMS and KMS.
Ever 404ed on a linked reference?
URLs are not necessarily persistent, I have come across this while reading over the series of reports I have authored, clicking on the references at the end of the report more often than not led to 404 error pages or redirections to the home page of the site. How can the utility of hyperlinking to academic papers be preserved, when the actual locations are volatile? Thankfully some clever people have already thought about this problem and devised a solution for it, the URI.
The URI, aka the Uniform Resource Identifier
Historically the term URL has long been associated with the world wide web, the idea of a bytes of data being able to locate any publicly available document over the internet is an appealing one. It is this flexibility that has lead to the vast explosion of the web and its self organising structure.
The popular, though “informal” term, URL, is associated with the popular URI schemes like ftp, http, gopher and mailto. The L in URL is what separates the URL and the URI, it stands for locator. Why is this a problem? To answer this question let us use out imagination…
…You are feeling sleepy…
You are in a large bookstore looking to purchase Bertrand Meyers book, “Object-Oriented Software Construction, 2nd edition” (highly recommended by myself). Think for a moment of the various ways the book could be located when you walk in the store, then compare with the list below:
- Look for the “computer books” section and then scan the shelves.
- Ask an attendant:
- Response, “Go to the third aisle, second shelf.”
- Wander around aimlessly and hope you stumble upon it.
- Use an in store catalogue access point to search by title and/or author to check it is in stock and where it is.
Many of these approaches will locate the book you are looking for, (wandering about might not). The point is you will have located the book you wanted to buy. lets go and buy it, head to the counter put hand in pocket for your wallet. Hmm not got enough cash, you poor student, come back next week after your next installment of the student loan comes through.
One week later…
Having safely stored the location of the book in your brain you return and travel directly to the shelf it was on, unfortunately all we can see is Delia Smiths cook books, and Jamie “the Naked Chef” Oliver trying to sell you on lightly toasted mashed potatoes or some other abomination. That’s right the book store has 404ed you.
The point
A reference mechanism that exclusively relies on the physical location of a resource is liable to break upon movement of that resource. This relates to the more abstract notion of web URLs. In his analysis of the problems caused by changed URIs Tim Berners-Lee notes that a URI should be persistent and that when we tie it too closely to the underlying architecture, whether technical or political, we are opening ourselves to future problems.
This article is concerned with citations, if you are citing a particular work you need to make sure that URI is strong. If the URI is liable to break then all those references you make are likely to break as well. Why is this important, well citing previous work is a healthy way for knowledge to be disseminated and passed on from reader to reader. A citation can give a more grounded understanding of the subject under discussion, we have gotten where we are today because we have learnt from the past, from those minds who have gone before, we may challenge their conclusions and theories but without them the world would be a poorer place.
URN, committed persistence
As we have seen in the context of scholarly works, or other serious writing, that we commit to the web the danger of failing URIs is a serious concern. A URI has advanced a step on from the common URL in that we now conceptually separate the location of the resource from its identifier. To make this more useful we now take the next step and use URNs.
Confused yet? We have introduced three very similar abbreviations, URL, URI and URN, these terms have been explained in theaddressing overview provided by the W3c, a slightly abridged version is given below:
- URL, Uniform Resource Locator
- Commonly used term associated with popular URI schemes.
- URI, Uniform Resource Identifier
- The generic set of all names/addresses that are short strings that refer to resources.
- URN, Uniform Resource Name
- An URI that has an institutional commitment to persistence and availability.
- A particular scheme, urn:, specified by RFC2141 and related documents, intended to serve as persistent, location-independent, resource identifiers.
It can be seen how we have undergone a transformation from location specific identifiers to the more general URI, and now the URN, built upon the URI syntax. Using URNs where available will help to strengthen the links between web documents and the resources they cite.
Examples of citations using a URN namespace
After that rather lengthy prelude let’s get some examples in front of us so we can see how to do this stuff. Examples of URNs for citing journals, books, magazines and specific articles in publications will be given.
Citing a Book, the ISBN URN namespace
When quoting from a book and giving a URI for the citation, it is common to pick a particular web page that provides the text of the book, or details on how it may be purchased. The following example quotes from a book and provides a link that can be followed to purchase the item.
<blockquote cite="http://www.amazon.com/exec/obidos/ASIN/0345339711/">
<p>They went in single file, running like hounds on a strong scent, and an eager light was in their eyes. Nearly due west the broad swath of the marching Orcs tramped its ugly slot; the sweet grass of Rohan had been bruised and blackened as they passed.</p>
</blockquote>
The earlier explanations we have had have shown that the cited URI, “http://www.example.com/tolkien/twotowers.html
“, is not a good example of a persistent identifier. It is quite conceivable that the owner of the site may rearrange the site structure, or change the technology utilised by the site, so that the referenced link no longer exists.
The ISBN namespace is designed to uniquely identify a book and is a suitable method for explicitly referencing a book without incurring a potential loss of persistence.
To use the ISBN namespace, the beginning of the uri is “urn:isbn:
“, followed by the ISBN number of the book, which should be easily obtainable from book. In the case of JRR Tolkiens “Two Towers” book, the isbn number is “0-345-33971-1
“. The URI we should use as the value of the cite attribute is “urn:isbn:0-345-33971-1
“, the example we gave would then become:
<blockquote cite="urn:isbn:0-345-33971-1">
<p>They went in single file, running like hounds on a strong scent, and an eager light was in their eyes. Nearly due west the broad swath of the marching Orcs tramped its ugly slot; the sweet grass of Rohan had been bruised and blackened as they passed.</p>
</blockquote>
Citing a Journal or Magazine, the ISSN URN namespace
Most people have heard of the ISBN and can easily find it on a book, a similar numbering scheme is in place for periodical publications like magazines and scientific journals, this is called the ISSN or International Standard Serial Number. These numbers can easily be discovered as they are usually printed somewhere on the magazine along with circulation details, or obtainable from the internet.
To use the ISSN URI namespace to reference a journal you must first obtain the ISSN of the periodical. To cite the “Science of Computer Programming” Journal you would take the ISSN, “0167-6423
“, and append it to “urn:issn:
“. The resulting value is used in the following example:
<q cite="urn:issn:0167-6423">The state machine model could be decomposed into lower states, however these lower states were still sequential not contemporaneous</q>
The given example demonstrates how a specific journal or magazine could be referenced. This is undoubtedly a useful property, however in writing reports or articles it is common to refer directly to an article, not just the journal in which it was published.
Citing a specific Journal article, the SICI URN namespace
Now that we have seen it is possible to reference periodical publications it is common to ask how can an individual article be referenced? The SICI, Serial Item and Contribution Identifier, URN namespace is a proposed namespace that will handle these kinds of references. Unlike the ISSN and ISBN URN namespaces it is not yet formally registered. Nevertheless it is undergoing the formal process at the moment. It may be instructive to see how such a reference might be cited using a recent draft. Before heading straight into the URI syntax it would be helpful to understand how a SICI is formed, unlike ISBNs or ISSNs they are not easily found on the back cover of the item. The SICI is discussed in RFC 2288 which provides the following example:
An example of a SICI code is:
0015-6914(19960101)157:1<62:KTSW>2.0.TX;2-F
The first nine characters are the ISSN identifying the serial title. The second component, in parentheses, is the chronology information giving the date the particular serial issue was published. In this example that date was January 1, 1996. The third component, 157:1, is enumeration information (volume, number) for the particular issue of the serial. These three components comprise the “item segment” of a SICI code. By augmenting the ISSN with the chronology and/or enumeration information, specific issues of the serial can be identified. The next segment, <62:KTSW>, identifies a particular contribution within the issue. In this example we provide the starting page number and a title code constructed from the initial characters of the title. Identifiers assigned to a contribution can be used in the contribution segment if page numbers are inappropriate. The rest of the identifier is the control segment, which includes a check character. Interested readers are encouraged to consult the standard for an explanation of the fields in that segment.
To reference the SICI given in the example certain characters must be escaped, as defined in RFC 2141, in the example these are the < and > characters. The resulting SICI URN would be “0015-6914(19960101)157:1%3C62:KTSW%3E2.0.TX;2-F
“, which could be included as the value of a cite or href attribute as in the other examples.
Citing an IETF RFC, the IETF URN namespace
Referencing an RFC is a reasonably common occurrence on the internet, RFCs are the backbone upon which many of the applications we use are built. As you would expect from the people who standardised the URN syntax the IETF has its own namespace for documents which it publishes. These cover a variety of different types of documents, however the most popular is the RFC. Most references to an RFC are made by simply linking to a copy of the RFC, usually the one held on the ietf.org web site, for example a link to RFC 2141 would simply be, “<a href="http://www.ietf.org/rfc/rfc2141.txt">RFC 2141</a>
“. However not everyone links directly to the IETF version, many people link to versions held elsewhere. This is often done because other formats contain internal hyperlinks to related documents whereas the original text files do not.
The syntax for the IETF is given in RFC 2141, creating a reference to this RFC will result in the following URI, “urn:ietf:rfc:2141
“, an example using this is given below:
<blockquote cite="urn:ietf:rfc:2141">
<p>Assignment of URNs from this namespace occurs in three ways. The first is through publication of a new RFC, FYI, STD or BCP is by the RFC Editor. This new document will have a new series number and will therefore define a new URN.</p>
</blockquote>
Conclusion
This article has given an overview of the relationship between URLs, URIs and URNs. It has shown the usefulness of the URN in referencing documents while writing a technical or academic item on the internet. In addition to explaining the usefulness of these standards I have given a variety of examples of how the URN standard can be applied in practice, ranging from books to magazines, journals and RFCs.
As I mentioned at the beginning of this article I was often frustrated when trying to follow hyperlinked references to academic papers I had cited in the reports I have written, I was frequently confronted with 404 errors because papers had been moved or deleted from the server. Locating the references and then rewriting my reports to include them was a drain on my resources and not a scalable solution. Using URNs to reference academic papers has given me the opportunity to gain a persistent quality to my links, this is a great benefit to me, and in the future I hope that it will be a benefit in integrating the articles and reports I have written into the semantic web.
Further Reading
- Citations in HTML
- Cite in html 4.01
- Cite in XHTML 2 (work in progress)
- XML Schema Part 2: Data types, basis for the anyURI data type used in draft version of XHTML 2 (work in progress)
- RFCs and other documents about URNs
- Miscellaneous