Culture Victoria

  • Share on Facebook
  • Share on Twitter
  • Share on Google+
  • Email
  • Print

Linking History, an exercise in linked open data

By News Team Posted Under News

The Linking History site has been created for use as a research tool by people wanting to upload material to the Portrait of a Nation website. It was also used by students at Mount Evelyn Primary School and Mildura Primary School to do research for the films they created about Nellie Melba and the Chaffey brothers during the History In Place project.

 

Whilst the interface doesn't look all that unusual, it's an experimental pilot in the practical application of linked open data. It aimed to aggregate collection items relating to a selected group of people commemorated through Canberra place names, and expose the results both through a web application and as Linked Data.

 

The pilot also aimed to increase community engagement with Australia’s history – particularly as it relates to our national capital – and increase access to archival material via Linked Data. The project was funded by the Centenary of Canberra.

 

The site was built by Tim Sherratt, and the following section of this blog (written by Tim) describes his process in detail...

 

Design considerations

There were three elements to be considered in the design of the application: the mechanism for storing and publishing the RDF data, the code to query and retrieve details from the RDF storage, and the way in which these details were presented to users. In addition, for this to be an example of Linked Data publishing, the RDF had to be discoverable according to one of the standard LOD publishing patterns.

 

I considered creating a standalone triple store for use with this project, but as I wanted to avoid any software dependencies on the server side, I decided to simply store and deliver static RDF/XML files, and display the details using javascript and html. This had the advantage of making the whole project portable – it could be published on any web server (although the server root values in the RDF would need to be adjusted).

 

Another advantage of this approach is that it provides an example of how collections, exposed as Linked Open Data, might be integrated into new forms of online publication that are not dependent on particular platforms.

 

Using this approach, the RDF data is exposed according to the Linked Open Data ‘autodiscovery’ pattern, using ‘link’ tags to point to the structured data underlying the web interface.

 

Data structure

 

As the project was aggregating records from a number of cultural institutions I looked towards Europeana for guidance on how I could best express information about the resources and their relationships asLinked Data. At the centre of the Europeana Data Model (EDM) is the idea of an ‘aggregation’ which brings the cultural heritage object together with any digital representations of that object. This same model is used by the Digital Public Library of America (DPLA).

 

I followed the same approach. Each object is represented by an aggregation that bundles together links to the object itself, to a web page that provides information about that object, and, if available, to an image of the object. So a typical object is represented as four interlinked entities: aggregation, object, web page and image. The aggregation is also linked to details of the institution providing the data.

 

Most of the familiar descriptive metadata – such as title, description and date – are attached to the object itself. So are relationships between the object and people or places. To keep things manageable I used a very limited set of relationships in object descriptions: subject, creator, and has met (an EDM property indicating a connection between a person and a thing).

 

People and places are described as entities in their own right, with their own properties and relationships. People are related to places through birth or death events, and via the ‘named for’ property that indicates when a place was named after a particular person.

 

More detail on the data model I used is provided below.

 

Data processing

The structure and format of the collection data provided varied considerably. I used Open Refine (formerly Google Refine) to clean and normalise the data, and to generate the RDF for use in the interface. It’s a very powerful tool and saved a lot of manual handling.

 

Open Refine’s clustering and batch editing tools make it easy to group items. For example, I needed to reduce the wide variety of formats described within the different collections to a small subset for use within the interface. Similarly, the clustering and filtering features were useful in assigning relationships between objects and people.

 

In some cases I wanted to modify punctuation, or combine separate fields into a single value. Open Refine’s own programming language, GREL, makes these sorts of transformations pretty easy.

 

There are also some more advanced features that allow you to retrieve and process related data. For example, some records didn’t include direct urls for images. If I couldn’t predict these from the web interface, I used Open Refine to retrieve and save the html of the web page for each item. I then used GREL to find the image links and save them to a new field – a simple form of screen scraping.

 

Open Refine’s reconciliation services allow you to find links with other data sets. I experimented with these using the National Gallery of Victoria data. First I copied and normalised the artist field, then I sent these values off for reconciliation against DBPedia. A number of useful matches were found and I added these additional people to the data set. They could then be related as creators to the original items.

 

You can also send values off to any number of third party APIs for further processing. I attempted some named entity recognition by feeding titles from the National Gallery records to the Alchemy API. I then extracted details of any place names mentioned and used DBPedia and GeoNames to harvest useful metadata, such as coordinates. These places could then be related as subjects to the original items. I’ve since discovered that the named entity extraction extension created by Free Your Metadata for Open Refine simplifies much of this.

 

My reconciliation and named entity recognition attempts showed some promising results, though there was considerable inconsistency, and a fair bit of manual intervention was still required. As such, results were only deployed for the National Gallery of Victoria. The value, of course, lies in the way such techniques can create connections to widely-used datasets such as Wikipedia/DBPedia – this is what really puts the ‘Linked’ into Linked Open Data and opens up new opportunities for discovery.

 

Data about the people and places identified through these semi-automated techniques was stored in Open Refine alongside the collection items from which they were extracted. The five main subjects of the project, and the Canberra places that are named after them, were stored and managed separately. In these cases I undertook a considerable amount of manual enrichment. Birth and death details were added, as was information about associated places. Links were created to DBPedia and a range of other biographical sources. In the case of the Canberra places, links were added to the Portrait of a Nation site and to the ACT Government’s place name database.

 

RDF Generation

The RDF extension for Open Refine makes it easy (and fun!) to design, manage and export Linked Data for consumption by other applications.

 

For each data set I defined an RDF skeleton that set out the basic entities – aggregation, object, web page and image – and mapped values from the data to properties attached to each of the entities. The properties are listed below.

 

As part of the mapping, values can be transformed on the fly using GREL expressions. For example, identifiers for the aggregations and the cultural heritage objects were created by combining the project’s namespace, the contributing institution’s name and a unique system id.

 

Once the skeleton was defined, the data could be exported in RDF/XML format and added to the interface.

 

 

 

Interface construction

As mentioned the web application was constructed using javascript and html only. Twitter Bootstrap was used to provide an adaptive framework, standard widgets and the basic look and feel of the site. JQuery and a number of specialised javascript libraries were used to provide the functionality.

 

A series of simple browse lists and an interactive map were created to provide users with the ability to explore the collections, people and places.

 

The RDF data is accessed using the RDFQuery javascript library. When the site loads, the RDF data files identified through the link tags are fed to RDFQuery, which adds the contents to an in-memory triple store. Each time a new page is accessed, this triple store is queried and the results are displayed.

 

The Address javascript library is used to manage in-site navigation, enable use of the back button, and provide bookmarkable links.

 

Thumbnail images are used wherever possible to enhance the browse lists. However, all the images on the site are loaded directly from the providing institutions. No images are stored on the site. This means that the images need to be resized on the fly. The NailThumb javascript library is being used to give more control over the resizing and improve the look of the thumbnails.

 

The map is created using the popular Leaflet javascript library.

 

While RDFQuery generally worked well, performance suffered once there were a few hundred items in the triple store. This was particularly evident on browsers with slower javascript engines and on mobile devices. After much experimentation and gnashing of teeth, I realised that the real performance hits came when a page looped through a list of items and queried the triple store for additional metadata.

 

In order to try and improve performance I simplified some of the browse lists to reduce the number of calls to the triple store. I also introduced a simple form of caching and added pagination to long lists. Page loads did get faster, but the experience on mobile devices remains poor. While I still believe that the creation of platform-independent applications that consume and display Linked Data is important, more experimentation is need to understand the performance limits.

 

Possibilities and problems

In general, once I had a clear model and a good understanding of the data, the processes of normalisation, mapping and RDF generation were pretty straightforward. Open Refine made the job quite enjoyable.

 

Where there were difficulties they mostly related to the unavailability of data, such as direct image urls. Some web interfaces embedded images within frames or javascript widgets that made it difficult for me to dig them out automatically.

 

While we’re all familiar with the arguments for publishing our web resources using persistent urls, this project made me realise that it is equally important to manage the urls of assets, such as images, that make up our resources. Given an item identifier, it should be possible to retrieve an image of it in a variety of sizes.

 

As described above, my experiments with named entity extraction were encouraging. Another useful feature of Open Refine is that a series of processing steps can be saved and imported into another project. This means that it might be possible to build and share a series of formulas that could be used for enrichment across data sets such as these.

 

This aim of this project was to aggregate and link resources to promote discovery. But aggregation of resources generally comes at the cost of a simplified data model – the richness of individual descriptive models can be lost. The purposes of aggregation, its costs and its benefits, need to be considered.

 

Key data considerations for cultural institutions

• You need to have ways of getting your data out of whatever system you use to manage it. This seems obvious, but some of the contributors to this project had difficulty exporting to simple formats for data exchange.

• Tools like Open Refine facilitate data cleanup and normalisation, so concerns about data quality or differing vocabularies shouldn't inhibit sharing in projects such as this.

• Good examples of Linked Data based models for resource aggregation already exist in Europeana and the DPLA. It's worth thinking about how your data might map to such structures.

• The importance of unique identifiers and persistent URLs can't be stressed enough. Some contributors had trouble supplying these because of limitations in their software. Linked Data needs reliable links to individual resources.

 

Data model

Schemas

Prefix Namespace

dc http://purl.org/dc/elements/1.1/

schema http://schema.org/

rdfs http://www.w3.org/2000/01/rdf-schema#

edm http://www.europeana.eu/schemas/edm/

foaf http://xmlns.com/foaf/0.1/

dct http://purl.org/dc/terms/

owl http://www.w3.org/2002/07/owl#

ore http://www.openarchives.org/ore/terms/

geo http://www.w3.org/2003/01/geo/wgs84_pos#

dbpprop http://dbpedia.org/property/

dbpont http://dbpedia.org/ontology/

bio http://purl.org/vocab/bio/0.1/

 

Aggregation

Field Property Value

Identifier eg: http://www.cv.vic.gov.au/LOD-project/mov_collections.rdf#mov-61442

Type ore:Aggregation

Object edm:aggregatedCHO edm:ProvidedCHO

Web page edm:isShownAt edm:WebResource

Image edm:isShownBy edm:WebResource

Institution edm:dataProvider foaf:Organization

 

Object

Field Property Value

Identifer eg: http://www.cv.vic.gov.au/LOD-project/mov_collections.rdf#mov-cho-61442

Type edm:ProvidedCHO

Title dc:title string

Label rdfs:label string

Description dc:description string

Format dc:format Controlled list

Provenance dct:provenance string

Date created dc:created string

Subject dc:subject foaf:Person, schema:Place

Creator dc:creator foaf:Person, string

Association edm:hasMet foaf:Person

 

Web page

Field Property Value

Identifier url

Type edm:WebResource, schema:ItemPage

Format dc:format schema:ItemPage

 

Image

Field Property Value

Identifier url

Type edm:WebResource, schema:ImageObject

Format dc:format schema:ImageObject

Depicts foaf:depicts foaf:Person

 

Person

Field Property Value

Identifier DBPedia uri or

eg: http://www.cv.vic.gov.au/LOD-project/people.rdf#melba

Type foaf:Person

Label rdfs:label string

Name foaf:name string

Same as owl:sameAs uri

Web page about foaf:isPrimaryTopicOf schema:ItemPage

Birth bio:birth bio:Birth

Death bio:death bio:Death

 

Birth event

Field Property Value

Identifier eg: http://www.cv.vic.gov.au/LOD-project/people.rdf#melba_birth

Type bio:Birth

Date bio:date string

Place bio:place schema:Place

 

Death event

Field Property Value

Identifier eg: http://www.cv.vic.gov.au/LOD-project/people.rdf#melba_death

Type bio:Death

Date bio:date string

Place bio:place schema:Place

 

Place

Field Property Value

Identifier DBPedia uri or

http://www.cv.vic.gov.au/LOD-project/canberra_places.rdf#[label]

Type schema:Place, dbpont:PopulatedPlace, dbpont:Road

Label rdfs:label string

Latitude geo:lat float

Longitude geo:long float

Same as owl:sameAs uri

Web page about foaf:isPrimaryTopicOf schema:ItemPage

Named in honour of dbpprop:namedFor foaf:Person

 

Formats

Drawing http://purl.org/dc/dcmitype/StillImage

Painting http://schema.org/Painting

Photograph http://schema.org/Photograph

Text http://purl.org/dc/dcmitype/Text

Object http://purl.org/dc/dcmitype/PhysicalObject

comments powered by Disqus