The Orlando British Women's Writing Dataset Release 2: Writing, Biography, Bibliography

Release Notes

This dataset supersedes the Orlando British Women's Writing Dataset Release 1: Biography and Bibliography, 2019. That dataset with accompanying documentation is archived in the University of Guelph Borealis data repository: doi:10.5683/SP2/EOB9S6
See previous release notes for more details. Interim versions of the dataset may be accessed through the Github repository.

Release 2 is a more comprehensive release of data that reflects substantial corrections and linking, remodeling and ontology corrections, and expansion of the dataset to include more writing relationships.

Introduction

This dataset provides a rich set of linked open data representing women's literary history from the beginnings to the present, concentrated on writing in English in the British Isles but with tentacles out to other languages, literary traditions, and parts of the world. It emerges from the ongoing experiments in literary history being conducted by The Orlando Project, whose textbase has been published and regularly updated and augmented as Orlando: Women's Writing in the British Isles from the Beginnings to the Present by Cambridge University Press since 2006, and from the Canadian Writing Research Collaboratory's work in Linked Open Data as a means of enabling digital scholarship and collaboration in the humanities.

The Orlando Textbase is a semi-structured collection of biocritical profiles providing detailed information on the lives and writing of more than 1400 writers with accompanying literary, social, and political materials to provide context to its representation of literary history. It does not contain digitized versions of primary texts.

The aim of extracting linked data from Orlando's textbase is to make the data accessible in new ways to discovery, querying, analysis, and visualization; to promote interlinking between Orlando and other related materials on the web; and to experiment with the potential of Linked Open Data technologies to support knowledge production and dissemination in the humanities.

This is a second release of the dataset, which is continuously being augmented and refined. This release is focused on the internal linking of biographical information using the CWRC ontology, with selective linking out to other ontology terms and other linked data entities. All of Orlando's bibliographical data, linked to Orlando authors, is also included in this release.

Using the data

CWRC uses an OWL RDF ontology schema which means data is represented in triple format (subject, predicate, object). The RDF is then exported to one of several formats.

Various tools and libraries exist to handle and produce Linked Open Data (LOD) in its various formats. See the Linked Data page for tools and libraries, including ones for Python, Java, and PHP. Each of these libraries allows one to load in the RDF data, and traverse the data as a graph which is useful in identifying links between data. More comprehensive ones like Apache Jena for Java allow you to embed the dataset within your application to infer and make queries on the data.

The data can also be queried and downloaded using the SPARQL endpoint available on this site. See the introduction to SPARQL and sample queries to get started.

Research questions

The ontology contains a number of competency questions focused on use of the data for literary historical research. The introduction to the main CWRC ontology is helpful in understanding how the structure can elucidate the data. One key difference between this data and some other datasets is the large number of variables or data values: these are elucidated by consulting the CWRC ontologies and the other ontologies whose namespaces appear at the top of the dataset.

The dataset aims to illuminate the following types of questions. These could be approached by visualizations in a range of formats including charts, network graphs, geospatial maps, heat maps, trend visualization, and infographics.

Charts

For the research question, "What is the relevance of family size to religious affiliation, socioeconomic status, or other factors?", creating .csv files with SPARQL queries could produce a series of charts (bar, line, pie charts, etc.) that speak to this question. Making the process interactive so that the user can select particular factors would create a multi-dimensional visualization.

Network Graph

A network graph approach could investigate the kinds of biographical networks that connect British women writers to each other and to other writers. How extensive are kinship networks as opposed to networks based on political or religious affiliation? The networks in these graphs are dense so subsetting the data by historical period, connection to a particular place, or another shared attribute is recommended.

Map

A map visualization might represent all the geographic information relevant to a person in Orlando, the places associated with one person’s immediate network, or aspects of a particular ethnicity, religious denomination or geographical heritage.

Heat Map

A heat map approach could elucidate a question such as "Does social identity become more diverse over time?" by representing cultural forms as a series of heat maps that plot out social identity (religion, ethnicity, etc.) over time. A video such as Temperature Anomalies by Country 1880-2017 could inform such an approach.

Trend Analysis

Trend analysis could help answer the question, "How is the number of publications related to the number of children a woman writer had?" An example of this can be seen in the charts created by Karen Bourrier and John Brosz to support their claim “Women Writers Have Had Plenty of Babies” based on Orlando data in Slate. An even more ambitious take on this could use dynamic graphs, examples of which can be seen in videos of Hans Rosling's Gapminder visualizations such as 200 Countries, 200 Years.

Infographic

Google’s Knowledge Graph is the basis for a kind of infographic in which search results for well-known writers such as Emily Brontë are embedded, and increasingly provides information also on less known ones such as L.E.L.. However, such results are oriented towards the commercial web.

Orlando’s linked data can produce a more detailed and nuanced infographics reflecting a scholarly perspective, or about groups of writers connected by factors such as their authorship of a particular genre of literature.

Other approaches

See the Competency Questions in the CWRC ontology introduction to the ontology for a fuller sense of the kinds of inquiry the data is designed to support.

Datasets

[Table with links to current data pending.]

Provenance

The datasets have been derived from the textbase of the Orlando Project, which explores and harnesses the power of digital tools and methods to advance feminist literary scholarship.

The dataset is based on the Orlando textbase, which comprises 1400+ author profiles, mostly on British women writers with some men and some international women writers; 13,000+ free-standing and 29,000+ embedded chronology entries; 30,000+ bibliographical listings; nearly 3 million XML tags and more than 9 million words exclusive of markup. The data is regularly updated and more profiles added.

The Orlando dataset employs several XML schemas that encode aspects of writers’ biocritical profiles and contextualizing information on the social, political, and literary context. The data elucidates the conditions of production and reception of texts, as well as the features of the texts themselves, from a recuperative critical perspective that considers gender and other intersecting social forces to be a salient factor in literary history.

The textbase employs several XML schemas:

Profile - for Orlando documents covering in detail the lives, literary careers, and oeuvres of individual writers
Event - for free-standing events
The Library of Congress MODS Schema - for bibliographical records

The semantic data structure of these schemas is the basis for the extraction process that has produced the linked dataset provided here.

Data extraction and transformation

The data extraction was guided by the CWRC ontologies, which can be retrieved in various formats. The separate ontologies are:

CWRC ontology - the main ontology
CWRC genre ontology - literary genres
Illnesses and injuries ontology - illnesses and injury classification based on the International Classification of Diseases

The code used in extraction can be found on Github.

The basic methodology for producing the RDF from the Orlando XML is to extract relationships from selected XPath locations in the document that have been mapped to specific relationships in the ontology.

Data subsets

This draft dataset is available as a whole or in two subsets:

Biography, containing:
- Birth and Death - includes birth and death dates for writers for whom these are known, in some cases including birth order within family and cause of death;
- Cultural identities - contains information on the social identities associated with writers, ranging from language, religion, social class, race, colour, or ethnicity to nationality. Such identities shift both historically and at times within writers' lives;
- Family relations - information on the family members, including spouses, of writers, and at times information related to their occupations;
- Education - includes links to instructors, schools, subjects of study, and credentials earned;
- Friends - includes information about loose associations through to close and enduring friendships to intimate relationships - includes information on both erotic and non-erotic ties;
- Leisure and Society - information on social activities;
- Occupations - covers both significant activities of and jobs held by writers;
- Political Affiliation - information on writers' political activities including their affiliations with particular groups or organizations and their degrees of involvement;
- Spatial activities - information on writers' residences, visits, travels, and migration to particular locations. Spatial data coordinates are granular to the level of settlement only; that is, they do not distinguish between different locations in the same place, such as London;
- Violence - information on writers' experiences of violence on a range of scales
- Wealth - information concerning writers' poverty, income, and wealth
- Health - information on writers' physical and mental health and illnesses;
- General biographical materials that don't align with the specific categories.
Bibliography:
- Standard bibliographic data about works published by the authors whose lives are described in the dataset, plus all works referenced in the Orlando textbase.
- Genre classification for the texts by women writers who have Orlando profiles.

Precision and Accuracy

This is an "open world" dataset, which is to say that the absence of an assertion does not indicate that the assertion is untrue.

The relationships between people represented in this dataset are based on inference from the XML. Results of the scripts have been read against key documents to ensure that the assertions that have been created in RDF are generally reasonable, based on the markup, and the scripts adjusted in response. However, not all results could be checked by human beings, and it is important to remember that the XML relationships were created by human beings producing discursive accounts of literary history and therefore seeking to tag notable features of the material they were writing, without awareness that this extraction would later take place. The result is that at times this dataset will produce misleading assertions based on how the data is extracted.

The most common cases are where a person mentioned within an XML tag used to create a relationship is tangentially rather than centrally involved in the relationship associated with that tag. For example, the discussion of Virginia Woolf's novel Orlando contains a mention of Woolf's diary, and that mention is tagged as a genre; as a result, Orlando is identified as a diary as well as a biography, drama, fiction, novel, history, masque, and mock form. Another example is that the name of a commentator or a contemporary witness, if mentioned and tagged in the XML within the prose of a tag, may be extracted and create a false assertion of a relationship between the subject of an Orlando profile and the commentator or witness, when they were being named in the document as a source of information about the relationship. We have worked to eliminate such false assertions where we can do so systematically during the data extraction process, and plan to develop more sophisticated means of eliminating such false assertions such as of personal relationships between people whose lives did not overlap.

However, these and other factors also mean that not every relationship indicated by the XML has necessarily been extracted, particularly if such extraction would lead to a substantial number of false assertions in addition to true ones. The extraction scripts are available for consultation here.

Data is extracted so as to maintain links to the source data so that the provenance of each assertion can be examined.

Omissions and limitations

This dataset is not in any sense comprehensive, given that it is based both on historical sources full of gaps and selectivity with respect to the inclusion of particular details. The markup from which the data is extracted is also on the more interpretive end of the spectrum, meaning that there are inconsistencies in application, even though encoders receive extensive training and all markup has been reviewed by a senior scholar.

This is the first complete release of linked open data from the Orlando Project representing all major entities and relationships in the biocritical dataset. We anticipate future iterations with further details including:

Some finer details of biographical and literary properties and relationships;
Freestanding events about other writers and historical contexts;
Scholarly notes.

In the meantime, such details are included in the full Orlando textbase.

Other limitations are related to the provenance of the dataset and the priorities of the Orlando Project itself. Specific genre information is present only for the works of women writers with profiles and not for all bibliographic records. In general, information on men writers and writers from outside Britain is less full than information on British women writers.

This is a 5-star LODset that includes dereferenceable URIs for all entities, with names, places, and organizations reconciled, where possible, against alternative authorities. Titles are also reconciled but less fully, in part because of the paucity of identifiers for literary Works. The data includes some blank nodes for Web Annotation components of the data. For many purposes, it may be useful to filter components of the data that are not useful and work with a subset of the data.

Key decisions and strategies

Regularization, disambiguation, and linking of data

Regularization was available in the dataset only for personal names and organizations. In other areas, such as for religions, it has been achieved through the ontology, wherein the skos:altLabel property indicates strings from the data that have been grouped together under the same concept.

In the case of place names, the results of automated matching (of the combination of settlement name and its associated region and geo-political unit) have been reviewed for accuracy, so geospatial coordinates should be accurate on places down the the level of the settlement or populated place. We use the Geonames (and provide their spatial coordinates) for most place identifiers, supplemented by Getty placenames where needed. These matches support mapping. However, we recognize that historical shifts and political contestation will in some cases make the labels associated with particular past and present places problematic.

While it is highly desirable to convert strings of text within the XML markup to things, that is, to defined, dereferenceable entities with URIs (Uniform Resource Identifiers) or PIDs (Persistent Identifiers), this is not possible or indeed desirable in all cases, given the nature of the source data. Where it has been possible to link to an external ontology or to regularize vocabulary within the data and create LOD instances of terms within the CWRC ontology, this dataset has done so. However, in some cases the data has been so idiosyncratic or heterogeneous that such regularization was not possible. This is often the case for a few instances within a larger subset of instances that have been regularized, such as religion. In the case of education, there were many outliers that have not been created as linked data instances. There are thousands of occupations, which are at present represented as strings, but a subset of common occupations grouping terms together is forthcoming.

Notable features

Contexts are a key feature of this dataset, which uses the Web Annotation Data Model to provide links from particular entities to their discursive contexts. Annotations with an oa:identifying motivation indicate all entities mentioned in a particular discursive context, so one can group mentions of a certain person or persons by Context type. Annotations with an oa:describing motivation include as annotation bodies the writer who is being described.

Events provide content with temporal and geospatial locators amenable to timelines or mapping. We draw on the Simple Event Model, extending it to indicate separately from date values themselves the degree of certainty associated with an event's date, and providing typing of events specific to our domain.

Spatial data, for instance for place of death, is often but not always regularized to Geonames or the Getty Thesaurus of Geographic Names identifiers.

Bibliography data is structured using the BIBFRAME RDF schema supplemented by a couple of terms from schema.org.

License

This dataset is made available under a CC-BY-NC license. If you make use of this dataset, we would appreciate being informed of this at cwrc@ualberta.ca.

Dataset Contributors

Jeffery Antoniuk. University of Alberta. Orlando Project programmer and systems analyst. Assisted in preparation and extraction of dataset.

Susan Brown. University of Guelph. Project lead. Produced extraction guidelines and guided process.

Joel Cummings. Responsible for overseeing and contributing to technical work on the ontology and making key design decisions to support extraction from a variety of sources.

Jasmine Drudge-Willson. University of Guelph. Research Assistant. Responsible for researching and modeling structural and theoretical aspects of external ontologies, and how they relate to the aims of the CWRC ontology.

Hannah Stewart. University of Guelph. Research Assistant. Responsible for definition refinements within the CWRC/GENRE ontologies and supporting ontology modelling.

Abigel Lemak. University of Guelph. PhD student in Literary Studies. Overall project management, as well drafting ontology terms and definitions, particularly those that deal with Cultural Forms.

Kim Martin. University of Guelph. Helped with project management. Produced sample triples for testing accuracy of extraction.

Alliyya Mo. University of Guelph. Wrote the extraction scripts for cultural form extraction and much of the rest of the biography data. Mined instance data for cultural forms, genres, religions, political affiliations. Responsible also for much ontology refinement reflecting the extraction process.

Michaela Rye. University of Guelph. Undergraduate research assistant. Responsible for tag cleanup as well as disambiguating and drafting ontology terms, particularly those that deal with Occupations.

Gurjap Singh. University of Guelph. Co-op student. Responsible for initial extraction of birth, death, and family data from Orlando data. Queried Geonames API to get URIs for locations in Orlando.

Thomas Smith. University of Guelph. Undergraduate research assistant. Responsible for tag cleanup as well as disambiguating and drafting ontology terms, particularly those that deal with geospatial data, educational awards, literary awards, and Occupations.

Deborah Stacey. University of Guelph. Associate professor in the School of Computer Science. Helped coordinate the process. Wrote scripts for extracting cause of death triples.

See also the Orlando Project credits and the CWRC Ontology Credits.

Introduction​​

Using the data​

Research questions​

Datasets​​

Provenance​​

Data extraction and transformation​​

Data subsets​​

Precision and Accuracy​​

Omissions​ and limitations​

Key decisions and strategies​​

Regularization, disambiguation, and linking of data​​

Notable features​​

License​​

Dataset Contributors​​