(6.1.1) XSL - Characteristics, Status, and Potentials for Text Processing Applications in the Humanities
Wendell Piez
Mulberry Technologies, USA
What XSL Is
The Extensible Style Language (XSL) is a specification currently being finalized (May 2000) by the W3 Consortium, the vendor consortium that proposes recommendations for web standards including HTML, CSS and now XML and its related technologies. XSL's immediate purpose is to support various kinds of presentation of arbitrarily marked-up documents in XML format. In an XSL system, any well-formed XML document could be formatted for print, displayed in hypertext (including on the web), or presented in other media, more easily and more effectively than is currently the case, and in a standards-based way. In a networked environment, processing documents for display on screen could happen on either server or client.
In order to support its task of presenting XML (that is, applying to an arbitrary tag set a formatting description for a user interface such as screen or printer), XSL evidently has to provide for granular access to markup structures, so as to be able, for example, to derive tables of contents, text for running heads, indexes, and other common (presentational) expressions of underlying document architecture. In the course of working out XSL it became increasingly clear that (as is often the case with computer data processing problems) this problem was more easily, and more powerfully, addressed, if it was treated as a special case of a more general capability, namely the "transformation" of one markup structure into another.
Accordingly, XSL is formally divided into two parts:
"XSL Transformations" (XSLT): on a standalone basis, provides a language to describe many of the kinds of rearrangement and filtering of markup structures that a reasonably powerful XML presentation language requires.
"XSL Formatting objects" (XSLFO): provides a vocabulary for describing, in a standard and abstract way, formatting of text for visual display in print or on screen (and possibly for alternative media presentation).
XSLT was accepted as an "official" W3C recommendation in November 1999. XSLFO is expected to be completed in mid-2000. The relative maturity of the specifications is reflected in the tools available: already by the end of 1999 a number of tools supporting XSLT were available on the Net for free. As of this writing, tools supporting XSLFO (which could, for example, convert XML via XSL into PDF format) are less mature.
XSLT, on the other hand, was instantly in use by mid-1999, primarily (though not exclusively) as a way of converting XML into HTML. Because XML becomes instantly useful as soon as HTML can be reliably created out of it, this has in effect jump-started the XML presentation industry, at the price of keeping on-line published versions of XML source documents limited to the capabilities of HTML, the current state of the art in browsing on the web. As a result, even before the ink is dry, we are beginning to get a sense of XSLT's capabilities for processing - while at the same time we are still unclear as to what XSL's own "design language" (its formatting objects) will look like.
XSLT's Capabilities
-Presentational XSLT
XSLT is already used to convert XML into HTML. In this, it is a ready
alternative to a scripting approach (Perl, Omnimark etc.) or to the ISO
standard DSSSL - and easier to learn than either. It also compares favorably
in price: tools for XSLT conversions are free.
-Analytic XSLT
XSL processing is dependent on markup in the source text for navigation
as opposed to (say) character offsets or line numbers. While very good
at presenting information encoded in markup, it is not good at recognizing
or construing implicit information such as character patterns. It does
no tokenizing, hence cannot recognize "word" boundaries. By default, string
processing and matching in XSLT is case-sensitive, and cannot readily be
configured otherwise.
Somewhat surprisingly, however, XSL is nevertheless useful for certain kinds of analytical functions, including certain kinds of XML validation (cf. Rick Jelliffe's "Schematron"). For example, one could write an XSL stylesheet that would check the conformance of an instance of a TEI Header to a certain model, that went beyond the DTD to specify element dependencies - for example, reporting a warning (or providing defaults) if the publication statement were not filled out according to house standards. This could be done with a stylesheet and would not require altering the DTD.
And because it can perform testing on strings, XSLT can also be used for generating rudimentary concordances. A concordancer in the form of an XSL style sheet will be demonstrated as a part of this presentation.
The thing to keep in mind about any computer-facilitated analytic work is that, without being supported by information from an external source (such as a thesaurus of terms or a morphological dictionary), no algorithm is able to reveal something about a text that is not implicit in the text already. That is, while a computer can rearrange information in a text, and therefore perform such operations as counting incidences or providing indexes, it cannot actually add any "knowledge". What it does, is present a text, and information derived about the text, in such a way that a careful reader can come to conclusions about it that would otherwise be very difficult to demonstrate. This is merely to point out that, for example, a concordance is not an analysis, and by itself makes no argument, although it may facilitate the development of one.
XSLT-based analytic work is no different, and since XSLT is not designed specifically with analytic work in mind, it is in some respects an unexpected benefit if it can support this work at all. Even given its fairly rudimentary capabilities, however, XSLT has certain incidental advantages:
1. It leverages investments made in markup:
Many repositories have XML texts, or texts readily convertible into
XML. These are all ready for XSL processing, and can be enhanced to support
more sophisticated processing.
2. It produces "publishable" results as a natural work product:
Since the end result of an XSL transformation can be HTML or an XML
format ready for further processing, it is easy to generate results in
a form that can be displayed as is.
3. An investment in XSL is worth making for other reasons:
Since XSLT processors are so inexpensive (free), the real investment
is in time to learn it. And XSLT is so portable and versatile, it pays
off this investment in expertise fairly quickly.
4. It can be combined with other methods:
An XSLT stylesheet can also be used to prepare XML texts for other
kinds of work. An XSLT stylesheet can generate COCOA encoding from XML,
that can be used to support TACT or another tool that takes advantage of
COCOA markup of events in a text stream (such as chapter breaks or shifts
in narrative voice). [An XSL stylesheet that creates COCOA markup from
an XML TEI source can be demonstrated.]
Or, an XSLT stylesheet can be used to derive SVG (Scaleable Vector Graphics) files from descriptive XML source. SVG is a graphics format which is expected, by some, to revolutionize distribution of graphics for certain kinds of applications on the web. Graphic representations of phenomena accessible to XSL transformations can be already displayed in prototype SVG viewers. [SVG frequency distribution graphs of strings in an XML file can be demonstrated.]
These are only two examples of ways XSLT can be applied to help prepare XML texts for a variety of further uses. The basic principle being applied is a layered architecture: the source data is maintained in a stable format, such as TEI XML, useful over the long term. Applied "on top" of this repository layer a separate process can expose a "view" or presentation of the source data (some readers may be familiar with the "model-controller-view" model of computer application design), ready for the special format requirements of an arbitrary application.
Role Of XSL/XSLT In The Future
- Possibilities for XSL extension:
The XSL specification also provides allowance for its extension. Extension
functions, in Java or an alternative scripting language, could be made
available to an XSL processor. Tokenizing functions, sophisticated string
processing and matching, database-integration services (for retrieving
data such as morphological variants or checking values against an authority
list) could all be addressable, given a good API, from within XSL stylesheets.
It is unlikely, however, that such extensions (at least, those especially suited for the types of analysis academic humanists are interested in) would be developed in the private sector - not that they would be without profitable application there. But academic researchers, with clear focus on their own functional requirements, have to lead the way.
-An XSL browser as "analytical engine":
XSL's potentials in these respects suggest that it could play a role
in the markup-aware "analytical engine" that many of us keep envisioning
(cf. the ELTA initiative). An XML browser that supported XSL stylesheets
could be integrated with an editing environment allowing on-the-fly emendation
of the stylesheets, and/or the extension functions they call. Stylesheets
and function libraries could be pulled "off the shelf," or written especially
to address local problems and questions. Specialized functions would have
the capability of integrating XSL's presentation/analytical capabilities
with other tools such as databases or network applications.
Not only would such a system be very versatile; also, in it, research results could take the form of ready-made publishable material, in HTML or any other markup-based form. Since it would basically be an XML web browser, it could also be readily networked, especially as concerns the XML source text (the text under analysis), which could be located anywhere on the Internet. Analytical stylesheets in XSL would be portable and applicable to any text that conformed to the same (sufficiently constrained) document model.
Present Advantages [as of the end of 1999]
-XSL tools are freely available:
As of this writing, free XSLT processors are available in Java, and
are not difficult to set up and run. Learning the stylesheet language itself
is the biggest barrier to entry, and there are free and inexpensive resources
for this as well.
-XSL is easy to get going with:
By design, XSL is a declarative language, abstracted at a fairly high
level. As a result, it is not difficult to learn, at least for most ordinary
operations, and is very portable (making it easier to learn from others'
work).
Present Disadvantages [as of the end of 1999]
-XSL is somewhat arcane:
Although the rudiments of XSL are not difficult, some users take to
it less easily than others. It is a "functional" and "declarative" language
unlike most scripting languages, so expertise in other computer languages
is not readily applicable to it. Naïve users seem to have less trouble
learning it than experts. The model of the text on which it operates, the
"document tree," although it leverages document markup in a very simple
and powerful way, is not a self-evident approach to developers used to
looking at text as a stream of characters.
-XSL processing is XML-based; requires well-formed XML to start:
Obviously, XSL requires an XML text to operate on. Either this is a
problem, or it isn't.
-Tools are rudimentary (although improving):
Strong support for internationalization, for example, is envisioned
by the specification but not yet widely implemented in interfaces or tools.
As mentioned above, it is unlikely that the private sector would, on its own initiative, develop function libraries that would provide for all the kinds of functions wanted by scholars in the Humanities. (Some, like support for sorting texts in major European and Asian languages, can be hoped for, although not necessarily for free.)
Conclusions
-What XSL will be good for:
Presentation, filtering/rearrangement, markup-based processing such
as indexing supported by markup. Some kinds of validation. Especially extended
or in combination with other methods, XSL will also be capable of supporting
sophisticated analytical functions on text marked up in XML.
-What the emergence of XSL tells us about our markup projects:
(6.1.2) Metainformation Strategies for Electronic Resources
Susan Schreibman
University College Dublin, Eire
This paper will address the theoretical and practical issues in devising and implementing a project-specific metainformation scheme for electronic resources. While one can argue that a scheme like the Text Encoding Initiative provides for encoding which greatly enhances plain text retrieval, in practice without extensive use of the keyword or indexing elements, retrieval of information is limited to what is explicit in the text. Searching for what is explicit in the text, even if that text has been encoded logically (as opposed to physically), does not provide the kind of functionality most humanists expect from digital archives.
This paper then is an exploration of the advantages and disadvantages in creating a meta-meta information or classification scheme for electronic resources. For this talk I will draw heavily on theoretical models (both pre-and post-computer indexing models) from library and information studies. I will also adopt the position that creators of electronic resources are encoding their primary material in a SGML or XML-based metainformation scheme, such as the Text Encoding Initiative. I will also assume that the project directors have already made certain specific decisions in encoding what is explicit in the text in accordance with the project's goals. In other words, I am assuming that a digital project is already taking advantage of the tagging structure afforded in a scheme like the TEI in providing for the encoding of titles of text, place, personal, geographic and organisation names, etc., as deemed important to a particular project.
There can be no doubt that this type of tagging greatly enhances retrieval, for example by distinguishing the occurrence of WB Yeats as a title as opposed to a personal name, or facilitating the searching of all strings within a <placename> element. And although this type of encoding of electronic resources gives users unprecedented access in locating very specific strings of text, in practice users are frustrated by limited and relatively simplistic search and retrieval strategies. In most electronic resources, users are limited to retrieving only what is explicit in the text, i.e. strings of text, some of which have been encoded logically. In the case of images, the situation is even more problematic. Unless a project has developed a header consisting of detailed metainformation, most images can only be retrieved by image title. Boolean and proximity searches go a very small way in solving the problem of retrieving more than single word searches, but do not provide the conceptually and theoretically rigorous searches most scholars in the humanities want and expect from electronic resources.
Specifically, this paper will address the practical and theoretical issues raised by devising a classification or indexing scheme which facilitates search and retrieval by going beyond encoding what is explicit in the text. To this end, several points will be raised:
To this end, the rest of the paper will be divided into three parts.
Part I will provide an overview of some of the major metainformation schemes
which were developed in a pre-digital environment, such as AACR2, the Dewey
Decimal Classification, and the Library of Congress Subject Headings. Topics
to be covered will include:
The third part of this paper will explore metainformation schemes devised for several specific digital archives, including The Blake Archive and The Thomas MacGreevy Archive, both published at the Institute for Advanced Technology in the Humanities at the University of Virginia. In the case of the Thomas MacGreevy Archive, I will demonstrate how we, working within the TEI, developed a metainformation scheme which facilitated very specific genre searching for both texts and images.
(6.2.1) On Mock-scholarly, Faux-casual, and Cod Philosophy: Patterns of Diachronic Change in a British Newspaper
R. Harald Baayen
University of Nijmegen, The Netherlands
Antoinette Renouf
University of Liverpool, UK
Introduction
In British newspapers, various 'vogue' prefix-like forms regularly appear (Renouf & Baayen, 1998; see also Baayen & Renouf, 1996, for standard affixation). This study considers three such forms: MOCK, COD, and FAUX. These forms are primarily used to modify nouns, as in
(1) he explodes in mock-outrage
chain belts, studded with faux gems around the hips
a quick snipe at the cod-mysticism
and to modify adjectives, as in
(2) a charmingly mock modest touch
coming back on stage, faux sheepish
a Russian night club - cod-glamorous
Occasionally, one finds instances of adverbial modification,
(3) Rabelais at his most mock heroically cloacal
ordered him out of the house, faux-crossly
and for MOCK, examples also exist of verbal modification:
(4) he mock bows, gallantly
he has to mock-apologise for his tedious bleating
This study addresses the productivity of MOCK, COD, and FAUX by investigating their use in a British newspaper, The Independent, from a diachronic perspective. To this end, we extracted all occurrences of these forms from a corpus of this newspaper, compiled from 1988 onwards, and currently containing some 360 million word forms. For each of the years 1989-1998, we measured the type and token frequencies of these words in 4 successive 3-month chunks. For 1988, the last 3-month chunk was also taken into account. The questions to be addressed are whether changes in the frequency with which these vogue forms are used across time can be observed, and whether the forms with a hyphen (as in FAUX-SHABBY) reveal different patterns from the forms without a hyphen (as in FAUX SHEEPISH), which would suggest that syntactic context would co-determine the productivity of these combining forms.
Results
Figure 1 will summarize the results obtained. The solid lines and the dots represent the numbers of tokens counted for the successive chunks. The dashed lines represent the numbers of new types observed across sampling time. Both line types were obtained using a non-parametric regression smoother (Cleveland, 1979). For reasons of space, we defer discussion of the type counts to the presentation at the conference. The left panels represent the forms without a hyphen, the right panels the forms with the hyphen.
The top panels of Figure 1 show the results obtained for MOCK and MOCK-. The left panel reveals a slow but steady increase for the token counts (r = 0.388, t(39) = 2.6294, p = .0122). The right panel suggests that the use of MOCK- did not change in the last 10 years (r = 0.194, t(39) = 1.237, p = .2234). A closer investigation of the MOCK data revealed that the increase in the number of tokens is primarily carried by the lowest-frequency types - the highest-frequency types have a relatively stable use across the years.
The central panels of Figure 1 concern COD (left) and COD- (right). For both forms, we observe a reliable increase over time, which appears to be more linear for COD- (r = 0.656, t(39) = 5.437, p = .0000) than for COD (r = 0.473, t(39) = 3.357, p = .0018). For both COD and COD-, an autocorrelation analysis suggests a reliable correlation at short lags, suggesting that the extent to which a form is used in a given month is co-determined by the extent to which it was popular or unpopular in the preceding months.
The bottom panels of Figure 1 suggest that FAUX and FAUX- were hardly used initially, but that in the second half of the sampling period they enjoyed greater productivity. Both non-parametric regression lines and parametric change point analysis suggest a breakpoint around the end of 1992, after which FAUX and FAUX- began to become more and more productive. What we probably are witnessing here is the birth of what may eventually become a fully-fledged new prefix of English.
Intriguingly, all three left panels show a (local) minimum at this point in time (marked in the plots by a vertical dotted line). Possibly, the general fashion for vogue modification with any of the near synonyms MOCK, COD, and FAUX reached an all-time low around the end of 1992, from which both MOCK and COD recovered. Note that Figure 1 shows that both forms show an increase in use during the immediately following chunks. This general fashion for vogue modification may have led in its wake to the upsurge in productivity of FAUX and FAUX-.
Conclusions
We have observed different patterns of diachronic change through an investigation of vogue modification in The Independent: a steady state (MOCK-), a linear increase over time (MOCK, COD-), a slightly oscillating pattern (COD), and a birth pattern (FAUX). The forms of MOCK- (with hyphen) show a clearly different pattern than the forms with MOCK (without hyphen), which shows that forms with and without hyphen should not be lumped together a priori. The syntactic contexts that favor hyphenation (e.g., modification of a prenominal noun or adjective) lead to slightly different diachronic patterns. Thanks to the accumulation of large diachronic computerized corpora, it is, at the end of the second millennium, finally becoming possible to directly monitor ongoing language change.
References
Baayen, R. Harald and Renouf, Antoinette (1996). Chronicling
The Times: Productive Lexical Innovations in an English Newspaper. Language
72, 69-96.
Cleveland, W.S. (1979). Robust locally weighted regression and
smooting scatterplots, JASA 74, 829-836.
Renouf, Antoinette and Baayen, R. Harald (1998). Aviating among
the hapax legomena: Morphological grammaticalisation in current British
newspaper English. In A. Renouf (ed) Explorations in Corpus Linguistics.
Rodopi, Amsterdam. 181-189.
(6.2.2) Primroses and Power: a Study on Linguistic Excellence in Political Discourse
Anna Loiacono
Angela Maria D'Uggento
Barbara Cafarelli
Rosaria Romita
Università degli Studi di Bari, Italy
The aim of this paper is to focus on a linguistic corpus in order to detect, by statistical methods and computational tools, either correspondences or contrasts between the hypotheses made by the linguist and the applied procedures made by the software experts.
The corpus is selected from nineteenth century Victorian speeches; in particular, Parliamentary and political speeches delivered by Benjamin Disraeli, both as a member of Parliament and as Prime Minister.
Analogous types of research have been recently carried out [see Labbè, Bolasco, Lebart, Salem, 1995] with reference to contemporary politicians such as de Gaulle, Mitterand and the Italian Berlusconi. In fact, computational research in political discourse is a highly significant new brand of criticism, which contributes to the modern notion of political contest by some more hidden truths that can be revealed. Nonetheless, we should say that, in most cases, where contemporary politicians are concerned, the use of "ghost writers" for writing public speeches definitely undermines that language structure/personality relationship which justifies work in textual analysis.
Dealing with 19th century England, we have considered Victorian politics as being the first true "arena" for party politics and as constituting fundamental principles for political dialectics.
The main source of reference was the collection of Selected Speeches by T. E. Kebbel in two volumes, borrowed from the Parliament Library in Rome. Disraeli's personal letters written to Lady Bradford and Lady Chesterfield also constitute a valuable source of reference, in that they unveil psychological traits and personal idiosyncrasies of the statesman. Moreover, in classical biographies we have found revealing hints of modernity and wisdom that we think useful to compare to our times. In a further perspective, we would aim at setting up a series of parameters which might characterise English political discourse, defining also its specific contexts by comparing corpora (for example Tory vs Whig discourse or Conservative vs Radical, etc.).
Some intriguing questions have led us to pursue this goal, bearing in mind the old lessons of rhetoricians concerning oratory and trying to see them in the light of modern standard rules: how great is the analogy existing between oratorical qualities, original style and words, phrases and discourse markers more or less unconsciously uttered by an individual? By which standards can this analogy be assessed? And which is the "benchmark" for moral evaluative judgement, when we use a meta-linguistic code to filter the quintessence of discourse? Should, for instance, the most frequent items adhere in meaning and collocation to their speaker's actual socio-political presuppositions? And in which degree should their collocation be found in accordance with given parameters? Moreover, what can computational analysis discover beyond or in contrast to all previous evaluative criticism on the same subject to which so many scholars have contributed in the course of centuries?
We have tried to answer these questions by examining, through data processing, which linguistic marks might define or confirm the features for linguistic excellence.
Our stylometric study is concerned with the Victorian century, an age rich in syntactical perfection and lexical complexities.
We have selected thirty-four speeches given by Disraeli between 1830 and 1870, and have processed them through the software programme "Lexico 2" and SPSS. The corpus has 205,800 occurrences1, with a type/token ratio of 9.83% (known also as a measure of vocabulary richness) and a Guiraud's coefficient G=44.61.
In table 1, for all the sub corpora in the decades, we show the the number of occurrences (N), forms (V) and hapax (V1), the word with maximum frequency for each decade.
Hapax forms (words used only once) constitute nearly the 50% of the total amount of different words, which is a representative ratio marking a highly varied though integrated vocabulary.
In table 2, we show a list of the fifty most important words considered as truly connotative of Disraeli's mind and world, displayed according to total decreasing frequency, making a distinction between the decades, the absolute frequency (F) and the normalized occurrences (x 1000 words).
Moreover, we have disambiguated words (textual forms) in order to obtain the correct rank and we have made a distinction, for some forms, between pronouns, nouns, verbs, adjectives, adverbs and conjunctions. The choice was made to build up the so-called lexical universe of Disraeli's discourse: the chosen words (i.e. power, principle, democracy, oligarchy, empire, etc.) have first been seen as collocates and, for them, we have built their lexical universe, in order to define specific contexts and co-texts through the relationship of proximity and distance.
We have patterned some meaningful words from a diachronic point of view, marking the evolution of his political thought, throughout his long parliamentary activity, detecting their characteristic trends and we have also carried out a factor analysis for some groups of words, clustered according to some semantic categories (i.e. self, key words, generic words, geography, negative words, etc.)
What clearly appears by means of statistical exploitation is as follows:
What might appear strange is the zero occurrence of the word "Judaism", even considering the very low presence of all vocabulary concerned with race and religion. He supposedly divided with care his choice of language genres (novel, essays and speeches) according to some well defined goals to pursue.
The high occurrence of generic words, i.e. with a neutral connotative value, seems to confirm some critical judgements expressed by his opponents for which Disraeli's political commitment was in most cases supported only by generic principles of well re-constructed political heritage and, only in a lesser degree he offered detailed policies on various occasions.
A possible consequence that originates from this analysis and supports the original hypothesis made is that Disraeli's political success and his prolonged prominence as a first-rate politician (even when in opposition) was mainly due to his gifts of linguistic excellence and flamboyant oratory. From a different perspective, his Italian-Jewish ancestry adds an odd and mysterious flavour to his success, in the pure aristocratic context of Victorian England.
Notes
According to an empirical criterion, a corpus can be considered wide enough if it is greater than 200,000 words; moreover the ratio V/N (vocabulary or number of different words divided by total occurrences) is 9.8% (when this ratio is over 20%, the corpus is not to be considered wide enough. See Bolasco, Analisi Multidimensionale dei dati. Carocci Editore. Roma. 1999. p.203)
References
Blake, R. (1978). Disraeli. Methuen & Co. Ltd. London.
Bolasco, S. (1999). Analisi Multidimensionale dei dati.
Carocci Editore. Roma.
Bolasco, S. (1994). L'individuazione di forme testuali per
lo studio statistico dei testi con tecniche di analisi multidimensionale,
SIS
Atti della XXXVII Riunione Scientifica 1994, Vol.2, pp.95-103.
Bolasco, S. (1995). Criteri di lemmatizzazione per l'individuazione
di coordinate semantiche. In S. Bolasco and R. Cipriani
(eds)
Ricerca qualitativa e computer: teorie metodi e applicazioni.
Franco Angeli, Milano. pp.87-111.
Bolasco, S. (1997). L'analisi informatica dei testi, in
L.
Ricolfi, La ricerca qualitativa, NIS, pp.165-202.
Kebbel, T. E. (1882) Selected speeches of the Late Earl of
Beaconsfield. 2 vols.
Holmes, D. I. (1985). The analysis of Literary Style-A review.
J.R.Statistical
Society, part 4, pp.328-41.
Labbé, D. (1990). Le vocabulaire de François
Mitterand. Paris, Presses de la FNSP.
Labbé, D., Hubert, P. La structure du vocabulaire de
Général De Gaulle. JADT 1995. Atti Analisi Statistica
dei Dati Testuali, vol.II CISU. Roma.
Lebart, L., Memmi, D. (1984). Analisi dei dati testuali: applicazione
al discorso politico. SIS Atti della XXXII Riunione Scientifica 1984,
Vol. I, pp. 27-41.
Lebart, L., Salem, A. (1994). Statistique Textuelle.
Paris, Dunod.
Lebart, L., Salem, A. Berry, L. (1998). Exploring textual
data. Kluwer Academic Publishers.
Levinson, S. C. (1995). Pragmatics. Cambridge Textbooks
in Linguistic. Cambridge University Press.
Monypenny and Buckle, G.E. (1910-20). The life of Benjamin
Disraeli Earl of Beaconsfield. 6 vols.
Oakes, M. P. (1998). Statistics for Corpus Linguistics. Edinburgh
University Press.
Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford
University Press.
Smith, P. (1996). Disraeli. A brief life. Cambridge University
Press.
Tweedie, F. J., Baayen, R. H. (1998). How variable may a constant
be? Measures of Lexical Richness in Perspective. Computers and Humanities,
32 (5): 323-352.
Zetland, Marques of (ed) (1929). The Letters of Disraeli
to Lady Bradford and Lady Chesterfield. 2 vols.
(6.2.3) The Statistical Analysis of Style: How Language Means in Beckett
C.W.F. McKenna
The University of Newcastle, Australia
This paper will analyse narrative style in selected fiction of Beckett (French and English versions). It will use computational stylistics for a formalist discrimination of patterns of language in the texts and will thus begin with a descriptive base. Previous published work by Burrows, Love, Craig, Holmes, Forsyth, Tweedie, Baayen, and Smith has shown the strength of computational procedures, particularly in such areas as the identification of authorship, genre, period, and character. This present work builds upon these foundations by examining issues in narrative theory and translation theory, with particular reference to Bakhtin's ideas on the way different discourses interact. Whilst Bakhtin recognised "the positive influence of Formalism" (1986: 169) he exposed its limitations, recognising the need to extend analysis to broader cultural issues. If, as Bakhtin argues, all utterance is ideologically governed and can never be neutral then the differentiations in language patterns revealed in our work cannot be construed as merely linguistic phenomena. That would be to rest with the formalist approach that Bakhtin wished to move beyond. The cultural questions arise as soon as one applies Bakhtinian concepts. Beckett is highly appropriate for such an investigation given the complexity, subtlety, and significance of his narrative experiments. When Beckett came to consider an English version of Molloy, which had been written in French in 1947 and published in 1951, he talked of producing a "new" text. To ask in what sense the text might be "new" is to open the large question of what can be and what cannot be achieved in translation - a question whose implications range from the immediate practical realities of searching for the nearest equivalent of a given word to the philosophical issues pertaining to language and how it means. At the philosophic level, translation raises ontological questions, the very questions raised by Molloy himself early in the text: "But my ideas on this subject were always horribly confused, for my knowledge of men was scant and the meaning of being beyond me" (52). The reference to a distinctive phrase in Heidegger's work alerts us to the link between the problem of translation and central ideas in Beckett's trilogy: commentators such as O'Hara argue, for example, that Beckett's work 'could almost be seen as a literary exploration of Heideggerian metaphysics', and that Beckett's fundamental inquiry in the trilogy centres around the question of how language means. This question then becomes refocussed through consideration of what can be achieved in translation. Heidegger's reservations about the way in which translation violates meaning in the source text appear eventually to be taken up by Beckett, who is reported as stating during a London rehearsal of Endgame: "The more I go on the more I think things are untranslatable" (Cockerham 1440. This issue of 'untranslatability', Steiner argues, 'is founded upon the conviction, formal and pragmatic, that there can be no true symmetry, no adequate mirroring, between two different semantic systems' (1975: 239). Expanding this argument concerning 'semantic dissonance' Steiner writes that 'Because all human speech consists of arbitrarily selected but intensely conventionalized signals, meaning can never be wholly separated from expressive form. Even the most purely ostensive, apparently neutral terms are embedded in linguistic particularity, in an intricate mould of cultural-historical habit. There are no surfaces of absolute transparency.'
Patterns of prepositionality or conjunctivity, such as emerge in our analyses as a feature both of Beckett's English and French versions, impact upon our understanding of each text's meaning. How might these patterns influence our reading of Beckett? Does Beckett provide one work of literature called Molloy, or does that title mask two works of literature? How different is it really to read Beckett in French as opposed to reading him in English? These questions have not, I believe, been addressed in quite the way that this project proposes. The closest work is that of Opas (1995), who has studied translations of Beckett's How It Is and All Strange Away into Finnish, German, and Swedish. Using the University of Toronto's TACT program as the basis for her computational analysis, and applying the postulate of van Leuven-Zwart (1989,1990) that 'if there are enough consistent changes between a text and its translation on the microstructural level, it will affect the macrostructure of the text also', she has provided evidence of the ways in which common words influence syntactic structures and of how translations of them can influence the meanings we read in a text.
The research data used in this paper derives from analyses of word frequencies, using established statistical techniques (e.g. principal component analysis, t-test, Mann-Whitney test). In order to produce this computational evidence texts are first prepared for the computer programs in accordance with protocols developed by Burrows. Frequency counts are established for each of the 99 most common words in the texts. These counts are standardized to allow for the variations in the total size of each section of text and each count is correlated with every other count so as to produce a matrix (using the Pearson product-moment method of correlation). We also use a technique of multivariate statistics known as principal component analysis and plot the results so as to show the relationships between the variables in the data. The plots show which words behave most like each other and which sections most resemble each other in their word-frequency patterns.
The significance of this evidence is further tested by using distribution tests such as the t-test and Mann-Whitney test, which assess whether the variations in the data occur at a level of probability that statisticians would deem likely to be an effect of chance or a significant outcome. These procedures enable identification of the words that discriminate significantly - in computational terms - between narrative styles. The discriminating words will also be examined through "scatter plots" which generate the scatter of values for each word in each section of the text. This procedure will reveal how sporadically or consistently each word discriminates in a particular comparison.
Although this research therefore begins with computational evidence, it will move from the quantifiable data to consider the literary significance of a word's use in context. As McCarty (1996) writes, "no tool is 'just a tool' but is an agent of perception and means of thinking". Common words are significant because they point to the larger linguistic structures in which they participate. With Beckett's work, investigations of stylistic differentiation show how translated texts maintain in the second language similar kinds of discriminations as those operating in the first language. The present work on a range of Beckett's early, middle, and late fiction extends previously published work by McKenna, Burrows, and Antonia on Molloy (1999) and on the trilogy (1999 forthcoming).
References:
Bakhtin, M. (1981). The Dialogic Imagination. Univ. Texas,
Austin.
Bakhtin, M. (1986). Speech Genres and other late essays.
Univ. Texas, Austin.
Burrows, J. F. (1987). Computation into Criticism: a study
of Jane Austen and an experiment in method. Clarendon Press, Oxford.
Burrows, J. F. and Hassall, A. J. (1988). "Anna Boleyn
and the authenticity of Fielding's feminine narratives", Eighteenth-Century
Studies, 21 (1988) 427-53.
Burrows, J. F. and Love, H. (1999). "Attribution Tests and the
Editing of Seventeenth-Century Poetry", The Yearbook of English Studies,
29 (1999) 151-70.
Burrows, J. F. and Sussex, L. (1997). "Whodunit?: Literary Forensics
and the Crime Writing of James Skip Borlese and Mary Fortune", Bibliographical
Society of Australia and New Zealand Bulletin, 21 (1997) 73-106
Burrows, J. F. (1997). "Style". In E. Copeland and
J. McMaster (eds). The Cambridge Companion to Jane Austen, Cambridge
Univ Press. 170-88.
Butler, L. St John (1984). Samuel Beckett and the meaning
of being: a study in ontological parable, Macmillan, London.
Craig, D. H. (1992). "Authorial Styles and the Frequencies of
Very Common Words: Jonson, Shakespeare, and the Additions to The Spanish
Tragedy", Style 26 (1992) 199-220.
Craig, D. H. (1999). "Authorial Atrribution and Computational
Stylistics: If you can tell authors apart, have you learned anything about
them?". Literary and Linguistic Computing 14 (1999) 103-13.
Holmes, David I. (1998). "The Evolution of Stylometry in Humanities
Scholarship", Literary and Linguistic Computing, 13 (1998) 111-17.
Holmes, David I. and Forsyth, R. S. (1995). "The Federalist
revisited: New Directions in Authorship Attribution" Literary and Linguistic
Computing. 10 (1995) 111-27.
Leuven-Zwart, K. Van (1989). "Translation and Original. Similarities
and Dissimilarities I". Target 1.2 (1989) 151-81.
Leuven-Zwart, K. Van (1990)."Translation and Original. Similarities
and Dissimilarities II". Target 2.1 (1990) 69-95.
McCarty, Willard. "What is humanities computing? Toward a definition
of the field", <http://ilex.cc.kcl.ac.uk/wlm/essays/what/what_is.html>
McCarty, Willard. (1996). "Peering through the Skylight: towards
an electronic edition of Ovid's Metamorphoses". In Susan Hockey and
Nancy Ide (eds) Research in Humanities Computing 4: Selected papers
for the ALLC/ACH Conference, Oxford 1992. Clarendon, Oxford. 240-262.
McKenna, C. W. F., and Antonia, Alexis (1994). "Intertextuality
and Joyce's 'Oxen of the Sun' episode in Ulysses: the relation between
literary and computational evidence" Revue Informatique et Statistique
dans les Sciences humaines 30 (1994) 75-9.
McKenna, C. W. F (1996). "'A Few Simple Words' of Interior Monologue
in Ulysses: Reconfiguring the Evidence". Literary and Linguistic Computing
11 (1996) 55-66.
McKenna, C. W. F., Burrows, J. F., and Antonia, A. (1999).
"Beckett's Molloy: computational stylistics and the meaning of translation".
In M. Ramsland (ed) Variÿeatÿea: Perspectives in French
Literature, Society, and Culture. Peter Lang, Frankfurt. 79-91.
O'Hara, Mary. unpublished thesis, quoted in Butler (1984)
7.
Opas, L.L. and Kujamaki, P. (1995). "A cross-linguistic study
of stream-of-consciousness techniques", Literary and Linguistic Computing
.
10 (1995) 287-91.
Smith, M. W. A. (1990). "Attribution by statistics: a critique
of four recent studies". Revue Informatique et Statistique dans les
sciences humaines, 26 (1990) 233-51.
Steiner, G. (1975). After Babel. Oxford UP, London.
Tweedie, F. and Baayen, H. (1998). "How variable may a constant
be? Measures of lexical richness in perspective". Computers and the
Humanities 32 (1998) 323-52.
(6.3.1) Border Crossing: Engineers, Papyrologists, and the Graphical Use Interface
Melissa Terras
Oxford University, UK
The discovery of the writing and stylus tablets from Vindolanda, a Roman Fort built in the late 80s AD near Hadrian's Wall at modern day Chesterholm, has provided a unique resource regarding the Roman occupation of northern Britain and the use and development of Latin around the turn of the first century AD. However, although papyrologists have been able to transcribe and translate most of the writing tablets, the meaning of the majority of the stylus tablets remains obscure. These small wooden plaques with a hollow recess which was once filled with wax and written on by means of a metal stylus are well preserved; however, the wax has deteriorated leaving a wooden surface with small strokes where the stylus pen has occasionally breached the wax covering. Making sense of such minute indentations by eye alone has proved impossible for papyrologists, and the problem is compounded by the fact that many tablets were reused again and again leaving the remains of many texts on the damaged surfaces.
An EPSRC funded project at the Department of Engineering Science, University of Oxford, was initiated two years ago to analyse these tablets and develop new image processing techniques to retrieve information from small incisions in damaged surfaces (the techniques developed being applicable to a wide variety of engineering problems). Some headway has been made using wavelet filtering to remove woodgrain in images of the stylus tablets, and developing and appropriating Phase Congruency and Shadow Stereo techniques to identify candidate writing strokes. However, until some interface is put into place between these engineering techniques and the papyrologist, the stylus tablets' texts will not become decipherable because of the complex nature of the papyrology process and the difficulties encountered in expecting a non-mathematical user to utilise such algorithms.
This paper gives a brief background to the project, before discussing
the steps taken and problems encountered whilst developing the computer
application and user interface for this system. Focussing on the interaction
between historians, papyrologists and the application developer, and the
design of the graphical user interface, the paper discusses the benefits
and problems surrounding such a multi-disciplinary project.
(6.3.2) Trajan's Column: Building a WWW Image-Database
Geoffrey Rockwell
Gretchen Umholtz
Michele George
Martin Beckmann
Paul Barrette
McMaster University, Canada
One of the challenges when designing image databases in the humanities is to provide interfaces that are drawn from the object being represented rather than from the technology used. In this paper we will discuss an image database of over 400 images that was built by a team of graduate students and faculty at McMaster University. The images relate to carving technique on Trajan's Column, one of the most extensive and important surviving sculptural monuments of ancient Rome. In addition to various text-based search tools, we also created a cartoon sketch interface to the entire frieze of Trajan's Column that allows users to move around a virtual column to access the images and to see them in narrative context. This virtual column is an example of a content driven interface to an image collection that is based on the subject matter of the collection.
Trajan, Roman emperor from AD 98 to 117, has generally been remembered as a 'good' emperor. A gifted general, he was close to his soldiers and led them on a number of campaigns which substantially expanded the Roman empire. Two of these campaigns, the first and second Dacian wars, are commemorated on Trajan's column: a 100 foot tall marble column decorated with a continuous, upwardly spiraling, sculpted frieze. The scenes on this frieze are an invaluable source of information about many aspects of Roman military practice, imperial ideology, and also sculptural technique. The column still stands today in its original place in the Forum of Trajan in Rome, but, due to its great height, visibility of individual scenes from the ground is quite limited; students and scholars must often rely on old photographs and plaster casts for the study of many details.
In 1997 a team of graduate students and faculty at McMaster University were given access to a unique collection of over 400 slides taken by Peter Rockwell and Claudio Martini when Trajan's Column was being restored and had scaffolding around it. The images are primarily of details that show carving technique. In addition we were given access to a series of cartoon sketches of the entire column that show the narrative of the frieze. The challenge was to create an image database that would provide a context for these photographs so that scholars and students of Roman art could use these resources easily. The WWW site is located at <http://cheiron.humanities.mcmaster.ca/~trajan/>.
In this paper we will do the following:
1. Discuss the background of the project.
This project was initiated by Peter Rockwell who was looking for a venue for making his collection of slides available to the research and education community. While he had published on the column, the full collection of images on which his research was based could not be published economically. A WWW site seemed the most effective way to make this collection available. The team was limited by the fact that the slide collection was in Rome, Italy and could not be removed to Canada for digitization. In this paper we will discuss the steps taken to digitize the images on site and build the database from information gathered by Peter Rockwell as part of his research.
2. Discuss the technical design of the WWW site.
After initially experimenting with a design that used hundreds of static WWW pages we developed a dynamic database that could generate the pages for the individual images on the fly. This gave us the freedom to easily alter the interface of the image pages and make site-wide changes in real time. This in turn allowed us to refine the design of the site in an interactive fashion as suggestions came in and users reported back. Finally the design of the database allowed editors involved in the project to make corrections over secure connections. In the paper we will comment on the virtues of moving to dynamically generated information for projects that involve teams of people, continuous correction and iterative interface design. We will contrast the development of this site with two previous projects, the Emily Pauline Johnson Archive (http://www.humanities.mcmaster.ca/~pjohnson/home.html) and the Cradle of Collective Bargaining site (http://www.humanities.mcmaster.ca/~cradle/). Both of these other sites use static WWW pages to hold the images. We will also comment on the choice of image resolutions that we settled on. In the Trajan site we make available small thumbnail images within WWW pages for timely access and links to medium and high resolution copies of each image and cartoon. This allows the user to look at and download images suitable for teaching and research.
3. Discuss the interface problems we faced and the solutions we settled on.
The Trajan Project provides three ways for users to navigate the images. The first is to search the image database. The second is by using indexes generated from the database which give the user an overview of the keywords used. The third and most visual is to navigate the column using the cartoon sketches and click on the links to individual images. In effect we have two means of access that are driven by the technology and one that was driven by the content of the database. In the paper we will discuss at length the need for content-driven interfaces that provide access to users to images in ways that correspond to the original artifact. Content-driven interfaces cannot be standardized because they vary according to the artifact(s) represented, but they provide a way of organizing information that the user familiar with the area can use. In our case we used the column as a whole as a visual index to the collection. The visual index not only provides access but creates a context for the images which would be hard to abstract from the database. This visual context augments the interpretative essays that introduce the site.
4. Conclude by discussing uses and benefits of the site.
As part of this project we have been showing the site to interested academics to get their feedback on the interface and to get ideas about how it might be used. As many of the images were taken to document carving techniques used in the sculpting of the frieze, this site is of greatest use to those interested in Roman carving. Many of the images, however, contain enough detail to be useful to those interested in Roman sculpture in general or those interested in studying Roman military equipment and activities. The WWW site provides an economical alternative to print for research image collections to be made available. In addition, the cartoons provide a continuous sketch of the narrative of the column that can be downloaded and annotated by those interested in this depiction of the events around the Dacian wars.
Viewers invited to try the site have commented on how they could use the high-resolution images and cartoons for teaching purposes since they are freely available and easily accessed. These types of comments have led us to think about the image database as not only a coherent research and teaching collection, but as an accessible collection of free images that can be mined by those who are shifting to electronic teaching tactics. Whenever you make a digitized collection of images available you are also providing the equivalent to academic clip art which can be reused in ways that are only remotely connected to the original content. This unintended outcome from the digitization of images will be discussed in the conclusion of the paper.
Bibliography
Arnheim, R. (1969). Visual Thinking. Berkeley.
Baecker, R. M., et al. (1995). "Vision, Graphic Design, and
Visual Display." In Ronald M. Baecker et al. (eds) Human-Computer
Interaction: Toward the Year 2000. 2nd ed. San Francisco.
Bertin, J. (1983). Semiology of Graphics: Diagrams, Networks,
Maps. Trans. W. J. Berg. Madison, Wisconsin.
Cichorius, C. (1896, 1900). Die Reliefs der Trajanssäule,
Berlin.
Lepper, F. and Frere, S. (1988). Trajan's Column. Gloucester.
Marcus, A. (1991). Graphic Design for Electronic Documents
and User Interfaces. New York.
Mitchell, W. J. (1992). The Reconfigured Eye: Visual Truth
in the Post-Photographic Era. Cambridge, Massachusetts.
Monti, P.M. (1980). La Colonna Traiana. Rome.
Settis, S., et al. (1988). La ColonnaTraiana. Rome.
Rockwell, P. (1985). "Prelimary Study of the Carving Techniques
on the Column of Trajan". Marmi Antichi Studi Miscellanei
26. pp. 101-111.
Rockwell, P. (1993). The Art of Stoneworking: a Reference
Guide. Cambridge.
Tufte, E. (1983). The Visual Display of Quantitative Information.
Cheshire, CT.
Tufte, E. (1990). Envisioning Information. Cheshire,
CT.
(6.3.3) Digital Images for Online Distance Education in the Humanities
Sue Tickner
Richard Hooker
Francis Halsall
University of Glasgow, UK
This paper will describe the design and development of a web-based distance education course in Art History at the University of Glasgow. In planning such courses the interplay between new media, distance education and the existing systems in a campus-based institution raises numerous issues which need to be resolved before such a course is feasible. Not least of these are the obstacles still inherent in accessing and delivering online images: in terms of locating the desired resources, copyright and costing implications, institutional agreements with other bodies such as the CLA and its digitising licence and dual-mode delivery both on and off-campus. We suggest that, two years on from Dan Greenstein's call for "new and creative ways to manage relations between those who 'own' rights in image resources and those who have an interest in acquiring access to them" (1) no simple methodology, clear guidelines for use or satisfactory search interface yet exists.
In common with other distance education developments in recent years, the work has been carried out in close partnership between the academic department concerned and GUIDE (Glasgow University Initiative in Distance Education). Following successful small-scale, campus-based attempts to implement new educational methods in teaching Art History, and in response to encouragement within the department, the Level 2 class convenor approached GUIDE for advice on potential distance options. Accordingly, in keeping with institutional policies and national pressures to widen participation (in particular flexible, distance and part-time provision), a 'flagship' course is to be offered at level 2 of a undergraduate degree course from January 2001.
The course will consist of two 30 credit modules (a total of 600 notional learning hours). One module will be on Scottish Art and the other on Art, Culture and the Avant-Garde. On the understanding that development work is never right first time, the initial module will be run once, closely evaluated, analysed and amended where appropriate. If this proves to meet our goals in the second presentation it will serve as a template, or model, for the design of the second module. A record of the history of the production of the course is being kept, to show how and why decisions to use particular methods of delivery and/or technology are made. It is intended that such a record will contribute to decision-making in the design of similar online distance education courses.
This development work constitutes a new departure, raising research issues in several areas. One aspect of this is in the innovative nature of these plans, and their place within the existing setting. For GUIDE, and indeed the University as a whole, this will be one of the first ventures into part-time distance provision at undergraduate level.
If the course proves a success in pedagogical terms and there is sufficient demand, we envisage the development of future courses, with the long-term aim of offering a distance education route through access to postgraduate provision. Achieving this aim requires a long-term commitment to developing other part-time distance courses that articulate with these modules in a cross-disciplinary degree structure within the Humanities. Ultimately, it requires the existence of sufficient comparable courses to enable a broad range of studies on the basis of part-time distance learning, in any subject, and at any level across the institution. As a flagship undergraduate distance course, it must also meet University requirements to an exemplary degree, and specifically those Quality Assurance guidelines being developed by GUIDE for distance education. The template we produce must therefore be of the highest possible quality, whilst remaining reasonably adaptable to future requirements. The development experience must inform and enable GUIDE to better assist future projects employing digital resources.
GUIDE has been working with a lecturer in the History of Art department to develop the first module on which the template will be based. In the first year, the module will run parallel to the face-to-face one, although it will cover different themes and use very different methods. It is not simply a distance version of the existing campus-based module. The process of designing a new distance learning course is very different from that involved in attempts to translate existing teaching materials into distance learning format. Existing Level-2 provision is heavily reliant on lectures. It is partly in order to investigate more exploratory, active learning approaches that this course has been conceived. It will be largely, though not entirely, web-based, and make use of group and peer-learning methods. We see the medium of online distance learning as enabling the transformation of the relationship between student, lecturer and the sources of learning materials - a transformation which, according to constructivist advocates of networked learning, will be necessary for the success of these methods.
In the development of this course, therefore, we have sought to identify the main features of this transformation with particular reference to art history. We have designed the course with a view to minimising the negative effects of losing face-to-face contact and maximising the benefits of technology-assisted distance education. We do this with the understanding that this constitutes not merely a compromise, but a fundamental change in what it means to teach, learn and communicate.
The fit with existing provision is only one of the issues raised. Another results from these different aims for the distance modules. The teaching of the discipline of art history and indeed the discipline itself - would have been unthinkable without the photographic transparency. Although it is exponentially more portable than a work of art, the transparency is also exponentially less portable than its digital equivalent. The new methods and web-based delivery will only be adding value to lecture-based provision if the technology is harnessed to provide easy, 'on-demand' access to source materials, especially digital resources. Ironically, the emergence of the technology for high quality digitisation of images promises a utopia of accessibility whilst simultaneously aggravating the problem of copyright. It has sometimes seemed as though the complexity of copyright law has resulted in a situation where the ability to access and use images is more restrictive now than before digitisation.
Other complex issues in the course design include the need to retain and extend successful features of the on-campus provision in Art History without competing too heavily for recruitment from the full-time student population (especially in the early stages when the options for continuation by these methods will be severely limited), and generic issues in online teaching and learning, especially those affecting developments in the Humanities. The effects of these considerations will be demonstrated in the design and evaluation of the course which will, by July 2000, be preparing for delivery.
Some suggestions which might facilitate the use of visual resources in teaching, drawn from our experience, will be offered to digital image service providers. Specific complications we encountered will be explored and those solutions we have found described. We hope these will help to ease the path for future users of these exciting resources in Humanities teaching.
Reference
(1) Greenstein, Daniel (1997). Digital Images and Virtual Scholarly Collections. Abstract, Joint International Conference of the Association for Computers and the Humanities and the Association for Literary & Linguistic Computing, Queen's University, Kingston, Ontario, CANADA, June 3 - 7, 1997 <http://www.qucis.queensu.ca/achallc97/papers/s008.html>
(6.4.1) Books into Bytes: The "Deutsches Wörterbuch" on CD-ROM and on the Internet
Ruth Christmann
University of Trier, Germany
I. Starting position, targets:
Jacob and Wilhelm Grimm's "Deutsches Woerterbuch" (DWB) comprises the most extensive documentation of the German language. Its outstanding position is confirmed by its history: for more than one hundred years - the longest period of publication for a German dictionary - generations of lexicographers have contributed about 350,000 entries to the DWB, which is divided into 16 volumes (bound as 32), containing a total of 67,744 columns.
The DWB reflects more than one hundred years of political, cultural, and institutional history. Moreover, it shows the influence of varying preferences of numerous philologists concerning practical lexicography as well as changed insights into philology and linguistics.
Digitizing the DWB not only means preserving the outstanding achievements of German lexicography but also opens up new possibilites in using the rich dictionary material. Since November 1998, a project at the University of Trier has been creating a computerized version of the DWB to be published on CD-ROM and also made available via the Internet. It is intended to provide user-friendly search and display software in order to get optimum opportunities for data retrieval. It will as a result appeal to anyone interested in the German language. In this way, the poor situation in the field of electronic dictionaries of German when compared internationally will be decisively improved.
II. Technical issues:
Taking into account the developments of international standards for text encoding, TEI Guidelines are used for a structured markup of the dictionary. This prepares the way for the production of the CD-ROM version by applying special SGML tools: starting-point for the application of CoST (Copenhagen SGML Tool) is our pool of SGML encoded files which have been created from the TUSTEP (TUebingen System of Text Processing Programs) data and validated by an SGML parser. CoST is a general-purpose SGML post-processing tool. It is a structure-controlled SGML application, that is, it operates on the element-structure information-set (ESIS) representation of SGML documents. CoST provides a flexible set of low-level primitives upon which sophisticated applications can be built. These include a powerful query language for navigating the document tree and extracting ESIS information, an event-driven programming interface, and a specification mechanism which binds properties to nodes based on queries. On the one hand CoST generates a set of HTML pages for displaying the dictionaries by traditional web browsers, on the other hand it transforms the SGML data into command scripts for Tcl/Tk for the graphical user interface of the CD-ROM. Tcl, the Tool Command Language, is a very simple programming language. Tcl provides basic language features such as variables, procedures, and control, and runs on almost any modern OS, such as Unix, Macintosh, and Windows 95/98/NT computers. Tk is a Tcl extension, written in C, designed to give the user a relatively high level interface to their windowing environment. Finally CoST is used to set up a database that contains all the information of the dictionary entries necessary to perform queries about the different components that might be interesting to those concerned with studying the German language from its very beginnings, such as etymology, language, quotations, including of course traditional full-text retrievals. The database is accessible from both platforms: the web browsers connect to it via CGI scripting, and the CD-ROM GUI uses an integrated Tcl interface.
III. Software demonstration:
The software demonstration will present the way in which valuable information can be extracted from an electronic version of the DWB by using full-text retrieval, links, and a database that facilitates complex queries.
The possibility of using different retrieval options enables the user of a CD-ROM or Internet version of the DWB to search for certain phenomena in up to 33 substantial volumes, independently of headwords. The information hidden within the different entries is made even more explicit in various ways: via hyperlinks, the index volume will be connected with the references appearing in the dictionary. The information needed in order to quote from the sources of the DWB can thus be accessed very easily via pop-up windows. As a common feature of electronic dictionaries, it will be possible by a preparation of a list of headwords to look up every one of the headwords just by activating the corresponding links and, moreover, to gain access to certain parts of the longer articles separately by using the original organization of the entries. Specific information as to the grammatical gender of headwords or sublemmata or the occurrence of certain words in quotations from certain authors or literature will be obtainable by a database generated from TEI compliant markup as mentioned before.
One of the major aims of the project is, however, not only to show the user different means of making use of the DWB for lexicographical, historical, or linguistic studies and to present the DWB in an appealing way, but to increase the use of the DWB in general, i.e. as a book in several volumes to be read by those interested in the German language and the history of German words. For such purposes, it is absolutely necessary to allow the user an easy access both to the electronic version of the DWB and to the information stored within the entries, and to the printed version of the DWB as presented on screen via PostScript files. Encoding the entries according to TEI Guidelines is as important as presenting the characters of the different languages exactly as they appear in print.
The procedure of digitizing the DWB is very closely connected to that of one of the other retrodigitization projects at the University of Trier, Digital Middle High German Dictionaries Interlinked. As the Middle High German Dictionaries are quoted very often within the DWB, it is also intended to include links to the electronic version of these dictionaries. Their CD-ROM is meant as a prototype for the retrodigitization of other historical dictionaries, thus the presentation of the Dictionaries Interlinked may conclude the software demonstration to show how the DWB will look in a final state.
At the time of the conference, at least two major volumes of the DWB including the index volume will be fully encoded, converted into a CD-ROM, and accessible for searches. Questions to be discussed when presenting the DWB in its prospective digital version might focus on the application of SGML/TEI to a dictionary as heterogeneously structured as the DWB, on the necessity for developing new entities for character representation, and the importance of a digital DWB for future research.
Literature:
Christmann, Ruth/Hildenbrandt, Vera/Schares, Thomas (forthcoming, 2000): Digitalisierung des Deutschen Woerterbuchs von Jacob und Wilhelm Grimm, Hg. von Nicolas Castrillo Benito et. al. TUSTEP educa. 6. ITUG-Jahrestagung. Burgos.
(6.4.2) Construction of Russian Corpus-Driven Dictionary Based and Monitor Corpora
Serge A. Yablonsky
St. Petersburg University of Transport, Russicon Company, Russia
1. Introduction
Monitor corpora are of interest to lexicographers and language learners who can trawl a stream of new texts looking for the occurrence of new words, or for changing meanings of old words (Collins COBUILD, 1995; McEnery T., Wilson A., 1996). Their main advantages are that they are not static and provide for a large and broad sample of language.
The application of language processing technologies for construction of shareable and multifunctional language corpora led to hopeful results (Varile G. B., Zampolli A., 1996).
Progress in Russian language processing affords an opportunity for applying its results for creating Russian monitor corpora strongly connected with the set of electronic dictionaries by the help of linguistic software. Our approach is particularly dependent on the language processor Russicon, and on wide usage of Russicon electronic dictionaries (Yablonsky S.A., 1998).
2. Composition of the corpora
The main part of the corpus was described in (Yablonsky S.A., 1999 a,b). Today's corpus is based on wide representation of Russian XIX and XX century literature, critics, philosophy, religion, newspapers, memoirs, law, business, computers, historical documents, stenographs, translations, folklore, Internet literature, "underground" literature etc.
The texts are taken from printed resources, CD - resources and the Internet. ASCII and Unicode text are the basic text type standards. Additionally SGML, HTML and XML markup is done by designing C-conversion programs. SGML configuring of texts is done by the SoftQuad SGML Publishing Suite. The text collection will continue to grow as resources are created and encoded. The open-ended (constantly growing) Russian monitor corpus helps in dictionary building as it enables lexicographers to keep on top of new words entering the language, or existing words changing their meanings, or the balance of their use according to genre etc.
3. Corpus-driven dictionary
The chief distinction of the corporus is its strong connection with the set of Russian electronic dictionaries and language processing tools, particularly dependent on the language processor Russicon.
Every word of the corpus simultaneously is the entry word of the corpus-driven dictionary and vice versa. For any form of a Russian word input, the dictionary outputs:
The pilot system is realized on IBM PC using Visual Basic 6.0 and MS SQL Server 7.0 and works in personal and local net mode.
References
Belyaev, B.M., Surcis, A.S. and Yablonsky, S.A. (1993). Russian
Language Processor RUSSICON: Design and Applications. Proceedings of
the East-West Artificial Intelligence Conference (EWAIC-93)(pp.175-180),
Moscow.
Collins COBUILD on CD ROM. (1995). HarperCollins, London.
McEnery, T.,Wilson, A. (1996). Corpus Linguistics. Edinburgh
University Press, Edinburgh.
Varile, G. B., Zampolli, A. (1996). Survey of the State of
Art in Human Language Technology. Cambridge University Press, Cambridge.
Yablonsky, S.A. (1998). Russicon Slavonic Language Resources
and Software. In A.Rubio, N. Gallardo, R. Castro and A. Tejada (eds)
Proceedings
of the First International Conference on Language Resources & Evaluation
( pp. 141-1147), Granada, Spain.
Yablonsky, S.A. (1999a). Russian Written Language Corpora Development.
Proceedings
of the International Seminar Dialog99, May 30-June 8, Tarussa, Russia.
Yablonsky, S.A. (1999b). Russian 20th Century Literature Digital
Library for Language Teaching. Proceedings of the International Conference
of the ACH/ALLC Digital Libraries for Humanities Scholarship and Teaching,
JUNE 9-13, 1999, University of Virginia, Charlottesville, Virginia,
USA.
(6.4.3) The MATE Workbench - a Tool for XML Corpus Annotation
Amy Isard
David McKelvie
University of Edinburgh, UK
Andreas Mengel
Universitaet Stuttgart, Germany
Morten Baun Moeller
Odense University, Denmark
The MATE workbench is an annotation tool which allows transcription, annotation, display and querying of speech and text corpora. It is not tied to any particular annotation scheme, and allows users to define interfaces which suit their particular needs.
The main difference between the MATE workbench and other annotation tools is its flexibility. Any annotation scheme which can be expressed in XML [7] can be used with the workbench (for a discussion of overlapping hierarchies see below), and the display and annotation interfaces are defined using a language based on the stylesheet language XSLT [8]. The workbench is written entirely in Java and is therefore platform independent.
Annotation can be a very tedious task for humans, and many tools have been developed to make it easier. We conducted a review of some existing annotation tools before beginning development of the workbench [3], and many of them have in common a fixed user interface or a restriction to a particular coding scheme. One tool which has some similarities to MATE is GATE [1], which can also be used with any annotation scheme, but has a different internal architecture, based on the US Tipster architecture rather than XML, and a main aim of making it possible to integrate different automatic annotation components within one system. MATE aims to provide a framework where stylesheets can be used to provide user-defined annotation and display interfaces. Because the stylesheet language is quite high level, it is easier to write a stylesheet to provide a given interface than to write an entire coding tool from scratch.
The MATE project has developed annotation schemes for five sets of linguistic phenomena, and examples of markup using these schemes will be distributed with the workbench, along with stylesheets for their annotation and display. Users of the workbench are by no means limited to these schemes, however.
To display annotated data in the workbench, a user must have a MATE project file, which specifies one or more XML annotation or transcription files and sound files if appropriate, and a stylesheet which is to be applied to these files. Several examples of these will be provided with the workbench. When the workbench is started, a corpus folder window appears with a display of all the available project files. After selecting a project file, the user clicks on the "run" button, the specified files are processed, and one or more display or annotation windows appears, depending on what was specified in the stylesheet. A different stylesheet can be used with the same files to produce different behaviour.
The other major use of the workbench is in performing queries over a corpus. A query language [4] was developed within the project which is tailored to our internal representation of the data, including the treatment of multiple hierarchies as defined below. To perform a query, the user first loads in one or more documents, then opens a Query Window, which provides support for building complex queries. The results of the queries are saved in XML format within the workbench, and are then also displayed using a stylesheet, allowing a flexible representation of the results.
One question which often comes up when XML annotation schemes are being developed is multiple overlapping hierarchies of markup. The TEI [5] describes several possible ways to provide overlapping in SGML. One of these, 'concur', is not possible in XML, which was designed to be a less complicated and easier to use version of SGML, and therefore left out some features. We have chosen to take a different approach from any of these, but one which has been proven to be successful in at least one large-scale corpus annotation project [2]. Our solution is to keep each level of annotation, and each data-stream (in the case of multi-speaker conversations for instance) separate, and to link each level to a common base-level. This base level would normally be the smallest unit on which all the other annotations depend. This may often be the word level, but could also be phonemes in the case of speech, higher level units such as sentences or paragraphs or indeed anything else as appropriate. The MATE workbench will therefore deal appropriately with any data which are marked up in this way using hyperlinks, as defined in the XLINK proposal [6].
Another advantage of the generality of the MATE workbench is that it makes it easier to combine views of annotation done using different schemes on the same corpus. These annotations may be done on different sites without any contact, but if both use hyperlinks to the same base level, then it is possible to create stylesheets which display both the annotations at the same time. It is also possible to write a stylesheet which will display one level and allow annotation of another level at the same time.
The MATE workbench has just been completed, and testing and evaluation are about to begin. We will be able to provide a section on this evaluation for the final paper. We will be asking testers to use the workbench for a variety of different annotation tasks, and provide feedback on ease of use and power, and also evaluating whether the stylesheet language allows testers to define new annotation and display interfaces easily. We will also be submitting a proposal to ALLC/ACH 2000 for a demo of the workbench.
References
[1] GATE <http://www.dcs.shef.ac.uk/research/groups/nlp%
2Fgate/>
[2] Isard, McKelvie and Thompson (1998). Towards a Minimal Standard
for Dialogue Transcripts: A New SGML Architecture for the HCRC Map Task
Corpus. Proceedings of the 5th International Conference on Spoken Language
Processing, ICSLP98, Sydney.
[3] MATE Deliverable D3.1: Specification of Coding Workbench <http://www.cogsci.ed.ac.uk/~amyi/mate/report.html>
[4] Mengel, A. and Heid, U. (1999). Enhancing Reusability of
Speech Corpora by Hyperlinked Query Output, Eurospeech 99, Budapest, September
1999.
[5] Sperberg-McQueen, C. M. and Bournard, Lou (eds). TEI Guidelines
for Electronic Text Encoding <http://etext.lib.virginia.edu/TEI.html>
[6] XLINK <http://www.w3.org/TR/xlink>
[7] XML <http://www.w3.org/TR/REC-xml>
[8] XSLT <http://www.w3.org/TR/xslt>
(6.4.4) Austrian Academy Corpus - Doing literary markup by means of XML
Karlheinz Moerth
Hanno Biber
Austrian Academy of Sciences, Austria
The poster we intend to present is supposed to give some details concerning the pilot phase of the newly founded Austrian Academy Corpus (AAC). The AAC is a project which has been started with the perspective of building up a digital corpus tailored to the particular needs of scholars doing research in the field of literary studies. The build-up of electronic text collections has been the domain of linguists so far. Although literary texts are becoming available in ever-increasing numbers, many of these products originated from linguistic endeavours to gather data for lexicographic or learners' purposes. Literary scholars have rather neglected the whole issue. Most of the existing digital literary resources are poorly edited and serve for little more than simplistic word searches. As the exigencies of work on literary texts differ considerably from those of linguistic studies the AAC has been trying to work towards solutions that meet the necessities of the literary domain. The AAC proceeds in its work from the basis of digital data which have been collected at the Austrian Academy of Sciences during the last few years. At that time the department's focus was placed on literary sources which were classified as functional literary text types, i.e. magazines, diaries, sermons, speeches, letters, obituaries and the like. Thus the materials in the corpus display considerable variety as to contents and structure. The data was accumulated primarily for text-lexicographic purposes preparing a phraseological dictionary which was published in 1999. This text-dictionary was based on a literary satirical magazine which appeared for the first time in 1899 in fin-de-siecle Vienna and was published until 1936. This magazine, 'Die Fackel', which was very popular among intellectuals in its time, is still of utmost interest to scholars of German literature trying to understand this crucial period of history. The electronic text of 'Die Fackel' was generated in a first run by means of rather unsophisticated OCR, a process which was completed several years ago. During the compilation of the above-mentioned text-dictionary the quality of this electronic text was improved in several runs of proofreading and adding basic markup which was developed specifically for this purpose. Only very few tags were applied, indicating for example the position of images or identifying special characters, which were not supported by the standard code-pages of the user interface. The texts being digitised at the moment (primarily historical magazines) undergo several stages of processing, first being scanned and made readable by means of OCR, then being corrected several times. The application of markup is the last step in this process. Formatting information of the digitised texts is in the first instance conserved in file formats such as RTF or utilizing standardized style sheets which yield quite reasonable results. With the increasing availability of more advanced text encoding techniques the working group of the AAC started to think about a more general approach to cope with the manifold problems of integrating text and text-related secondary data. We had to look for a markup scheme which allowed for labelling different types of data within one coherent markup system. In their endeavours to find appropriate ways of describing the structure of texts, linking up data and facilitating structured searches the working group of the AAC started doing experiments applying XML (Extensible Markup Language) to their data. The issue of XML, especially the question of XML versus SGML, has been touched upon repeatedly in Humanities computing. XML, as the youngest descendant of the SGML/HTML family, is a subset of SGML and can be viewed as the logic further development of SGML. As a general rule one can assume that SGML-encoded data can easily be transformed into XML-data, to a certain extent even the other way round. The XML related experiments at our department started already in the early stages of XML's appearance on the scene, i.e. in 1998. As yet only parts of the corpora of the AAC have been provided with XML-conforming markup as we are still trying to fathom out the potential benefits of XML for our work. Basic markup (start and end of pages, line breaks), character encoding as well as concatination of hyphenated words was carried out by means of a special conversion engine. There are many different ways to perform such transformations. Many would prefer to accomplish such a task making use of scripting languages such as Perl which, of course, can be a quite practicable and efficient way to do it. We tried to use graphical user interfaces from the very beginning of our work in order to ensure the constant monitoring and controlling of the transformation processes. Therefore we developed our tools using the programming language Delphi, a RAD tool which helped to cut developing time. The absence of any XML compliant browser software in the early phase forced us to seek a solution of our own and consequently to develop a graphical interface to bring our data on screen and display them in a readable form. These experiments showed that the fundamentally strict structure of XML documents makes parsing pretty straightforward. Even in the absence of a definitive set of tags XML-files are still processible without parsing an attached DTD (Document Type Definition). As the texts we are working on contain passages in various languages and characters XML's compliance with the Unicode Standard proved to be extremely helpful. Unicode has brought about the unification of a huge number of diverging standards and will in the future ensure the exchangeability and interoperability of all sorts of textual data. XML does not only allow generic, highly specialised encoding, but also enables the flawless exchange of data among different systems.