HCL - a Language for Internet Data Acquisition
Richard Connor and Keith Sibson
Department of Computing Science
University of Glasgow
{richard, sibsonk}@dcs.gla.ac.uk
Abstract
We identify problems in using traditional programming systems to write applications that access internet data. Some of these problems may be artefacts of current internet technology, and some are fundamental in any large heterogeneous and autonomous network. Our experimental approach to this is to design a new language system, with a semantic domain which corresponds, as far as possible, to that of the problem domain. Our primary aim is to allow the straightforward automation of tasks currently performed in a routine way by human beings interacting with Web browsers, in terms accessible to programmers who are not necessarily expert in the underlying protocols and mechanisms.
In this paper we describe a system which has been designed and implemented as a first experimental step along this path. We believe we have identified some simplifying concepts, and through early experiments have gained some experience in both their use and implementation. This is very much a position paper, since we are just about to embark on a phase of research where we try to build some more serious applications, to gain some more useful feedback. The language is not presented as a good solution to the identified problems, but as our particular choice of starting point.
In this paper we describe the key new concepts of the language and give some simple examples of its use.
1. Introduction
The problem we identify is that the automation of even simple tasks which require the fetching of data from the internet is an unnecessarily difficult task using current technology. We have identified two main reasons for this:
• Current abstractions over transport protocols are low-level and hard to use, often requiring explicit memory and cache management schemes to be provided to achieve an acceptable degree of efficiency [W3C].
• The computational models of traditional languages do not match the implicit semantics of the global network, where autonomous management decisions, network failures and significant delays figure largely, even in high-quality data collections [Car97].
Whilst the programming of a single URL fetch using the http protocol is straightforward, programs that deal with multiple fetches or repeated accesses of the same URLs require immediately to become involved with thread and cache management to provide acceptable performance. Such algorithms are inherently complex and therefore liable to error. Furthermore, these aspects are not a part of the semantics of the computation the programmer has in mind, and to force them to be coded within such applications shows a deficiency in the programming model. Ideally details of efficient management of the conceptual data value should be handled by a language run-time system, rather than by programmers. A useful analogy is the memory management associated with high-level data structures, described by the programmer in traditional systems, but handled by the run-time system in more modern languages.
Even without these problems, traditional semantic models do not usefully match with the nature of computations in this domain. For example, traditional systems are largely deterministic in their performance, even including languages that model non-determinism for the sake of parallel thread communication. However the internet domain implicitly includes serious degrees of indeterminacy. For example an attempt to fetch the same URL may result, at different times of the same day, in a successful fetch within a second, a successful fetch taking a minute, a successful fetch of only the first 80 per cent of the file, or a failure to even communicate with the machine on which it is resident.
Behaviour of this nature is part of the semantic model as understood by users, as can be observed in even a relatively naive user’s behaviour. It is common practice, for instance, to perform certain queries at certain times of day, according to some acquired model of response time variation over the 24-hour cycle; similarly, it is common to observe users testing a number of response times before deciding which set of resources to use. Such behaviour, however, can not easily be coded into an automated system built with a traditional language.
The non-deterministic nature resource fetching may be regarded partly as an artefact of bad engineering properties of the current network. At least to some degree, however, it must always feature in a network which is large, heterogeneous and autonomous. Our aim is to investigate systems with which a programmer can describe computations within the such a semantic domain, in terms which can be understood solely with respect to the identified problem and knowledge of the environment. It should therefore be possible to describe computations without any unnecessary clutter caused by a mismatch between the language semantics and the domain of the computation.
In the rest of this paper we give a description of the Hippo Core Language (HCL), which provides a platform from which to experiment with both the provision and implementation of high-level constructs and paradigms to help overcome the problems identified above.
2. Hippo Core Language concepts
HCL’s main simplifying concepts to allow the automation of internet queries are as follows:
• the unification of universal resource, text file and string types
• implicit URL fetching
• parallel execution
• alternate execution
This combination of concepts is not presented as a definitive answer to the problem, but only as a starting point. We believe that it provides a useful set of abstractions, and furthermore one that can be implemented within reasonable bounds of efficiency, but we do not yet have sufficient experience in the building of applications using these constructs to make any further assertions.
2.1. URL and file representation
Universal resources are stored as text files and passed as streams of characters. However, we have chosen to represent them within programs as an immutable string type. This is believed to be the closest approximation to most users’ view of the concept. Although a character stream model might provide more functionality, we have in the first instance avoided this for the sake of simplicity. The string model of universal resource is also unified with the result of local file system fetch and other strings modelled during a computation.
The immutable string type is a simple model, deriving from one first used in the language S-algol [Mor82]. There is a simple string algebra, consisting only of string literals, append, substring and length functions, implemented in a garbage-collected heap. Characters are represented as strings of length one. Our experience with other language implementations is that this model of string not only gives a higher-level abstraction than updateable character arrays, but in some cases can result in more efficient code. This is because buffer space is not gratuitously allocated, as is the common paradigm with fixed-size updateable character arrays, making the programs much more space-efficient [MG82]. This is however at the cost of execution time overhead for garbage collection. In the past we have modelled string-intensive applications, such as lexical analysers and string databases, with quite acceptable performance.
In HCL, all files fetched at run-time, whether from the local file system or as the result of URL fetch, are typed as immutable strings. An obvious difficulty with this simplification is the efficiency of the implementation. To date we have performed some simple experiments in which simple memory management is built into the string implementation, and can see no reason why an acceptable performance can not be achieved. We also intend to provide a string library to give support for operations we judge to be commonly used in the processing of HTML documents, to allow these to be coded at an efficient level.
2.2. URL fetching
HCL has a URL data type. There are only two ways of introducing a value of this type, either as a literal or as the result of the standard function stringToURL. There is otherwise no algebra over values of this type, and the meaning of a value of this type is the result of attempting to fetch the resource it represents. Thus in practice any occurrence of a URL in a program is immediately retyped as string, representing the result of an attempted fetch of the URL. This operation in HCL is guaranteed to eventually succeed; failure to fetch a URL will result in a string which represents an error, and there is a system timeout value to ensure eventual termination.
Figure 1 shows some simple examples of the use of URL fetching in HCL.
let hippoHomePage = http://www.dcs.gla.ac.uk/~hippo/default.html
let hippoBase = "http://www.dcs.gla.ac.uk/~hippo/"
let hippoHome = hippoBase ++ "default.html"
let anotherHomePage = stringToURL( hippoHome )
let firstHREFPos = findSubstring( "HREF", hippoHomePage )
Figure 1 : simple example fetches
This arrangement is rather non-standard in programming language design, and is captured by the following language extensions. As there is no URL algebra, it is straightforward to apply the re-typing rule from URLs to strings.
T ::= ... | string | URL
(Note to referees: the LICS font probably won’t work in a postscript version printed on a non-Apple printer...)
[[ e : URL ]] = attemptGlobalFetch( [[ e ]] )
In the first implementation, this URL fetch is a simple synchronous operation which delays until a string is returned from the attempted fetch. In the next version we are considering making this an asynchronous operation, and associating sychronicity instead with the implicit retyping operation. This would not affect the meaning of programs, but would allow URL fetches to proceed in parallel with the rest of the code until such time as their results were required. One interesting side issue is how to model this in the formal definition of the language; although there would be no difference in a denotational semantics, the behaviour of delay is one we do wish to model formally. The two different models of URL fetch would make the same programs behave noticeably differently in terms of this, although without affecting the set of possible results.
2.3. Parallel and alternate execution
HCL has two ways of specifying concurrent computation. The infix operator par causes its operands, which may be any void expressions, to be evaluated in parallel, allowing the sharing of the state of the context in which they are executed. par is a fairly standard way of introducing parallelism into a language, and its main intention is to overcome the potential blocking caused by the synchronous behaviour of the URL fetch. Figure 2 shows a simple use of this construct.
{ writeToFile( "/users/hippo/hippo", http://www.dcs.gla.ac.uk/~hippo/ ) }
par
{ writeToFile( "/users/hippo/richard", http://www.dcs.gla.ac.uk/~richard/ ) }
Figure 2 Parallel URL fetches
It may also be useful for the branches of a par construct to communicate with each other or with another thread. Figure 3 shows a program which performs the same tasks, but gives information about its progress to another parallel thread.
let doneHippo = loc( false )
let doneRichard = loc( false )
{ writeToFile( "/users/hippo/hippo", http://www.dcs.gla.ac.uk/~hippo/ )
doneHippo := true }
par
{ writeToFile( "/users/hippo/richard", http://www.dcs.gla.ac.uk/~richard/ )
doneRichard := true }
par
// something else that uses doneHippo and doneRichard
Figure 3 Communicating parallel branches
The infix operator alt also causes its operands, which may be void or non-void expressions, to be evaluated in parallel. The result of the alt operation is non-deterministically chosen from the operands. However this non-deterministic choice is governed by the heuristic that, if one of the branches terminates significantly sooner than the other, then this will be the result chosen. Given the orders of magnitude difference in speed between URL fetches and other run-time operations this can be used to govern the large-grained progress of a computation. Figure 4 shows how alt can be used to program the fetch of a mirrored resource, and a timeout.
let hippoHome = { http://www.dcs.gla.ac.uk/~hippo/ }
alt
{ http://www.hippo.org.uk }
let AVHome = { http://www.altavista.dec.com }
alt
{ pause( 30 ) ; "error: timeout after 30 seconds" }
Figure 4 Example uses of alt
The branches of the alt operator can not communicate with each other or any other parallel threads, and the unsuccessful branch leaves no trace of its execution history after it has been terminated. Thus the two branches of the alt operator have the effective semantics of heavyweight transactions, with one succeeding and the other aborting, giving a clean semantic model of the alternative computation. This concept is recursively applied within nested branches of an alt operation applied, giving a model similar to nested transactions [Mos??]. Figure 5 shows the use of this transaction model: it is not possible for the alt expression to return the result of the second branch and at the same time cause a global increment to the filesFetched location.
let filesFetched = loc( 0 )
let AVHome = { filesFetched := at( filesFetched ) + 1
http://www.altavista.dec.com }
alt
{ pause( 30 ) ; "error: timeout after 30 seconds" }
Figure 5 Example uses of alt
3. Related Work
Cardelli and Davies [CD97] describe language features which allow the explicit inclusion of the concepts of non-determinism and transfer rates within computations, which gives the programmer the ability to explicitly code with respect to the internet semantics of delay and failure. The unit of transfer in this language is assumed to be the text file, and the transfer of typed data is not addressed. Surprisingly we have not been able to discover further work in the field of "real" internet languages, by these authors or others
4. Further Work
Our direction from this point, where we have a basic working model of a language incorporating the above features, is to write a number of applications to determine efficacy of the constructs and standard library provision. We also expect to have to iterate over various aspects of the implementation, particularly with respect to string memory management and URL caching. We may also experiment with the semantic models provided of data download, possibly with asynchronous or character stream approaches, and a semantic model of caching.
One aspect we have not talked about in this paper is a further aspiration to incorporate a model of typed data transfer within the semantic domain of a similar language. This will use a model of orthogonal persistence [ABC83, AM95] where the persistent namespace is ranged over by URLs. An outline of this work has been described separately [CS97]. This project gives rise to the acronym Hippo (High-level Internet Programming with Persistent Objects), which up to this point has been left as a mystery to the reader!
5. References
[ABC83] M.P. Atkinson, P. Bailey, K.J. Chisholm, W.P. Cockshott and R. Morrison An Approach to Persistent Programming The Computer Journal 26, 4 ( 1983 ) pp 360 - 365
[AM95] M.P Atkinson and R Morrison Orthogonally Persistent Object Systems VLDB Journal 4, 3 pp 319 - 401
[Car97] Luca Cardelli Mobile Computation In Mobile Object Systems - Towards the Programmable Internet, J. Vitek and C. Tschudin Eds., 3-6, LNCS 1222, Springer-Verlag, 1997.
[CD97] Luca Cardelli and Rowan Davies. Service Combinators for Web Computing. SRC Research Report 148, Digital Equipment Corporation Systems Research Center. 1997.
[CS97] Richard Connor and Keith Sibson An Overview of the Hippo project. In preparation
[Hip97] http://www.dcs.glasgow.ac.uk/~hippo
[Mor82] R. Morrison S-algol: a Simple Algol Computer Bulletin 2, 31 ( March 1982 )
[Mos??] Eliot Moss Nested Transactions PhD Thesis, MIT
[MG82] R. Morrison and H. Gunn The String as a Simple Data Type Software, Practice and Experience, ?,?
[W3C] World-Wide Web Consortium, http://www.w3.org/pub/WWW/