Some of the recommendations in this section of the report apply to the National Science Foundation and some apply to the economics profession as a whole.
The following sections discuss some of the key components of the
computing infrastructure in economics.
A. Software
One area of significant software under-investment is that of
interfaces between various statistical, mathematical, and graphical
tool kits used by economists. The state of computing software in
economics resembles the characters of the Ray Bradbury novel,
Farenheit 451, where each researcher is expert in one of many
distinct computing environments, such as Gauss, GAMS,
Mathematica, RATS, S-plus, SAS, SST, Troll, or TSP. On a
given project, many researchers would like to transfer data
between windows in separate environments to take advantage of
specialized features in different packages. An all-purpose
software research environment is not a reasonable concept, but it
would be useful to encourage development of software interfaces
between existing environments. One mode of design that might
help is object-oriented analysis software, such as that being
developed by Oldford (1988), where inherited characteristics,
such as the method of data construction (modeling), are carried
along with the data. Object-oriented coding could help alleviate a
key problem with imported software, namely, that the output
produced by the imported algorithm has to be re-connected to the
remainder of the user's software analysis tool kit.
One byproduct of the lack of adequate interfaces among alternative software environments is the notable inefficiency of duplicative coding of research innovations in mathematical and statistical analysis (a recent case in point is the coast-to-coast recoding of the Johansen algorithm for co-integration analysis). Although one can rely on vendors to eventually incorporate new wrinkles in computational analysis, the time lag is considerable. Whether or not this is viewed as a market failure may depend on the marginal cost of graduate students' time, but we suspect the translation lag slows disinterested scrutiny of algorithms that are the basis of published work. Although national software banks, analogous to national data banks, are a possibility, perhaps growth of traffic on national networks by economists will provide a solution.
One software problem that will not be alleviated by more network software swapping is a need for economists to develop standard modeling protocols and nomenclature. User agreements for modeling standards could encourage development of software modules in high-level languages that can be used by the variety of computing systems adopted by economic departments and agencies.
Finally, many of us feel that software availability is an important
bottleneck and that NSF should encourage investigators to add
software items to their grant request rather than to discourage
them
as is the current policy.
B. Visualization
Visualization has been used principally by non-social scientists
with applications such as the visualization of molecular
phenomena, smog in Los Angeles and the effects of alternative
policy interventions, and visualization of patterns of fires, with
simulations of what the landscape would look like at various
points in history in the absence of fires. This novel technology
would bring another dimension to the understanding of complex
phenomena in economics in a dynamic setting. Moreover, it
would enable the researcher to recognize patterns and trends not
apparent from conventional data analysis, and to actually see the
evolutions of economic systems. Further, it would be a graphic
way of representing questions raised by the discipline and of
providing answers to researchers, laymen and policymakers alike.
It, inevitably, will also suggest new theories.
This tool could also be used to visually measure the performance
of algorithms and to explain their behavior to individuals not
necessarily schooled in mathematical programming fundamentals.
Economics, with its fascinating problems, should take the
initiative in this direction.
C. Unix-based Networks and Workstations
The organization of computing in most economic departments is
driven by the selection of either mainframes or PCs, or some
mixture of both. On the one hand, mainframes are generally well-
suited for SAS manipulations of large-scale data sets. Similarly,
PCs appear to be adequate for small-scale macro modeling, have
better graphical capabilities for pre-test evaluations and
examination of estimation surfaces, and avoid the hassles of
petitioning for larger departmental allocations on university
mainframes. In addition, business schools may prefer to prepare
students for the spreadsheet computing that graduates will face in
the business world.
On the other hand, in adopting this either/or setup, students and researchers in economics are generally shut out from a wide midrange of hardware power and scientific computing software options that are available to users of Unix-based workstations and servers. Perhaps most damaging, the lack of access to the Unix environment deprives economists from low-cost imports of computing technologies developed in other scientific computing disciplines, ranging from freeware editors for multi-tasking to experimental software for large-scale optimizations and simulations.
Given the reduced prices of workstations and servers and the ability to establish networks of relatively inexpensive terminals (with varying degrees of local power) that can access more expensive compute servers, it is hard to believe that this state of affairs is only a consequence of sunk costs. Many colleges and universities encourage or subsidize student-owned personal computers, which could be attached to departmental Unix-based servers. However, the need for skilled maintenance and pooled resources in a Unix network does require overcoming bureaucratic inertia and incentives for noncooperative behavior.
The previous sentence may seem benign and the reader may rush
across it easily. However, we urge a full stop here. In our
experience, workstations are wonderful to use but difficult to set
up and to maintain. Single users should not look at their PCs and
think that a workstation will be only slightly more difficult to
manage. Rather they should think of their workstation as
belonging to a cluster around a server, and there should be a
skilled and permanent staff member to support the entire set of
machines.
D. Distributed Processing and Parallel Computing
One of the first things late-night users of a network observe is the
underutilization of processor capacity on other network machines.
Our impression is that network-distributed processing, a special
case of parallel processing involving ten or twenty network
processors, is not employed by economic departments and should
be encouraged.
At a minimum, a mechanical adaptation of distributed processing is cost-effective and would permit even small departments to approach cranking speeds usually available only to privileged users of array or vector processors. Thus far, experience at the Federal Reserve Board of Governors is mostly limited to spawning independent processes, such as Monte Carlo runs to generate confidence intervals for staff model forecast production runs. In the case of independent processes, the speedup (the ratio of serial processing time to parallel time) is approximately proportional to the number of parallel processors, so it is relatively straightforward to obtain an order of magnitude speedup.
In the case of non-independent processes, the concurrent message passing associated with distributed processing is slowly altering appreciation of modeling systems, in contrast to a single fixed model specification. For example, to examine the robustness of policy options to alternative model specifications, different sectors from domestic and international macroeconometric models can be combined by message-passing simulations among various models. The causal structure, or the identification of "exogenous" and "endogenous" variables in a particular sector, is determined by the lists of variables imported or exported by particular sectors of the component models. In this manner, a variety of models can be generated by a small set of competing specifications. Also, alternative sectors, as might be constructed by a specialist, can be embedded in a larger simulation environment. Other obvious extensions include numerical analysis of differential games by agents with limited perceptions.
In theory, it is relatively straightforward to extend parallel computing to numerical analysis problems such as gradient methods of nonlinear optimization. However, in contrast to a range of accessible libraries, such as NAG, that contain pre-coded numerical modules for serial processing, we am not aware of a software library for network-distributed processing implementations of standard numerical algorithms.
To suggest an example of a more conceptual transfer of the structure of parallel computing to economic theory, an attractive application would be more explicit analysis of how heterogeneous agents learn or how markets generate transaction prices. In contrast to the concept of a representative agent with a single mode of learning, it would appear to be more realistic to specify that agents operate in parallel, with a variety of learning heuristics, and with restricted inter-agent messaging but periodic communication with a central message center that provides averaged market characteristics.
Although parallel computing at this point does not yet play a major
role in economics, it represents a technology whose potential
usefulness and power cannot be ignored. Effective exploitation of
parallel architectures is still very time-consuming, although
coarse-
grain architectures such as the IBM 3090-600 series, are, in a
sense, easier to use, than massively parallel architectures such as
the Thinking Machines CM-2 machine. Nevertheless, economists
merit the support that permits such investments, since much can be
gained by rethinking algorithmic procedures from the parallel
computational standpoint. Of course, emphasis should be placed
on parallel algorithms which are useful in a variety of problems in
econometrics, economic theory and applied economics. Finally,
there is now an excellent text by Bertsekas and Tsitsiklis (1989)
which lays the foundations for such algorithm development, albeit
in a general setting.
E. Networks and Supercomputers
John Rust has been involved in a 4-year study of the retirement
behavior of older men. He is trying to see whether the
mathematical principle of dynamic programming (DP) can provide
a good representation of their behavior. Since the DP model is
very complex, he used a supercomputer to solve the model and
compare its predictions to observations on a very large data set on
the actual retirement behavior of 8,131 men followed over 10
years. This is an interactive process which involves repeated
transfers of large amounts of data (up to 10 megabytes per model)
between a supercomputer and an IBM 386 workstation. To date
he has transferred close to 400 megabytes of data over the
Internet, and he expects to transfer roughly 20 times that amount
before the project is done. Some of the data is "raw" input data
prepared on his PC for transfer to the supercomputer for
estimation, and part is output data generated on the
supercomputer, which he transfers back to his workstation for
graphical post-processing (even the Internet does not yet have
sufficient bandwidth for real-time, interactive graphical analyses).
Without access to the Internet, he would be forced to use the
supercomputer via a dial-in line and a modem, at much slower
rates. Transfer of a 10 megabyte data file by this method would
take approximately 12 to 15 hours, effectively making it
impossible to interactively evaluate his models. With the Internet,
he can transfer these files in less than 15 minutes. Thus, without
the Internet, it would be virtually impossible to fully analyze his
supercomputer output, and he would have to severely cut back the
scope of his project.
We can offer an even more compelling example of how inadequate
communications can seriously hinder use of supercomputer
facilities. In 1989, IBM granted Rust 200 cpu hours on their 3090
supercomputer as part of their "Research Support Program." For
security reasons, the IBM 3090 (located at IBM's Palo Alto
Scientific Center) was not accessible via Internet, but only over a
special call-back modem on a dial-up line. IBM went to the
expense of connecting a dedicated phone line to Rust's office so
that he could use the 3090 without tying up his office phone line.
Despite the use a clever terminal emulation software designed by
IBM, the transfer rate of a 1200 baud modem is too slow to move
files larger than 200 kilobytes in a reasonable amount of time.
IBM recognizes this problem and provides a standard $5,000
travel budget to all research support program participants so they
can fly to Palo Alto to install and remove large data-sets on site.
However from the perspective of NSF, it is probably cheaper in
the long-run to support fast network access over Internet than to
pay for the costs of dedicate phone lines and the travel expenses to
physically transfer datasets over disk or tape. In terms of research
turn-around time, it is difficult to see how one could do large
computational tasks in a reasonable amount of time without a
medium such as Internet. Rust found that the slowness of modem
access to the IBM 3090 severely limited the types of projects he
could undertake.
1. Long distance research collaborations.
The Internet is perhaps most useful on a day-to-day basis to
facilitate joint research with colleagues around the world. Using a
telephone is prohibitively expensive and also inconvenient since it
is often difficult to reach busy colleagues or those in different time
zones. The key advantage of Internet is the ability to jointly work
on papers or programs, sending electronic copies of the paper
back and forth in the process of revisions. John Rust's and his
colleagues' work on developing double auction software shows
that a team of workers can closely collaborate on software
development even though they are separated by thousands of
miles. In the Double Auction project, Richard Palmer was located
at Duke University, John Rust was at Wisconsin and John Miller
was at the Santa Fe Institute. Despite the distance, they managed
to develop over 15,000 lines of code, and a 100-page participant's
manual entirely via Internet communications.
Another example is Rust's paper "Dynamic Structural Models: Problems and Prospects" written jointly with Ariel Pakes at Yale University. Rust and Pakes were under very tight time deadlines in writing up the paper, and they cannot image how they would have done it by exchanging trial drafts via Federal Express. Certainly it would have been much more costly and would have severely cut back the number of revisions they would have attempted.
Internet is very useful for a number of other activities, including: 1) jointly-organized conferences, sending out program updates and confirmations via E-mail, 2) using E-mail to speed up sending in referee reports for journals and NSF, and 3) obtaining free software over electronic bulletin boards. In the latter case, one can obtain a vast array of free software that would cost thousands if one were to purchase similar proprietary programs through standard vendors.
Our general observation is that Internet has been a significant
factor in promoting not only our own research, but that of many of
our colleagues. Last year, the University of Wisconsin spent $12
million dollars on new wiring and network equipment to take
advantage of the Internet system. Clearly, the research community
is coming to depend on Internet: it is no longer a luxury but a
necessity for many researchers. Several years ago, John Geweke
suggested that investment in the Internet might be a substitute to
further investment in supercomputers, since it provided a way of
tapping unused computer power on researchers desktops
throughout the country. John Rust thought this was somewhat
preposterous at the time; however, given his subsequent
experience with the amazing speeds of (certain segments of) the
Internet and the recent development of distributed computing
software, this idea no longer seems preposterous but eminently
practical. Currently, one of the main obstacles to practical
implementation of this ideas are concerns of security in the wake
of recent break-ins by computer hackers, and destruction of data
by computer viruses and worms.
2. Supercomputers
While distributed computing arrangements can offer great
advantages, for the very largest problems they are inadequate. For
example, Rust has a problem involving arrays of several to
hundreds of millions of elements. Most workstations have only
about 8 to 16 megabytes of memory and 200 to 300 megabytes of
disk storage. Solving a large-scale dynamic programming
problem in a distributed computing environment would be
infeasible because small sections of the problem would have to be
distributed to many machines, requiring considerable
programming effort to figure out how to divide the arrays,
distribute the data to the individual machines, and synchronize the
computations so that each component computation could be re-
assembled to produce the overall solution. There simply is no
software that allows one to do this automatically, so the time
required to write the software would be considerable (and
probably rife with errors), and the ultimate throughput of such an
arrangement would still be significantly less than supercomputers
such as the Cray-2 that regularly achieve rates of 400 megaflops or
better.
Thus, we would argue for a balanced approach to economic
computing. Both workstations and supercomputers will be
needed, and a fast network infrastructure will be the glue that
binds them together.
F. Infrastructure in Economics and in Statistics
The hardware-software situation in economics is curious.
Statistics is firmly part of the "hard science" and engineering
community and culture, using Unix-based workstations, tex
word-processing and the Unix-based statistical package S. Use of
Internet is widespread. With a few exceptions, economists use
PCs (rarely reliably tied into Internet), a wide array of word-
processors and several statistical packages of which Gauss may be
the most commonly used. The computational infrastructure in the
statistical community is much better suited to computational
economics than is the computational infrastructure of the
economics community itself. We could elaborate on this at length,
but will refrain from doing so here. In part, this situation can be
traced to the suspicion that computational work is not highly
valued in the economics community, but this is changing very
quickly. More important is the fact that statistics departments are
often on the "hard science" side of campus, and administrators
who are accustomed to providing over $100K for new assistant
professors in the lab sciences will see their way to spending $10K
per active researcher to install, maintain and upgrade distributed
computing in statistics (the level identified in the "Eddy Report" in
Statistical Science a few years ago). Most important is the fact
that
statisticians have been active in identifying their needs (see the
Eddy Report) resulting in, among other things, their active
participation in the ideally leveraged Scientific Computing
Research Equipment in the Mathematical Sciences (SCREMS)
program at NSF. The characteristic of workstations and
associated software that is important for computational economics
is the ability for very rapid communication between researchers,
and the ability to run software essentially interchangeably. The
maturing of computational economics will lead to a system with
these same characteristics, and our opinion is this will entail the
adoption of workstations by economists. NSF should encourage
and facilitate this movement in the same way it has in the
mathematical sciences.
Some economists have found the economics program at NSF very responsive to their requests for hardware and software support in grant applications, providing flat amounts each year for maintenance and upgrading of a base funded originally through combinations of SCREMS and university money. However, many economists who do very good work and are regularly funded by NSF-economics report routine cutting of hardware and software requests.
It is our opinion that, on balance, NSF has been a follower rather
than a leader in computational economics. In some areas, being
more of a leader means saving money. A clear example is the
archiving of data, which ought to be extended to software. When
one sends software to a typical economist, one mails a PC floppy
with code and quite a few pages of documentation, asking the
person on the other end to return the disk. A secretary often
handles this. When one sends software to a statistician, who
makes a request by e-mail, one transfers the files on Internet
which takes about a minute. NSF could do a lot to encourage (to
the point of requiring) funded research to use Internet-accessible
machines. This would have little or no impact on NSF's budget,
but would appropriately place great pressure on campuses to
upgrade their wiring.
G. Databases
One welcome development in applied macroeconomics in the
19809s was the move to subject macro theories to microdata
analysis. The Center for Research in Security Prices, the
Longitudinal Research Database at Census, the Manufacturing
Sector Master File at the NBER, and the Surveys of Consumer
Finance by the Federal Reserve Board are only a few of the
sources of high-quality, detailed data for researchers. Four
generic issues worth mentioning as areas for future funding are:
(1) network access to standard archival data banks; (2) reduction
of barriers to public access of disaggregated data, such as
aggregations that preserve information but protect the
confidentiality of reporters; (3) user-friendly concordances among
alternative data bases, including links to corresponding foreign
industry data banks; and (4) establishment of time-dated or "as of"
data bases, where successive measurements of events are
preserved. On the latter point, access to "as of" perspectives that
indicate the calendar timing of data revisions would help identify
information actually available to agents in historical time.
In these preliminary comments on databases, we are less
concerned with specific areas of research than we are with the
components of a good program in computational economics.
Such a program can only be developed through a unified or
balanced approach emphasizing the three basic foundations of
empirical research. These foundations of good empirical
economics are: 1) a solid theoretical base and econometric
methodology, 2) appropriate data, and 3) accessibility of the data
to a wide variety of researchers. Although we will focus our
remarks here on the data component, we emphasize that a unified
approach to an initiative in computational economics is needed.
1. Theory and Econometric Methodology
The first need is for a theoretical framework rich enough to allow
for estimation of basic structural relationships among economic
agents. Here we are thinking about establishments, firms,
individuals and households as the economic agents of interest.
The models developed must be rich enough to provide for
stochastic outcomes, incorporate equilibrium concepts that enable
researchers to evaluate the impacts of spillovers on behavior, and
provide for efficient use of both cross-section and time-series
variations in agents behavior. For example, there is a growing
literature on stochastic models of industry evolution such as the
one recently developed by Pakes and Ericson (1988). To estimate
such models, longitudinal data on individual agents is necessary.
Development of longitudinal micro databases are possible and, to a
limited extent, already available. (See Olley and Pakes (1991) for
a successful implementation of a dynamic stochastic model with
endogenous exit behavior using the Longitudinal Research
Database (LRD) at the Census Bureau.)
2. Economic Data
It is not enough to develop models. (Parenthetically, we note that
the existence of appropriate databases will tend to spur theory
development. We think the expansion of industrial organization
data in the 1960's directly fed many theoretical advances in the
1970's and 19809s.) In order to make progress on answering
basic policy questions, these models must be parameterized and
"experiments" with different policy instruments undertaken. This
requires microdata panels. Such panels are possible to create, but
they will require substantial matching across existing databases,
extensive editing and long-term commitments of resources,
including mechanisms for feedback between the entities collecting
data and the information developed from the matched database(s).
In addition, since the data involved will be subject to
confidentiality restrictions, methods for researcher access will
need to be a part of any plan.
A comprehensive initiative will seek to bring together establishment and firm information on outputs (products and services), inputs (capital, labor, R&D, materials, purchase services), financial and management characteristics, and demographic information on workers (work histories, education, training, etc.). We are stressing a rather broad view of a database initiative for a couple of reasons. First, we think such an effort is possible, although we recognize that political and other realities may suggest a less grandiose design. Second, we think that many problems will require elements of both demographic and establishment data for resolution. For example, the production function is basic to most economic theories. It requires information on input flows that are in turn derived from estimates of capital, both human and physical. If we want to get correct answers we must develop the ability to measure the basic variables in the underlying structural equations of our theories.
One might ask why the emphasis on micropanels in these comments. The answer is that, increasingly it is becoming apparent that for a wide range of problems the assumptions that allow one to use aggregate data to draw inferences about the behavior of individual economic agents are not valid. Whether the application of the data is productivity measurement, business cycle analysis, R&D resource allocations, job turnover, minority and small business development, environmental or merger policy, the stylized facts suggest that it is essential to take account of the heterogeneity of economic agents. For example, economic studies at the Census Bureau are finding that the basic building blocks of the statistical system show substantial variation along a variety of dimensions and these variations are not constant over time. As outlined in McGuckin (1989) and numerous Center for Economic Studies produced papers, this has potentially important implications for economic measurement. In particular, the size and nature of observed heterogeneity call into question the use of, for example, simple fixed-effects models of economic performance utilizing aggregate industry data based on a "representative firm" as a basis for measurement and policy analysis.
Even at very detailed levels of aggregation, such as the 4- and 5-
digit levels of industry and product classifications, there exists
substantial heterogeneity among establishments and firms along a
wide variety of dimensions. This heterogeneity in economic
behavior is associated with observables such as age, size, location
and ownership of the plant. Nonetheless, it is also true that even
after controlling for these observables, there remains substantial
residual heterogeneity in the behavior of establishments. A similar
story can be told from the perspective of wage equations estimated
with supply-side demographic data. Moreover, some forms of
behavior that often are not treated as endogenous, such as
ownership changes, entry and exit, migration of production and
distribution systems, and regulatory changes all interesting
phenomena in their own right -- have all been shown to effect the
behavior and performance of firms and establishments. Thus, not
only is there heterogeneity of the behavior of economic entities at a
point in time, also the nature of heterogeneity is changing over
time. Thus, we place an emphasis on micro-panel data.
3. Accessibility
It is important that a wide array of researchers have access to the
data both for replication purposes and tests of alternative models.
This is a major difficulty because of the inherent confidentiality of
data on individual agents. Most official economic data
publications are based on aggregations of microdata collected in
statistical surveys of individual respondents. These data are used
by policy makers, researchers and market analysts as economic
indicators and as a source of information for developing economic
policy and testing economic theory. As useful as these aggregate
data are, the underlying microdata provide even more valuable
information for the study of the economy. Many hypotheses
concerning the nature of production, technical change, and the
interaction of individual firms can only be tested using detailed
microdata. For example, John Solow (1987) argues convincingly
that, using aggregate data, it is impossible to determine whether
energy and capital are complements or substitutes. Moreover, the
extent of aggregation bias can only be evaluated with the use of
microdata. As a result, the necessity for use of detailed microdata
by the public and private research communities cannot be
minimized.
Faced with this, statistical agencies such as the Census Bureau have sought ways to make microdata available to outside researchers and policy makers without violating confidentiality commitments to individual respondents. Aside from cost and legal issues, the confidentiality commitment to respondents is of great concern because statistical officials fear that low rates of response to statistical surveys will get lower if the released microdata reveal confidential information about individual respondents.
All masking techniques create surrogate data by adding either stochastic or systematic (or both) measurement errors to the data. In turn, undoing or correcting for such errors can only be accomplished within the context of specific econometric models. Put differently, evaluation of the effects of measurement error or parameter estimates depends on the model describing the relationships among the variables associated with the unmasked data. Thus, determining the usefulness of a public use data file is essentially a problem in evaluating the effects of measurement error.
It would be convenient to have one public-use file that could provide researchers with sufficient information to test hypotheses and estimate models, while maintaining confidentiality protection for respondents. Unfortunately, the masking techniques used to preserve confidentiality limit the economic studies that can be carried out with any particular public use data set. (See McGuckin and Nguyen (1990)). Thus, it is extremely unlikely that any single public use file will satisfy all users, particularly for data on business populations that are highly skewed. Ideally, it would be best to release many different files to satisfy the needs of different researchers. But this complicates disclosure analysis because the release of a particular public use file often makes it possible to identify individual respondents in another file, that by itself would not reveal confidential information.
These issues are of clear importance to economists. Yet, the confidentiality issues are not widely understood, and there has been little research on the subject even within the statistical community. One approach in use at the Center for Economic Studies of the Census Bureau is to have outside researchers visit as special sworn Census employees with accessibility to the microdata under controlled conditions. In fact, this mission is a major component of the Center for Economic Studies. Costs of such access cover, among other things, security agreements and disclosure analysis. Unfortunately, for some projects these costs may be substantial when viewed from the perspective of the individual researchers without grant support. Procedures for data access must be considered as part of any comprehensive initiative in computational economics.
Currently the Center for Economic Studies is working on
proposals to link workers to Longitudinal Research Database
(LRD) plant level data, environmental emissions to LRD plant
level data, non-manufacturing plants longitudinally to make the
LRD truly economy-wide, as well as a number of other smaller
efforts.
4. An Economics Server
An argument can be made that much of U.S. economic statistics
are still organized in a form which was appropriate to the
econometric models of twenty years ago, but is completely out
step with the possibilities which are available now for databases
maintained on high-speed servers which are accessible over
networks of workstations.
It would seem worthwhile to organize our data so as to permit the user to impose his or her own view on the data and then retrieve it with the type of aggregation needed for the work at hand. While organizational lines are very important to those who create and maintain economic data, they are frequently not important to users. Therefore, relational database should be organized to let the user access the data in ways which are independent of the organizations which create and maintain the databases.
This could be accomplished by the creation of a National Economics Server (NES). This server would provide an interface between data users and suppliers. It would have a menu-driven interface that would make it simple for the user to download data without having to be concerned about the agency which maintains each portion of the desired data. The economic data as such would not be stored on the National Economics Server, but rather links would be maintained to the organizations which maintain major economic databases. A single account could be maintained on the server by the user, and the server organization could then pass along the appropriate fees to the organizations which maintain the databases.
In the future we will want to have databases of models as well as of numbers. The NES might also maintain a database of economic models and the documentation which supports them. Thus if a user were interested in models of the economic effects of global warming, he or she could access the models and their databases and build on the existing work.
Because science depends on reproducible results, it is extremely important that the manipulations undertaken to develop estimates are documented in a machine-independent form. Query languages offer a platform independent way of describing data manipulation. They also use a standard vocabulary of verbs, instead of "cute" or "ingenious" code that is not easily interpreted by persons attempting replication.
In addition, replication demands attention to the naming of the measurement objects. If different researchers use different names for the same measurement, scientific discourse is obscured. If different scientists use the same name for different measurements, anarchy results. Economists need a standard nomenclature for measures, just as much as chemists. Databases enforce permanent nomenclature on the source of measurements and standardize labelling used in computing derivatives from a particular data source.
The following sections are particularly concerned with information
systems for complex data and with the supporting computational
hardware and software.
5. Complex data
This kind of data has been growing in importance and potential
over the last twenty years. Panel data (Panel Survey of Income
Dynamics - PSID), matched observations from clients and
administrative records (Wisconsin Assets and Income Studies),
designs with multiple units of analysis (High School and
Beyond), data from the Social Experiments (Seattle-Denver
Income Maintenance) are all examples of complex data. Such data
are characterized by: 1) group design, 2) parallel exploitation, 3)
extended periods of data collection, and 4) alterations in design
during that period.
Group design in complex data implies that no single scientist is fully informed about the design in the absence of good internal communication within the organization responsible for data collection. Parallel exploitation of the data implies that scientists other than the collectors have access to the data. Secondary users need design and process information that is used by the collectors to affect their design. Extended periods of data collection imply that information about a changing design must be archived for use by a younger cohort of scientists at a later date.
The task of scientific journalling of the activities that generate
complex data and collating the knowledge that is generated from
the data is beyond the capacity of traditional social science and its
journals. Repeated failures demonstrate the inadequacy of current
institutions for supporting secondary users and maintaining
archival data (David (1980)).
6. Exploiting complex data
Fortunately, we can overcome errors of the past. Technology and
knowledge have changed the cost-effective mode for exploiting
complex data. At the same time they have created an environment
in which researchers can be more strongly linked to data
producers. The new technology creates opportunities for scientific
collaboration that furthers the agendas of both scientists and policy
makers.
The technology and knowledge that create a new mode for exploiting complex data contains three elements: 1) high-capacity communication networks, 2) relational database management systems, and 3) "necessary support" (David (1991)). When these elements are augmented by a specialist "expert" and adequate computational capacity in an institution that provides incentives for cooperation, the combination makes low-cost access to data feasible.
Data, relationships, metadata, and aggregates of data are a complex object that must be transmitted in its entirety if it is to be used scientifically, David (1991). An objective for databases in use in economics is to implement standards for that complex object. Relational databases come closer than any other tool we now have to realizing the objective of working with a complex data object.
Special-purpose programming to optimize on computer cycles is not efficient in an economic sense if the integrity of the data object is lost. The inefficiency comes about because many potential research questions go unasked or unanswered, as users do not have an understanding of the special purpose programming system. The relational model provided by the computer science community gives a common basis that can be taught and learned widely, then applied to a broad range of problems. The special purpose system does not meet that criterion.
We anticipate that improvements upon the present relational model
for data will be forthcoming, due to experiments now going on
with object-oriented databases, temporal databases, and interfaces
between hypertext and databases. Nonetheless, these
improvements will necessarily retain much relational structure, and
it behooves the profession to begin organizing relational
databases, since future extensions are unlikely to be incompatible
with the improvement in data access that is already technically
possible.
7. An Information System for Complex Data
It is precisely this combination that was created for five years to
foster research on the Survey of Income and Program Participation
(SIPP) (David (1985) and David and Robbin (1989)). National
Science Foundation support for this project recognized the
pressing national need to develop solutions for storing, accessing
and retrieving data from very large, dynamic statistical data sets
with complex designs (Fienberg, Martin and Straf (1985) and
Aborn (1988)). SIPP ACCESS designed an information system
that integrated statistical data and metadata (information about the
data), including the database design and contents, survey design,
collection and processing procedures, and the results of data
analysis. More than 2.2 gigabytes of data released as cross-
sectional public use files were reduced by about 75 percent using
relational database management system (RDBMS) software. New
data structures were designed to facilitate an understanding of the
complexity of longitudinal panel surveys, reduce conceptual
errors, and obtain large reductions in the amount of time required
to prepare data for longitudinal panel analysis. Technical
memoranda reflecting the scientific design and processing at the
Bureau of the Census were collated and catalogued to create the
only complete record of the data collection activity.
The foregoing activities reduced the cost of research access to these complex data dramatically. Learning time, assembling necessary documentation, computing resources and the cost of mistakes in research procedures were reduced by two orders of magnitude.
At the same time, the facility created a node through which social scientists could maintain active communication with the data collectors and with each other. This resulted in:
The facility has served 45 research projects nationwide. More
than 150 faculty, graduate students and programmers in
universities across the country; policy analysts in private, not-for-
profit research institutes; and members of federal agencies have
been trained in SIPP ACCESS workshops. They have learned the
SIPP design, how to apply relational database management
systems (RDBMS) to complex data, and how to use the new laser
disk storage and communication technologies.
8. Resource Requirements for a Database Program
To multiply the number of databases that are handled in
information systems for complex data requires:
Given the cost of designing a DBMS and adapting it to alternative computer architectures, it would be fool-hardy for social science to "reinvent the wheel" by creating its own system. Social science should lobby for adaptations of existing systems and create "applications programs" if professional agreement is reached on the capability required. The number of competing systems and the non-standard languages used, despite the SQL standard, imply that social scientists need to coordinate development of databases for related complex data sets. Social scientists also need to be aware of the interfaces that exist between database systems and statistical processors which have a comparative advantage at sweeping through rectangular arrays, but which do not generally support set-theoretic operations on relations and which do not offer the protections of concurrence, integrity and dynamic independence that are the forte of "truly relational" RDBMS (Date (1988)).
One might ask if the resources required for the complex data
program outlined above are large. Compared to solo investigator
projects, the costs of retrofitting information systems onto data
acquired in conventional ways are large. However, this
comparison is fallacious. First, the costs of conventional solo
investigator projects will fall if database technology is used.
Economies of scale are substantial; the second and subsequent
analyses have lower marginal costs. Also, learning about data will
be more efficient, replication will be less costly, and
transformation of data to particular conceptual models will be more
efficient. Second, the quality of data and the quality of analysis
will rise because error can be more systematically studied, more
completely detected, and eliminated from the data source and the
estimates. Lastly, processing of data that is required for the
capture of administrative records and statistical protocols can be
much more efficiently accomplished by database management.
The real possibility exists that databases dominate earlier
technologies for data production in statistical inquiry, and if
adopted by the data collector, the investment required on the part
of secondary analysts would be minimal.
9. What can NSF do?
NSF must make clear that it views each information system for
complex data as a national laboratory facility. The facility would
supply ongoing maintenance, assurance of a host for each
complex data set, and a commitment through which necessary
institutional arrangements can be secured. Because the
information system is infrastructure for science it requires a
different kind of peer review than the applications which can be
made with the data.
NSF can facilitate relationships between software vendors and the information system. Our experience is that vendors do not understand the academic market and exercise monopolistic power in licensing software.
NSF can become a focus for pressure on the data producers to invest in information systems. If we are correct, and the technology is dominant over conventional means, this is a "win- win" situation.
NSF can sponsor planning meetings at which priorities for the
scope of database laboratories can be established, and agreement
can be reached on the standards that are appropriate to scientific
design. Thorough understanding of the strengths and weaknesses
of alternative software, and its likely evolution, can be reached.
The latter activity will require close interface to the computer
science community (which has not been educated about the
complexity of typical social science retrieval activity).
10. How does this relate to the need for computation in the
social sciences?
Experiments with SIPP data have established that Ethernet connections between MicroVAX and 80386SX machines makes downloading of data from the database engine to microcomputers economic. Handling very large-scale survey data in this way is feasible. High-data transfer rates are required, and more memory in the microcomputer is desirable. (One of the limitations of microcomputer versions of RDBMS is that they do not all support use of large memory stores.)
Development of the information system for complex data will need to take place on a machine that is networked to the NSFnet, and which can support multiple users. With SIPP, our experience shows that even a relatively large scope, large-scale survey can be handled with either a micro-mini computer (microVAX) or a well- endowed workstation. The limitations on development come from the scarcity of human capital -- people who know systems programming, database theory, research in social science and information science. Limitations also come from the failure to endow database development with control over the computer environment used. (The need to adapt queues, system parameters, updates, and storage to competing demands from other irrelevant activities imposes high costs on the development of a database.)
How can we reduce the costs of acquiring computer expertise? In our experience, NSF's policy of providing researchers with individual workstations has had a much bigger impact on the general level of computer expertise than providing extra supercomputer funding. It involves difficult issues of the proper balance of funding for "big science" vs. "little science." The supercomputer is frankly intimidating to many researchers: its sheer power and remoteness seem to discourage "playing around" that seems to be essential to really understanding how to use a computer effectively. On the other hand, a desktop workstation is a much more friendly device: it is always available whenever you turn it on, and there are no cpu-limits, account expiration deadlines, software revisions and "bug-fixes," and other bureaucratic hassles of working on the supercomputer. The "PC revolution" had such an amazing impact precisely because users had much more incentive to learn to understand a computer that was their own personal property than a computer they could only rent space on for an indefinite period of time. We find the administration of supercomputer accounts at the NSF supercomputer centers to involve a number of seemingly unnecessary bureaucratic obstacles. An example is NCSA's policy of setting relatively short 1-year account expiration deadlines. If one is doing anything that is slightly non-standard and encounters significant data or programming problems, it can be very hard to use up a cpu allocation in only one year. It would be much easier on the user to be allocated a block of cpu time with either no calendar expiration date, or a longer 2- to 4-year period before expiration. Office workstations do not impose such unnecessary deadlines, and as a result many researchers would rather use their workstations than a supercomputer even if their runs take 100 times longer.
Given the amazingly rapid improvement in chip technology, today's desktop workstations are essentially little supercomputers. Thus, it is getting to the point where even the personal workstation can be a fairly complex and intimidating object. To avoid wasting idle cpu cycles, we have seen a migration away from single-user, single-tasking operating systems like DOS to multi-user, multi- tasking operating systems like Unix and MACH. Owning a Unix workstation requires the owner to be a "super-user" in charge of system administration and this involves a fairly high degree of computer sophistication. For example, given the concern about network viruses and worms, it is important to make sure that various security procedures are adhered to, including setting up user logins and passwords and read/write permissions, monitoring of network access and disk IO, in addition to dealing with normal hardware and software maintenance. These tasks were formerly performed by a full-time professional; now they must be a sideline activity of any researcher who acquires their own workstation. This requires the user to acquire a fair amount of computer- specific human capital in order to effectively and safely use modern generation workstations.
The upshot is that while for sufficiently simple machines (like the early DOS PCs), the incentive effect of personal ownership was sufficient to overcome the relatively small human capital investment (resulting in a larger group of computer literate users), for the newer high-performance workstations the personal ownership incentive may not be enough: it may be optimal to revert back to the centralized bureaucratic mainframe cpu-time allocations to economize on duplication of computer-specific human capital.
There is an alternative that still allows a healthy degree of decentralization in computing resources: namely to have local "computer consultants" who can assist in the running and maintenance of many departmental workstations and help write software and answer questions of the non-specialist users. This type of arrangement is difficult to support under current funding arrangements: the money seems to be there to acquire the machinery, but not to pay for a professional to help manage, maintain and instruct new users about this machinery. It ends up diverting all the costs onto the professor, and eventually it becomes a deterrent to trying to acquire more computing resources.
While NSF does support student research assistants, for many
detailed multi-year projects it can be inefficient to support a
graduate student for a year, lose him or her to the job market and
face incurring the search and retraining costs for a new graduate
student. In our experience, the lack of students with sufficient
expertise in computers, the high costs of training and the lack of
continuity in one-year contracting arrangements ended up making
it easier and safer for us to do our own programming. An
intermediate situation that we would much prefer is to have a
departmental "computer expert" who can help develop a "pipeline"
of relatively trained student research assistants, while also
providing the continuity to insure that if one student familiar with a
detailed computing project quits, another substitute could be
quickly found. Our experience would suggest that departments
will need about one computer expert per twenty workstations.
I. Training of Graduate Students
The implications of the emergence of computational economics for
the way we train people are becoming clear. It's essential for first-
year graduate students to be fluent in computer software, as they
must be in mathematics. Our experience is that economics is some
years behind statistics -- not because of inherent differences in the
problems, but because of differences in the ways computing has
been treated within the structures of universities and funding
agencies, and in peer evaluation. The model emerging in the near
future will be one in which a workstation or associated high-level
terminal is available to each graduate student. These machines will
be used for classwork and research. In addition, with appropriate
software for distributed processing, these machines will be
available for intense background computational tasks (something
which is essentially impossible in the PC architecture). It's a
small step to distributed processing over networks. Currently,
economics is behind statistics and other computationally-oriented
fields in this regard, and the distance grows greater all the time.
Much of the investment will have to come from universities. The
funding requirement is shrinking, however, not growing. Ideally,
an important contribution of NSF would be the one it has
historically made in the lab sciences, the provision of training
grants for graduate students.
Training in computational economics has specific implications for
the continued training of mature investigators as well. Experience
in statistics again suggests a model. In the summer of 1989,
Nancy Flournoy from the Statistics and Probability Program
arranged for one week of the annual AMS research meetings (at
Arcata, that year) to be devoted to statistical multiple integration.
It brought together statisticians, computer scientists and numerical
analysts (40 or 50 all together) for a week. A primary reason for
the meeting was that while multiple integration is a research topic
for all three groups, the focus is different. In statistics the concern
is with algorithms that work well in the vast majority of a loosely-
structured set of cases, with diagnostics indicating when they do
not. In contrast, numerical analysts seek procedures that work
most reliably in worst-case scenarios for more narrowly defined
problems. For several years John Geweke had worked in this
area with the suspicion that he might be reinventing someone
else's wheel. The meeting quickly brought the difference in
orientation into focus, and showed that the attendees were
working on related but distinct problems. A few new ideas were
provided by the numerical analysts and a few by the computer
scientists. Mostly, John learned that the numerical analysts were
not able to help with statistical multiple integration problems. This
result has been important in directing his research since then, and
it only could have emerged through a meeting like the one in
Arcata. The wheel-reinvention problem, discussed by some of his
panel colleagues, is a very real one. Meetings with a specific
focus to bring together well-chosen groups in economics and other
areas (computer science, mathematics or other fields) would be
very productive.
J. Postdoctoral Programs
It is not common to have postdoctoral programs in economics -
perhaps in part because economics is not a laboratory science.
However, computational economics has some of the flavor of a
laboratory science and we suggest that some postdoctoral
programs be initiated by NSF on an experimental basis. A subset
of these fellowships might be set aside to take advantage of an
important opportunity that we perceive. Thirty years ago most of
the Ph.D. programs in economics were in the developed countries.
That situation has now changed and many first-rate Ph.D.s are
now being trained in the developing countries. Those students
could benefit substantially from an opportunity to work as a
postdoctoral student for a few years in a U.S. university where
they could become acquainted with some computer hardware and
software which may not be available in their home country.
In summary then, we suggest three actions on postdoctoral
programs. First, principal investigators should be encouraged to
include postdocs in their proposals where that is appropriate.
Second, there should be a national competition for postdoctoral
awards the way there is now a program for doctoral students.
Third, a portion of this last program should be reserved for
students who have received their Ph.D. degrees in the developing
countries.
K. Continuing Education
In addition to being able to push the frontiers of research in
computational economics, the computational economics initiative
can go a long way in helping speed up the education of the
professorate. Only a small percentage of researchers are
knowledgeable about the most up-to-date equipment, software,
techniques, etc. Given the large investment of time that is needed
simply to gather information about how to change the
configuration of an individual's or department's computer usage,
educational activities could significantly reduce the search costs to
the profession. Specific suggestions include short courses,
workshops, lectures and newsletters. For example, at the annual
American Economics Association (AEA) meetings and at one or
more of the more widely attended regional meetings (such as the
Western or Southern) there could be one session on "recent
developments" in hardware, another in software, another on new
algorithms, etc. Alternatively, there could be a short course with a
nominal charge on each of these topics, or a one-day course
encompassing all three. It would also be a great benefit to the
profession if the AEA could be persuaded to begin a newsletter (to
be published perhaps twice a year) identifying new developments
in these areas. There is already an AEA committee on Economic
Education, so perhaps it could get involved or another committee
could be formed.
There has been much discussion of the problem of "wheel re- invention" and NSF could certainly play a role here. One common complaint is that there is no mechanism for sharing code and algorithms or at least getting information out to the community about small- or large-scale programs. Part of this is wrapped up in the incentive problem, as well. If it takes a researcher days or weeks to get a program written and modified and running, there are no incentives to share with colleagues. Although a journal, such as the new Computer Science in Economics and Management journal, can help by publishing articles on software, smaller scale and more quickly disseminated methods are also needed. Perhaps a bulletin board arrangement of some kind could be set up and every time a program is used or requested it would be counted as a citation. Although it clearly would not "count" as much as a publication, it would be appropriate for smaller scale problems such as writing Gauss code or other things along those lines. NSF could help develop and underwrite such a project which could eventually be taken over by the AEA.
Another way that NSF funds could help in this area would be by
sponsoring institute-like activities that bring together computer
scientists and econometricians, or theoretical and applied
econometricians. Such a program could be modeled after the new
NSF initiative for institute-like activities to bring together
statisticians and econometricians, which is being jointly funded by
the Economics Program and the Probability and Statistics
Program.
L. Digital Journals
Access to scientific information will be made easier and more
precise as refereed journals become digital in form. The vision is
to provide direct access over the NREN to journal articles, or
portions of them, commentaries on them and pointers to related
work. There are about a dozen refereed journals on the network
now. Most of them are not in scientific fields. But that will
change. We are not sure the existing ones fit our model of a
digital journal. What is that model? First, it is not a formal
collection of articles bound into volumes and owned by a
publishing house. That is a model dictated by the economics and
organization of the print technology. With the ability to store bits
instead of pages, and to conduct intelligent searches over massive
amounts of data located on distributed databases, a new model
evolves. It allows the scientist to reference subjects, authors, data
sets, and multi-media material including images. Parameters of
such searches are flexible and more finely tuned than the way we
search on printed articles (by date, by journal and by author). For
example, retrieval should be possible on qualitative factors such as
how well liked by peer reviewers, or the perceived excellence of
the "journal." It should be possible, also, to retrieve portions of
articles and data sets, and portions of related articles down to a
fine level of detail.
Four kinds of developments are needed before the vision can become reality.
First, the network architecture and capacity to handle high- volume, high-density file search and retrieval has to be designed. This is well under way in the NSFNET research program.
Second, the design of large-scale, heterogeneous, distributed databases has to be accomplished. This is a basic research challenge that is just beginning. Currently, NSF has a Scientific Database initiative underway, and there are also initiatives in biology and geo-sciences sections of NSF.
Third, basic research is needed on how to design intelligent agents on the network that can search across many distributed databases for the information needed. Funding of this research is also relatively new in the Computer and Information Sciences and Engineering Directorate of NSF.
Fourth, we need to design the institutions and incentives that will encourage an industry (or its economic equivalent) of digital journals. This is a challenge for economists interested in the design of institutions and mechanisms as well as for lawyers who are concerned with intellectual property issues. Academics, generally, need to be concerned with the implications of digital journals for the system of priority and the determinants of tenure decisions. It may not be too early for interested economics journals or new entrants to propose experiments on these problems.